Aerial person re-identification (AReID) focuses on accurately matching target person images within a UAV camera network. Challenges arise due to the broad field of view and arbitrary movement of UAVs, leading to foreg...
详细信息
ISBN:
(数字)9798350390155
ISBN:
(纸本)9798350390162
Aerial person re-identification (AReID) focuses on accurately matching target person images within a UAV camera network. Challenges arise due to the broad field of view and arbitrary movement of UAVs, leading to foreground target rotation and background style variation. Existing AReID methods have provided limited solutions for the former, while the latter remains largely unexplored. This paper propose a Rotation Exploration Vision Transformer (RoExViT) to tackle the aforementioned dual challenges. Specifically, we design Multiple Rotation Tokens (MRT) to explore diverse rotational representations at the feature level, addressing foreground target rotation. To handle background style variation, we propose Cross-Camera Similarity (CCS) loss to effectively minimize the view gap among different cameras. Furthermore, we propose Iteratively Adaptive Batch Construction (IABC) strategy to mitigate overfitting on small datasets. Extensive experiments show that our method outperforms the state-of-the-art methods on PRAI-1581 and UAV-Human while also exhibting outstanding performance on Market1501.
Incremental learning aims to overcome catastrophic forgetting when learning deep networks from sequential tasks. With impressive learning efficiency and performance, prompt-based methods adopt a fixed backbone to sequ...
详细信息
Pseudo bounding box supervision is a promising approach for weakly supervised object localization (WSOL) with only image-level labels. However, the generated pseudo bounding boxes may be inaccurate or even completely ...
详细信息
Incremental learning aims to overcome catastrophic forgetting when learning deep networks from sequential tasks. With impressive learning efficiency and performance, prompt-based methods adopt a fixed backbone to sequ...
Incremental learning aims to overcome catastrophic forgetting when learning deep networks from sequential tasks. With impressive learning efficiency and performance, prompt-based methods adopt a fixed backbone to sequential tasks by learning task-specific prompts. However, existing prompt-based methods heavily rely on strong pretraining (typically trained on ImageNet-21k), and we find that their models could be trapped if the potential gap between the pretraining task and unknown future tasks is large. In this work, we develop a learnable Adaptive Prompt Generator (APG). The key is to unify the prompt retrieval and prompt learning processes into a learnable prompt generator. Hence, the whole prompting process can be optimized to reduce the negative effects of the gap between tasks effectively. To make our APG avoid learning ineffective knowledge, we maintain a knowledge pool to regularize APG with the feature distribution of each class. Extensive experiments show that our method significantly outperforms advanced methods in exemplar-free incremental learning without (strong) pretraining. Besides, under strong pretraining, our method also has comparable performance to existing prompt-based models, showing that our method can still benefit from pretraining. Codes can be found at https://***/TOM-tym/APG
Recent open-vocabulary detectors achieve promising performance with abundant region-level annotated data. In this work, we show that an open-vocabulary detector co-training with a large language model by generating im...
详细信息
Human skin can accurately sense subtle changes of both normal and shear forces. However, tactile sensors applied to robots are challenging in decoupling 3D forces due to the inability to develop adaptive models for co...
详细信息
Exquisite demand exists for customizing the pretrained large text-to-image model, e.g. Stable Diffusion, to generate innovative concepts, such as the users themselves. However, the newly-added concept from previous cu...
Exquisite demand exists for customizing the pretrained large text-to-image model, e.g. Stable Diffusion, to generate innovative concepts, such as the users themselves. However, the newly-added concept from previous customization methods often shows weaker combination abilities than the original ones even given several images during training. We thus propose a new personalization method that allows for the seamless integration of a unique individual into the pre-trained diffusion model using just one facial photograph and only 1024 learnable parameters under 3 minutes. So we can effortlessly generate stunning images of this person in any pose or position, interacting with anyone and doing anything imaginable from text prompts. To achieve this, we first analyze and build a well-defined celeb basis from the embedding space of the pre-trained large text encoder. Then, given one facial photo as the target identity, we generate its own embedding by optimizing the weight of this basis and locking all other parameters. Empowered by the proposed celeb basis, the new identity in our customized model showcases a better concept combination ability than previous personalization methods. Besides, our model can also learn several new identities at once and interact with each other where the previous customization model fails to. Project page is at: http://***. Code is at: https://***/ygtxr1997/CelebBasis.
Recent advances in Large Multi-modal Models (LMMs) are primarily focused on offline video understanding. Instead, streaming video understanding poses great challenges to recent models due to its time-sensitive, omni-m...
详细信息
Internet of Things (IoT) is growing with various applications linked in, and node failures are becoming more common as a result of malicious strikes and other issues. The cascading collapse induced by local node failu...
详细信息
In unsupervised 3D face reconstruction, existing methods modeling the canonical face typically exclude the skip connections between encoder-decoder pairs. Consequently, they have difficulty capturing appearance detail...
In unsupervised 3D face reconstruction, existing methods modeling the canonical face typically exclude the skip connections between encoder-decoder pairs. Consequently, they have difficulty capturing appearance details necessary for the task. However, directly applying original skip connections only induces these methods to degrade to a trivial 2D texture reconstruction algorithm. In this paper, we propose novel Reprogramming Skip Connections (RSCs), which escape from bringing about degradation and improve the 3D face reconstruction quality. Specifically, the proposed method filters out inappropriate information causing degradation by aggregating the features from the encoder in spatial dimensions into several prototypes. These prototypes preserving beneficial information are subsequently combined with the corresponding decoder features with the help of expansion masks. Further, we design the masks reconstruction consistency loss to improve the quality of the expansion masks. Our experiments verify the superiority of our method compared to other competitors.
暂无评论