Computer vision systems, such as objectdetection, traditionally rely on supervised learning and predetermined categories, an approach facing limitations when applied to infrared images due to dataset constraints. Eme...
详细信息
ISBN:
(纸本)9781510674110;9781510674103
Computer vision systems, such as objectdetection, traditionally rely on supervised learning and predetermined categories, an approach facing limitations when applied to infrared images due to dataset constraints. Emerging contrastive vision-language models, like (Contrastive Language-Image Pre-Training) CLIP, offer a transformative approach through their pre-training on extensive image-text pairs, providing diverse visual representations integrated with language semantics. Our work proposes a novel zero-shot objectdetection approach for infrared images by extending the benefits of CLIP into this domain. We have developed a two-stage detection system using CLIP for detecting humans in infrared images. The first stage involves region proposal by a (You Only Look Once) YOLO object detector, followed by CLIP in the second stage. When compared with a YOLO model fine-tuned using infrared images, our proposed system demonstrates comparable performance, illustrating its efficacy as a zero-shot objectdetection approach. This method opens up new avenues for infrared image processing leveraging the capabilities of foundation models.
The development of deep learning models for intelligent vehicles rely on a large number of reliable data, among which large-scale and accurately labeled traffic scene image data is conducive to promoting the research ...
详细信息
Existing cross-domain classification and detection methods usually apply a consistency constraint between the target sample and its self-augmentation for unsupervised learning without considering the essential source ...
详细信息
Existing cross-domain classification and detection methods usually apply a consistency constraint between the target sample and its self-augmentation for unsupervised learning without considering the essential source knowledge. In this paper, we propose a Source-guided Target Feature Reconstruction (STFR) module for cross-domain visual tasks, which applies source visual words to reconstruct the target features. Since the reconstructed target features contain the source knowledge, they can be treated as a bridge to connect the source and target domains. Therefore, using them for consistency learning can enhance the target representation and reduce the domain bias. Technically, source visual words are selected and updated according to the source feature distribution, and applied to reconstruct the given target feature via a weighted combination strategy. After that, consistency constraints are built between the reconstructed and original target features for domain alignment. Furthermore, STFR is connected with the optimal transportation algorithm theoretically, which explains the rationality of the proposed module. Extensive experiments on nine benchmarks and two cross-domain visual tasks prove the effectiveness of the proposed STFR module, e.g., 1) cross-domain image classification: obtaining average accuracy of 91.0%, 73.9%, and 87.4% on Office-31, Office-Home, and VisDA-2017, respectively;2) cross-domain object detection: obtaining mAP of 44.50% on Cityscapes -> Foggy Cityscapes, AP on car of 78.10% on Cityscapes -> KITTI, MR(-2 )of 8.63%, 12.27%, 22.10%, and 40.58% on COCOPersons -> Caltech, CityPersons -> Caltech, COCOPersons -> CityPersons, and Caltech -> CityPersons, respectively.
objectdetection is a critical component of autonomous vehicle perception systems. However, domain shifts between training environments and real-world scenarios often degrade detector performance. cross-domainobject ...
详细信息
objectdetection is a critical component of autonomous vehicle perception systems. However, domain shifts between training environments and real-world scenarios often degrade detector performance. cross-domain object detection aims to adapt detectors to unlabeled target domains utilizing only labeled source data. Recent popular cross-domain object detection methods employ the mean teacher framework, which uses pseudo-labels generated by the teacher model to guide training on unlabeled real-world data. Despite its effectiveness, continuous training with noisy pseudo-labels leads to abnormal performance degradation in the later stages of training. To address this issue, we propose a novel Hybrid Matching Teacher (HMT) framework for cross-domain visual detection transformers, which enhances cross-domain knowledge transfer across pseudo-label generation, filtering, and training processes. Specifically, we design a Feature Sparse Alignment (FSA) module to adapt DETR tokens and queries, generate domain-adaptive weights to initialize the teacher-student models, and mitigate the inherent initial source bias in the teacher model. Next, a Localization-aware Pseudo-label Filtering (LPF) module ensures high-quality pseudo-labels by considering the consistency between localization and classification tasks. Furthermore, to improve the efficiency of pseudo-label training, the cross-view Hybrid Matching (CHM) module introduces an auxiliary matching branch to increase the number of positive queries that match with pseudo-labels. Extensive experiments demonstrate that our approach achieves state-of-the-art performance, outperforming previous benchmarks by 3.1%, 8.5%, and 4.4% in adverse weather, diverse scenes, and synthetic-to-real, respectively.
暂无评论