Cross-Domain Few-Shot Learning (CD-FSL) aims to recognize new classes from unseen domains, given limited training samples. Majority of the state-of-the-art approaches for this task introduce new task-specific addition...
详细信息
On the account of the explosive growth in the large-scale transportation videos, vehicle retrieval plays an important role in the public transportation security and the intelligent transport system recently. Most vehi...
详细信息
ISBN:
(数字)9781665487399
ISBN:
(纸本)9781665487399
On the account of the explosive growth in the large-scale transportation videos, vehicle retrieval plays an important role in the public transportation security and the intelligent transport system recently. Most vehicle retrieval algorithms are vision-based and consist of vehicle re-identification and vehicle tracking. However, the performance of vision-based vehicle retrieval algorithms is constrained as the limited information provided by traffic video streams. In this paper, we propose a contrastive cross-modal vehicle retrieval solution, maximizing the value of the complementation between natural language representation and vision representation. The framework of the proposed solution includes: (1) Preprocess a source video in four ways for generating local motional semantics and global motional semantics;(2) Correspondingly, preprocess relevant description sentences in two ways, including Textual Local Instance Semantics Extraction (TLISE) and Textual Local Motional Semantics Extraction (TLMSE);(3) Use a two-stream architecture model with four visual encoders and four text encoders to extract visual features and textual embeddings;(4) Fuse visual features and textual embeddings respectively by concatenating them along the feature channel in the order of importance, and use them for retrieval. By using the proposed solution, we achieved MRR score of 33.20%, ranking the 7th place in the AI City Challenge 2022 Track 2.
We introduce Temporal consistency for Test-time adaptation (TempT), a novel method for test-time adaptation on videos through the use of temporal coherence of predictions across sequential frames as a self-supervision...
详细信息
Measuring the perceptual quality of images automatically is an essential task in the area of computervision, as degradations on image quality can exist in many processes from image acquisition, transmission to enhanc...
详细信息
ISBN:
(数字)9781665487399
ISBN:
(纸本)9781665487399
Measuring the perceptual quality of images automatically is an essential task in the area of computervision, as degradations on image quality can exist in many processes from image acquisition, transmission to enhancing. Many Image Quality Assessment(IQA) algorithms have been designed to tackle this problem. However, it still remains unsettled due to the various types of image distortions and the lack of large-scale human-rated datasets. In this paper, we propose a novel algorithm based on the Swin Transformer [31] with fused features from multiple stages, which aggregates information from both local and global features to better predict the quality. To address the issues of small-scale datasets, relative rankings of images have been taken into account together with regression loss to simultaneously optimize the model. Furthermore, effective data augmentation strategies are also used to improve the performance. In comparisons with previous works, experiments are carried out on two standard IQA datasets and a challenge dataset. The results demonstrate the effectiveness of our work. The proposed method outperforms other methods on standard datasets and ranks 2nd in the no-reference track of NTIRE 2022 Perceptual Image Quality Assessment Challenge [53]. It verifies that our method is promising in solving diverse IQA problems and thus can be used to real-word applications.
Image alignment, also known as image registration, is a critical block used in many computervision problems. One of the key factors in alignment is efficiency, as inefficient aligners can cause significant overhead t...
详细信息
ISBN:
(纸本)9781665487399
Image alignment, also known as image registration, is a critical block used in many computervision problems. One of the key factors in alignment is efficiency, as inefficient aligners can cause significant overhead to the overall problem. In the literature, there are some blocks that appear to do the alignment operation, although most do not focus on efficiency. Therefore, an image alignment block which can both work in time and/or space and can work on edge devices would be beneficial for almost all networks dealing with multiple images. Given its wide usage and importance, we propose an efficient, cross-attention-based, multi-purpose image alignment block (XABA) suitable to work within edge devices. Using cross-attention, we exploit the relationships between features extracted from images. To make cross-attention feasible for real-time image alignment problems and handle large motions, we provide a pyramidal block based cross-attention scheme. This also captures local relationships besides reducing memory requirements and number of operations. Efficient XABA models achieve real-time requirements of running above 20 FPS performance on NVIDIA Jetson Xavier with 30W power consumption compared to other powerful computers. Used as a sub-block in a larger network, XABA also improves multi-image super-resolution network performance in comparison to other alignment methods.
The association task of assigning detections to tracks in multi-person tracking has recently been improved by integration of a second matching stage for low-confident detections that are usually discarded in the track...
详细信息
Human affective behavior analysis has received much attention in human-computer interaction (HCI). In this paper, we introduce our submission to the CVPR 2022 Competition on Affective Behavior Analysis in-the-wild (AB...
详细信息
ISBN:
(数字)9781665487399
ISBN:
(纸本)9781665487399
Human affective behavior analysis has received much attention in human-computer interaction (HCI). In this paper, we introduce our submission to the CVPR 2022 Competition on Affective Behavior Analysis in-the-wild (ABAW). To fully exploit affective knowledge from multiple views, we utilize the multimodal features of spoken words, speech prosody, and facial expression, which are extracted from the video clips in the Aff-Wild2 dataset. Based on these features, we propose a unified transformer-based multimodal framework for Action Unit detection and also expression recognition. Specifically, the static vision feature is first encoded from the current frame image. At the same time, we clip its adjacent frames by a sliding window and extract three kinds of multimodal features from the sequence of images, audio, and text. Then, we introduce a transformer-based fusion module that integrates the static vision features and the dynamic multimodal features. The cross-attention module in the fusion module makes the output integrated features focus on the crucial parts that facilitate the downstream detection tasks. We also leverage some data balancing techniques, data augmentation techniques, and postprocessing methods to further improve the model performance. In the official test of ABAW3 Competition, our model ranks first in the EXPR and AU tracks. The extensive quantitative evaluations, as well as ablation studies on the Aff-Wild2 dataset, prove the effectiveness of our proposed method.
Although massive pre-trained vision-language models like CLIP show impressive generalization capabilities for many tasks, still it often remains necessary to fine-tune them for improved performance on specific dataset...
详细信息
As face masks continue to be a part of our daily lives, the challenge of reconstructing occluded faces remains relevant. While several approaches have been proposed for removing masks from neutral facial images, few h...
详细信息
The retail industry has seen an increasing growth of artificial intelligence and computervision applications. Of the various topics, automatic checkout (ACO) in retail stores or supermarkets has emerged as one of the...
详细信息
ISBN:
(数字)9781665487399
ISBN:
(纸本)9781665487399
The retail industry has seen an increasing growth of artificial intelligence and computervision applications. Of the various topics, automatic checkout (ACO) in retail stores or supermarkets has emerged as one of the critical tasks in this area. Several problems stem from real-world scenarios such as object occlusion, blurring from scanning motion, and similarity in scanned items. Moreover, the challenge also comes from the difficulty of collecting training images that reflect the realistic checkout scenarios due to continuous updates of the products. This paper proposes a deep learning-based automatic checkout system (DeepACO) to recognize, localize, track, and count products as they move along a retail check-out conveyor belt. The DeepACO follows the detect-and-track approach, i.e., applying trackers on detected bounding boxes. It also provides a completed pipeline for generating large training datasets under various environments from synthetic data. The proposed system has been evaluated on the 2022 AI City Challenge Track 4 benchmark. Compared to other state-of-the-art solutions, it has shown outstanding results, achieving top-2 on the test-set A with the F1 score of 0.4783.
暂无评论