Compositionality is one of the fundamental properties of human cognition (Fodor & Pylyshyn, 1988). Compositional generalization is critical to simulate the compositional capability of humans, and has received much...
详细信息
ISBN:
(纸本)9798350301298
Compositionality is one of the fundamental properties of human cognition (Fodor & Pylyshyn, 1988). Compositional generalization is critical to simulate the compositional capability of humans, and has received much attention in the vision-and-language (V&L) community. It is essential to understand the effect of the primitives, including words, image regions, and video frames, to improve the compositional generalization capability. In this paper, we explore the effect of primitives for compositional generalization in V&L. Specifically, we present a self-supervised learning based framework that equips existing V&L methods with two characteristics: semantic equivariance and semantic invariance. With the two characteristics, the methods understand primitives by perceiving the effect of primitive changes on sample semantics and ground-truth. Experimental results on two tasks: temporal video grounding and visual question answering, demonstrate the effectiveness of our framework.
Visual place recognition (VPR) is a fundamental task of computervision for visual localization. Existing methods are trained using image pairs that either depict the same place or not. Such a binary indication does n...
详细信息
ISBN:
(纸本)9798350301298
Visual place recognition (VPR) is a fundamental task of computervision for visual localization. Existing methods are trained using image pairs that either depict the same place or not. Such a binary indication does not consider continuous relations of similarity between images of the same place taken from different positions, determined by the continuous nature of camera pose. The binary similarity induces a noisy supervision signal into the training of VPR methods, which stall in local minima and require expensive hard mining algorithms to guarantee convergence. Motivated by the fact that two images of the same place only partially share visual cues due to camera pose differences, we deploy an automatic re-annotation strategy to re-label VPR datasets. We compute graded similarity labels for image pairs based on available localization metadata. Furthermore, we propose a new Generalized Contrastive Loss (GCL) that uses graded similarity labels for training contrastive networks. We demonstrate that the use of the new labels and GCL allow to dispense from hard-pair mining, and to train image descriptors that perform better in VPR by nearest neighbor search, obtaining superior or comparable results than methods that require expensive hard-pair mining and re-ranking techniques.
In this research, a novel approach for autonomous spacecraft navigation, particularly in lunar contexts, is presented, focusing on vision-based techniques. The system incorporates lunar crater recognition in conjuncti...
详细信息
ISBN:
(纸本)9798350365474
In this research, a novel approach for autonomous spacecraft navigation, particularly in lunar contexts, is presented, focusing on vision-based techniques. The system incorporates lunar crater recognition in conjunction with feature tracking to enhance the accuracy of spacecraft navigation. This system underwent comprehensive evaluation through a purpose-built software simulation, replicating lunar conditions for thorough evaluation and refinement. The methodology integrates established navigational methods with advanced artificial intelligence algorithms, resulting in significant navigational accuracy. The system demonstrates precise capabilities in determining the spacecraft position, with an average accuracy of approximately 270 m for the absolute navigation mode, while the relative mode exhibited an average error of 27.4 m and 0.8 m in determining the horizontal and vertical lander displacements relative to terrain. Initial tests on embedded systems-akin to those on-board spacecraft-were conducted. These tests are pivotal in demonstrating the system's operational viability within the constraints of limited bandwidth and rapid processing requirements characteristic of space missions. The promising results from these tests suggest potential applicability in real-world space missions, enhancing autonomous navigation capabilities in lunar and potentially other extraterrestrial environments.
In this paper we introduce the Temporo-Spatial vision Transformer (TSViT), a fully-attentional model for general Satellite Image Time Series (SITS) processing based on the vision Transformer (ViT). TSViT splits a SITS...
详细信息
ISBN:
(纸本)9798350301298
In this paper we introduce the Temporo-Spatial vision Transformer (TSViT), a fully-attentional model for general Satellite Image Time Series (SITS) processing based on the vision Transformer (ViT). TSViT splits a SITS record into non-overlapping patches in space and time which are tokenized and subsequently processed by a factorized temporo-spatial encoder. We argue, that in contrast to natural images, a temporal-then-spatial factorization is more intuitive for SITS processing and present experimental evidence for this claim. Additionally, we enhance the model's discriminative power by introducing two novel mechanisms for acquisition-time-specific temporal positional encodings and multiple learnable class tokens. The effect of all novel design choices is evaluated through an extensive ablation study. Our proposed architecture achieves state-of-the-art performance, surpassing previous approaches by a significant margin in three publicly available SITS semantic segmentation and classification datasets. All model, training and evaluation codes can be found at https://***/michaeltrs/DeepSatModels.
Fashion representation learning involves the analysis and understanding of various visual elements at different granularities and the interactions among them. Existing works often learn fine-grained fashion representa...
详细信息
ISBN:
(纸本)9798350301298
Fashion representation learning involves the analysis and understanding of various visual elements at different granularities and the interactions among them. Existing works often learn fine-grained fashion representations at the attribute level without considering their relationships and inter-dependencies across different classes. In this work, we propose to learn an attribute and class-specific fashion representation duet to better model such attribute relationships and inter-dependencies by leveraging prior knowledge about the taxonomy of fashion attributes and classes. Through two sub-networks for the attributes and classes, respectively, our proposed an embedding network progressively learns and refines the visual representation of a fashion image to improve its robustness for fashion retrieval. A multi-granularity loss consisting of attribute-level and class-level losses is proposed to introduce appropriate inductive bias to learn across different granularities of the fashion representations. Experimental results on three benchmark datasets demonstrate the effectiveness of our method, which outperforms the state-of-the-art methods by a large margin.
This study investigates the integration of vision language models (VLM) to enhance the classification of situations within rugby match broadcasts. The importance of accurately identifying situations in sports videos i...
详细信息
ISBN:
(纸本)9798350365474
This study investigates the integration of vision language models (VLM) to enhance the classification of situations within rugby match broadcasts. The importance of accurately identifying situations in sports videos is emphasized for understanding game dynamics and facilitating downstream tasks like performance evaluation and injury prevention. Utilizing a dataset comprising 18, 000 labeled images extracted at 0.2-second intervals from 100 minutes of rugby match broadcasts, scene classification tasks including contact plays (scrums, mauls, rucks, tackles, lineouts), rucks, tackles, lineouts, and multiclass classification were performed. The study aims to validate the utility of VLM outputs in improving classification performance compared to using solely image data. Experimental results demonstrate substantial performance improvements across all tasks with the incorporation of VLM outputs. Our analysis of prompts suggests that, when provided with appropriate contextual information through natural language, VLMs can effectively capture the context of a given image. The findings of our study indicate that leveraging VLMs in the domain of sports analysis holds promise for developing image processing models capable of incorpolating the tacit knowledge encoded within language models, as well as information conveyed through natural language descriptions.
Compound Expression recognition (CER) plays a crucial role in interpersonal interactions. Due to the complexity of human emotional expressions, which leads to the existence of compound expressions, it is necessary to ...
ISBN:
(纸本)9798350365474
Compound Expression recognition (CER) plays a crucial role in interpersonal interactions. Due to the complexity of human emotional expressions, which leads to the existence of compound expressions, it is necessary to consider both local and global facial expressions comprehensively for recognition. In this paper, to address this issue, we propose a solution for compound expression recognition based on ensemble learning methods. Specifically, our task is classification. We trained three expression classification models based on convolutional networks (ResNet50), vision Transformers, and multi-scale local attention networks, respectively. Then, by using late fusion, integrated the outputs of three models to predict the final result, leveraging the strengths of different models. Our method achieves high accuracy on RAF-DB and in sixth Affective Behavior Analysis in-the-wild (ABAW) Challenge, achieves an F1 score of 0.224 on the test set of C-EXPR-DB.
This paper introduces Content-aware Token Sharing (CTS), a token reduction approach that improves the computational efficiency of semantic segmentation networks that use vision Transformers (ViTs). Existing works have...
详细信息
ISBN:
(纸本)9798350301298
This paper introduces Content-aware Token Sharing (CTS), a token reduction approach that improves the computational efficiency of semantic segmentation networks that use vision Transformers (ViTs). Existing works have proposed token reduction approaches to improve the efficiency of ViT-based image classification networks, but these methods are not directly applicable to semantic segmentation, which we address in this work. We observe that, for semantic segmentation, multiple image patches can share a token if they contain the same semantic class, as they contain redundant information. Our approach leverages this by employing an efficient, class-agnostic policy network that predicts if image patches contain the same semantic class, and lets them share a token if they do. With experiments, we explore the critical design choices of CTS and show its effectiveness on the ADE20K, Pascal Context and Cityscapes datasets, various ViT backbones, and different segmentation decoders. With Content-aware Token Sharing, we are able to reduce the number of processed tokens by up to 44%, without diminishing the segmentation quality.
In this paper, we explore the cross-modal adaptation of pre-trained vision Transformers (ViTs) for the audio-visual domain by incorporating a limited set of trainable parameters. To this end, we propose a Spatial-Temp...
详细信息
ISBN:
(纸本)9798350365474
In this paper, we explore the cross-modal adaptation of pre-trained vision Transformers (ViTs) for the audio-visual domain by incorporating a limited set of trainable parameters. To this end, we propose a Spatial-Temporal-Global Cross-Modal Adaptation (STG-CMA) to gradually equip the frozen ViTs with the capability for learning audio-visual representation, consisting of the modality-specific temporal adaptation for temporal reasoning of each modality, the cross-modal spatial adaptation for refining the spatial information with the cue from counterpart modality, and the cross-modal global adaptation for global interaction between audio and visual modalities. Our STG-CMA presents a meaningful finding that only leveraging the shared pre-trained image model with inserted lightweight adapters is enough for spatial-temporal modeling and feature interaction of audio-visual modality. Extensive experiments indicate that our STG-CMA achieves state-of-the-art performance on various audio-visual understanding tasks including AVE, AVS, and AVQA while containing significantly reduced tunable parameters. The code is available at https://***/kaiw7/STG-CMA.
Image recognition models that work in challenging environments (e.g., extremely dark, blurry, or high dynamic range conditions) must be useful. However, creating training datasets for such environments is expensive an...
详细信息
ISBN:
(纸本)9798350301298
Image recognition models that work in challenging environments (e.g., extremely dark, blurry, or high dynamic range conditions) must be useful. However, creating training datasets for such environments is expensive and hard due to the difficulties of data collection and annotation. It is desirable if we could get a robust model without the need for hard-to-obtain datasets. One simple approach is to apply data augmentation such as color jitter and blur to standard RGB (sRGB) images in simple scenes. Unfortunately, this approach struggles to yield realistic images in terms of pixel intensity and noise distribution due to not considering the non-linearity of Image Signal Processors (ISPs) and noise characteristics of image sensors. Instead, we propose a noise-accounted RAW image augmentation method. In essence, color jitter and blur augmentation are applied to a RAW image before applying non-linear ISP, resulting in realistic intensity. Furthermore, we introduce a noise amount alignment method that calibrates the domain gap in the noise property caused by the augmentation. We show that our proposed noise-accounted RAW augmentation method doubles the image recognition accuracy in challenging environments only with simple training data.
暂无评论