Multimodal sentiment analysis aims to accurately assess the sentiment expressed in a given data source by integrating and analyzing multiple modalities, such as text and images. Extracting discriminative features for ...
详细信息
Multimodal sentiment analysis aims to accurately assess the sentiment expressed in a given data source by integrating and analyzing multiple modalities, such as text and images. Extracting discriminative features for sentiment prediction is a powerful approach to address the challenges in multimodal sentiment analysis. Most methods in this domain leverage pre-trained unimodal models to extract features from individual modalities. Subsequently, these features undergo integration via sophisticated fusion mechanisms. However, these models often need to be improved in their ability to proficiently process multimodal data, potentially risking the loss of semantic associations between the different modalities. This study aims to address this problem in multimodal sentiment analysis by developing a simple end-to-end model that avoids the need for sophisticated ensemble techniques for feature extraction. In contrast, the proposed methodology capitalizes on the benefits of transfer learning through the deployment of a vision-language pre-trained model. This model efficiently extracts both visual and textual features within a cohesive framework. Extracted features are subsequently integrated via the proposed feature interaction module, which facilitates capturing potential semantic information in an image-text pair through explicit and implicit feature interaction. Finally, the derived representations undergo transmission to the classification module, thereby augmenting performance in sentiment analysis tasks. The effectiveness of the proposed approach is substantiated through a rigorous experimental evaluation. Assessments conducted on two publicly available real-world datasets reveal significant enhancements in sentiment analysis performance.
Text-based person search aims to retrieve the most relevant pedestrian images from an image gallery based on textual descriptions. Most existing methods rely on two separate encoders to extract the image and text feat...
详细信息
Text-based person search aims to retrieve the most relevant pedestrian images from an image gallery based on textual descriptions. Most existing methods rely on two separate encoders to extract the image and text features, and then elaborately design various schemes to bridge the gap between image and text modalities. However, the shallow interaction between both modalities in these methods is still insufficient to eliminate the modality gap. To address the above problem, we propose TransTPS, a transformer-based framework that enables deeper interaction between both modalities through the self-attention mechanism in transformer, effectively alleviating the modality gap. In addition, due to the small inter-class variance and large intra-class variance in image modality, we further develop two techniques to overcome these limitations. Specifically, Cross-modal Multi-Granularity Matching (CMGM) is proposed to address the problem caused by small inter-class variance and facilitate distinguishing pedestrians with similar appearance. Besides, Contrastive Loss with Weakly Positive pairs (CLWP) is introduced to mitigate the impact of large intra-class variance and contribute to the retrieval of more target images. Experiments on CUHK-PEDES and RSTpreID datasets demonstrate that our proposed framework achieves state-of-the-art performance compared to previous methods.
In recent studies on domain adaptation, significant emphasis has been placed on the advancement of learning shared knowledge from a source domain to a target domain. Recently, the large vision-languagepre-trained mod...
详细信息
ISBN:
(纸本)9798350349405;9798350349399
In recent studies on domain adaptation, significant emphasis has been placed on the advancement of learning shared knowledge from a source domain to a target domain. Recently, the large vision-language pre-trained model (i.e., CLIP) has shown strong ability on zero-shot recognition, and parameter efficient tuning can further improve its performance on specific tasks. This work demonstrates that a simple domain prior boosts CLIP's zero-shot recognition in a specific domain. Besides, CLIP's adaptation relies less on source domain data due to its diverse pre-training dataset. Furthermore, we create a benchmark for zero-shot adaptation and pseudo-labeling based self-training with CLIP. Last but not least, we propose to improve the task generalization ability of CLIP from multiple unlabeled domains, which is a more practical and unique scenario. We believe our findings motivate a rethinking of domain adaptation benchmarks and the associated role of related algorithms in the era of CLIP.
Facial image inpainting aims to fill visually realistic and semantically new pixels for masked or missing pixels in a face image. Although current methods have made progress in achieving high visual quality, the contr...
详细信息
ISBN:
(纸本)9783031270765;9783031270772
Facial image inpainting aims to fill visually realistic and semantically new pixels for masked or missing pixels in a face image. Although current methods have made progress in achieving high visual quality, the controllable diversity of face inpainting remains an open issue. This paper proposes a new facial image inpainting interaction mode, which enables filling semantic contents based on the texts or exemplar images provided by users. We use the powerful image-text representation abilities from the recently introduced Contrastive language-Image pre-training (CLIP) models to achieve this interactive face inpainting. We present two thoughts on our method. Specifically, we first explore a simple and effective optimization-based text-guided facial inpainting method in which a CLIP model is used as a loss network to modify the latent code iteratively in response to the text prompt. Next, we describe a multi-modal inpainting mapper network to map the input conditions (e.g., text or image) into corresponding latent code changes, supporting the guidance of different text prompts and exemplars within one model. We also introduce an exemplar-semantic similarity loss, which maps the inpainted facial image and the exemplar image into the CLIP's space to measure their similarity. This loss enables the generated image to include high-level semantic attributes from the exemplar image. Through extensive experiments, we demonstrate the effectiveness of our method in interactive facial inpainting based on the texts or exemplars.
Inspired by the remarkable zero-shot generalization capacity of vision-language pre-trained model, we seek to leverage the supervision from CLIP model to alleviate the burden of data labeling. However, such supervisio...
详细信息
ISBN:
(纸本)9781665405409
Inspired by the remarkable zero-shot generalization capacity of vision-language pre-trained model, we seek to leverage the supervision from CLIP model to alleviate the burden of data labeling. However, such supervision inevitably contains the label noise, which significantly degrades the discriminative power of the classification model. In this work, we propose Transductive CLIP, a novel framework for learning a classification network with noisy labels from scratch. Firstly, a class-conditional contrastive learning mechanism is proposed to mitigate the reliance on pseudo labels and boost the tolerance to noisy labels. Secondly, ensemble labels is adopted as a pseudo label updating strategy to stabilize the training of deep neural networks with noisy labels. This framework can reduce the impact of noisy labels from CLIP model effectively by combining both techniques. Experiments on multiple benchmark datasets demonstrate the substantial improvements over other state-of-the-art methods.
暂无评论