检索结果-内蒙古大学图书馆

Improving multimodal sentiment prediction through vision-language feature interaction

MULTIMEDIA SYSTEMS 2025年第1期31卷 1-12页

作者： An, Jieyu Ding, Binfen Zainon, Wan Mohd Nazmee Wan Gandong Univ Sch Informat Engn Fuzhou 344000 Jiangxi Peoples R China Univ Sains Malaysia Sch Comp Sci George Town 11800 Malaysia

Multimodal sentiment analysis aims to accurately assess the sentiment expressed in a given data source by integrating and analyzing multiple modalities, such as text and images. Extracting discriminative features for sentiment prediction is a powerful approach to address the challenges in multimodal sentiment analysis. Most methods in this domain leverage pre-trained unimodal models to extract features from individual modalities. Subsequently, these features undergo integration via sophisticated fusion mechanisms. However, these models often need to be improved in their ability to proficiently process multimodal data, potentially risking the loss of semantic associations between the different modalities. This study aims to address this problem in multimodal sentiment analysis by developing a simple end-to-end model that avoids the need for sophisticated ensemble techniques for feature extraction. In contrast, the proposed methodology capitalizes on the benefits of transfer learning through the deployment of a vision-language pre-trained model. This model efficiently extracts both visual and textual features within a cohesive framework. Extracted features are subsequently integrated via the proposed feature interaction module, which facilitates capturing potential semantic information in an image-text pair through explicit and implicit feature interaction. Finally, the derived representations undergo transmission to the classification module, thereby augmenting performance in sentiment analysis tasks. The effectiveness of the proposed approach is substantiated through a rigorous experimental evaluation. Assessments conducted on two publicly available real-world datasets reveal significant enhancements in sentiment analysis performance.

关键词： Multimodal deep learning Multimodal fusion Sentiment classification Sentiment analysis vision-language pre-trained model

来源：评论

学校读者我要写书评

暂无评论

Multi-Granularity Matching Transformer for Text-Based Person Search

引用

IEEE TRANSACTIONS ON MULTIMEDIA 2024年 26卷 4281-4293页

作者： Bao, Liping Wei, Longhui Zhou, Wengang Liu, Lin Xie, Lingxi Li, Houqiang Tian, Qi Univ Sci & Technol China Dept Elect Engn & Informat Sci Hefei 230027 Peoples R China Univ Sci & Technol China Dept Elect Engn & Informat Sci Hefei 230027 Peoples R China Huawei Cloud Shenzhen 518129 Peoples R China

Text-based person search aims to retrieve the most relevant pedestrian images from an image gallery based on textual descriptions. Most existing methods rely on two separate encoders to extract the image and text features, and then elaborately design various schemes to bridge the gap between image and text modalities. However, the shallow interaction between both modalities in these methods is still insufficient to eliminate the modality gap. To address the above problem, we propose TransTPS, a transformer-based framework that enables deeper interaction between both modalities through the self-attention mechanism in transformer, effectively alleviating the modality gap. In addition, due to the small inter-class variance and large intra-class variance in image modality, we further develop two techniques to overcome these limitations. Specifically, Cross-modal Multi-Granularity Matching (CMGM) is proposed to address the problem caused by small inter-class variance and facilitate distinguishing pedestrians with similar appearance. Besides, Contrastive Loss with Weakly Positive pairs (CLWP) is introduced to mitigate the impact of large intra-class variance and contribute to the retrieval of more target images. Experiments on CUHK-PEDES and RSTpreID datasets demonstrate that our proposed framework achieves state-of-the-art performance compared to previous methods.

关键词： Transformers Feature extraction Task analysis Pedestrians Visualization Search problems Training Text-based person search transformer vision-language pre-trained model

来源：评论

学校读者我要写书评

暂无评论

RETHINKING DOMAIN ADAPTATION AND GENERALIZATION IN THE ERA OF CLIP 31

RETHINKING DOMAIN ADAPTATION AND GENERALIZATION IN THE ERA O...

引用

2024 International Conference on Image Processing

作者： Feng, Ruoyu Yu, Tao Jin, Xin Yu, Xiaoyuan Xiao, Lei Chen, Zhibo Univ Sci & Technol China Beijing Peoples R China Eastern Inst Technol Ningbo Peoples R China Huawei Cloud Ningbo Peoples R China

ISBN: (纸本)9798350349405;9798350349399

In recent studies on domain adaptation, significant emphasis has been placed on the advancement of learning shared knowledge from a source domain to a target domain. Recently, the large vision-language pre-trained model (i.e., CLIP) has shown strong ability on zero-shot recognition, and parameter efficient tuning can further improve its performance on specific tasks. This work demonstrates that a simple domain prior boosts CLIP's zero-shot recognition in a specific domain. Besides, CLIP's adaptation relies less on source domain data due to its diverse pre-training dataset. Furthermore, we create a benchmark for zero-shot adaptation and pseudo-labeling based self-training with CLIP. Last but not least, we propose to improve the task generalization ability of CLIP from multiple unlabeled domains, which is a more practical and unique scenario. We believe our findings motivate a rethinking of domain adaptation benchmarks and the associated role of related algorithms in the era of CLIP.

关键词： Unsupervised domain adaptation Domain generalization vision-language pre-trained model Self-training

来源：评论

学校读者我要写书评

暂无评论

Towards Interactive Facial Image Inpainting by Text or Exemplar Image 29th

Towards Interactive Facial Image Inpainting by Text or Exemp...

引用

29th International Conference on MultiMedia modeling (MMM)

作者： Li, Ailin Zhao, Lei Zuo, Zhiwen Wang, Zhizhong Xing, Wei Lu, Dongming Zhejianng Univ Coll Comp Sci & Technol Hangzhou Peoples R China

ISBN: (纸本)9783031270765;9783031270772

Facial image inpainting aims to fill visually realistic and semantically new pixels for masked or missing pixels in a face image. Although current methods have made progress in achieving high visual quality, the controllable diversity of face inpainting remains an open issue. This paper proposes a new facial image inpainting interaction mode, which enables filling semantic contents based on the texts or exemplar images provided by users. We use the powerful image-text representation abilities from the recently introduced Contrastive language-Image pre-training (CLIP) models to achieve this interactive face inpainting. We present two thoughts on our method. Specifically, we first explore a simple and effective optimization-based text-guided facial inpainting method in which a CLIP model is used as a loss network to modify the latent code iteratively in response to the text prompt. Next, we describe a multi-modal inpainting mapper network to map the input conditions (e.g., text or image) into corresponding latent code changes, supporting the guidance of different text prompts and exemplars within one model. We also introduce an exemplar-semantic similarity loss, which maps the inpainted facial image and the exemplar image into the CLIP's space to measure their similarity. This loss enables the generated image to include high-level semantic attributes from the exemplar image. Through extensive experiments, we demonstrate the effectiveness of our method in interactive facial inpainting based on the texts or exemplars.

关键词： Multi-modal learning Image inpainting vision-language pre-trained model

来源：评论

学校读者我要写书评

暂无评论

TRANSDUCTIVE CLIP WITH CLASS-CONDITIONAL CONTRASTIVE LEARNING 47

TRANSDUCTIVE CLIP WITH CLASS-CONDITIONAL CONTRASTIVE LEARNIN...

引用

47th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

作者： Huang, Junchu Chen, Weijie Yang, Shicai Xie, Di Pu, Shiliang Zhuang, Yueting South China Univ Technol Guangzhou Peoples R China Hikvis Res Inst Hangzhou Peoples R China Zhejiang Univ Hangzhou Peoples R China

ISBN: (纸本)9781665405409

Inspired by the remarkable zero-shot generalization capacity of vision-language pre-trained model, we seek to leverage the supervision from CLIP model to alleviate the burden of data labeling. However, such supervision inevitably contains the label noise, which significantly degrades the discriminative power of the classification model. In this work, we propose Transductive CLIP, a novel framework for learning a classification network with noisy labels from scratch. Firstly, a class-conditional contrastive learning mechanism is proposed to mitigate the reliance on pseudo labels and boost the tolerance to noisy labels. Secondly, ensemble labels is adopted as a pseudo label updating strategy to stabilize the training of deep neural networks with noisy labels. This framework can reduce the impact of noisy labels from CLIP model effectively by combining both techniques. Experiments on multiple benchmark datasets demonstrate the substantial improvements over other state-of-the-art methods.

关键词： vision-language pre-trained model Transductive Learning Noisy Label Learning Contrastive Learning Unsupervised model Optimization

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：