版权所有:内蒙古大学图书馆 技术提供:维普资讯• 智图
内蒙古自治区呼和浩特市赛罕区大学西街235号 邮编: 010021
作者机构:Univ Chittagong Dept Comp Sci & Engn Chattogram 4331 Bangladesh Charles Sturt Univ Sch Comp Math & Engn Bathurst NSW 2795 Australia Southern Illinois Univ Dept Comp Sci Carbondale IL 62901 USA
出 版 物:《NEUROCOMPUTING》 (Neurocomputing)
年 卷 期:2025年第623卷
核心收录:
学科分类:08[工学] 0812[工学-计算机科学与技术(可授工学、理学学位)]
主 题:Human desire understanding Desire analysis Sentiment analysis Emotion analysis Multimodal transformer Vision-language models
摘 要:Desires, emotions, and sentiments are pivotal in understanding and predicting human behavior, influencing various aspects of decision-making, communication, and social interactions. Their analysis, particularly in the context of multimodal data (such as images and texts) from social media, provides profound insights into cultural diversity, psychological well-being, and consumer behavior. Prior studies overlooked the use of image-text pairwise feature representation, which is crucial for the task of human desire understanding. In this research, we have proposed a unified multimodal-based framework with image-text pair settings to identify human desire, sentiment, and emotion. The core of our proposed method lies in the encoder module, which is built using two state-of-the-art multimodal vision-language models (VLMs). To effectively extract visual and contextualized embedding features from social media image and text pairs, we jointly fine-tune two pre-trained multimodal VLMs: Vision-and-Language Transformer (ViLT) and Vision-and-Augmented-Language Transformer (VAuLT). Subsequently, we use an early fusion strategy on these embedding features to obtain combined diverse feature representations. Moreover, we leverage a multi-sample dropout mechanism to enhance the generalization ability and expedite the training process of our proposed method. To evaluate our proposed approach, we used the multimodal dataset MSED for the human desire understanding task. Through our experimental evaluation, we demonstrate that our method excels in capturing both visual and contextual information, resulting in superior performance compared to other state-of-the-art techniques. Specifically, our method outperforms existing approaches by 3% for sentiment analysis, 2.2% for emotion analysis, and approximately 1% for desire analysis.