咨询与建议

看过本文的还看了

相关文献

该作者的其他文献

文献详情 >MMTF-DES: A fusion of multimod... 收藏

MMTF-DES: A fusion of multimodal transformer models for desire, emotion, and sentiment analysis of social media data

作     者:Aziz, Abdul Chowdhury, Nihad Karim Kabir, Muhammad Ashad Chy, Abu Nowshed Siddique, Md. Jawad 

作者机构:Univ Chittagong Dept Comp Sci & Engn Chattogram 4331 Bangladesh Charles Sturt Univ Sch Comp Math & Engn Bathurst NSW 2795 Australia Southern Illinois Univ Dept Comp Sci Carbondale IL 62901 USA 

出 版 物:《NEUROCOMPUTING》 (Neurocomputing)

年 卷 期:2025年第623卷

核心收录:

学科分类:08[工学] 0812[工学-计算机科学与技术(可授工学、理学学位)] 

主  题:Human desire understanding Desire analysis Sentiment analysis Emotion analysis Multimodal transformer Vision-language models 

摘      要:Desires, emotions, and sentiments are pivotal in understanding and predicting human behavior, influencing various aspects of decision-making, communication, and social interactions. Their analysis, particularly in the context of multimodal data (such as images and texts) from social media, provides profound insights into cultural diversity, psychological well-being, and consumer behavior. Prior studies overlooked the use of image-text pairwise feature representation, which is crucial for the task of human desire understanding. In this research, we have proposed a unified multimodal-based framework with image-text pair settings to identify human desire, sentiment, and emotion. The core of our proposed method lies in the encoder module, which is built using two state-of-the-art multimodal vision-language models (VLMs). To effectively extract visual and contextualized embedding features from social media image and text pairs, we jointly fine-tune two pre-trained multimodal VLMs: Vision-and-Language Transformer (ViLT) and Vision-and-Augmented-Language Transformer (VAuLT). Subsequently, we use an early fusion strategy on these embedding features to obtain combined diverse feature representations. Moreover, we leverage a multi-sample dropout mechanism to enhance the generalization ability and expedite the training process of our proposed method. To evaluate our proposed approach, we used the multimodal dataset MSED for the human desire understanding task. Through our experimental evaluation, we demonstrate that our method excels in capturing both visual and contextual information, resulting in superior performance compared to other state-of-the-art techniques. Specifically, our method outperforms existing approaches by 3% for sentiment analysis, 2.2% for emotion analysis, and approximately 1% for desire analysis.

读者评论 与其他读者分享你的观点

用户名:未登录
我的评分