检索结果-内蒙古大学图书馆

HUMAN-COMPUTER DIALOGUE SYSTEMS-BASED image-text VISUALIZATION, FUSION AND MULTI-INTENT MODELS' CONSTRUCTION

FRACTALS-COMPLEX GEOMETRY PATTERNS AND SCALING IN NATURE AND SOCIETY 2025年

作者： Meng, Wang Obaiys, Suzan J. Karaca, Yeliz Jamaludin, Nur Amalina Binti Univ Malaya Fac Comp Sci & Informat Technol Dept Comp Syst & Technol Kuala Lumpur Malaysia Univ Massachusetts Chan Med Sch UMASS 55 Lake Ave North Worcester MA 01655 USA Massachusetts Inst Technol MIT 77 Massachusetts Ave Cambridge MA 02139 USA UPNM Natl Def Univ Malaysia Ctr Fdn Studies Kuala Lumpur 57000 Malaysia

Human-computer dialogue systems enable intent recognition to be a crucial aspect to determine the intentions or purposes of users during respective interactions with the system, which allows the particular system to solicit appropriate actions or responses. The recognition of user intentions has become increasingly challenging with the rapid evolution with the advent of multimedia technology and widespread use of social media platforms. Traditional unimodal approaches, especially those relying solely on either textual or visual information, may fail to fully capture the intricacies of user intentions in multimedia content. To address this limitation, the fusion of image and text modalities employing multimodal technology has emerged as a promising solution for intent recognition. Compared with single-modality data such as images and text, multimodal data can contain more information and can more accurately identify user intentions. In this paper, we propose and construct a multi-intent recognition method based on vision-language pre-training (VLP) model and cross-modality multi-head attention mechanism. The method includes two equally important stages of multimodal representation and fusion to explore the integration of image and text data to enhance the accuracy of intent recognition in multimedia content. The effectiveness of our approach for multi-intent recognition based on image and text fusion is proven by the comparative experiments with the baseline model on the public multimodal intent dataset which is used for this study is the first benchmark dataset for intent recognition in real-world multimodal scenes, including both image and text modalities. The ultimate goal aimed at attaining is to provide empowerment for making informed decisions based on interpretable models with field-specific observational and experimental aspects.

关键词： Human-Computer Dialogue Systems Multi-Intent Recognition VisionLanguage Pre-Training Deep Learning Feature Representation image-text data processing Multimodal Pre-Training Model BERT Computational Linguistics Multimodal Fusion Multi-Head Attention Mechanism Sequential Information Multimodal Fusion Construction Model

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：