Human-computer dialogue systems enable intent recognition to be a crucial aspect to determine the intentions or purposes of users during respective interactions with the system, which allows the particular system to s...
详细信息
Human-computer dialogue systems enable intent recognition to be a crucial aspect to determine the intentions or purposes of users during respective interactions with the system, which allows the particular system to solicit appropriate actions or responses. The recognition of user intentions has become increasingly challenging with the rapid evolution with the advent of multimedia technology and widespread use of social media platforms. Traditional unimodal approaches, especially those relying solely on either textual or visual information, may fail to fully capture the intricacies of user intentions in multimedia content. To address this limitation, the fusion of image and text modalities employing multimodal technology has emerged as a promising solution for intent recognition. Compared with single-modality data such as images and text, multimodal data can contain more information and can more accurately identify user intentions. In this paper, we propose and construct a multi-intent recognition method based on vision-language pre-training (VLP) model and cross-modality multi-head attention mechanism. The method includes two equally important stages of multimodal representation and fusion to explore the integration of image and textdata to enhance the accuracy of intent recognition in multimedia content. The effectiveness of our approach for multi-intent recognition based on image and text fusion is proven by the comparative experiments with the baseline model on the public multimodal intent dataset which is used for this study is the first benchmark dataset for intent recognition in real-world multimodal scenes, including both image and text modalities. The ultimate goal aimed at attaining is to provide empowerment for making informed decisions based on interpretable models with field-specific observational and experimental aspects.
暂无评论