版权所有:内蒙古大学图书馆 技术提供:维普资讯• 智图
内蒙古自治区呼和浩特市赛罕区大学西街235号 邮编: 010021
作者机构:Lanzhou Univ Sch Informat Sci & Engn Lanzhou 730000 Peoples R China Univ Guelph Sch Comp Sci Guelph ON N1G 2W1 Canada Univ Alberta Dept Elect & Comp Engn Edmonton AB T6G 1H9 Canada Zhejiang Sci Tech Univ Sch Sci Hangzhou 310018 Peoples R China Chongqing Univ Sch Big Data & Software Engn Chongqing 401331 Peoples R China
出 版 物:《IEEE TRANSACTIONS ON MULTIMEDIA》 (IEEE Trans Multimedia)
年 卷 期:2025年第27卷
页 面:2412-2422页
核心收录:
学科分类:0810[工学-信息与通信工程] 0808[工学-电气工程] 08[工学] 0835[工学-软件工程] 0812[工学-计算机科学与技术(可授工学、理学学位)]
基 金:China Scholarship Council Science and Technology Project of Gansu [24JRRA388] Science Foundation of Zhejiang Sci-Tech University [22062338-Y]
主 题:Transformers Annotations Semantics Feature extraction Head Computational modeling Face recognition Electronic mail Collaboration Correlation Annotation ambiguity CLIP facial expression recognition multimodal vision transformer
摘 要:The ever-increasing demands for intuitive interactions in virtual reality have led to surging interests in facial expression recognition (FER). There are however several issues commonly seen in existing methods, including narrow receptive fields and homogenous supervisory signals. To address these issues, we propose in this paper a novel multimodal supervision-steering transformer for facial expression recognition in the wild, referred to as FER-former. Specifically, to address the limitation of narrow receptive fields, a hybrid feature extraction pipeline is designed by cascading both prevailing CNNs and transformers. To deal with the issue of homogenous supervisory signals, a heterogeneous domain-steering supervision module is proposed to incorporate text-space semantic correlations to enhance image features, based on the similarity between image and text features. Additionally, a FER-specific transformer encoder is introduced to characterize conventional one-hot label-focusing and CLIP-based text-oriented tokens in parallel for final classification. Based on the collaboration of multifarious token heads, global receptive fields with multimodal semantic cues are captured, delivering superb learning capability. Extensive experiments on popular benchmarks demonstrate the superiority of the proposed FER-former over the existing state-of-the-art methods.