文献详情 >FER-Former: Multimodal Transfo... 收藏

FER-Former: Multimodal Transformer for Facial Expression Recognition

作者：Li, Yande Wang, Mingjie Gong, Minglun Lu, Yonggang Liu, Li

作者机构：Lanzhou Univ Sch Informat Sci & Engn Lanzhou 730000 Peoples R China Univ Guelph Sch Comp Sci Guelph ON N1G 2W1 Canada Univ Alberta Dept Elect & Comp Engn Edmonton AB T6G 1H9 Canada Zhejiang Sci Tech Univ Sch Sci Hangzhou 310018 Peoples R China Chongqing Univ Sch Big Data & Software Engn Chongqing 401331 Peoples R China

出版物：《IEEE TRANSACTIONS ON MULTIMEDIA》 (IEEE Trans Multimedia)

年卷期：2025年第27卷

页面：2412-2422页

核心收录：

学科分类：0810[工学-信息与通信工程] 0808[工学-电气工程] 08[工学] 0835[工学-软件工程] 0812[工学-计算机科学与技术（可授工学、理学学位）]

基　　金：China Scholarship Council Science and Technology Project of Gansu [24JRRA388] Science Foundation of Zhejiang Sci-Tech University [22062338-Y]

主　　题：Transformers Annotations Semantics Feature extraction Head Computational modeling Face recognition Electronic mail Collaboration Correlation Annotation ambiguity CLIP facial expression recognition multimodal vision transformer

摘要：The ever-increasing demands for intuitive interactions in virtual reality have led to surging interests in facial expression recognition (FER). There are however several issues commonly seen in existing methods, including narrow receptive fields and homogenous supervisory signals. To address these issues, we propose in this paper a novel multimodal supervision-steering transformer for facial expression recognition in the wild, referred to as FER-former. Specifically, to address the limitation of narrow receptive fields, a hybrid feature extraction pipeline is designed by cascading both prevailing CNNs and transformers. To deal with the issue of homogenous supervisory signals, a heterogeneous domain-steering supervision module is proposed to incorporate text-space semantic correlations to enhance image features, based on the similarity between image and text features. Additionally, a FER-specific transformer encoder is introduced to characterize conventional one-hot label-focusing and CLIP-based text-oriented tokens in parallel for final classification. Based on the collaboration of multifarious token heads, global receptive fields with multimodal semantic cues are captured, delivering superb learning capability. Extensive experiments on popular benchmarks demonstrate the superiority of the proposed FER-former over the existing state-of-the-art methods.

本地馆藏 | 借阅须知 | 我要预约

已订购，未入库

sda

目录详情 | 试阅读 |

读者评论与其他读者分享你的观点

学校读者

用户名:未登录

我的评分

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

看过本文的还看了

相关文献

该作者的其他文献

CADAL相关文献

FER-Former: Multimodal Transformer for Facial Expression Recognition

读者评论与其他读者分享你的观点

请选择收藏分类：

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

看过本文的还看了

相关文献

该作者的其他文献

CADAL相关文献

FER-Former: Multimodal Transformer for Facial Expression Recognition

读者评论 与其他读者分享你的观点

请选择收藏分类： 新增自定义分类 确定 取消

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

读者评论与其他读者分享你的观点

请选择收藏分类：