版权所有:内蒙古大学图书馆 技术提供:维普资讯• 智图
内蒙古自治区呼和浩特市赛罕区大学西街235号 邮编: 010021
作者机构:Macau Univ Sci & Technol Fac Innovat Engn Sch Comp Sci & Engn Macau 999078 Peoples R China Chinese Acad Sci CASIA Inst Automat State Key Lab Multimodal Artificial Intelligence S Beijing 100190 Peoples R China Univ Chinese Acad Sci Sch Artificial Intelligence Beijing 100049 Peoples R China Univ Sci & Technol Beijing Sch Comp & Commun Engn Beijing 100083 Peoples R China Beijing Key Lab Knowledge Engn Mat Sci Beijing 100083 Peoples R China Mashang Consumer Finance Co Ltd Chongqing 400000 Peoples R China
出 版 物:《IEEE TRANSACTIONS ON MULTIMEDIA》 (IEEE Trans Multimedia)
年 卷 期:2025年第27卷
页 面:2570-2581页
核心收录:
学科分类:0810[工学-信息与通信工程] 0808[工学-电气工程] 08[工学] 0835[工学-软件工程] 0812[工学-计算机科学与技术(可授工学、理学学位)]
基 金:National Key Research and Development Program of China [2021YFE0205700] Beijing Natural Science Foundation [JQ23016] Chinese National Natural Science Foundation Projects Science and Technology Development Fund of Macau Project [0123/2022/A3, 0044/2024/AGJ, 0070/2020/AMJ]
主 题:Three-dimensional displays Facial animation Faces Visualization Feature extraction Face recognition Solid modeling Synchronization Decoding Acoustics Speech-driven 3D facial animation PMMTalk 3D Chinese Audio-Visual Facial Animation (3D-CAVFA) dataset
摘 要:Speech-driven 3D facial animation has improved a lot recently while most related works only utilize acoustic modality and neglect the influence of visual and textual cues, leading to unsatisfactory results in terms of precision and coherence. We argue that visual and textual cues are not trivial information. Therefore, we present a novel framework, namely PMMTalk, using complementary Pseudo Multi-Modal features for improving the accuracy of facial animation. The framework entails three modules: PMMTalk encoder, cross-modal alignment module, and PMMTalk decoder. Specifically, the PMMTalk encoder employs the off-the-shelf talking head generation architecture and speech recognition technology to extract visual and textual information from speech, respectively. Following this, the cross-modal alignment module aligns the audio-image-text features at temporal and semantic levels. Subsequently, the PMMTalk decoder is employed to predict lip-syncing facial blendshape coefficients. Contrary to prior methods, PMMTalk only requires an additional random reference face image but yields more accurate results. Additionally, it is artist-friendly as it seamlessly integrates into standard animation production workflows by introducing facial blendshape coefficients. Finally, given the scarcity of 3D talking face datasets, we introduce a large-scale 3D Chinese Audio-Visual Facial Animation (3D-CAVFA) dataset. Extensive experiments and user studies show that our approach outperforms the state of the art. Codes and datasets are available at PMMTalk.