咨询与建议

看过本文的还看了

相关文献

该作者的其他文献

文献详情 >PMMTalk: Speech-Driven 3D Faci... 收藏

PMMTalk: Speech-Driven 3D Facial Animation From Complementary Pseudo Multi-Modal Features

作     者:Han, Tianshun Gui, Shengnan Huang, Yiqing Li, Baihui Liu, Lijian Zhou, Benjia Jiang, Ning Lu, Quan Zhi, Ruicong Liang, Yanyan Zhang, Du Wan, Jun 

作者机构:Macau Univ Sci & Technol Fac Innovat Engn Sch Comp Sci & Engn Macau 999078 Peoples R China Chinese Acad Sci CASIA Inst Automat State Key Lab Multimodal Artificial Intelligence S Beijing 100190 Peoples R China Univ Chinese Acad Sci Sch Artificial Intelligence Beijing 100049 Peoples R China Univ Sci & Technol Beijing Sch Comp & Commun Engn Beijing 100083 Peoples R China Beijing Key Lab Knowledge Engn Mat Sci Beijing 100083 Peoples R China Mashang Consumer Finance Co Ltd Chongqing 400000 Peoples R China 

出 版 物:《IEEE TRANSACTIONS ON MULTIMEDIA》 (IEEE Trans Multimedia)

年 卷 期:2025年第27卷

页      面:2570-2581页

核心收录:

学科分类:0810[工学-信息与通信工程] 0808[工学-电气工程] 08[工学] 0835[工学-软件工程] 0812[工学-计算机科学与技术(可授工学、理学学位)] 

基  金:National Key Research and Development Program of China [2021YFE0205700] Beijing Natural Science Foundation [JQ23016] Chinese National Natural Science Foundation Projects Science and Technology Development Fund of Macau Project [0123/2022/A3, 0044/2024/AGJ, 0070/2020/AMJ] 

主  题:Three-dimensional displays Facial animation Faces Visualization Feature extraction Face recognition Solid modeling Synchronization Decoding Acoustics Speech-driven 3D facial animation PMMTalk 3D Chinese Audio-Visual Facial Animation (3D-CAVFA) dataset 

摘      要:Speech-driven 3D facial animation has improved a lot recently while most related works only utilize acoustic modality and neglect the influence of visual and textual cues, leading to unsatisfactory results in terms of precision and coherence. We argue that visual and textual cues are not trivial information. Therefore, we present a novel framework, namely PMMTalk, using complementary Pseudo Multi-Modal features for improving the accuracy of facial animation. The framework entails three modules: PMMTalk encoder, cross-modal alignment module, and PMMTalk decoder. Specifically, the PMMTalk encoder employs the off-the-shelf talking head generation architecture and speech recognition technology to extract visual and textual information from speech, respectively. Following this, the cross-modal alignment module aligns the audio-image-text features at temporal and semantic levels. Subsequently, the PMMTalk decoder is employed to predict lip-syncing facial blendshape coefficients. Contrary to prior methods, PMMTalk only requires an additional random reference face image but yields more accurate results. Additionally, it is artist-friendly as it seamlessly integrates into standard animation production workflows by introducing facial blendshape coefficients. Finally, given the scarcity of 3D talking face datasets, we introduce a large-scale 3D Chinese Audio-Visual Facial Animation (3D-CAVFA) dataset. Extensive experiments and user studies show that our approach outperforms the state of the art. Codes and datasets are available at PMMTalk.

读者评论 与其他读者分享你的观点

用户名:未登录
我的评分