咨询与建议

看过本文的还看了

相关文献

该作者的其他文献

文献详情 >Vision-Language Meets the Skel... 收藏

Vision-Language Meets the Skeleton: Progressively Distillation With Cross-Modal Knowledge for 3D Action Representation Learning

作     者:Chen, Yang He, Tian Fu, Junfeng Wang, Ling Guo, Jingcai Hu, Ting Cheng, Hong 

作者机构:Univ Elect Sci & Technol China Sch Informat & Commun Engn Chengdu 611731 Peoples R China Univ Elect Sci & Technol China Sch Automat Engn Chengdu 611731 Peoples R China Hong Kong Polytech Univ Dept Comp Hong Kong 999077 Peoples R China Hong Kong Polytech Univ Shenzhen Res Inst Shenzhen 518000 Peoples R China 1 Orthoped Hosp Chengdu Chengdu 611731 Peoples R China 

出 版 物:《IEEE TRANSACTIONS ON MULTIMEDIA》 (IEEE Trans Multimedia)

年 卷 期:2025年第27卷

页      面:2293-2303页

核心收录:

学科分类:0810[工学-信息与通信工程] 0808[工学-电气工程] 08[工学] 0835[工学-软件工程] 0812[工学-计算机科学与技术(可授工学、理学学位)] 

基  金:National Key Research and Development Program of China [2022YFE0133100] Hong Kong RGC General Research Fund [152211/23E, 15216424/24E] National Natural Science Foundation of China PolyU Internal Fund [P0043932] NVIDIA AI Technology Center (NVAITC) 

主  题:Skeleton Noise measurement Contrastive learning Training Annotations Three-dimensional displays Joints Aerospace electronics Supervised learning Representation learning Action recognition contrastive learning cross-modal LMMs self-supervised vision-language 

摘      要:Skeleton-based action representation learning aims to interpret and understand human behaviors by encoding the skeleton sequences, which can be categorized into two primary training paradigms: supervised learning and self-supervised learning. However, the former one-hot classification requires labor-intensive predefined action categories annotations, while the latter involves skeleton transformations (e.g., cropping) in the pretext tasks that may impair the skeleton structure. To address these challenges, we introduce a novel skeleton-based training framework ((CVL)-V-2) based on Cross-modal Contrastive learning that uses the progressive distillation to learn task-agnostic human skeleton action representation from the Vision-Language knowledge prompts. Specifically, we establish the vision-language action concept space through vision-language knowledge prompts generated by pre-trained large multimodal models (LMMs), which enrich the fine-grained details that the skeleton action space lacks. Moreover, we propose the intra-modal self-similarity and inter-modal cross-consistency softened targets in the cross-modal representation learning process to progressively control and guide the degree of pulling vision-language knowledge prompts and corresponding skeletons closer. These soft instance discrimination and self-knowledge distillation strategies contribute to the learning of better skeleton-based action representations from the noisy skeleton-vision-language pairs. During the inference phase, our method requires only the skeleton data as the input for action recognition and no longer for vision-language prompts. Extensive experiments on NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD datasets demonstrate that our method outperforms the previous methods and achieves state-of-the-art results.

读者评论 与其他读者分享你的观点

用户名:未登录
我的评分