版权所有:内蒙古大学图书馆 技术提供:维普资讯• 智图
内蒙古自治区呼和浩特市赛罕区大学西街235号 邮编: 010021
作者机构:Univ Elect Sci & Technol China Sch Informat & Commun Engn Chengdu 611731 Peoples R China Univ Elect Sci & Technol China Sch Automat Engn Chengdu 611731 Peoples R China Hong Kong Polytech Univ Dept Comp Hong Kong 999077 Peoples R China Hong Kong Polytech Univ Shenzhen Res Inst Shenzhen 518000 Peoples R China 1 Orthoped Hosp Chengdu Chengdu 611731 Peoples R China
出 版 物:《IEEE TRANSACTIONS ON MULTIMEDIA》 (IEEE Trans Multimedia)
年 卷 期:2025年第27卷
页 面:2293-2303页
核心收录:
学科分类:0810[工学-信息与通信工程] 0808[工学-电气工程] 08[工学] 0835[工学-软件工程] 0812[工学-计算机科学与技术(可授工学、理学学位)]
基 金:National Key Research and Development Program of China [2022YFE0133100] Hong Kong RGC General Research Fund [152211/23E, 15216424/24E] National Natural Science Foundation of China PolyU Internal Fund [P0043932] NVIDIA AI Technology Center (NVAITC)
主 题:Skeleton Noise measurement Contrastive learning Training Annotations Three-dimensional displays Joints Aerospace electronics Supervised learning Representation learning Action recognition contrastive learning cross-modal LMMs self-supervised vision-language
摘 要:Skeleton-based action representation learning aims to interpret and understand human behaviors by encoding the skeleton sequences, which can be categorized into two primary training paradigms: supervised learning and self-supervised learning. However, the former one-hot classification requires labor-intensive predefined action categories annotations, while the latter involves skeleton transformations (e.g., cropping) in the pretext tasks that may impair the skeleton structure. To address these challenges, we introduce a novel skeleton-based training framework ((CVL)-V-2) based on Cross-modal Contrastive learning that uses the progressive distillation to learn task-agnostic human skeleton action representation from the Vision-Language knowledge prompts. Specifically, we establish the vision-language action concept space through vision-language knowledge prompts generated by pre-trained large multimodal models (LMMs), which enrich the fine-grained details that the skeleton action space lacks. Moreover, we propose the intra-modal self-similarity and inter-modal cross-consistency softened targets in the cross-modal representation learning process to progressively control and guide the degree of pulling vision-language knowledge prompts and corresponding skeletons closer. These soft instance discrimination and self-knowledge distillation strategies contribute to the learning of better skeleton-based action representations from the noisy skeleton-vision-language pairs. During the inference phase, our method requires only the skeleton data as the input for action recognition and no longer for vision-language prompts. Extensive experiments on NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD datasets demonstrate that our method outperforms the previous methods and achieves state-of-the-art results.