文献详情 >Vision-Language Meets the Skel... 收藏

Vision-Language Meets the Skeleton: Progressively Distillation With Cross-Modal Knowledge for 3D Action Representation Learning

作者：Chen, Yang He, Tian Fu, Junfeng Wang, Ling Guo, Jingcai Hu, Ting Cheng, Hong

作者机构：Univ Elect Sci & Technol China Sch Informat & Commun Engn Chengdu 611731 Peoples R China Univ Elect Sci & Technol China Sch Automat Engn Chengdu 611731 Peoples R China Hong Kong Polytech Univ Dept Comp Hong Kong 999077 Peoples R China Hong Kong Polytech Univ Shenzhen Res Inst Shenzhen 518000 Peoples R China 1 Orthoped Hosp Chengdu Chengdu 611731 Peoples R China

出版物：《IEEE TRANSACTIONS ON MULTIMEDIA》 (IEEE Trans Multimedia)

年卷期：2025年第27卷

页面：2293-2303页

核心收录：

学科分类：0810[工学-信息与通信工程] 0808[工学-电气工程] 08[工学] 0835[工学-软件工程] 0812[工学-计算机科学与技术（可授工学、理学学位）]

基　　金：National Key Research and Development Program of China [2022YFE0133100] Hong Kong RGC General Research Fund [152211/23E, 15216424/24E] National Natural Science Foundation of China PolyU Internal Fund [P0043932] NVIDIA AI Technology Center (NVAITC)

主　　题：Skeleton Noise measurement Contrastive learning Training Annotations Three-dimensional displays Joints Aerospace electronics Supervised learning Representation learning Action recognition contrastive learning cross-modal LMMs self-supervised vision-language

摘要：Skeleton-based action representation learning aims to interpret and understand human behaviors by encoding the skeleton sequences, which can be categorized into two primary training paradigms: supervised learning and self-supervised learning. However, the former one-hot classification requires labor-intensive predefined action categories annotations, while the latter involves skeleton transformations (e.g., cropping) in the pretext tasks that may impair the skeleton structure. To address these challenges, we introduce a novel skeleton-based training framework ((CVL)-V-2) based on Cross-modal Contrastive learning that uses the progressive distillation to learn task-agnostic human skeleton action representation from the Vision-Language knowledge prompts. Specifically, we establish the vision-language action concept space through vision-language knowledge prompts generated by pre-trained large multimodal models (LMMs), which enrich the fine-grained details that the skeleton action space lacks. Moreover, we propose the intra-modal self-similarity and inter-modal cross-consistency softened targets in the cross-modal representation learning process to progressively control and guide the degree of pulling vision-language knowledge prompts and corresponding skeletons closer. These soft instance discrimination and self-knowledge distillation strategies contribute to the learning of better skeleton-based action representations from the noisy skeleton-vision-language pairs. During the inference phase, our method requires only the skeleton data as the input for action recognition and no longer for vision-language prompts. Extensive experiments on NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD datasets demonstrate that our method outperforms the previous methods and achieves state-of-the-art results.

本地馆藏 | 借阅须知 | 我要预约

已订购，未入库

sda

目录详情 | 试阅读 |

读者评论与其他读者分享你的观点

学校读者

用户名:未登录

我的评分

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

看过本文的还看了

相关文献

该作者的其他文献

CADAL相关文献

Vision-Language Meets the Skeleton: Progressively Distillation With Cross-Modal Knowledge for 3D Action Representation Learning

读者评论与其他读者分享你的观点

请选择收藏分类：

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

看过本文的还看了

相关文献

该作者的其他文献

CADAL相关文献

Vision-Language Meets the Skeleton: Progressively Distillation With Cross-Modal Knowledge for 3D Action Representation Learning

读者评论 与其他读者分享你的观点

请选择收藏分类： 新增自定义分类 确定 取消

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

读者评论与其他读者分享你的观点

请选择收藏分类：