版权所有:内蒙古大学图书馆 技术提供:维普资讯• 智图
内蒙古自治区呼和浩特市赛罕区大学西街235号 邮编: 010021
作者机构:Dongguan Univ Technol Sch Comp Sci & Technol Dongguan 523808 Guangdong Peoples R China Chinese Acad Sci Shenzhen Inst Adv Technol CAS Key Lab Human Machine Intelligence Synergy Sys Shenzhen 518055 Guangdong Peoples R China
出 版 物:《PATTERN ANALYSIS AND APPLICATIONS》 (Pattern Anal. Appl.)
年 卷 期:2025年第28卷第2期
页 面:1-17页
核心收录:
学科分类:08[工学] 0812[工学-计算机科学与技术(可授工学、理学学位)]
基 金:Natural Science Foundation of Guangdong Province Intergovernmental International Scientific and Technological Innovation Cooperation Project of the National Key Research and Development Program [2025YFE0199900] National Natural Science Foundation of China [62376261, U21A20487] 2024A1515011754 2023A1515011307 2022A1515140119
主 题:Skeleton-based action recognition Self-supervised learning Masked autoencoders
摘 要:Skeleton-based human action recognition faces challenges owing to the limited availability of annotated data, which constrains the performance of supervised methods in learning representations of skeleton sequences. To address this issue, researchers have introduced self-supervised learning as a method of reducing the reliance on annotated data. This approach exploits the intrinsic supervisory signals embedded within the data itself. In this study, we demonstrate that considering relative positional relationships between joints, rather than relying on joint coordinates as absolute positional information, yields more effective representations of skeleton sequences. Based on this, we introduce the Masked Cosine Similarity Prediction (MCSP) framework, which takes randomly masked skeleton sequences as input and predicts the corresponding cosine similarity between masked joints. Comprehensive experiments show that the proposed MCSP self-supervised pre-training method effectively learns representations in skeleton sequences, improving model performance while decreasing dependence on extensive labeled datasets. After pre-training with MCSP, a vanilla transformer architecture is employed for fine-tuning in action recognition. The results obtained from six subsets of the NTU-RGB+D 60, NTU-RGB+D 120 and PKU-MMD datasets show that our method achieves significant performance improvements on five subsets. Compared to training from scratch, performance improvements are 9.8%, 4.9%, 13%, 11.5%, and 3.6%, respectively, with top-1 accuracies of 92.9%, 97.3%, 89.8%, 91.2%, and 96.1% being achieved. Furthermore, our method achieves comparable results on the PKU-MMD Phase II dataset, achieving a top-1 accuracy of 51.5%. These results are competitive without the need for intricate designs, such as multi-stream model ensembles or extreme data augmentation. The source code of our MOSP is available at https://***/skyisyourlimit/MCSP.