版权所有:内蒙古大学图书馆 技术提供:维普资讯• 智图
内蒙古自治区呼和浩特市赛罕区大学西街235号 邮编: 010021
作者机构:Nanjing Univ Sci & Technol Sch Comp Sci & Engn Nanjing 210094 Peoples R China Nanjing Univ Dept Comp Sci & Technol Nanjing 210023 Peoples R China
出 版 物:《IEEE TRANSACTIONS ON MULTIMEDIA》 (IEEE Trans Multimedia)
年 卷 期:2025年第27卷
页 面:2450-2462页
核心收录:
学科分类:0810[工学-信息与通信工程] 0808[工学-电气工程] 08[工学] 0835[工学-软件工程] 0812[工学-计算机科学与技术(可授工学、理学学位)]
基 金:National Natural Science Foundation of China [62222207, 62072245, 61932020] Natural Science Foundation of Jiangsu Province [BK20211520] Open Foundation of the Key lab (center) of Anhui Jianzhu University Anhui Province Key Laboratory of Intelligent Building & Building Energy Saving [IBES2024KF02]
主 题:Motion segmentation Videos Dynamics Few shot learning Prototypes Transformers Image recognition Graph convolutional networks Feature extraction Face recognition Action recognition few-shot action recognition few-shot learning transformer
摘 要:Few-Shot Action Recognition (FSAR) aims to recognize novel class action with limited annotated training data from the same class. Most FSAR methods subconsciously follow the few-shot image classification solutions by solely focusing on appearance-level matching between support and query videos, such as part-level matching, frame-level matching, and segment-level matching. However, these methods, almost always, have two main limitations: 1) generally ignore the relationship among these part-, frame- and segment-level features and 2) may mismatch the same class actions under fast-term and slow-term dynamics. To this end, we present a novel Hierarchical Motion-enhanced Matching (HM2) framework to hierarchically learn the relation-aware multi-modal features, and jointly promote the multi-modal matching, including appearance-level matching on segments, frames, and parts, as well as the motion-level matching on dynamics. Specifically, we first propose a new Hierarchical Tokenizer (HT) to learn multi-modal features, namely utilizing a hierarchical Transformer to learn appearance-level features, along with a Slow-Fast Aware Motion (SFAM) strategy to learn motion-level features covering fast- and slow-term dynamics. Next, we propose a new Relation-aware Matcher (RM) to match the multi-modal features, by leveraging a Hierarchical Relational Graph Convolutional Network (H-RGCN) to capture the relationship among these appearance-level features. Further, a Dual Sample-to-Class Matching (DSCM) strategy is proposed to measure the bidirectional similarities among appearance- and motion-modal features by sample-to-class matching and class-to-sample matching. Extensive experiments on four golden FSAR datasets demonstrate significant performance improvements of HM2 compared with the state-of-the-art methods.