Transformer tracking always takes paired template and search images as encoder input and conduct feature extraction and target‐search feature correlation by self and/or cross attention operations,thus the model compl...
详细信息
Transformer tracking always takes paired template and search images as encoder input and conduct feature extraction and target‐search feature correlation by self and/or cross attention operations,thus the model complexity will grow quadratically with the number of input *** alleviate the burden of this tracking paradigm and facilitate practical deployment of Transformer‐based trackers,we propose a dual pooling transformer tracking framework,dubbed as DPT,which consists of three components:a simple yet efficient spatiotemporal attention model(SAM),a mutual correlation pooling Trans-former(MCPT)and a multiscale aggregation pooling Transformer(MAPT).SAM is designed to gracefully aggregates temporal dynamics and spatial appearance information of multi‐frame templates along space‐time *** aims to capture multi‐scale pooled and correlated contextual features,which is followed by MAPT that aggregates multi‐scale features into a unified feature representation for tracking *** tracker achieves AUC score of 69.5 on LaSOT and precision score of 82.8 on Track-ingNet while maintaining a shorter sequence length of attention tokens,fewer parameters and FLOPs compared to existing state‐of‐the‐art(SOTA)Transformer tracking *** experiments demonstrate that DPT tracker yields a strong real‐time tracking baseline with a good trade‐off between tracking performance and inference efficiency.
Audio‐visual wake word spotting is a challenging multi‐modal task that exploits visual information of lip motion patterns to supplement acoustic speech to improve overall detection ***,most audio‐visual wake word s...
详细信息
Audio‐visual wake word spotting is a challenging multi‐modal task that exploits visual information of lip motion patterns to supplement acoustic speech to improve overall detection ***,most audio‐visual wake word spotting models are only suitable for simple single‐speaker scenarios and require high computational *** development is hindered by complex multi‐person scenarios and computational limitations in mobile *** this paper,a novel audio‐visual model is proposed for on‐device multi‐person wake word ***,an attention‐based audio‐visual voice activity detection module is presented,which generates an attention score matrix of audio and visual representations to derive active speaker ***,the knowledge distillation method is introduced to transfer knowledge from the large model to the on‐device model to control the size of our ***,a new audio‐visual dataset,PKU‐KWS,is collected for sentence‐level multi‐person wake word *** results on the PKU‐KWS dataset show that this approach outperforms the previous state‐of‐the‐art methods.
暂无评论