文献详情 >Progressive Semantic-Visual Al... 收藏

Progressive Semantic-Visual Alignment and Refinement for Vision-Language Tracking

作者：Liang, Yanjie Wu, Qiangqiang Cheng, Lin Xia, Changqun Li, Jia

作者机构：Peng Cheng Lab Shenzhen 518000 Peoples R China City Univ Hong Kong Dept Comp Sci Hong Kong Peoples R China Xiamen Univ Sch Informat Fujian Key Lab Sensing & Comp Smart City Xiamen 361005 Peoples R China Beihang Univ Sch Comp Sci & Engn State Key Lab Virtual Real Technol & Syst Beijing 100191 Peoples R China

出版物：《IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY》 (IEEE Trans Circuits Syst Video Technol)

年卷期：2025年第35卷第5期

页面：4271-4286页

核心收录：

学科分类：0808[工学-电气工程] 08[工学]

基　　金：National Natural Science Foundation of China [62132002, 62202249, 62102206] Major Key Project of Peng Cheng Laboratory(PCL) [PCL2024A04-4] Postdoctoral Science Foundation of China [2022M721732] Postdoctoral Fellowship Program of China Postdoctoral Science Foundation [GZC20233362] Chongqing Postdoctoral Innovative Talents Support Program

主　　题：Transformers Target tracking Visualization Semantics Feature extraction Object tracking Natural languages Circuits and systems Fuses Adaptation models Vision-language tracking progressive joint vision-language transformer semantic-aware instance encoder channel communication patch interaction

摘要：In recent years, vision-language tracking has drawn emerging attention in the tracking field. The critical challenge for the task is to fuse semantic representations of language information and visual representations of vision information. For this purpose, several vision-language tracking methods perform early or late fusion to fuse visual and semantic features. However, these methods cannot take full advantage of the transformer architecture to excavate useful cross-modal context at various levels. To this end, we propose a new progressive joint vision-language transformer (PJVLT) to progressively align and refine visual embedding with semantic embedding for vision-language tracking. Specifically, to align visual signals with semantic signals, we propose to insert a semantic-aware instance encoder layer (SAIEL) into each intermediate layer of transformer encoder to perform progressive alignment of visual and semantic features. Furthermore, to highlight the multi-modal feature channels and patches corresponding to target objects, we propose a unified channel communication patch interaction layer (CCPIL), which is plugged into each intermediate layer of transformer encoder to progressively activate target-aware channels and patches of aligned multi-modal features for fine-grained tracking. In general, by progressively aligning and refining visual features with semantic features in the transformer encoder, our PJVLT can adaptively excavate well-aligned vision-language context at coarse-to-fine levels, therefore highlighting target objects at various levels for more discriminative tracking. Experiments on several tracking datasets show that the proposed PJVLT can achieve favorable performance in comparison with both conventional trackers and other vision-language trackers.

本地馆藏 | 借阅须知 | 我要预约

已订购，未入库

sda

目录详情 | 试阅读 |

读者评论与其他读者分享你的观点

学校读者

用户名:未登录

我的评分

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

看过本文的还看了

相关文献

该作者的其他文献

CADAL相关文献

Progressive Semantic-Visual Alignment and Refinement for Vision-Language Tracking

读者评论与其他读者分享你的观点

请选择收藏分类：

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

看过本文的还看了

相关文献

该作者的其他文献

CADAL相关文献

Progressive Semantic-Visual Alignment and Refinement for Vision-Language Tracking

读者评论 与其他读者分享你的观点

请选择收藏分类： 新增自定义分类 确定 取消

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

读者评论与其他读者分享你的观点

请选择收藏分类：