咨询与建议

看过本文的还看了

相关文献

该作者的其他文献

文献详情 >Unleash the Power of Vision-La... 收藏

Unleash the Power of Vision-Language Models by Visual Attention Prompt and Multimodal Interaction

作     者:Zhang, Wenyao Wu, Letian Zhang, Zequn Yu, Tao Ma, Chao Jin, Xin Yang, Xiaokang Zeng, Wenjun 

作者机构:Shanghai Jiao Tong Univ AI Inst MoE Key Lab Artificial Intelligence Shanghai 200240 Peoples R China Ningbo Inst Digital Twin Eastern Inst Technol Ningbo 315200 Peoples R China Southeast Univ Sch Automat Nanjing 210096 Peoples R China Univ Sci & Technol China Dept Elect Engn & Informat Sci Hefei 230026 Peoples R China 

出 版 物:《IEEE TRANSACTIONS ON MULTIMEDIA》 (IEEE Trans Multimedia)

年 卷 期:2025年第27卷

页      面:2399-2411页

核心收录:

学科分类:0810[工学-信息与通信工程] 0808[工学-电气工程] 08[工学] 0835[工学-软件工程] 0812[工学-计算机科学与技术(可授工学、理学学位)] 

基  金:NSFC ZJNSFC [LQ23F010008] High Performance Computing Center at Eastern Institute of Technology, Ningbo Ningbo Institute of Digital Twin 

主  题:Visualization Adaptation models Tuning Training Electronic mail Computational modeling Tail Pipelines Overfitting Nose Attention multimodal interaction prompt learning vision-language models 

摘      要:Pre-trained vision-language models (VLMs), equipped with parameter-efficient tuning (PET) methods like prompting, have shown impressive knowledge transferability on new downstream tasks, but they are still prone to be limited by catastrophic forgetting and overfitting dilemma due to large gaps among tasks. Furthermore, the underlying physical mechanisms of prompt-based tuning methods (especially for visual prompting) remain largely unexplored. It is unclear why these methods work solely based on learnable parameters as prompts for adaptation. To address the above challenges, we present a new prompt-based framework for vision-language models, termed Uni-prompt. Our framework transfers VLMs to downstream tasks by designing visual prompts from an attention perspective that reduces the transfer/solution space, which enables the vision model to focus on task-relevant regions of the input image while also learning task-specific knowledge. Additionally, Uni-prompt further aligns visual-text prompts learning through a pretext task with masked representation modeling interactions, which implicitly learns a global cross-modal matching between visual and language concepts for consistency. We conduct extensive experiments on the few-shot classification task and achieve significant improvement using our Uni-prompt method while requiring minimal extra parameters cost.

读者评论 与其他读者分享你的观点

用户名:未登录
我的评分