版权所有:内蒙古大学图书馆 技术提供:维普资讯• 智图
内蒙古自治区呼和浩特市赛罕区大学西街235号 邮编: 010021
作者机构:Shanghai Jiao Tong Univ AI Inst MoE Key Lab Artificial Intelligence Shanghai 200240 Peoples R China Ningbo Inst Digital Twin Eastern Inst Technol Ningbo 315200 Peoples R China Southeast Univ Sch Automat Nanjing 210096 Peoples R China Univ Sci & Technol China Dept Elect Engn & Informat Sci Hefei 230026 Peoples R China
出 版 物:《IEEE TRANSACTIONS ON MULTIMEDIA》 (IEEE Trans Multimedia)
年 卷 期:2025年第27卷
页 面:2399-2411页
核心收录:
学科分类:0810[工学-信息与通信工程] 0808[工学-电气工程] 08[工学] 0835[工学-软件工程] 0812[工学-计算机科学与技术(可授工学、理学学位)]
基 金:NSFC ZJNSFC [LQ23F010008] High Performance Computing Center at Eastern Institute of Technology, Ningbo Ningbo Institute of Digital Twin
主 题:Visualization Adaptation models Tuning Training Electronic mail Computational modeling Tail Pipelines Overfitting Nose Attention multimodal interaction prompt learning vision-language models
摘 要:Pre-trained vision-language models (VLMs), equipped with parameter-efficient tuning (PET) methods like prompting, have shown impressive knowledge transferability on new downstream tasks, but they are still prone to be limited by catastrophic forgetting and overfitting dilemma due to large gaps among tasks. Furthermore, the underlying physical mechanisms of prompt-based tuning methods (especially for visual prompting) remain largely unexplored. It is unclear why these methods work solely based on learnable parameters as prompts for adaptation. To address the above challenges, we present a new prompt-based framework for vision-language models, termed Uni-prompt. Our framework transfers VLMs to downstream tasks by designing visual prompts from an attention perspective that reduces the transfer/solution space, which enables the vision model to focus on task-relevant regions of the input image while also learning task-specific knowledge. Additionally, Uni-prompt further aligns visual-text prompts learning through a pretext task with masked representation modeling interactions, which implicitly learns a global cross-modal matching between visual and language concepts for consistency. We conduct extensive experiments on the few-shot classification task and achieve significant improvement using our Uni-prompt method while requiring minimal extra parameters cost.