文献详情 >Unleash the Power of Vision-La... 收藏

Unleash the Power of Vision-Language Models by Visual Attention Prompt and Multimodal Interaction

作者：Zhang, Wenyao Wu, Letian Zhang, Zequn Yu, Tao Ma, Chao Jin, Xin Yang, Xiaokang Zeng, Wenjun

作者机构：Shanghai Jiao Tong Univ AI Inst MoE Key Lab Artificial Intelligence Shanghai 200240 Peoples R China Ningbo Inst Digital Twin Eastern Inst Technol Ningbo 315200 Peoples R China Southeast Univ Sch Automat Nanjing 210096 Peoples R China Univ Sci & Technol China Dept Elect Engn & Informat Sci Hefei 230026 Peoples R China

出版物：《IEEE TRANSACTIONS ON MULTIMEDIA》 (IEEE Trans Multimedia)

年卷期：2025年第27卷

页面：2399-2411页

核心收录：

学科分类：0810[工学-信息与通信工程] 0808[工学-电气工程] 08[工学] 0835[工学-软件工程] 0812[工学-计算机科学与技术（可授工学、理学学位）]

基　　金：NSFC ZJNSFC [LQ23F010008] High Performance Computing Center at Eastern Institute of Technology, Ningbo Ningbo Institute of Digital Twin

主　　题：Visualization Adaptation models Tuning Training Electronic mail Computational modeling Tail Pipelines Overfitting Nose Attention multimodal interaction prompt learning vision-language models

摘要：Pre-trained vision-language models (VLMs), equipped with parameter-efficient tuning (PET) methods like prompting, have shown impressive knowledge transferability on new downstream tasks, but they are still prone to be limited by catastrophic forgetting and overfitting dilemma due to large gaps among tasks. Furthermore, the underlying physical mechanisms of prompt-based tuning methods (especially for visual prompting) remain largely unexplored. It is unclear why these methods work solely based on learnable parameters as prompts for adaptation. To address the above challenges, we present a new prompt-based framework for vision-language models, termed Uni-prompt. Our framework transfers VLMs to downstream tasks by designing visual prompts from an attention perspective that reduces the transfer/solution space, which enables the vision model to focus on task-relevant regions of the input image while also learning task-specific knowledge. Additionally, Uni-prompt further aligns visual-text prompts learning through a pretext task with masked representation modeling interactions, which implicitly learns a global cross-modal matching between visual and language concepts for consistency. We conduct extensive experiments on the few-shot classification task and achieve significant improvement using our Uni-prompt method while requiring minimal extra parameters cost.

本地馆藏 | 借阅须知 | 我要预约

已订购，未入库

sda

目录详情 | 试阅读 |

读者评论与其他读者分享你的观点

学校读者

用户名:未登录

我的评分

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

看过本文的还看了

相关文献

该作者的其他文献

CADAL相关文献

Unleash the Power of Vision-Language Models by Visual Attention Prompt and Multimodal Interaction

读者评论与其他读者分享你的观点

请选择收藏分类：

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

看过本文的还看了

相关文献

该作者的其他文献

CADAL相关文献

Unleash the Power of Vision-Language Models by Visual Attention Prompt and Multimodal Interaction

读者评论 与其他读者分享你的观点

请选择收藏分类： 新增自定义分类 确定 取消

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

读者评论与其他读者分享你的观点

请选择收藏分类：