文献详情 >Adapting Pre-trained Language ... 收藏

arXiv

Adapting Pre-trained Language Models to Vision-Language Tasks via Dynamic Visual Prompting

作者：Huang, Shubin Wu, Qiong Zhou, Yiyi Chen, Weijie Zhang, Rongsheng Sun, Xiaoshuai

作者机构：Key Laboratory of Multimedia Trusted Perception and Efficient Computing Ministry of Education of China Xiamen University 361005 China Institute of Artificial Intelligence Xiamen University 361005 China Fuxi AI Lab NetEase Inc. China

出版物：《arXiv》 (arXiv)

年卷期：2023年

核心收录：

主　　题：Reinforcement learning

摘要：Pre-trained language models (PLMs) have played an increasing role in vision-language (VL) learning, but they usually require a deep multi-modal branch for VL reasoning, resulting in excessive computation and memory overhead. Recently, visual prompting is a feasible way to adapt PLMs to VL tasks, but we notice that the use of all visual tokens will greatly execrate the already high computation, and the token placement is also vital to performance. Based on these observations, we propose a novel transfer learning approach for PLMs in this paper, termed Dynamic Visual Prompting (DVP). Concretely, DVP first deploys a cross-attention module to obtain text-related and compact visual prompt tokens, thereby greatly reducing the input length of PLMs. To obtain the optimal placement, we also equip DVP with a reinforcement-learning based search algorithm, which can automatically merge DVP with PLMs for different VL tasks via a very short search process. In addition, we also combine DVP with the recently popular adapter approach to keep the most parameters of PLMs intact during adaption, which also help PLMs achieve a quick shift between single- and multi-modal tasks. We apply DVP to two representative PLMs, namely BERT and T5, and a recent large language model called LLaMA. Extensive experiments are conducted on a set of VL reasoning benchmarks including VQA2.0, GQA, SNLI-VE and ScienceQA. The experimental results not only show the merits of DVP in performance and efficiency, e.g. +2.28% accuracy and -80% FLOPs on VQA2.0, but also confirm its superiority in adapting pre-trained language models to VL tasks. Our code is anonymously released at https://***/hsb1357173526/ Dynamic Visual Prompting. Copyright © 2023, The Authors. All rights reserved.

本地馆藏 | 借阅须知 | 我要预约

已订购，未入库

sda

目录详情 | 试阅读 |

读者评论与其他读者分享你的观点

学校读者

用户名:未登录

我的评分

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

看过本文的还看了

相关文献

该作者的其他文献

CADAL相关文献

Adapting Pre-trained Language Models to Vision-Language Tasks via Dynamic Visual Prompting

读者评论与其他读者分享你的观点

请选择收藏分类：

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

看过本文的还看了

相关文献

该作者的其他文献

CADAL相关文献

Adapting Pre-trained Language Models to Vision-Language Tasks via Dynamic Visual Prompting

读者评论 与其他读者分享你的观点

请选择收藏分类： 新增自定义分类 确定 取消

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

读者评论与其他读者分享你的观点

请选择收藏分类：