咨询与建议

看过本文的还看了

相关文献

该作者的其他文献

文献详情 >Adapting Pre-trained Language ... 收藏
arXiv

Adapting Pre-trained Language Models to Vision-Language Tasks via Dynamic Visual Prompting

作     者:Huang, Shubin Wu, Qiong Zhou, Yiyi Chen, Weijie Zhang, Rongsheng Sun, Xiaoshuai 

作者机构:Key Laboratory of Multimedia Trusted Perception and Efficient Computing Ministry of Education of China Xiamen University 361005 China Institute of Artificial Intelligence Xiamen University 361005 China Fuxi AI Lab NetEase Inc. China 

出 版 物:《arXiv》 (arXiv)

年 卷 期:2023年

核心收录:

主  题:Reinforcement learning 

摘      要:Pre-trained language models (PLMs) have played an increasing role in vision-language (VL) learning, but they usually require a deep multi-modal branch for VL reasoning, resulting in excessive computation and memory overhead. Recently, visual prompting is a feasible way to adapt PLMs to VL tasks, but we notice that the use of all visual tokens will greatly execrate the already high computation, and the token placement is also vital to performance. Based on these observations, we propose a novel transfer learning approach for PLMs in this paper, termed Dynamic Visual Prompting (DVP). Concretely, DVP first deploys a cross-attention module to obtain text-related and compact visual prompt tokens, thereby greatly reducing the input length of PLMs. To obtain the optimal placement, we also equip DVP with a reinforcement-learning based search algorithm, which can automatically merge DVP with PLMs for different VL tasks via a very short search process. In addition, we also combine DVP with the recently popular adapter approach to keep the most parameters of PLMs intact during adaption, which also help PLMs achieve a quick shift between single- and multi-modal tasks. We apply DVP to two representative PLMs, namely BERT and T5, and a recent large language model called LLaMA. Extensive experiments are conducted on a set of VL reasoning benchmarks including VQA2.0, GQA, SNLI-VE and ScienceQA. The experimental results not only show the merits of DVP in performance and efficiency, e.g. +2.28% accuracy and -80% FLOPs on VQA2.0, but also confirm its superiority in adapting pre-trained language models to VL tasks. Our code is anonymously released at https://***/hsb1357173526/ Dynamic Visual Prompting. Copyright © 2023, The Authors. All rights reserved.

读者评论 与其他读者分享你的观点

用户名:未登录
我的评分