咨询与建议

看过本文的还看了

相关文献

该作者的其他文献

文献详情 >FlashSloth: Lightning Multimod... 收藏
arXiv

FlashSloth: Lightning Multimodal Large Language Models via Embedded Visual Compression

作     者:Tong, Bo Lai, Bokai Zhou, Yiyi Luo, Gen Shen, Yunhang Li, Ke Sun, Xiaoshuai Ji, Rongrong 

作者机构:Key Laboratory of Multimedia Trusted Perception and Efficient Computing Ministry of Education of China Xiamen University 361005 China Youtu Lab Tencent China OpenGVLab Shanghai AI Laboratory China 

出 版 物:《arXiv》 (arXiv)

年 卷 期:2024年

核心收录:

主  题:Semantics 

摘      要:Despite a big leap forward in capability, multimodal large language models (MLLMs) tend to behave like a sloth in practical use, i.e., slow response and large latency. Recent efforts are devoted to building tiny MLLMs for better efficiency, but the plethora of visual tokens still used limit their actual speedup. In this paper, we propose a powerful and fast tiny MLLM called FlashSloth. Different from previous efforts, FlashSloth focuses on improving the descriptive power of visual tokens in the process of compressing their redundant semantics. In particular, FlashSloth introduces embedded visual compression designs to capture both visually salient and instruction-related image information, so as to achieving superior multimodal performance with fewer visual tokens. Extensive experiments are conducted to validate the proposed FlashSloth, and a bunch of tiny but strong MLLMs are also comprehensively compared, e.g., InternVL2, MiniCPM-V2 and Qwen2-VL. The experimental results show that compared with these advanced tiny MLLMs, our FlashSloth can greatly reduce the number of visual tokens, training memory and computation complexity while retaining high performance on various VL tasks. Our code is released at: https://***/codefanw/FlashSloth. Copyright © 2024, The Authors. All rights reserved.

读者评论 与其他读者分享你的观点

用户名:未登录
我的评分