文献详情 >DYNAMIC-LLAVA: EFFICIENT MULTI... 收藏

arXiv

DYNAMIC-LLAVA: EFFICIENT MULTIMODAL LARGE LANGUAGE MODELS VIA DYNAMIC VISION-LANGUAGE CONTEXT SPARSIFICATION

作者：Huang, Wenxuan Zhai, Zijie Shen, Yunhang Cao, Shaosheng Zhao, Fei Xu, Xiangfeng Ye, Zheyu Hu, Yao Lin, Shaohui

作者机构：East China Normal University China Xiamen University China Xiaohongshu Inc China Nanjing University China Key Laboratory of Advanced Theory and Application in Statistics and Data Science MOE China

出版物：《arXiv》 (arXiv)

年卷期：2024年

核心收录：

主　　题：Decoding

摘要：Multimodal Large Language Models (MLLMs) have achieved remarkable success in vision understanding, reasoning, and interaction. However, the inference computation and memory increase progressively with the generation of output tokens during decoding, directly affecting the efficacy of MLLMs. Existing methods attempt to reduce the vision context redundancy to achieve efficient MLLMs. Unfortunately, the efficiency benefits of the vision context reduction in the prefill stage gradually diminish during the decoding stage. To address this problem, we proposed a dynamic vision-language context sparsification framework Dynamic-LLaVA, which dynamically reduces the redundancy of vision context in the prefill stage and decreases the memory and computation overhead of the generated language context during decoding. Dynamic-LLaVA designs a tailored sparsification inference scheme for different inference modes, i.e., prefill, decoding with and without KV cache, to achieve efficient inference of MLLMs. In practice, Dynamic-LLaVA can reduce computation consumption by ∼75% in the prefill stage. Meanwhile, throughout the entire generation process of MLLMs, Dynamic-LLaVA reduces the ∼50% computation consumption under decoding without KV cache, while saving ∼50% GPU memory overhead when decoding with KV cache, due to the vision-language context sparsification. Extensive experiments also demonstrate that Dynamic-LLaVA achieves efficient inference for MLLMs with negligible understanding and generation ability degradation or even performance gains compared to the full-context inference baselines. Code is available at https://***/Osilly/dynamic_llava . Copyright © 2024, The Authors. All rights reserved.

本地馆藏 | 借阅须知 | 我要预约

已订购，未入库

sda

目录详情 | 试阅读 |

读者评论与其他读者分享你的观点

学校读者

用户名:未登录

我的评分

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

看过本文的还看了

相关文献

该作者的其他文献

CADAL相关文献

DYNAMIC-LLAVA: EFFICIENT MULTIMODAL LARGE LANGUAGE MODELS VIA DYNAMIC VISION-LANGUAGE CONTEXT SPARSIFICATION

读者评论与其他读者分享你的观点

请选择收藏分类：

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

看过本文的还看了

相关文献

该作者的其他文献

CADAL相关文献

DYNAMIC-LLAVA: EFFICIENT MULTIMODAL LARGE LANGUAGE MODELS VIA DYNAMIC VISION-LANGUAGE CONTEXT SPARSIFICATION

读者评论 与其他读者分享你的观点

请选择收藏分类： 新增自定义分类 确定 取消

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

读者评论与其他读者分享你的观点

请选择收藏分类：