咨询与建议

看过本文的还看了

相关文献

该作者的其他文献

文献详情 >Interleaved-Modal Chain-of-Tho... 收藏
arXiv

Interleaved-Modal Chain-of-Thought

作     者:Gao, Jun Li, Yongqi Cao, Ziqiang Li, Wenjie 

作者机构:School of Computer Science and Technology Soochow University Taiwan Department of Computer Science The Hong Kong Polytechnic University Hong Kong 

出 版 物:《arXiv》 (arXiv)

年 卷 期:2024年

核心收录:

主  题:Visual languages 

摘      要:Chain-of-Thought (CoT) prompting elicits large language models (LLMs) to produce a series of intermediate reasoning steps before arriving at the final answer. However, when transitioning to vision-language models (VLMs), their text-only rationales struggle to express the fine-grained associations with the original image. In this paper, we propose an image-incorporated multimodal Chain-of-Thought, named Interleaved-modal Chain-of-Thought (ICoT), which generates sequential reasoning steps consisting of paired visual and textual rationales to infer the final answer. Intuitively, the novel ICoT requires VLMs to enable the generation of fine-grained interleaved-modal content, which is hard for current VLMs to fulfill. Considering that the required visual information is usually part of the input image, we propose Attention-driven Selection (ADS) to realize ICoT over existing VLMs. ADS intelligently inserts regions of the input image to generate the interleaved-modal reasoning steps with ignorable additional latency. ADS relies solely on the attention map of VLMs without the need for parameterization, and therefore it is a plug-and-play strategy that can be generalized to a spectrum of VLMs. We apply ADS to realize ICoT on two popular VLMs of different architectures. Extensive evaluations of three benchmarks have shown that ICoT prompting achieves substantial performance (up to 14%) and interpretability improvements compared to existing multimodal CoT prompting methods. © 2024, CC BY.

读者评论 与其他读者分享你的观点

用户名:未登录
我的评分