检索结果-内蒙古大学图书馆

B-AVIBench: Toward Evaluating the Robustness of large vision-language model on Black-Box Adversarial Visual-Instructions

引用

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY 2025年 20卷 1434-1446页

作者： Zhang, Hao Shao, Wenqi Liu, Hong Ma, Yongqiang Luo, Ping Qiao, Yu Zheng, Nanning Zhang, Kaipeng Xi An Jiao Tong Univ Natl Engn Res Ctr Visual Informat & Applicat Natl Key Lab Human Machine Hybrid Augmented Intell Xian 710049 Shaanxi Peoples R China Xi An Jiao Tong Univ Inst Artificial Intelligence & Robot Xian 710049 Shaanxi Peoples R China Shanghai Artificial Intelligence Lab Shanghai 200000 Peoples R China Osaka Univ Inst Databil Sci Suita Osaka 5650871 Japan

large vision-language models (LVLMs) have shown significant progress in responding well to visual-instructions from users. However, these instructions, encompassing images and text, are susceptible to both intentional and inadvertent attacks. Despite the critical importance of LVLMs' robustness against such threats, current research in this area remains limited. To bridge this gap, we introduce B-AVIBench, a framework designed to analyze the robustness of LVLMs when facing various Black-box Adversarial Visual-Instructions (B-AVIs), including four types of image-based B-AVIs, ten types of text-based B-AVIs, and nine types of content bias B-AVIs (such as gender, violence, cultural, and racial biases, among others). We generate 316K B-AVIs encompassing five categories of multimodal capabilities (ten tasks) and content bias. We then conduct a comprehensive evaluation involving 14 open-source LVLMs to assess their performance. B-AVIBench also serves as a convenient tool for practitioners to evaluate the robustness of LVLMs against B-AVIs. Our findings and extensive experimental results shed light on the vulnerabilities of LVLMs, and highlight that inherent biases exist even in advanced closed-source LVLMs like GeminiProvision and GPT-4V. This underscores the importance of enhancing the robustness, security, and fairness of LVLMs. The source code and benchmark are available at https://***/zhanghao5201/B-AVIBench.

关键词： Closed box Visualization Robustness Adaptation models Glass box Benchmark testing Cognition Visual perception Security Electronic mail large vision-language model black-box adversarial visual-instructions bias evaluation

来源：评论

学校读者我要写书评

暂无评论

MiniMedGPT: Efficient large vision-language model for medical Visual Question Answering

引用

PATTERN RECOGNITION LETTERS 2025年 189卷 8-16页

作者： Alsabbagh, Abdel Rahman Mansour, Tariq Al-Kharabsheh, Mohammad Ebdah, Abdel Salam Al-Emaryeen, Roa'a Al-Nahhas, Sara Mahafza, Waleed Al-Kadi, Omar Univ Jordan King Abdullah Sch Informat Technol 2 Amman 11942 Jordan Jordan Univ Hosp Diagnost Radiol Dept Amman 11942 Jordan

While large vision-language models (LVLMs) like GPT-4 and Gemini demonstrate significant potential, utilization in the medical domain remains largely unexplored. This is due to challenges attributed to prolonged training and language generation issues. Imbalances within medical Visual Question Answering (VQA) datasets further complicate the integration of LVLMs. In this paper, we present a novel approach named MiniMedGPT (Mini Medical Generative Pretrained Transformer). Inspired by MiniGPT4-v2, MiniMedGPT is specifically designed for efficient medical VQA. The framework of MiniMedGPT is built upon both medical and pretrained large language models and features an end-to-end versatile fine-tuning pipeline that enables alignment of medical VQA data in just 30 min within a single-stage framework. To address language generation shortcomings and dataset imbalances, we employ Gemini vision Pro and MediCap using them as an auxiliary component. Through comprehensive benchmarking and evaluations against 6 prominent medical VQA across 2 well-known datasets, our approach brings an improved performance with the least number of trainable parameters against competitors across various performance metrics. This work can help train junior clinicians and has the potential to serve as a decision support tool for experienced radiologists.1

关键词： Medical VQA large vision-language model MedGPT Generative pre-trained transformers Natural language processing

来源：评论

学校读者我要写书评

暂无评论

Cross-scene visual context parsing with large vision-language model

引用

PATTERN RECOGNITION 2025年 166卷

作者： Zhang, Guoqing Kan, Shichao Shi, Lu Xu, Wanru An, Gaoyun Cen, Yigang Beijing Jiaotong Univ State Key Lab Adv Rail Autonomous Operat Beijing 100044 Peoples R China Beijing Jiaotong Univ Sch Comp Sci & Technol Beijing 100044 Peoples R China Beijing Jiaotong Univ Visual Intelligence X Int Cooperat Joint Lab MOE Beijing 100044 Peoples R China Cent South Univ Sch Comp Sci & Engn Changsha 410083 Hunan Peoples R China

Relation analysis is crucial for image-based applications such as visual reasoning and visual question answering. Current relation analysis such as scene graph generation (SGG) only focuses on building relationships among objects within a single image. However, in real-world applications, relationships among objects across multiple images, as seen in video understanding, may hold greater significance as they can capture global information. This is still a challenging and unexplored task. In this paper, we aim to explore the technique of Cross-Scene Visual Context Parsing (CS-VCP) using a large vision-language model. To achieve this, we first introduce a cross-scene dataset comprising 10,000 pairs of cross-scene visual instruction data, with each instruction describing the common knowledge of a pair of cross-scene images. We then propose a Cross-Scene Visual Symbiotic Linkage (CS-VSL) model to understand both cross-scene relationships and objects by analyzing the rationales in each scene. The model is pre-trained on 100,000 cross-scene image pairs and validated on 10,000 image pairs. Both quantitative and qualitative experiments demonstrate the effectiveness of the proposed method. Our method has been released on GitHub: https://***/gavin-gqzhang/CS-VSL.

关键词： Scene graph generation large vision-language model Image caption Visual context parsing Instruction tuning

来源：评论

学校读者我要写书评

暂无评论

Enhancing Multi-Label Deep Hashing for Image and Audio With Joint Internal Global Loss Constraints and large vision-language model

引用

IEEE SIGNAL PROCESSING LETTERS 2024年 31卷 2550-2554页

作者： Liu, Ye Pan, Yan Yin, Jian Sun Yat Sen Univ Sch Comp Sci & Engn Guangzhou 510006 Peoples R China Lizhi Inc Artificial Intelligence & Big Data Dept Guangzhou 510630 Peoples R China Guangdong Key Lab Big Data Anal & Proc Guangzhou 510006 Peoples R China Sun Yat Sen Univ Sch Artificial Intelligence Zhuhai 519000 Peoples R China

Deep hashing algorithms can transform high-dimensional features into low-dimensional hash codes, which can reduce storage space and improve computational efficiency in traditional information retrieval (IR) and large model related retrieval augmented generation (RAG) scenarios. In recent years, pre-trained convolutional or transformer networks are commonly chosen as the backbone in deep hashing frameworks. This involves incorporating local loss constraints among training samples, and then fine-tuning the model to generate hash codes. Due to the relatively limited local information of constraints among training samples, we propose to design the novel anchor constraint and structural constraint as internal global loss constraints with the vision transformer network, and augment external information by integrating the large vision-language model, thereby enhancing the performance of hash code generation. Additionally, to enhance the scalability of the novel deep hashing framework, we propose to incorporate the adapter module to extend its application from the image domain to the audio domain. By conducting comparative experiments and ablation analysis on various image and audio datasets, it can be confirmed that the proposed method achieves state-of-the-art retrieval results.

关键词： Codes Transformers Adaptation models Training Convolutional neural networks Computer vision Feature extraction Internal global loss large vision-language model modality adapter module multi-label deep hashing vision transformer

来源：评论

学校读者我要写书评

暂无评论

FashionGPT: A large vision-language model for Enhancing Fashion Understanding 33rd

FashionGPT: A Large Vision-Language Model for Enhancing Fash...

引用

33rd International Conference on Artificial Neural Networks and Machine Learning (ICANN)

作者： Song, Duanxiao Gao, Dehong Liu, Gongshen Li, Xiaoyong Shanghai Jiao Tong Univ Shanghai Peoples R China Northwestern Polytech Univ Xian Peoples R China

ISBN: (纸本)9783031723438;9783031723445

Fashion understanding is a challenging multi-modal task of interpreting multi aspects of fashion images. While traditional computer vision or multi-modal algorithms fall short in providing a comprehensive understanding, large vision-language model (LVLM) offers a new approach. However, directly using LVLMs presents four major limitations, highlighting the need for a fashion-specific LVLM. Existing fashion datasets also reveal limitations in providing a coherent natural input that fits the LVLMs. To address this bottleneck, we introduce the FUND dataset featuring meticulously annotated textual descriptions for fashion images. Specifically, we build a fashion knowledge base and collect fashion images in various categories online. By leveraging image segmentation model and GPT4, we refine the pre-annotations through manual modifications. Through instruct-tuning with FUND, we develop FashionGPT, a GPT-assisted LVLM based on a solid architecture with exceptional performance on fashion understanding. It is capable of generating coherent and multi-aspect descriptions for fashion images and greatly alleviates the four limitations. Extensive experiments quantitatively and qualitatively demonstrate the effectiveness of FashionGPT and the benefits of FUND, and showcase the broad applications in more tasks.

关键词： Fashion Understanding large vision-language model Instruct Tuning

来源：评论

学校读者我要写书评

暂无评论

Reward Generation via large vision-language model in Offline Reinforcement Learning

Reward Generation via Large Vision-Language Model in Offline...

引用

2025 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2025

作者： Lee, Younghwan Luu, Tung M. Lee, Donghoon Yoo, Chang D. Electrical Engineering KAIST Daejeon Korea Republic of Robotics Program KAIST Daejeon Korea Republic of

ISBN: (纸本)9798350368741

In offline reinforcement learning (RL), learning from fixed datasets presents a promising solution for domains where real-time interaction with the environment is expensive or risky. However, designing dense reward signals for offline dataset requires significant human effort and domain expertise. Reinforcement learning with human feedback (RLHF) has emerged as an alternative, but it remains costly due to the human-in-the-loop process, prompting interest in automated reward generation models. To address this, we propose Reward Generation via large vision-language models (RG-VLM), which leverages the reasoning capabilities of LVLMs to generate rewards from offline data without human involvement. RG-VLM improves generalization in long-horizon tasks and can be seamlessly integrated with the sparse reward signals to enhance task performance, demonstrating its potential as an auxiliary reward signal. © 2025 IEEE.

关键词： large vision-language model Offline Reinforcement Learning Reward Labeling

来源：评论

学校读者我要写书评

暂无评论

DiViCo: Disentangled Visual Token Compression for Efficient large vision-language model

引用

IEEE Transactions on Circuits and Systems for Video Technology 2025年

作者： Wang, Xin Pan, Zirui Chen, Hong Zhu, Wenwu Tsinghua University Beijing National Research Center for Information Science and Technology Department of Computer Science and Technology Beijing China

large vision-language models have drawn much attention and become increasingly applicable in complicated multimodal tasks such as visual question answering, video grounding, etc. However, it still suffers from inefficiency problem during the inference stage due to the computational overhead brought by the large number of visual tokens. Existing works either utilize an attention score (or visual-text relevance) to filter out the less significant visual tokens, or insert learnable projection layers to directly compress the tokens, which neglects the informative details in visual signals and introduces information loss, resulting in poor generalizability to test data. To solve these problems, in this paper we propose a novel Disentangled Visual Token Compression module, i.e., DiViCo, that effectively compresses the visual tokens and maintains good performance simultaneously. In concrete, we first select the top τ% visual tokens according to their average attention scores, then predict the gap between these selected tokens and the original information by employing the chosen tokens in a disentangled and variational manner. Specifically, we model the mean and variance, sampling the predicted gap from the Gaussian prior. We further keep the informativeness of the compressed visual tokens via KL divergence, which ensures the generalizability of the model. Extensive experiments demonstrate the advantage of our proposed DiViCo module against several state-of-the-art baselines over various real-world datasets. Most notably, LLaVA-v1.5-7b equipped with DiViCo is able to reduce 67.7% FLOPs and save 51.7% time while maintaining 95.6% of the accuracy for LLaVA-v1.5-7b without any compression. © 1991-2012 IEEE.

关键词： large vision-language model multimodal token compression

来源：评论

学校读者我要写书评

暂无评论

Applications of large vision-language models in Visual Inspection

引用

Seimitsu Kogaku Kaishi/Journal of the Japan Society for Precision Engineering 2025年第3期91卷 333-336页

作者： Kato, Kunihito Ueno, Shiryu Yoshida, Haruto

来源：评论

学校读者我要写书评

暂无评论

THRONE: An Object-based Hallucination Benchmark for the Free-form Generations of large vision-language models

THRONE: An Object-based Hallucination Benchmark for the Free...

引用

IEEE/CVF Conference on Computer vision and Pattern Recognition (CVPR)

作者： Kaul, Prannay Li, Zhizhong Yang, Hao Dukler, Yonatan Swaminathan, Ashwin Taylor, C. J. Soatto, Stefano Univ Oxford VGG Oxford England AWS AI Labs Oxford England

ISBN: (纸本)9798350353006

Mitigating hallucinations in large vision-language models (LVLMs) remains an open problem. Recent benchmarks do not address hallucinations in open-ended free-form responses, which we term "Type I hallucinations". Instead, they focus on hallucinations responding to very specific question formats-typically a multiple-choice response regarding a particular object or attribute-which we term "Type II hallucinations". Additionally, such benchmarks often require external API calls to models which are subject to change. In practice, we observe that a reduction in Type II hallucinations does not lead to a reduction in Type I hallucinations but rather that the two forms of hallucinations are often anti-correlated. To address this, we pro-pose THRONE, a novel object-based automatic framework for quantitatively evaluating Type I hallucinations in LVLM free-form outputs. We use public language models (LMs) to identify hallucinations in LVLM responses and compute informative metrics. By evaluating a large selection of recent LVLMs using public datasets, we show that an improvement in existing metrics do not lead to a reduction in Type I hallucinations, and that established benchmarks for measuring Type I hallucinations are incomplete. Finally, we provide a simple and effective data augmentation method to reduce Type I and Type II hallucinations as a strong baseline.

关键词： benchmark hallucination large language model large vision-language model LLM LVLM

来源：评论

学校读者我要写书评

暂无评论

OPERA: Alleviating Hallucination in Multi-Modal large language models via Over-Trust Penalty and Retrospection-Allocation

OPERA: Alleviating Hallucination in Multi-Modal Large Langua...

引用

IEEE/CVF Conference on Computer vision and Pattern Recognition (CVPR)

作者： Hu, Qidong Dong, Xiaoyi Zhang, Pan Wang, Bin He, Conghui Wang, Jiaqi Lin, Dahua Zhang, Weiming Yu, Nenghai Univ Sci & Technol China Anhui Prov Key Lab Digital Secur Hefei Peoples R China Shanghai AI Lab Shanghai Peoples R China Chinese Univ Hong Kong Hong Kong Peoples R China

ISBN: (纸本)9798350353006

Hallucination, posed as a pervasive challenge of multi-modal large language models (MLLMs), has significantly impeded their real-world usage that demands precise judgment. Existing methods mitigate this issue with either training with specific designed data or inferencing with external knowledge from other sources, incurring inevitable additional costs. In this paper, we present OPERA, a novel MLLM decoding method grounded in an Overtrust Penalty and a Retrospection-Allocation strategy, serving as a nearly free lunch to alleviate the hallucination issue without additional data, knowledge, or training. Our approach begins with an interesting observation that, most hallucinations are closely tied to the knowledge aggregation patterns manifested in the self-attention matrix, i.e., MLLMs tend to generate new tokens by focusing on a few summary tokens, but not all the previous tokens. Such partial over-trust inclination results in the neglecting of image tokens and describes the image content with hallucination. Based on the observation, OPERA introduces a penalty term on the model logits during the beam-search decoding to mitigate the over-trust issue, along with a rollback strategy that retrospects the presence of summary tokens in the previ-ously generated tokens, and re-allocate the token selection if necessary. With extensive experiments, OPERA shows significant hallucination-mitigating performance on differ-ent MLLMs and metrics, proving its effectiveness and gen-erality. Our code is at: https://github. com/shikiw/OPERA.

关键词： Hallucination large vision-language model LLM Multimodal LLM

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：