检索结果-内蒙古大学图书馆

VL-Trojan: Multimodal Instruction Backdoor Attacks against Autoregressive visual language models

INTERNATIONAL JOURNAL OF COMPUTER VISION 2025年 1-20页

作者： Liang, Jiawei Liang, Siyuan Liu, Aishan Cao, Xiaochun Sun Yat Sen Univ Shenzhen Campus Shenzhen Peoples R China Natl Univ Singapore Singapore Singapore Beihang Univ Beijing Peoples R China Minist Educ Key Lab Cyberspace Secur Zhengzhou Peoples R China

Autoregressive visual language models (VLMs) demonstrate remarkable few-shot learning capabilities within a multimodal context. Recently, multimodal instruction tuning has emerged as a technique to further refine instruction-following abilities. However, we uncover the potential threat posed by backdoor attacks on autoregressive VLMs during instruction tuning. Adversaries can implant a backdoor by inserting poisoned samples with triggers embedded in instructions or images to datasets, enabling malicious manipulation of the victim model's predictions with predefined triggers. However, the frozen visual encoder in autoregressive VLMs imposes constraints on learning conventional image triggers. Additionally, adversaries may lack access to the parameters and architectures of the victim model. To overcome these challenges, we introduce a multimodal instruction backdoor attack, namely VL-Trojan. Our approach facilitates image trigger learning through active reshaping of poisoned features and enhances black-box attack efficacy through an iterative character-level text trigger generation method. Our attack successfully induces target output during inference, significantly outperforming baselines (+15.68%) in ASR. Furthermore, our attack demonstrates robustness across various model scales, architectures and few-shot in-context reasoning scenarios. Our codes are available at https://***/JWLiang007/VL-Trojan.

关键词： visual language model Backdoor attack Instruction tuning

来源：评论

学校读者我要写书评

暂无评论

CILP-FGDI: Exploiting Vision-language model for Generalizable Person Re-Identification

引用

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY 2025年 20卷 2132-2142页

作者： Zhao, Huazhong Qi, Lei Geng, Xin Southeast Univ Minist Educ Sch Comp Sci & Engn Nanjing 211189 Peoples R China Southeast Univ Minist Educ Key Lab New Generat Artificial Intelligence Techno Nanjing 211189 Peoples R China

The visual language model, known for its robust cross-modal capabilities, has been extensively applied in various computer vision tasks. In this paper, we explore the use of CLIP (Contrastive language-Image Pretraining), a vision-language model pretrained on large-scale image-text pairs to align visual and textual features, for acquiring fine-grained and domain-invariant representations in generalizable person re-identification. The adaptation of CLIP to the task presents two primary challenges: learning more fine-grained features to enhance discriminative ability, and learning more domain-invariant features to improve the model's generalization capabilities. To mitigate the first challenge thereby enhance the ability to learn fine-grained features, a three-stage strategy is proposed to boost the accuracy of text descriptions. Initially, the image encoder is trained to effectively adapt to person re-identification tasks. In the second stage, the features extracted by the image encoder are used to generate textual descriptions (i.e., prompts) for each image. Finally, the text encoder with the learned prompts is employed to guide the training of the final image encoder. To enhance the model's generalization capabilities to unseen domains, a bidirectional guiding method is introduced to learn domain-invariant image features. Specifically, domain-invariant and domain-relevant prompts are generated, and both positive (i.e., pulling together image features and domain-invariant prompts) and negative (i.e., pushing apart image features and domain-relevant prompts) views are used to train the image encoder. Collectively, these strategies contribute to the development of an innovative CLIP-based framework for learning fine-grained generalized features in person re-identification. The effectiveness of the proposed method is validated through a comprehensive series of experiments conducted on multiple benchmarks. Our code is available at https://***/Qi5Lei/CLIP-FGDI.

关键词： Computational modeling visualization Feature extraction Training Computer vision Adaptation models Text to image Representation learning Lighting Image resolution visual language model generalizable person re-identification generalization capabilities

来源：评论

学校读者我要写书评

暂无评论

On Large visual language models for Medical Imaging Analysis: An Empirical Study

On Large Visual Language Models for Medical Imaging Analysis...

引用

9th IEEE/ACM International Conference on Connected Health - Applications, Systems and Engineering Technologies (CHASE)

作者： Minh-Hao Van Verma, Prateek Wu, Xintao Univ Arkansas Fayetteville AR 72701 USA

ISBN: (纸本)9798350345025;9798350345018

Recently, large language models (LLMs) have taken the spotlight in natural language processing. Further, integrating LLMs with vision enables the users to explore emergent abilities with multimodal data. visual language models (VLMs), such as LLaVA, Flamingo, or CLIP, have demonstrated impressive performance on various visio-linguistic tasks. Consequently, there are enormous applications of large models that could be potentially used in the biomedical imaging field. Along that direction, there is a lack of related work to show the ability of large models to diagnose the diseases. In this work, we study the zeroshot and few-shot robustness of VLMs on the medical imaging analysis tasks. Our comprehensive experiments demonstrate the effectiveness of VLMs in analyzing biomedical images such as brain MRIs, microscopic images of blood cells, and chest X-rays. While VLMs can not outperform classic vision models like CNN or ResNet, it is worth noting that VLMs can serve as chat assistants to provide pre-diagnosis before making decisions without the need for retraining or finetuning stages.

关键词： visual language model zero-shot learning medical imaging analysis

来源：评论

学校读者我要写书评

暂无评论

Exploring Vision language Pretraining with Knowledge Enhancement via Large language model 1

引用

2nd International Workshop on Trustworthy Artificial Intelligence for Healthcare (TAI4H)

作者： Tung, Chuenyuet Lin, Yi Yin, Jianing Ye, Qiaoyuchen Chen, Hao Hong Kong Univ Sci & Technol Hong Kong Peoples R China

ISBN: (数字)9783031677519

ISBN: (纸本)9783031677502;9783031677519

The integration of Vision-language Pretraining (VLP) models in the medical field represents a significant advancement in the development of AI-driven diagnostic tools. These models, which learn to understand and generate descriptions of visual content, have shown great promise in enhancing the interpretability and accuracy of medical image analysis. However, the application of VLP models in healthcare poses unique challenges, including the scarcity of labeled data and fine-grained nature of medical imaging. Our contributions include the development of a Medical visual language Pre-training (MVLP) model that leverages domain-specific knowledge to improve the alignment between medical images and radiology reports. By utilizing a triplet extraction method and encoding the medical entities with detailed descriptions by Med-PALM 2, we simplify language complexity and exploit the rich domain knowledge learned in Large language model, and implicitly build relationships between medical entities in the language embedding space. Our model demonstrates significant improvements in disease classification tasks, achieving competitive Area Under the Curve scores on benchmark datasets such as RSNA Pneumonia and ChestX-ray14.

关键词： visual language model Large language model Knowledge Enhancement

来源：评论

学校读者我要写书评

暂无评论

Semantic matters: A constrained approach for zero-shot video action recognition

引用

PATTERN RECOGNITION 2025年 162卷

作者： Quan, Zhenzhen Chen, Jialei Deguchi, Daisuke Sun, Jie Zhang, Chenkai Li, Yujun Murase, Hiroshi Shandong Univ Sch Informat Sci & Engn 72 Binhai Rd Qingdao 266237 Shandong Peoples R China Nagoya Univ Furo ChoChikusa Ku Nagoya Aichi Japan Shandong Univ Smart State Governance Lab 72 Binhai Rd Qingdao Shandong Peoples R China Shandong Prov Dept Justice 15743 Jingshi Rd Jinan Shandong Peoples R China

Zero-shot video action recognition has advanced significantly due to the adaptation of visual-language models, such as CLIP, to video domains. However, existing methods attempt to adapt CLIP to video tasks by leveraging temporal information, neglecting the semantic information (i.e. the latent categories and their relationships) within videos. In this paper, we propose a Semantic Constrained CLIP (SC-CLIP) approach that leverages semantic information to adjust CLIP for video recognition while ensuring its performance on unseen data. SC-CLIP comprises a semantic-related query generation module and a semantic constrained cross attention module. First, the semantic-related query generation module clusters dense tokens from CLIP to generate semantic-related mask. The semantic-related query is then derived by pooling the adapted CLIP output using the semantic-related mask. Next, the semantic constrained cross attention module feeds the generated semantic-related query back into CLIP to probe semantic-related values, enhancing their ability to leverage the vision-language matching capabilities of CLIP. By generating semantic-related query, the semantic information aids in distinguishing similar actions, thereby improving performance on unseen samples. Experimental results on three zero-shot action recognition benchmarks show improvements of up to 1.9% and 2% in harmonic mean under two settings. Code is available at https://***/quanzhenzhen/SC-CLIP.

关键词： visual language model Video action recognition Semantic constrained Zero-shot Semantic-related

来源：评论

学校读者我要写书评

暂无评论

GCD-Net: Global consciousness-driven open-vocabulary semantic segmentation network

引用

NEUROCOMPUTING 2025年 636卷

作者： Wu, Xing Xu, Zhenyao Qian, Quan Huang, Bin Shanghai Univ Sch Comp Engn & Sci Shanghai 200444 Peoples R China Shanghai Univ Shanghai Inst Adv Commun & Data Sci Shanghai 200444 Peoples R China Shanghai Univ Key Lab Silicate Cultural Rel Conservat Minist Educ Shanghai Peoples R China Harbin Engn Univ Harbin Peoples R China

Open-vocabulary semantic segmentation aims to achieve accurate classification of different categories of pixels, even if these categories are not explicitly labeled during training. The current research trend in this field emphasizes the utilization of pre-trained visual-language models to augment exploration capabilities. The core of these methods is to use image-level models to guide the segmentation process at the pixel level, thereby enhancing the model's ability to recognize and segment unseen categories during training. However, many approaches overlook global information, which may lead to a lack of comprehensive scene understanding when processing images. Thereby, GCD-Net is introduced as an innovative open-vocabulary semantic segmentation framework, which integrates a novel decoder with a hierarchical encoder to form an encoder-decoder architecture. The hierarchical encoder leverages a hierarchical backbone network to generate a pixel-level image-text cost map, which preserves spatial information effectively at different levels. The proposed decoder, known as the Feature Fusion Decoder, comprises three pivotal modules: the Global Feature Extraction Module, the visual Enhancement Module, and the Feature Aggregation Module. These modules cooperate to process hierarchical feature maps from different levels to capture global context information and effectively aggregate pixel blocks into semantic regions for high-quality open-vocabulary semantic segmentation. Experiments on multiple open-vocabulary semantic segmentation datasets demonstrate that GCD-Net achieves an mIoU score of 17.5% on PC-459 and 94.3% on PAS-20, verifying the effectiveness and superiority of the method.

关键词： Open-vocabulary Semantic segmentation visual language model Deep learning

来源：评论

学校读者我要写书评

暂无评论

LATTE: A Real-time Lightweight Attention-based Traffic Accident Anticipation Engine

引用

INFORMATION FUSION 2025年 122卷

作者： Zhang, Jiaxun Guan, Yanchen Wang, Chengyue Liao, Haicheng Zhang, Guohui Li, Zhenning Univ Macau State Key Lab Internet Things Smart City Macau Peoples R China Univ Macau Dept Civil & Environm Engn Macau Peoples R China Univ Macau Dept Comp & Informat Sci Macau Peoples R China Univ Hawaii Manoa Dept Civil Environm & Construct Engn Honolulu HI USA

Accurately predicting traffic accidents in real-time is a critical challenge in autonomous driving, particularly in resource-constrained environments. Existing solutions often suffer from high computational overhead or fail to adequately address the uncertainty of evolving traffic scenarios. This paper introduces LATTE, a Lightweight Attention-based Traffic Accident Anticipation Engine, which integrates computational efficiency with stateof-the-art performance. LATTE employs Efficient Multiscale Spatial Aggregation (EMSA) to capture spatial features across scales, Memory Attention Aggregation (MAA) to enhance temporal modeling, and Auxiliary Self-Attention Aggregation (AAA) to extract latent dependencies over extended sequences. Additionally, LATTE incorporates the Flamingo Alert-Assisted System (FAA), leveraging a vision-language model to provide realtime, cognitively accessible verbal hazard alerts, improving passenger situational awareness. Evaluations on benchmark datasets (DAD, CCD, A3D) demonstrate LATTE's superior predictive capabilities and computational efficiency. LATTE achieves state-of-the-art 89.74% Average Precision (AP) on DAD benchmark, with 5.4% higher mean Time-To-Accident (mTTA) than the second-best model, and maintains competitive mTTA at a Recall of 80% (TTA@R80) (4.04s) while demonstrating robust accident anticipation across diverse driving conditions. Its lightweight design delivers a 93.14% reduction in floating-point operations (FLOPs) and a 31.58% decrease in parameter count (Params), enabling real-time operation on resource-limited hardware without compromising performance. Ablation studies confirm the effectiveness of LATTE's architectural components, while visualizations and failure case analyses highlight its practical applicability and areas for enhancement. Our codes are available at https://***/icypear/***.

关键词： Accident anticipation Lightweight attention visual language model Autonomous driving

来源：评论

学校读者我要写书评

暂无评论

MoSCE-ReID: Mixture of semantic clustering experts for person re-identification

引用

NEUROCOMPUTING 2025年 626卷

作者： Ren, Kai Hu, Chuanping Xi, Hao Li, Yongqiang Fan, Jinhao Liu, Lihua Zhengzhou Univ Sch Elect & Informat Engn Zhengzhou 450000 Henan Peoples R China Zhengzhou Univ Sch Cyber Sci & Engn 100 Sci Ave Zhengzhou 450000 Henan Peoples R China Henan Remote Sensing Inst Zhengzhou 450003 Henan Peoples R China

This study advances the utilization of semantic information in person re-identification (ReID) by leveraging pre- trained vision-language models, addressing the current limitations in semantic information processing within ReID systems. While recent studies have explored CLIP integration for ReID tasks, their training approaches have inadvertently diminished semantic information by focusing primarily on indirect alignment between person IDs through text encoders and image features. Through comprehensive empirical analysis of semantic information's role in pedestrian ReID, we propose MoSCE-ReID, a mixed semantic clustering expert model. The framework incorporates two key components: a learnable Attribute Group Weight Extractor (AGWE) and a Mixed of LoRA Expert (MoLE) module, designed specifically for attribute group feature extraction. The final ReID decisions are made through the synergistic integration of attribute group features and global features. Extensive experiments across multiple public datasets demonstrate that our approach, by effectively incorporating person attribute group semantic information, achieves substantial performance improvements in ReID tasks, exhibiting superior generalization capabilities compared to existing frameworks.

关键词： Person re-identification visual language model Mixture of experts Prompt learning CLIP

来源：评论

学校读者我要写书评

暂无评论

CLIP-DFGS: A Hard Sample Mining Method for CLIP in Generalizable Person Re-Identification

引用

ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS 2025年第1期21卷 1-20页

作者： Zhao, Huazhong Qi, Lei Geng, Xin Southeast Univ Sch Comp Sci & Engn Nanjing Peoples R China Southeast Univ Key Lab New Generat Artificial Intelligence Techno Minist Educ Nanjing Peoples R China

Recent advancements in pre-trained vision-language models like CLIP have shown promise in person re-identification (ReID) applications. However, their performance in generalizable person ReID tasks remains suboptimal. The large-scale and diverse image-text pairs used in CLIP's pre-training may lead to a lack or insufficiency of certain fine-grained features. In light of these challenges, we propose a hard sample mining method called Depth-First Graph Sampler (DFGS), based on depth-first search, designed to offer sufficiently challenging samples to enhance CLIP's ability to extract fine-grained features. DFGS can be applied to both the image encoder and the text encoder in CLIP. By leveraging the powerful cross-modal learning capabilities of CLIP, we aim to apply our DFGS method to extract challenging samples and form mini-batches with high discriminative difficulty, providing the image model with more efficient and challenging samples that are difficult to distinguish, thereby enhancing the model's ability to differentiate between individuals. Our results demonstrate significant improvements over other methods, confirming the effectiveness of DFGS in providing challenging samples that enhance CLIP's performance in generalizable person Re-ID.

关键词： visual language model Generalizable person re-identification Depth first search

来源：评论

学校读者我要写书评

暂无评论

Bi-LORA: A Vision-language Approach for Synthetic Image Detection

引用

EXPERT SYSTEMS 2025年第2期42卷

作者： Keita, Mamadou Hamidouche, Wassim Eutamene, Hessen Bougueffa Taleb-Ahmed, Abdelmalik Camacho, David Hadid, Abdenour Univ Polytech Hauts Defrance Lab IEMN Valenciennes France Khalifa Univ Abu Dhabi U Arab Emirates Tech Univ Madrid Madrid Spain Sorbonne Univ Sorbonne Ctr Artificial Intelligence Abu Dhabi U Arab Emirates

Advancements in deep image synthesis techniques, such as generative adversarial networks (GANs) and diffusion models (DMs), have ushered in an era of generating highly realistic images. While this technological progress has captured significant interest, it has also raised concerns about the high challenge in distinguishing real images from their synthetic counterparts. This paper takes inspiration from the potent convergence capabilities between vision and language, coupled with the zero-shot nature of vision-language models (VLMs). We introduce an innovative method called Bi-LORA that leverages VLMs, combined with low-rank adaptation (LORA) tuning techniques, to enhance the precision of synthetic image detection for unseen model-generated images. The pivotal conceptual shift in our methodology revolves around reframing binary classification as an image captioning task, leveraging the distinctive capabilities of cutting-edge VLM, notably bootstrapping language image pre-training (BLIP)2. Rigorous and comprehensive experiments are conducted to validate the effectiveness of our proposed approach, particularly in detecting unseen diffusion-generated images from unknown diffusion-based generative models during training, showcasing robustness to noise, and demonstrating generalisation capabilities to GANs. The experiments show that Bi-LORA outperforms state of the art models in cross-generator tasks because it leverages multi-modal learning, open-world visual knowledge, and benefits from robust, high-level semantic understanding. By combining visual and textual knowledge, it can handle variations in the data distribution (such as those caused by different generators) and maintain strong performance across different domains. Its ability to transfer knowledge, robustly extract features and perform zero-shot learning also contributes to its generalisation capabilities, making it more adaptable to new generators. The experimental results showcase an impressive average accur

关键词： deepfake diffusion models generative adversarial nets image captioning large language model low rank adaptation text-to-image generation visual language model

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：