Autoregressive visual language models (VLMs) demonstrate remarkable few-shot learning capabilities within a multimodal context. Recently, multimodal instruction tuning has emerged as a technique to further refine inst...
详细信息
Autoregressive visual language models (VLMs) demonstrate remarkable few-shot learning capabilities within a multimodal context. Recently, multimodal instruction tuning has emerged as a technique to further refine instruction-following abilities. However, we uncover the potential threat posed by backdoor attacks on autoregressive VLMs during instruction tuning. Adversaries can implant a backdoor by inserting poisoned samples with triggers embedded in instructions or images to datasets, enabling malicious manipulation of the victim model's predictions with predefined triggers. However, the frozen visual encoder in autoregressive VLMs imposes constraints on learning conventional image triggers. Additionally, adversaries may lack access to the parameters and architectures of the victim model. To overcome these challenges, we introduce a multimodal instruction backdoor attack, namely VL-Trojan. Our approach facilitates image trigger learning through active reshaping of poisoned features and enhances black-box attack efficacy through an iterative character-level text trigger generation method. Our attack successfully induces target output during inference, significantly outperforming baselines (+15.68%) in ASR. Furthermore, our attack demonstrates robustness across various model scales, architectures and few-shot in-context reasoning scenarios. Our codes are available at https://***/JWLiang007/VL-Trojan.
The visual language model, known for its robust cross-modal capabilities, has been extensively applied in various computer vision tasks. In this paper, we explore the use of CLIP (Contrastive language-Image Pretrainin...
详细信息
The visual language model, known for its robust cross-modal capabilities, has been extensively applied in various computer vision tasks. In this paper, we explore the use of CLIP (Contrastive language-Image Pretraining), a vision-languagemodel pretrained on large-scale image-text pairs to align visual and textual features, for acquiring fine-grained and domain-invariant representations in generalizable person re-identification. The adaptation of CLIP to the task presents two primary challenges: learning more fine-grained features to enhance discriminative ability, and learning more domain-invariant features to improve the model's generalization capabilities. To mitigate the first challenge thereby enhance the ability to learn fine-grained features, a three-stage strategy is proposed to boost the accuracy of text descriptions. Initially, the image encoder is trained to effectively adapt to person re-identification tasks. In the second stage, the features extracted by the image encoder are used to generate textual descriptions (i.e., prompts) for each image. Finally, the text encoder with the learned prompts is employed to guide the training of the final image encoder. To enhance the model's generalization capabilities to unseen domains, a bidirectional guiding method is introduced to learn domain-invariant image features. Specifically, domain-invariant and domain-relevant prompts are generated, and both positive (i.e., pulling together image features and domain-invariant prompts) and negative (i.e., pushing apart image features and domain-relevant prompts) views are used to train the image encoder. Collectively, these strategies contribute to the development of an innovative CLIP-based framework for learning fine-grained generalized features in person re-identification. The effectiveness of the proposed method is validated through a comprehensive series of experiments conducted on multiple benchmarks. Our code is available at https://***/Qi5Lei/CLIP-FGDI.
Recently, large languagemodels (LLMs) have taken the spotlight in natural language processing. Further, integrating LLMs with vision enables the users to explore emergent abilities with multimodal data. visual langua...
详细信息
ISBN:
(纸本)9798350345025;9798350345018
Recently, large languagemodels (LLMs) have taken the spotlight in natural language processing. Further, integrating LLMs with vision enables the users to explore emergent abilities with multimodal data. visual language models (VLMs), such as LLaVA, Flamingo, or CLIP, have demonstrated impressive performance on various visio-linguistic tasks. Consequently, there are enormous applications of large models that could be potentially used in the biomedical imaging field. Along that direction, there is a lack of related work to show the ability of large models to diagnose the diseases. In this work, we study the zeroshot and few-shot robustness of VLMs on the medical imaging analysis tasks. Our comprehensive experiments demonstrate the effectiveness of VLMs in analyzing biomedical images such as brain MRIs, microscopic images of blood cells, and chest X-rays. While VLMs can not outperform classic vision models like CNN or ResNet, it is worth noting that VLMs can serve as chat assistants to provide pre-diagnosis before making decisions without the need for retraining or finetuning stages.
The integration of Vision-language Pretraining (VLP) models in the medical field represents a significant advancement in the development of AI-driven diagnostic tools. These models, which learn to understand and gener...
详细信息
ISBN:
(数字)9783031677519
ISBN:
(纸本)9783031677502;9783031677519
The integration of Vision-language Pretraining (VLP) models in the medical field represents a significant advancement in the development of AI-driven diagnostic tools. These models, which learn to understand and generate descriptions of visual content, have shown great promise in enhancing the interpretability and accuracy of medical image analysis. However, the application of VLP models in healthcare poses unique challenges, including the scarcity of labeled data and fine-grained nature of medical imaging. Our contributions include the development of a Medical visuallanguage Pre-training (MVLP) model that leverages domain-specific knowledge to improve the alignment between medical images and radiology reports. By utilizing a triplet extraction method and encoding the medical entities with detailed descriptions by Med-PALM 2, we simplify language complexity and exploit the rich domain knowledge learned in Large languagemodel, and implicitly build relationships between medical entities in the language embedding space. Our model demonstrates significant improvements in disease classification tasks, achieving competitive Area Under the Curve scores on benchmark datasets such as RSNA Pneumonia and ChestX-ray14.
Zero-shot video action recognition has advanced significantly due to the adaptation of visual-languagemodels, such as CLIP, to video domains. However, existing methods attempt to adapt CLIP to video tasks by leveragi...
详细信息
Zero-shot video action recognition has advanced significantly due to the adaptation of visual-languagemodels, such as CLIP, to video domains. However, existing methods attempt to adapt CLIP to video tasks by leveraging temporal information, neglecting the semantic information (i.e. the latent categories and their relationships) within videos. In this paper, we propose a Semantic Constrained CLIP (SC-CLIP) approach that leverages semantic information to adjust CLIP for video recognition while ensuring its performance on unseen data. SC-CLIP comprises a semantic-related query generation module and a semantic constrained cross attention module. First, the semantic-related query generation module clusters dense tokens from CLIP to generate semantic-related mask. The semantic-related query is then derived by pooling the adapted CLIP output using the semantic-related mask. Next, the semantic constrained cross attention module feeds the generated semantic-related query back into CLIP to probe semantic-related values, enhancing their ability to leverage the vision-language matching capabilities of CLIP. By generating semantic-related query, the semantic information aids in distinguishing similar actions, thereby improving performance on unseen samples. Experimental results on three zero-shot action recognition benchmarks show improvements of up to 1.9% and 2% in harmonic mean under two settings. Code is available at https://***/quanzhenzhen/SC-CLIP.
Open-vocabulary semantic segmentation aims to achieve accurate classification of different categories of pixels, even if these categories are not explicitly labeled during training. The current research trend in this ...
详细信息
Open-vocabulary semantic segmentation aims to achieve accurate classification of different categories of pixels, even if these categories are not explicitly labeled during training. The current research trend in this field emphasizes the utilization of pre-trained visual-languagemodels to augment exploration capabilities. The core of these methods is to use image-level models to guide the segmentation process at the pixel level, thereby enhancing the model's ability to recognize and segment unseen categories during training. However, many approaches overlook global information, which may lead to a lack of comprehensive scene understanding when processing images. Thereby, GCD-Net is introduced as an innovative open-vocabulary semantic segmentation framework, which integrates a novel decoder with a hierarchical encoder to form an encoder-decoder architecture. The hierarchical encoder leverages a hierarchical backbone network to generate a pixel-level image-text cost map, which preserves spatial information effectively at different levels. The proposed decoder, known as the Feature Fusion Decoder, comprises three pivotal modules: the Global Feature Extraction Module, the visual Enhancement Module, and the Feature Aggregation Module. These modules cooperate to process hierarchical feature maps from different levels to capture global context information and effectively aggregate pixel blocks into semantic regions for high-quality open-vocabulary semantic segmentation. Experiments on multiple open-vocabulary semantic segmentation datasets demonstrate that GCD-Net achieves an mIoU score of 17.5% on PC-459 and 94.3% on PAS-20, verifying the effectiveness and superiority of the method.
Accurately predicting traffic accidents in real-time is a critical challenge in autonomous driving, particularly in resource-constrained environments. Existing solutions often suffer from high computational overhead o...
详细信息
Accurately predicting traffic accidents in real-time is a critical challenge in autonomous driving, particularly in resource-constrained environments. Existing solutions often suffer from high computational overhead or fail to adequately address the uncertainty of evolving traffic scenarios. This paper introduces LATTE, a Lightweight Attention-based Traffic Accident Anticipation Engine, which integrates computational efficiency with stateof-the-art performance. LATTE employs Efficient Multiscale Spatial Aggregation (EMSA) to capture spatial features across scales, Memory Attention Aggregation (MAA) to enhance temporal modeling, and Auxiliary Self-Attention Aggregation (AAA) to extract latent dependencies over extended sequences. Additionally, LATTE incorporates the Flamingo Alert-Assisted System (FAA), leveraging a vision-languagemodel to provide realtime, cognitively accessible verbal hazard alerts, improving passenger situational awareness. Evaluations on benchmark datasets (DAD, CCD, A3D) demonstrate LATTE's superior predictive capabilities and computational efficiency. LATTE achieves state-of-the-art 89.74% Average Precision (AP) on DAD benchmark, with 5.4% higher mean Time-To-Accident (mTTA) than the second-best model, and maintains competitive mTTA at a Recall of 80% (TTA@R80) (4.04s) while demonstrating robust accident anticipation across diverse driving conditions. Its lightweight design delivers a 93.14% reduction in floating-point operations (FLOPs) and a 31.58% decrease in parameter count (Params), enabling real-time operation on resource-limited hardware without compromising performance. Ablation studies confirm the effectiveness of LATTE's architectural components, while visualizations and failure case analyses highlight its practical applicability and areas for enhancement. Our codes are available at https://***/icypear/***.
This study advances the utilization of semantic information in person re-identification (ReID) by leveraging pre- trained vision-languagemodels, addressing the current limitations in semantic information processing w...
详细信息
This study advances the utilization of semantic information in person re-identification (ReID) by leveraging pre- trained vision-languagemodels, addressing the current limitations in semantic information processing within ReID systems. While recent studies have explored CLIP integration for ReID tasks, their training approaches have inadvertently diminished semantic information by focusing primarily on indirect alignment between person IDs through text encoders and image features. Through comprehensive empirical analysis of semantic information's role in pedestrian ReID, we propose MoSCE-ReID, a mixed semantic clustering expert model. The framework incorporates two key components: a learnable Attribute Group Weight Extractor (AGWE) and a Mixed of LoRA Expert (MoLE) module, designed specifically for attribute group feature extraction. The final ReID decisions are made through the synergistic integration of attribute group features and global features. Extensive experiments across multiple public datasets demonstrate that our approach, by effectively incorporating person attribute group semantic information, achieves substantial performance improvements in ReID tasks, exhibiting superior generalization capabilities compared to existing frameworks.
Recent advancements in pre-trained vision-languagemodels like CLIP have shown promise in person re-identification (ReID) applications. However, their performance in generalizable person ReID tasks remains suboptimal....
详细信息
Recent advancements in pre-trained vision-languagemodels like CLIP have shown promise in person re-identification (ReID) applications. However, their performance in generalizable person ReID tasks remains suboptimal. The large-scale and diverse image-text pairs used in CLIP's pre-training may lead to a lack or insufficiency of certain fine-grained features. In light of these challenges, we propose a hard sample mining method called Depth-First Graph Sampler (DFGS), based on depth-first search, designed to offer sufficiently challenging samples to enhance CLIP's ability to extract fine-grained features. DFGS can be applied to both the image encoder and the text encoder in CLIP. By leveraging the powerful cross-modal learning capabilities of CLIP, we aim to apply our DFGS method to extract challenging samples and form mini-batches with high discriminative difficulty, providing the image model with more efficient and challenging samples that are difficult to distinguish, thereby enhancing the model's ability to differentiate between individuals. Our results demonstrate significant improvements over other methods, confirming the effectiveness of DFGS in providing challenging samples that enhance CLIP's performance in generalizable person Re-ID.
Advancements in deep image synthesis techniques, such as generative adversarial networks (GANs) and diffusion models (DMs), have ushered in an era of generating highly realistic images. While this technological progre...
详细信息
Advancements in deep image synthesis techniques, such as generative adversarial networks (GANs) and diffusion models (DMs), have ushered in an era of generating highly realistic images. While this technological progress has captured significant interest, it has also raised concerns about the high challenge in distinguishing real images from their synthetic counterparts. This paper takes inspiration from the potent convergence capabilities between vision and language, coupled with the zero-shot nature of vision-languagemodels (VLMs). We introduce an innovative method called Bi-LORA that leverages VLMs, combined with low-rank adaptation (LORA) tuning techniques, to enhance the precision of synthetic image detection for unseen model-generated images. The pivotal conceptual shift in our methodology revolves around reframing binary classification as an image captioning task, leveraging the distinctive capabilities of cutting-edge VLM, notably bootstrapping language image pre-training (BLIP)2. Rigorous and comprehensive experiments are conducted to validate the effectiveness of our proposed approach, particularly in detecting unseen diffusion-generated images from unknown diffusion-based generative models during training, showcasing robustness to noise, and demonstrating generalisation capabilities to GANs. The experiments show that Bi-LORA outperforms state of the art models in cross-generator tasks because it leverages multi-modal learning, open-world visual knowledge, and benefits from robust, high-level semantic understanding. By combining visual and textual knowledge, it can handle variations in the data distribution (such as those caused by different generators) and maintain strong performance across different domains. Its ability to transfer knowledge, robustly extract features and perform zero-shot learning also contributes to its generalisation capabilities, making it more adaptable to new generators. The experimental results showcase an impressive average accur
暂无评论