检索结果-内蒙古大学图书馆

IEEE/CVF Conference on Computer vision and Pattern Recognition (CVPR)

作者： Stojnic, Vladan Kalantidis, Yannis Tolias, Giorgos Czech Tech Univ FEE VRG Prague Czech Republic NAVER LABS Europe Meylan France

ISBN: (纸本)9798350353006

vision-language models (VLMs) have demonstrated impressive performance on zero-shot classification, i.e. classification when provided merely with a list of class names. In this paper, we tackle the case of zero-shot classification in the presence of unlabeled data. We leverage the graph structure of the unlabeled data and introduce ZLaP, a method based on label propagation (LP) that utilizes geodesic distances for classification. We tailor LP to graphs containing both text and image features and further propose an efficient method for performing inductive inference based on a dual solution and a sparsification step. We perform extensive experiments to evaluate the effectiveness of our method on 14 common datasets and show that ZLaP outperforms the latest related works. Code: https://***/vladan-stojnic/ZLaP

关键词： label propagation vision-language models zero-shot classification

来源：评论

学校读者我要写书评

暂无评论

Multiple Prompt Fusion for Zero-Shot Lesion Detection Using vision-language models 26th

Multiple Prompt Fusion for Zero-Shot Lesion Detection Using ...

引用

26th International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI)

作者： Guo, Miaotian Yi, Huahui Qin, Ziyuan Wang, Haiying Men, Aidong Lao, Qicheng Beijing Univ Posts & Telecommun Sch Artificial Intelligence Beijing Peoples R China Sichuan Univ West China Biomed Big Data Ctr West China Hosp Sichuan University Sichuan Peoples R China Shanghai Artificial Intelligence Lab Shanghai Peoples R China

ISBN: (纸本)9783031439032;9783031439049

The success of large-scale pre-trained vision-language models (VLM) has provided a promising direction of transferring natural image representations to the medical domain by providing a well-designed prompt with medical expert-level knowledge. However, one prompt has difficulty in describing the medical lesions thoroughly enough and containing all the attributes. Besides, the models pre-trained with natural images fail to detect lesions precisely. To solve this problem, fusing multiple prompts is vital to assist the VLM in learning a more comprehensive alignment between textual and visual modalities. In this paper, we propose an ensemble guided fusion approach to leverage multiple statements when tackling the phrase grounding task for zero-shot lesion detection. Extensive experiments are conducted on three public medical image datasets across different modalities and the detection accuracy improvement demonstrates the superiority of our method.

关键词： vision-language models Lesion detection Multiple prompts Prompt fusion Ensemble learning

来源：评论

学校读者我要写书评

暂无评论

Towards Better vision-Inspired vision-language models

Towards Better Vision-Inspired Vision-Language Models

引用

IEEE/CVF Conference on Computer vision and Pattern Recognition (CVPR)

作者： Cao, Yun-Hao Ji, Kaixiang Huang, Ziyuan Zheng, Chuanyang Liu, Jiajia Wang, Jian Chen, Jingdong Yang, Ming Nanjing Univ Natl Key Lab Novel Software Technol Nanjing Jiangsu Peoples R China Ant Grp Hangzhou Zhejiang Peoples R China

ISBN: (纸本)9798350353006

vision-language (VL) models have achieved unprecedented success recently, in which the connection module is the key to bridge the modality gap. Nevertheless, the abundant visual clues are not sufficiently exploited in most existing methods. On the vision side, most existing approaches only use the last feature of the vision tower, without using the low-level features. On the language side, most existing methods only introduce shallow vision-language interactions. In this paper, we present a vision-inspired vision-language connection module, dubbed as VIVL, which efficiently exploits the vision cue for VL models. To take advantage of the lower-level information from the vision tower, a feature pyramid extractor (FPE) is introduced to combine features from different intermediate layers, which enriches the visual cue with negligible parameters and computation overhead. To enhance VL interactions, we propose deep vision-conditioned prompts (DVCP) that allows deep interactions of vision and language features efficiently. Our VIVL exceeds the previous state-of- the-art method by 18.1 CIDEr when training from scratch on the COCO caption task, which greatly improves the data efficiency. When used as a plug-in module, VIVL consistently improves the performance for various backbones and VL frameworks, delivering new state-of-the-art results on multiple benchmarks, e.g., NoCaps and VQAv2.

关键词： deep learning vision-language models feature pyramid deep prompt

来源：评论

学校读者我要写书评

暂无评论

The Arrival of Artificial Intelligence Large language models and vision-language models: A Potential to Possible Change in the Paradigm of Healthcare Delivery in Dermatology

引用

JOURNAL OF INVESTIGATIVE DERMATOLOGY 2024年第6期144卷 1186-1188页

作者： Gupta, Aditya K. Talukder, Mesbah Wang, Tong Daneshjou, Roxana Piguet, Vincent Mediprobe Res Inc London ON Canada Univ Toronto Dept Med Div Dermatol Toronto ON Canada BRAC Univ Sch Pharm Dhaka Bangladesh Stanford Sch Med Dept Dermatol Redwood City CA USA Stanford Sch Med Dept Biomed Data Sci Redwood City CA USA Womens Coll Hosp Div Dermatol Toronto ON Canada

来源：评论

学校读者我要写书评

暂无评论

Concept-Based Analysis of Neural Networks via vision-language models 1st

Concept-Based Analysis of Neural Networks via Vision-Languag...

引用

1st International Symposium on AI Verification (SAIV)

作者： Mangal, Ravi Narodytska, Nina Gopinath, Divya Hu, Boyue Caroline Roy, Anirban Jha, Susmit Pasareanu, Corina S. Carnegie Mellon Univ Pittsburgh PA 15213 USA VMware Res Palo Alto CA USA NASA Ames Moffett Field CA USA Univ Toronto Toronto ON Canada SRI Int Menlo Pk CA USA

ISBN: (纸本)9783031651113;9783031651120

The analysis of vision-based deep neural networks (DNNs) is highly desirable but it is very challenging due to the difficulty of expressing formal specifications for vision tasks and the lack of efficient verification procedures. In this paper, we propose to leverage emerging multimodal, vision-language, foundation models (VLMs) as a lens through which we can reason about vision models. VLMs have been trained on a large body of images accompanied by their textual description, and are thus implicitly aware of high-level, human-understandable concepts describing the images. We describe a logical specification language Conspec designed to facilitate writing specifications in terms of these concepts. To define and formally check Conspec specifications, we build a map between the internal representations of a given vision model and a VLM, leading to an efficient verification procedure of natural-language properties for vision models. We demonstrate our techniques on a ResNet-based classifier trained on the RIVAL-10 dataset using CLIP as the multimodal model.

关键词： Formal analysis Concepts vision-language models

来源：评论

学校读者我要写书评

暂无评论

Fast Certification of vision-language models Using Incremental Randomized Smoothing

Fast Certification of Vision-Language Models Using Increment...

引用

Conference on Safe and Trustworthy Machine Learning (SaTML)

作者： Nirala, Ashutosh Joshi, Ameya Sarkar, Soumik Hegde, Chinmay Iowa State Univ Ames IA 50011 USA New York Univ New York NY USA

ISBN: (纸本)9798350349511;9798350349504

A key benefit of deep vision-language models such as CLIP is that they enable zero-shot open vocabulary classification;the user has the ability to define novel class labels via natural language prompts at inference time. However, while CLIP-based zero-shot classifiers have demonstrated competitive performance across a range of domain shifts, they remain highly vulnerable to adversarial attacks. Therefore, ensuring the robustness of such models is crucial for their reliable deployment in the wild. In this work, we introduce Open Vocabulary Certification (OVC), a fast certification method designed for open-vocabulary models like CLIP via randomized smoothing techniques. Given a base "training" set of prompts and their corresponding certified CLIP classifiers, OVC relies on the observation that a classifier with a novel prompt can be viewed as a perturbed version of nearby classifiers in the base training set. Therefore, OVC can rapidly certify the novel classifier using a variation of incremental randomized smoothing. By using a caching trick, we achieve approximately two orders of magnitude acceleration in the certification process for novel prompts. To achieve further (heuristic) speedups, OVC approximates the embedding space at a given input using a multivariate normal distribution bypassing the need for sampling via forward passes through the vision backbone. We demonstrate the effectiveness of OVC on through experimental evaluation using multiple vision-language backbones on the CIFAR-10 and ImageNet test datasets.

关键词： vision-language models CLIP certified robustness randomized smoothing

来源：评论

学校读者我要写书评

暂无评论

Unsupervised Prototype Adapter for vision-language models 6th

Unsupervised Prototype Adapter for Vision-Language Models

引用

6th Chinese Conference on Pattern Recognition and Computer vision (PRCV)

作者： Zhang, Yi Zhang, Ce Hu, Xueting He, Zhihai Harbin Inst Technol Harbin Peoples R China Southern Univ Sci & Technol Shenzhen Peoples R China Pengcheng Lab Shenzhen Peoples R China

ISBN: (纸本)9789819984282;9789819984299

Recently, large-scale pre-trained vision-language models (e.g. CLIP and ALIGN) have demonstrated remarkable effectiveness in acquiring transferable visual representations. To leverage the valuable knowledge encoded within these models for downstream tasks, several fine-tuning approaches, including prompt tuning methods and adapter-based methods, have been developed to adapt vision-language models effectively with supervision. However, these methods rely on the availability of annotated samples, which can be labor-intensive and time-consuming to acquire, thus limiting scalability. To address this issue, in this work, we design an unsupervised fine-tuning approach for vision-language models called Unsupervised Prototype Adapter (UP-Adapter). Specifically, for the unannotated target datasets, we leverage the text-image aligning capability of CLIP to automatically select the most confident samples for each class. Utilizing these selected samples, we generate class prototypes, which serve as the initialization for the learnable prototype model. After fine-tuning, the prototype model prediction is combined with the original CLIP's prediction by a residual connection to perform downstream recognition tasks. Our extensive experimental results on image recognition and domain generalization show that the proposed unsupervised method outperforms 8-shot CoOp, 8-shot Tip-Adapter, and also the state-of-the-art UPL method by large margins.

关键词： vision-language models Contrastive language-Image Pre-training Unsupervised Learning Image Recognition

来源：评论

学校读者我要写书评

暂无评论

The Potential of vision-language models for Content Moderation of Children's Videos 22

The Potential of Vision-Language Models for Content Moderati...

引用

22nd IEEE International Conference on Machine Learning and Applications, ICMLA 2023

作者： Ahmed, Syed Hammad Hu, Shengnan Sukthankar, Gita University of Central Florida Department of Computer Science Orlando United States

ISBN: (纸本)9798350345346

Natural language supervision has been shown to be effective for zero-shot learning in many computer vision tasks, such as object detection and activity recognition. However, generating informative prompts can be challenging for more subtle tasks, such as video content moderation. This can be difficult, as there are many reasons why a video might be inappropriate, beyond violence and obscenity. For example, scammers may attempt to create junk content that is similar to popular educational videos but with no meaningful information. This paper evaluates the performance of several CLIP variations for content moderation of children's cartoons in both the supervised and zero-shot setting. We show that our proposed model (Vanilla CLIP with Projection Layer) outperforms previous work conducted on the Malicious or Benign (MOB) benchmark for video content moderation. This paper presents an in depth analysis of how context-specific language prompts affect content moderation performance. Our results indicate that it is important to include more context in content moderation prompts, particularly for cartoon videos as they are not well represented in the CLIP training data. © 2023 IEEE.

关键词： prompt engineering video content moderation vision-language models

来源：评论

学校读者我要写书评

暂无评论

GraphVL: Graph-Enhanced Semantic Modeling via vision-language models for Generalized Class Discovery 24

GraphVL: Graph-Enhanced Semantic Modeling via Vision-Languag...

引用

15th Indian Conference on Computer vision Graphics and Image Processing

作者： Solanki, Bhupendra Nair, Ashwin R. Singha, Mainak Mukhopadhyay, Souradeep Jha, Ankit Banerjee, Biplab Indian Inst Technol Mumbai Maharashtra India Indian Inst Sci Educ & Res Thiruvananthapuram Thiruvananthapuram Kerala India Indian Inst Sci Bangalore Karnataka India LNM Inst Informat Technol Jaipur Rajasthan India

ISBN: (纸本)9798400710759

Generalized Category Discovery (GCD) aims to cluster unlabeled images into known and novel categories using labeled images from known classes. To address the challenge of transferring features from known to unknown classes while mitigating model bias, we introduce GraphVL, a novel approach for vision-language modeling in GCD, leveraging CLIP. Our method integrates a graph convolutional network (GCN) with CLIP's text encoder to preserve class neighborhood structure. We also employ a lightweight visual projector for image data, ensuring discriminative features through margin-based contrastive losses for image-text mapping. This neighborhood preservation criterion effectively regulates the semantic space, making it less sensitive to known classes. Additionally, we learn textual prompts from known classes and align them to create a more contextually meaningful semantic feature space for the GCN layer using a contextual similarity loss. Finally, we represent unlabeled samples based on their semantic distance to class prompts from the GCN, enabling semi-supervised clustering for class discovery and minimizing errors. Our experiments on seven benchmark datasets consistently demonstrate the superiority of GraphVL when integrated with the CLIP backbone.

关键词： Class Discovery Contrastive Learning Graph Convolutional Networks Unsupervised learning vision-language models

来源：评论

学校读者我要写书评

暂无评论

Cross-Modal Concept Learning and Inference for vision-language models

引用

NEUROCOMPUTING 2024年 583卷

作者： Zhang, Yi Zhang, Ce Tang, Yushun He, Zhihai Harbin Inst Technol Harbin 150001 Peoples R China Southern Univ Sci & Technol Shenzhen 518055 Peoples R China Pengcheng Lab Shenzhen 518000 Peoples R China

Large-scale pre -trained vision -language models (VLMs), such as CLIP, establish the correlation between texts and images, achieving remarkable success on various downstream tasks with fine-tuning. In existing fine-tuning methods, the class -specific text description is matched against the whole image. We recognize that this imagescale matching is not effective since images from the same class often contain a set of different semantic objects, and an object further consists of a set of semantic parts or concepts. Individual semantic parts or concepts may appear in image samples from different classes. To address this issue, in this paper, we develop a new method called cross -model concept learning and inference (CCLI). Using the powerful text -image correlation capability of CLIP, our method automatically learns a large set of distinctive visual concepts from images using a set of semantic text concepts. Based on these visual concepts, we construct a discriminative representation of images and learn a concept inference network to perform downstream image classification tasks, such as few -shot learning and domain generalization. Extensive experimental results demonstrate that our CCLI method is able to improve the performance of the current state-of-the-art methods by large margins, for example, by up to 8.0% improvement on few -shot learning and by up to 1.3% for domain generalization.

关键词： vision-language models Concept learning Few-shot learning Domain generalization

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：