检索结果-内蒙古大学图书馆

CLIP4STR: A Simple Baseline for Scene Text Recognition With Pre-Trained vision-language model

IEEE TRANSACTIONS ON IMAGE PROCESSING 2024年 33卷 6893-6904页

作者： Zhao, Shuai Quan, Ruijie Zhu, Linchao Yang, Yi Univ Technol Sydney Australian Artificial Intelligence Inst ReLER Lab Ultimo NSW 2007 Australia Nanyang Technol Univ Coll Comp & Data Sci Singapore 308232 Singapore Zhejiang Univ ReLER Lab CCAI Hangzhou 310027 Zhejiang Peoples R China

Pre-trained vision-language models (VLMs) are the de-facto foundation models for various downstream tasks. However, scene text recognition methods still prefer backbones pre-trained on a single modality, namely, the visual modality, despite the potential of VLMs to serve as powerful scene text readers. For example, CLIP can robustly identify regular (horizontal) and irregular (rotated, curved, blurred, or occluded) text in images. With such merits, we transform CLIP into a scene text reader and introduce CLIP4STR, a simple yet effective STR method built upon image and text encoders of CLIP. It has two encoder-decoder branches: a visual branch and a cross-modal branch. The visual branch provides an initial prediction based on the visual feature, and the cross-modal branch refines this prediction by addressing the discrepancy between the visual feature and text semantics. To fully leverage the capabilities of both branches, we design a dual predict-and-refine decoding scheme for inference. We scale CLIP4STR in terms of the model size, pre-training data, and training data, achieving state-of-the-art performance on 13 STR benchmarks. Additionally, a comprehensive empirical study is provided to enhance the understanding of the adaptation of CLIP to STR. Our method establishes a simple yet strong baseline for future STR research with VLMs.

关键词： Visualization Text recognition Decoding Training data Marine vehicles Dogs Birds Automobiles Airplanes Semantics vision-language model scene text recognition CLIP

来源：评论

学校读者我要写书评

暂无评论

Continual Learning of Image Classes With language Guidance From a vision-language model

引用

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 2024年第12期34卷 13152-13163页

作者： Zhang, Wentao Huang, Yujun Zhang, Weizhuo Zhang, Tong Lao, Qicheng Yu, Yue Zheng, Wei-Shi Wang, Ruixuan Sun Yat Sen Univ Sch Comp Sci & Engn Guangzhou 510275 Peoples R China Minist Educ Key Lab Machine Intelligence & Adv Comp Guangzhou 510275 Peoples R China Peng Cheng Lab Shenzhen 518066 Peoples R China Beijing Univ Posts & Telecommun Sch Artificial Intelligence Beijing 100876 Peoples R China

Current deep learning models often catastrophically forget the knowledge of old classes when continually learning new ones. State-of-the-art approaches to continual learning of image classes often require retaining a small subset of old data to partly alleviate the catastrophic forgetting issue, and their performance would be degraded sharply when no old data can be stored due to privacy or safety concerns. In this study, inspired by human learning of visual knowledge with the effective help of language, we propose a novel continual learning framework based on a pre-trained vision-language model (VLM) without retaining any old data. Rich prior knowledge of each new image class is effectively encoded by the frozen text encoder of the VLM, which is then used to guide the learning of new image classes. The output space of the frozen text encoder is unchanged over the whole process of continual learning, through which image representations of different classes become comparable during model inference even when the image classes are learned at different times. Extensive empirical evaluations on multiple image classification datasets under various settings confirm the superior performance of our method over existing ones. The source code is available at https://***/Fatflower/CIL_LG_VLM/.

关键词： Visualization Continuing education Adaptation models Data models Task analysis Knowledge engineering Semantics Continual learning vision-language model language guidance

来源：评论

学校读者我要写书评

暂无评论

Unsupervised graph reasoning distillation hashing for multimodal hamming space search with vision-language model

引用

INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL 2024年第2期13卷 1-18页

作者： Sun, Lina Dong, Yumin Chongqing Normal Univ Sch Comp & Informat Sci Chongqing 401331 Peoples R China

Multimodal hash technology maps high-dimensional multimodal data into hash codes, which greatly reduces the cost of data storage and improves query speed through the Hamming similarity calculation. However, existing unsupervised methods still have two key obstacles: (1) With the evolution of large multimodal models, how to efficiently distill the multimodal matching relationship of large models to train a powerful student model? (2) Existing methods do not consider other adjacencies between multimodal instances, resulting in limited similarity representation. To address these obstacles, called Unsupervised Graph Reasoning Distillation Hashing (UGRDH) is proposed. The UGRDH approach uses the CLIP as the teacher model, thus extracting fine-grained multimodal features and relations for teacher-student distillation. Specifically, the multimodal features of the teacher are used to construct a similarity-complementary relation graph matrix, and the proposed graph convolution auxiliary network performs feature aggregation guided by the relation graph matrix to generate a more discriminative hash code. In addition, a cross-attention module was designed to reason potential instance relations to enable effective teacher-student distilled learning. Finally, UGRDH greatly improves search precision while maintaining lightness. Experimental results show that our method achieves about 1.5%, 3%, and 2.8% performance improvements on MS COCO, NUS-WIDE, and MIRFlickr, respectively.

关键词： Knowledge distillation Deep multimodal hashing Hamming space search vision-language model

来源：评论

学校读者我要写书评

暂无评论

A vision-language model Based on Prompt Learner for Few-shot Medical Images Diagnosis 27

A Vision-language Model Based on Prompt Learner for Few-shot...

引用

27th International Conference on Computer Supported Cooperative Work in Design (CSCWD)

作者： Chang, Tianyou Chen, Shizhan Fan, Guodong Feng, Zhiyong Tianjin Univ Coll Intelligence & Comp Tianjin Peoples R China

ISBN: (纸本)9798350349184;9798350349191

In the real world, it can be challenging to annotate a large-scale dataset for all medical images, making few-shot medical image classification an important task. The latest advancements in pre-trained vision-language models as CLIP have demonstrated excellent performance in zero-shot natural image recognition and show advantages in medical applications. However, we have found that deploying such models in practical applications faces challenges in terms of engineering effort. It requires specialized medical domain knowledge and is time-consuming, as even slight variations in wording can have a significant impact on performance. Inspired by recent research on prompt learning in the field of Natural language Processing (NLP), we propose a simple approach called Prompt Learner (PoLe) for automating the design of prompts in pre-trained vision-language models. This is a simple method specifically designed to fine-tune vision-language models, similar to CLIP, for downstream image recognition. In particular, PoLe models context tokens using continuous vectors that can automatically learn from medical images, thus avoiding the tedious process of handcrafting prompt engineering. Additionally, this approach maintains the frozen state of the large-scale pre-trained parameters, saving computational resources. Through extensive experiments on 5 medical image datasets, we have demonstrated that PoLe surpasses manually designed prompts with just one or two shots, and further training with more shots significantly improves the performance of image classification. For instance, when trained with 16 shots, the average improvement is approximately 20% (with a maximum improvement of over 31%). PoLe effectively transforms CLIP into a powerful few-shot learner. In terms of recognition performance, adjusting the CLIP model using PoLe yields better results than manually designed prompts for CLIP. When enhancing CLIP, PoLe demonstrates stronger learning capabilities compared to other few-s

关键词： Medical images CLIP Learnable prompt vision-language model Chatgpt

来源：评论

学校读者我要写书评

暂无评论

Subsampling of Frequent Words in Text for Pre-training a vision-language model 1

Subsampling of Frequent Words in Text for Pre-training a Vis...

引用

1st Workshop on Large Generative models Meet Multimodal Applications (LGM3A)

作者： Liang, Mingliang Larson, Martha Radboud Univ Nijmegen Nijmegen Netherlands

ISBN: (纸本)9798400702839

In this paper, we introduce Subsampling of frequentWords for Contrastive language-Image Pre-training (SW-CLIP), a novel approach for the training vision-language models (VLMs). SW-CLIP uses frequency-based subsampling of words that has been previously proposed to train skip-gram models in natural language processing and applies it to the textual training data of VLMs. We report on experiments that demonstrate the ability of frequency-based subsampling to speed up training and also to deliver a substantial improvement in accuracy in a number of downstream zero-shot (i.e., transfer) classification tasks. We notice that the classification test sets on which SW-CLIP seems to be particularly effective are those in which the labels of the classes occur infrequently as words in the training data, and thus have a high probability of being retained during frequency-based subsampling of the model training data. Overall, the advantages of SW-CLIP demonstrated in this paper serves to motivated further future work in text subsampling for the training of VLMs. Our code and pre-trained weights are available at https://***/Anastasiais-ml/sw_***

关键词： vision-language model subsampling frequent words zero-shot image Classification

来源：评论

学校读者我要写书评

暂无评论

Exploring Interactive Semantic Alignment for Efficient HOI Detection with vision-language model

Exploring Interactive Semantic Alignment for Efficient HOI D...

引用

IEEE International Conference on Multimedia and Expo (ICME)

作者： Dong, Jihao Yang, Hua Pan, Renjie Shanghai Jiao Tong Univ Inst Image Commun & Network Engn Shanghai Peoples R China Shanghai Key Lab Digital Media Proc & Transmiss Shanghai Peoples R China Shanghai Jiao Tong Univ AI Inst China MoE Key Lab Artificial Intelligence Shanghai Peoples R China

ISBN: (纸本)9798350390155;9798350390162

Human-Object Interaction (HOI) detection aims to localize human-object pairs and comprehend their interactions. Recently, two-stage transformer-based methods have demonstrated competitive performance. However, these methods frequently focus on object appearance features and ignore global contextual information. Besides, vision-language model CLIP which effectively aligns visual and text embeddings has shown great potential in zero-shot HOI detection. Based on the former facts, We introduce a novel HOI detector named ISA-HOI, which extensively leverages knowledge from CLIP, aligning interactive semantics between visual and textual features. We first extract global context of image and local features of object to Improve interaction Features in images (IF). On the other hand, we propose a Verb Semantic Improvement (VSI) module to enhance textual features of verb labels via cross-modal fusion. Ultimately, our method achieves competitive results on the HICO-DET and V-COCO benchmarks with much fewer training epochs, and outperforms the state-of-the-art under zero-shot settings.

关键词： Human-object interaction transformer-based vision-language model zero-shot learning

来源：评论

学校读者我要写书评

暂无评论

SWEEPMM: A HIGH-QUALITY MULTIMODAL DATASET FOR SWEEPING ROBOTS IN HOME SCENARIOS FOR vision-language model 49

SWEEPMM: A HIGH-QUALITY MULTIMODAL DATASET FOR SWEEPING ROBO...

引用

49th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

作者： Xu, Weichen Xu, Xinxin Fu, Tianhao Cao, Jian Xu, Xiaoyang Huang, Yuetian Cao, Xixin Zhang, Xing Peking Univ Sch Software & Microelect Beijing Peoples R China Peking Univ Shenzhen Grad Sch Beijing Peoples R China

ISBN: (纸本)9798350344868;9798350344851

Embodied intelligence based on vision-language models aims to learn from interactions and derive general intelligence. However, existing generalized vision-language models cannot understand domain knowledge in home scenarios due to the lack of sweeping robot multimodal datasets. In this paper, we propose the first multimodal dataset for sweeping robots, called SweepMM. We create textual data such as room type, scene descriptions, and moving recommendations using various approaches including rule-based, manual-based, and off-the-shelf model-based methods. Based on this dataset, we fine-tune the first generative pretrained model for sweeping robots, called SweepGPM. This model enables human-robot dialogue and surpasses previous state-of-the-art methods by 0.8% in room type recognition, 0.4% in obstacle detection, and 8.0% in lost item search, demonstrating the potential of embodied intelligence in sweeping robots.

关键词： Sweeping robot Benchmark dataset vision-language model Embodied intelligence

来源：评论

学校读者我要写书评

暂无评论

Recognition of Heat-Induced Food State Changes by Time-Series Use of vision-language model for Cooking Robot 1

引用

18th International Conference on Intelligent Autonomous Systems (IAS)

作者： Kanazawa, Naoaki Kawaharazuka, Kento Obinata, Yoshiki Okada, Kei Inaba, Masayuki Univ Tokyo 7-3-1 HongoBunkyo Ku Tokyo Japan

ISBN: (数字)9783031448515

ISBN: (纸本)9783031448508;9783031448515

Cooking tasks are characterized by large changes in the state of the food, which is one of the major challenges in robot execution of cooking tasks. In particular, cooking using a stove to apply heat to the foodstuff causes many special state changes that are not seen in other tasks, making it difficult to design a recognizer. In this study, we propose a unified method for recognizing changes in the cooking state of robots by using the vision-language model that can discriminate open-vocabulary objects in a time-series manner. We collected data on four typical state changes in cooking using a real robot and confirmed the effectiveness of the proposed method. We also compared the conditions and discussed the types of natural language prompts and the image regions that are suitable for recognizing the state changes.

关键词： Cooking robot Robot recognition vision-language model State change recognition

来源：评论

学校读者我要写书评

暂无评论

INTEGRATING EXPERT KNOWLEDGE WITH vision-language model FOR MEDICAL IMAGE RETRIEVAL 21

INTEGRATING EXPERT KNOWLEDGE WITH VISION-LANGUAGE MODEL FOR ...

引用

21st IEEE International Symposium on Biomedical Imaging (ISBI)

作者： Wei, Xiaoyang Vagena, Zografoula Kurtz, Camille Cloppet, Florence Univ Paris Cite France Lab Informat Paris Descartes LIPADE Paris France Univ Paris Cite France Data Intelligence Inst Paris diiP Paris France

ISBN: (纸本)9798350313345;9798350313338

Content-Based Image Retrieval (CBIR) is an image search technique that can offer diagnostic guidance when facing difficult cases in radiology. State-of-the-art approaches propose to extract image features using vision-language models which learn image representations from supervision of text in medical literature. However, existing methods seldom take expert knowledge in medical domain into account. In this article, we propose a knowledge-and-language-guided contrastive visual representation learning framework for image retrieval. Our method consists of two steps: (1) modeling relationships between medical concepts and medical images using a knowledge graph, and translating each node in the graph into a knowledge embedding;(2) injecting knowledge embeddings into a vision-language model by aligning image representations using both encoded textual input and knowledge embeddings. Our experiments show that the proposed framework achieves comparable results to state-of-the-art methods on CBIR tasks using much less training data. Our code is publicly available at https://***/Wxy-24/KL-CVR.

关键词： Image retrieval Representation learning vision-language model Knowledge graph

来源：评论

学校读者我要写书评

暂无评论

Efficient and Long-Tailed Generalization for Pre-trained vision-language model 24

Efficient and Long-Tailed Generalization for Pre-trained Vis...

引用

30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

作者： Shi, Jiang-Xin Zhang, Chi Wei, Tong Li, Yu-Feng Nanjing Univ Natl Key Lab Novel Software Technol Sch Artificial Intelligence Nanjing Peoples R China Nanjing Univ Natl Key Lab Novel Software Technol Nanjing Peoples R China Southeast Univ Sch Comp Sci & Engn Key Lab Comp Network & Informat Integrat Minist Educ Dhaka Bangladesh

ISBN: (纸本)9798400704901

Pre-trained vision-language models like CLIP have shown powerful zero-shot inference ability via image-text matching and prove to be strong few-shot learners in various downstream tasks. However, in real-world scenarios, adapting CLIP to downstream tasks may encounter the following challenges: 1) data may exhibit long-tailed data distributions and might not have abundant samples for all the classes;2) There might be emerging tasks with new classes that contain no samples at all. To overcome them, we propose a novel framework to achieve efficient and long-tailed generalization, which can be termed as Candle. During the training process, we propose compensating logit-adjusted loss to encourage large margins of prototypes and alleviate imbalance both within the base classes and between the base and new classes. For efficient adaptation, we treat the CLIP model as a black box and leverage the extracted features to obtain visual and textual prototypes for prediction. To make full use of multi-modal information, we also propose cross-modal attention to enrich the features from both modalities. For effective generalization, we introduce virtual prototypes for new classes to make up for their lack of training images. Candle achieves state-of-the-art performance over extensive experiments on 11 diverse datasets while substantially reducing the training time, demonstrating the superiority of our approach. The source code is available at https://***/shijxcs/Candle.

关键词： long-tail learning vision-language model new class generalization

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：