检索结果-内蒙古大学图书馆

Synth-CLIP: Synthetic data make CLIP generalize better in data-limited scenarios

NEURAL NETWORKS 2025年 184卷 107083页

作者： Liu, Mushui He, Weijie Lu, Ziqian Dan, Jun Yu, Yunlong Li, Yingming Li, Xi Han, Jungong Zhejiang Univ Coll Informat Sci & Elect Engn Hangzhou Peoples R China Zhejiang Univ Sch Aeronaut & Astronaut Hangzhou Peoples R China Zhejiang Univ Coll Comp Sci & Technol Hangzhou Peoples R China Univ Sheffield Dept Comp Sci Sheffield England

Prompt learning is a powerful technique that enables the transfer of vision-language models (VLMs) like CLIP to downstream tasks. However, when the prompt-based methods are fine-tuned solely on base classes, they often struggle to generalize to novel classes lacking visual samples during training, especially in scenarios with limited training data. To address this challenge, we propose an innovative approach called Synth-CLIP that leverages synthetic data to enhance CLIP's generalization capability for base classes and the general capability for novel classes. Synth-CLIP fine-tunes the pre-trained CLIP model by seamlessly integrating tailored prompts that are both domain-specific and domain-shared, specifically designed for visual samples, reorganizing visual features from real and synthetic domains into the semantic space. This approach efficiently expands the data pool and enriches category diversity. Moreover, based on semantic structure consistency, we introduce a cross-domain feature alignment loss to match the real and synthetic samples in the feature embedding space. By aligning the visual and semantic distributions, the synthetic data from base and novel classes provide crucial discriminative information, enabling the model to rebalance the decision boundaries even in the absence of real novel visual samples. Experimental results on three model generalization tasks demonstrate that our method performs very competitively across various benchmarks. Notably, Synth-CLIP outperforms the recent competitor PromptSRC by an average improvement of 3.0% on novel classes across 11 datasets in open-vocabulary scenarios.

关键词： vision-language models CLIP Generalization Synthetic samples Data limited scenarios

来源：评论

学校读者我要写书评

暂无评论

Open-Vocabulary Action Localization With Iterative Visual Prompting

引用

IEEE ACCESS 2025年 13卷 56908-56917页

作者： Wake, Naoki Kanehira, Atsushi Sasabuchi, Kazuhiro Takamatsu, Jun Ikeuchi, Katsushi Microsoft Appl Robot Res Redmond WA 98052 USA

Video action localization aims to find the timings of specific actions from a long video. Although existing learning-based approaches have been successful, they require annotating videos, which comes with a considerable labor cost. This paper proposes a training-free, open-vocabulary approach based on emerging off-the-shelf vision-language models (VLMs). The challenge stems from the fact that VLMs are neither designed to process long videos nor tailored for finding actions. We overcome these problems by extending an iterative visual prompting technique. Specifically, we sample video frames and create a concatenated image with frame index labels, allowing a VLM to identify the frames that most likely correspond to the start and end of the action. By iteratively narrowing the sampling window around the selected frames, the estimation gradually converges to more precise temporal boundaries. We demonstrate that this technique yields reasonable performance, achieving results comparable to state-of-the-art zero-shot action localization. These results support the use of VLMs as a practical tool for understanding videos. Sample code is available at (https://***/VLM-Video-Action-Localization/).

关键词： Location awareness Pipelines Timing Robots Visualization Iterative methods Hands Benchmark testing Indexes Grasping Open-vocabulary action localization vision-language models large language models GPT action localization

来源：评论

学校读者我要写书评

暂无评论

Continual Test-Time Adaptation for Single Image Defocus Deblurring via Causal Siamese Networks

引用

INTERNATIONAL JOURNAL OF COMPUTER vision 2025年第7期133卷 4134-4157页

作者： Cui, Shuang Li, Yi Li, Jiangmeng Tang, Xiongxin Su, Bing Xu, Fanjiang Xiong, Hui Chinese Acad Sci Natl Key Lab Space Integrated Informat Syst Inst Software Beijing Peoples R China Univ Chinese Acad Sci Beijing Peoples R China Renmin Univ China Gaoling Sch Artificial Intelligence Beijing Key Lab Big Data Management & Anal Methods Beijing Peoples R China Hong Kong Univ Sci & Technol Guangzhou Thrust Artificial Intelligence Guangzhou Peoples R China Hong Kong Univ Sci & Technol Dept Comp Sci & Engn Guangzhou Peoples R China

Single image defocus deblurring (SIDD) aims to restore an all-in-focus image from a defocused one. Distribution shifts in defocused images generally lead to performance degradation of existing methods during out-of-distribution inferences. In this work, we gauge the intrinsic reason behind the performance degradation, which is identified as the heterogeneity of lens-specific point spread functions. Empirical evidence supports this finding, motivating us to employ a continual test-time adaptation (CTTA) paradigm for SIDD. However, traditional CTTA methods, which primarily rely on entropy minimization, cannot sufficiently explore task-dependent information for pixel-level regression tasks like SIDD. To address this issue, we propose a novel Siamese networks-based continual test-time adaptation framework, which adapts source models to continuously changing target domains only requiring unlabeled target data in an online manner. To further mitigate semantically erroneous textures introduced by source SIDD models under severe degradation, we revisit the learning paradigm through a structural causal model and propose Causal Siamese networks (CauSiam). Our method leverages large-scale pre-trained vision-language models to derive discriminative universal semantic priors and integrates these priors into Siamese networks, ensuring causal identifiability between blurry inputs and restored images. Extensive experiments demonstrate that CauSiam effectively improves the generalization performance of existing SIDD methods in continuously changing domains.

关键词： Continual test-time adaptation Single image defocus deblurring Causality vision-language models

来源：评论

学校读者我要写书评

暂无评论

RS3Lip: Consistency for remote sensing image classification on part embeddings using self-supervised learning and CLIP

引用

COMPUTER vision AND IMAGE UNDERSTANDING 2025年 251卷

作者： Jhaa, Ankit Singhab, Mainak Bhattacharyab, Avigyan Banerjeeb, Biplab LNM Inst Informat Technol Jaipur 302031 India Indian Inst Technol Mumbai 400076 India

Tackling domain and class generalization challenges remains a significant hurdle in the realm of remote sensing (RS). Recently, large-scale pre-trained vision-language models (VLMs), exemplified by CLIP, have showcased impressive zero-shot and few-shot generalization capabilities through extensive contrastive training. Existing literature emphasizes prompt learning as a means of enriching prompts with both domain and content information, particularly through smaller learnable projectors, thereby addressing multi-domain data challenges perceptibly. Along with this, it is observed that CLIP's vision encoder fails to generalize well on the puzzled or corrupted RS images. In response, we propose a novel solution utilizing self-supervised learning (SSL) to ensure consistency for puzzled RS images in domain generalization (DG). This approach strengthens visual features, facilitating the generation of domain-invariant prompts. Our proposed RS3Lip, trained with small projectors featuring few layers, complements the pre-trained CLIP. It incorporates SSL and inpainting losses for visual features, along with a consistency loss between the features of SSL tasks and textual features. Empirical findings demonstrate that RS3Lip consistently outperforms state-of-the-art prompt learning methods across five benchmark optical remote sensing datasets, achieving improvements of at least by 1.3% in domain and class generalization tasks.

关键词： Prompt Learning Remote Sensing Self-supervised Learning vision-language models

来源：评论

学校读者我要写书评

暂无评论

LAMARS: Large language Model-Based Anticipation Mechanism Acceleration in Real-Time Robotic Systems

引用

IEEE ACCESS 2025年 13卷 3864-3880页

作者： Gao, Yifang Luo, Wei Wang, Xuye Zhang, Shunshun Goh, Patrick Univ Sains Malaysia Sch Elect & Elect Engn Nibong Tebal 14300 Penang Malaysia Beijing Jiaotong Univ Sch Elect Engn Beijing 100044 Peoples R China Univ Sains Malaysia Sch Pharmaceut Sci Gelugor 11800 Penang Malaysia Guangxi Univ Sci & Technol Sch Automat Liuzhou 545006 Peoples R China

Large language models (LLMs) have assumed an increasingly crucial role in robotic systems because of their ability to leverage the extensive knowledge they possess in robotic inference and task handling. Although LLMs offer significant potential, their integration into robotic systems poses substantial challenges, particularly with regard to computational efficiency and latency. To address this challenge, this study presents LAMARS, an LLM-based anticipation mechanism designed to accelerate real-time robotic systems. LAMARS leverages the predictive power and zero-shot capabilities of LLMs combined with an anticipation mechanism and vision-language processing to position a robot in advance for upcoming tasks. This reduces latency and optimizes path planning without requiring expensive training data. Our evaluations in a realistic simulation environment and with a variation of the RLBench dataset demonstrated that LAMARS achieved an average success rate of 0.79 and improves efficiency by up to 52.4% compared to existing methods, significantly lowering path planning costs. These results indicate that LAMARS effectively accelerates directive execution, making it a promising solution to minimize delays in real-time robotic systems.

关键词： Robot kinematics Real-time systems Hidden Markov models Visualization Costs Data models Planning Adaptation models Predictive models Vectors Human-robot interaction large language models latency reduction vision-language models path planning

来源：评论

学校读者我要写书评

暂无评论

Global-local prompts guided image-text embedding, alignment and aggregation for multi-label zero-shot learning

引用

JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION 2025年 106卷

作者： Song, Tiecheng Huang, Yu Yang, Feng Qin, Anyong Zhao, Yue Gao, Chenqiang Chongqing Univ Posts & Telecommun Sch Commun & Informat Engn Chongqing 400065 Peoples R China Sun Yat Sen Univ Sch Intelligent Syst Engn Shenzhen Campus Shenzhen 518107 Guangdong Peoples R China

Multi-label zero-shot learning (MLZSL) aims to classify images into multiple unseen label classes, which is a practical yet challenging task. Recent methods have used vision-language models (VLM) for MLZSL, but they do not well consider the global and local semantic relationships to align images and texts, yielding limited classification performance. In this paper, we propose a novel MLZSL approach, named global-local prompts guided image-text embedding, alignment and aggregation (GLP-EAA) to alleviate this problem. Specifically, based on the parameter-frozen VLM, we divide the image into patches and explore a simple adapter to obtain global and local image embeddings. Meanwhile, we design global-local prompts to obtain text embeddings of different granularities. Then, we introduce global-local alignment losses to establish image-text consistencies at different granularity levels. Finally, we aggregate global and local scores to compute the multi- label classification loss. The aggregated scores are also used for inference. As such, our approach integrates prompt learning, image-text alignment and classification score aggregation into a unified learning framework. Experimental results on NUS-WIDE and MS-COCO datasets demonstrate the superiority of our approach over state-of-the-art methods for both ZSL and generalized ZSL tasks.

关键词： multi-label zero-shot learning vision-language models Prompt learning Alignment

来源：评论

学校读者我要写书评

暂无评论

MMTF-DES: A fusion of multimodal transformer models for desire, emotion, and sentiment analysis of social media data

引用

NEUROCOMPUTING 2025年 623卷

作者： Aziz, Abdul Chowdhury, Nihad Karim Kabir, Muhammad Ashad Chy, Abu Nowshed Siddique, Md. Jawad Univ Chittagong Dept Comp Sci & Engn Chattogram 4331 Bangladesh Charles Sturt Univ Sch Comp Math & Engn Bathurst NSW 2795 Australia Southern Illinois Univ Dept Comp Sci Carbondale IL 62901 USA

Desires, emotions, and sentiments are pivotal in understanding and predicting human behavior, influencing various aspects of decision-making, communication, and social interactions. Their analysis, particularly in the context of multimodal data (such as images and texts) from social media, provides profound insights into cultural diversity, psychological well-being, and consumer behavior. Prior studies overlooked the use of image-text pairwise feature representation, which is crucial for the task of human desire understanding. In this research, we have proposed a unified multimodal-based framework with image-text pair settings to identify human desire, sentiment, and emotion. The core of our proposed method lies in the encoder module, which is built using two state-of-the-art multimodal vision-language models (VLMs). To effectively extract visual and contextualized embedding features from social media image and text pairs, we jointly fine-tune two pre-trained multimodal VLMs: vision-and-language Transformer (ViLT) and vision-and-Augmented-language Transformer (VAuLT). Subsequently, we use an early fusion strategy on these embedding features to obtain combined diverse feature representations. Moreover, we leverage a multi-sample dropout mechanism to enhance the generalization ability and expedite the training process of our proposed method. To evaluate our proposed approach, we used the multimodal dataset MSED for the human desire understanding task. Through our experimental evaluation, we demonstrate that our method excels in capturing both visual and contextual information, resulting in superior performance compared to other state-of-the-art techniques. Specifically, our method outperforms existing approaches by 3% for sentiment analysis, 2.2% for emotion analysis, and approximately 1% for desire analysis.

关键词： Human desire understanding Desire analysis Sentiment analysis Emotion analysis Multimodal transformer vision-language models

来源：评论

学校读者我要写书评

暂无评论

A two-step concept-based approach for enhanced interpretability and trust in skin lesion diagnosis

引用

COMPUTATIONAL AND STRUCTURAL BIOTECHNOLOGY JOURNAL 2025年 28卷 71-79页

作者： Patricio, Cristiano Teixeira, Luis F. Neves, Joao C. Univ Beira Interior Covilha Portugal NOVA LINCS Lisbon Portugal Univ Porto Fac Engn Porto Portugal INESC TEC Porto Portugal

The main challenges hindering the adoption of deep learning-based systems in clinical settings are the scarcity of annotated data and the lack of interpretability and trust in these systems. Concept Bottleneck models (CBMs) offer inherent interpretability by constraining the final disease prediction on a set of human-understandable concepts. However, this inherent interpretability comes at the cost of greater annotation burden. Additionally, adding new concepts requires retraining the entire system. In this work, we introduce a novel two-step methodology that addresses both of these challenges. By simulating the two stages of a CBM, we utilize a pretrained vision language Model (VLM) to automatically predict clinical concepts, and an off-the-shelf Large language Model (LLM) to generate disease diagnoses grounded on the predicted concepts. Furthermore, our approach supports test-time human intervention, enabling corrections to predicted concepts, which improves final diagnoses and enhances transparency in decision-making. We validate our approach on three skin lesion datasets, demonstrating that it outperforms traditional CBMs and state-of-the-art explainable methods, all without requiring any training and utilizing only a few annotated examples. The code is available at https://***/CristianoPatricio/2step-concept-based-skin-diagnosis.

关键词： Concept bottleneck models vision-language models Interpretability Skin cancer Dermoscopy

来源：评论

学校读者我要写书评

暂无评论

CLIP-guided black-box domain adaptation of image classification

引用

SIGNAL IMAGE AND VIDEO PROCESSING 2024年第5期18卷 4637-4646页

作者： Tian, Liang Ye, Mao Zhou, Lihua He, Qichen Univ Elect Sci & Technol China Sch Comp Sci & Engn Chengdu 611731 Peoples R China

Recently, the significant success of the large pre-trained models have attracted great attentions. How to sufficiently use these models is a big issue. Black-box domain adaptation is a way which tries to train a target model by a cloud API offered by a large pre-trained model without model details and source data. The existing black-box domain adaptation methods for image classification always use the prediction results from the cloud API, but the information is very limited. On the other hand, the recent proposed visual-language model (CLIP), trained from a large number of extensive datasets, aligns the visual feature and text feature in a common space, which provides useful auxiliary information. In this work, we propose a new black-box domain adaptation method guided by CLIP (BBC). The key idea is to generate more accurate pseudo-labels. Two strategies are adapted. The first is called generation of joint pseudo-labels, which combines the predictions from cloud API and CLIP model. Another one is the structure-preserved pseudo-labeling strategy which further generates much better pseudo-labels by the previous stored predictions of the k-closest neighbors. Experiments on three benchmark datasets show that our method achieves the state-of-the-art results with large margin.

关键词： Black-box domain adaptation vision-language models Image classification

来源：评论

学校读者我要写书评

暂无评论

Source bias reduction for source-free domain adaptation

引用

SIGNAL IMAGE AND VIDEO PROCESSING 2024年第SUPPL 1期18卷 883-893页

作者： Tian, Liang Ye, Mao Zhou, Lihua Wang, Zhenbin Univ Elect Sci & Technol China Sch Comp Sci & Engn Chengdu 611731 Peoples R China Sichuan Univ Coll Comp Sci Chengdu 610044 Peoples R China

Source-free domain adaptation (SFDA) mainly aims to the problem of not being able to access the source domain data during the model migration process. Although significant breakthroughs have been achieved, the current works meet performance ceiling. The key problem is that the source bias of the adapted model is difficult to be eliminated. In this work, we propose a novel SFDA method named rectiFication Upon SEmantic information (FUSE). The key idea is to reduce source bias of the adapted model with the help of pre-trained vision-language model (e.g., CLIP). Two strategies are adapted. The first is named source bias reduction, which is to restrict the impact of the samples with inconsistent predictions between the source and pre-trained models. The samples with high confidence classification based on pre-trained model automatically assume the task of supervision. Another one adjusts the pre-trained model to fit the distribution of the target domain. The features that better represent class centers are extracted. Except these two strategies, we also adapt pseudo-labeling method to further improve the performance of the adapted model. Experiments on three benchmark datasets show that our method achieves the state-of-the-art results.

关键词： Source-free domain adaptation vision-language models Source bias reduction

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：