检索结果-内蒙古大学图书馆

Consistent prompt learning for vision-language models

KNOWLEDGE-BASED SYSTEMS 2025年 310卷

作者： Zhang, Yonggang Tian, Xinmei Hong Kong Baptist Univ Dept Comp Sci Hong Kong Peoples R China Univ Sci & Technol China Natl Engn Lab Brain Inspired Intelligence Technol Hefei 230000 Anhui Peoples R China

Pre-trained vision-language models, such as CLIP, have shown remarkable capabilities across various downstream tasks by learning prompts that consist of context concatenated with a class name;for example, 'a photo of a (dog]' with (dog] as a class prior. Advanced prompt-learning methods typically initialize and optimize the context;for example, 'a photo of a ' for downstream task adaptation. However, context optimization typically leads to poor generalization performance over novel classes or datasets sampled from different distributions. This maybe attributed to prompt inconsistency;namely, prompts optimized using one image distribution may differ from those optimized using a different image distribution. To improve the generalization performance of optimized prompts, we propose the novel consistent prompt learning (CPL) approach that identifies and addresses the image distribution that causes prompt inconsistency by performing distributional exploration. CPL identifies and mitigates prompt inconsistency in an adversarial training scheme, in which prompt inconsistency is measured as the similarity discrepancy between images and two different prompts. Specifically, CPL calculates two similarities between a query image and two prompts, and determines the prompt inconsistency through the discrepancy between these two similarities. Subsequently, CPL performs distributional exploration to enlarge the discrepancy and uses an adversarial-training approach to mitigate the discrepancy. Consequently, the model predictions are insensitive to prompt changes. The optimized prompt performs well under various image distributions. Comprehensive experiments show that the proposed CPL method performs favorably on four types of representative tasks across 11 datasets, which improves on existing prompt-learning methods, achieving state-of-the-art performance.

关键词： Prompt learning vision-language models Domain generalization Domain adaptation

来源：评论

学校读者我要写书评

暂无评论

H2R Bridge: Transferring vision-language models to few-shot intention meta-perception in human robot collaboration

引用

JOURNAL OF MANUFACTURING SYSTEMS 2025年 80卷 524-535页

作者： Wu, Duidi Zhao, Qianyou Fan, Junming Qi, Jin Zheng, Pai Hu, Jie Shanghai Jiao Tong Univ Sch Mech Engn Shanghai 200240 Peoples R China State Key Lab Mech Syst & Vibrat Shanghai 200240 Peoples R China Hong Kong Polytech Univ Dept Ind & Syst Engn Hong Kong Peoples R China

Human-robot collaboration enhances efficiency by enabling robots to work alongside human operators in shared tasks. Accurately understanding human intentions is critical for achieving a high level of collaboration. Existing methods heavily rely on case-specific data and face challenges with new tasks and unseen categories, while often limited data is available under real-world conditions. To bolster the proactive cognitive abilities of collaborative robots, this work introduces a Visual-language-Temporal approach, conceptualizing intent recognition as a multimodal learning problem with HRC-oriented prompts. A large model with prior knowledge is fine-tuned to acquire industrial domain expertise, then enables efficient rapid transfer through few-shot learning in data-scarce scenarios. Comparisons with state-of-the-art methods across various datasets demonstrate the proposed approach achieves new benchmarks. Ablation studies confirm the efficacy of the multimodal framework, and few-shot experiments further underscore meta-perceptual potential. This work addresses the challenges of perceptual data and training costs, building a human-robot bridge (H2R Bridge) for semantic communication, and is expected to facilitate proactive HRC and further integration of large models in industrial applications.

关键词： Human-robot collaboration Intent recognition Few-shot learning vision-language models

来源：评论

学校读者我要写书评

暂无评论

APOVIS: Automated pixel-level open-vocabulary instance segmentation through integration of pre-trained vision-language models and foundational segmentation models

引用

IMAGE AND vision COMPUTING 2025年 154卷

作者： Ma, Qiujie Yang, Shuqi Zhang, Lijuan Lan, Qing Yang, Dongdong Chen, Honghan Tan, Ying Southwest Minzu Univ State Ethn Affairs Commiss Key Lab Comp Syst Chengdu Peoples R China Hosp Chengdu Univ TCM Chengdu Peoples R China

In recent years, substantial advancements have been achieved in vision-language integration and image segmentation, particularly through the use of pre-trained models like BERT and vision Transformer (ViT). Within the domain of open-vocabulary instance segmentation (OVIS), accurately identifying an instance's positional information is critical, as it directly influences the precision of subsequent segmentation tasks. However, many existing methods rely on supplementary networks to generate pseudo-labels, such as multiple anchor frames containing object positional information. While these pseudo-labels aid visual language models in recognizing the absolute position of objects, they often compromise the overall efficiency and performance of the OVIS pipeline. In this study, we introduce a novel Automated Pixel-level OVIS (APOVIS) framework aimed at enhancing OVIS. Our approach automatically generates pixel-level annotations by leveraging the matching capabilities of pre-trained vision-language models for image-text pairs alongside a foundational segmentation model that accepts multiple prompts (e.g., points or anchor boxes) to guide the segmentation process. Specifically, our method first utilizes a pre-trained vision-language model to match instances within image-text pairs to identify relative positions. Next, we employ activation maps to visualize the instances, enabling us to extract instance location information and generate pseudo-label prompts that direct the segmentation process. These pseudo-labels then guide the segmentation model to execute pixel-level segmentation, enhancing both the accuracy and generalizability of object segmentation across images. Extensive experimental results demonstrate that our model significantly outperforms current state-of-the-art models in object detection accuracy and pixel- level instance segmentation on the COCO dataset. Additionally, the generalizability of our approach is validated through image-text pair data inference tas

关键词： Open-vocabulary instance segmentation vision-language models Foundational segmentation models Object detection Zero-shot instance segmentation

来源：评论

学校读者我要写书评

暂无评论

Advancements in vision-language models for Remote Sensing: Datasets, Capabilities, and Enhancement Techniques

引用

REMOTE SENSING 2025年第1期17卷 162-162页

作者： Tao, Lijie Zhang, Haokui Jing, Haizhao Liu, Yu Yan, Dawei Wei, Guoting Xue, Xizhe Northwestern Polytech Univ Sch Cybersecur Int Cooperat Dept Xian 710072 Peoples R China Zhejiang Lab Inst Intelligent Percept Hangzhou 311500 Peoples R China Nanjing Univ Sci & Technol Sch Comp Sci & Engn Nanjing 210094 Peoples R China Tech Univ Munich Dept Aerosp & Geodesy DE-80333 Munich Germany

Recently, the remarkable success of ChatGPT has sparked a renewed wave of interest in artificial intelligence (AI), and the advancements in vision-language models (VLMs) have pushed this enthusiasm to new heights. Differing from previous AI approaches that generally formulated different tasks as discriminative models, VLMs frame tasks as generative models and align language with visual information, enabling the handling of more challenging problems. The remote sensing (RS) field, a highly practical domain, has also embraced this new trend and introduced several VLM-based RS methods that have demonstrated promising performance and enormous potential. In this paper, we first review the fundamental theories related to VLM, then summarize the datasets constructed for VLMs in remote sensing and the various tasks they address. Finally, we categorize the improvement methods into three main parts according to the core components of VLMs and provide a detailed introduction and comparison of these methods.

关键词： vision-language models remote sensing

来源：评论

学校读者我要写书评

暂无评论

Integrating With Multimodal Information for Enhancing Robotic Grasping With vision-language models

引用

IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING 2025年 22卷 13073-13086页

作者： Zhao, Zhou Zheng, Dongyuan Chen, Yizi Luo, Jing Wang, Yanjun Huang, Panfeng Yang, Chenguang Cent China Normal Univ Sch Comp Sci Wuhan 430079 Peoples R China Hubei Engn Res Ctr Intelligent Detect & Identifica Wuhan 430205 Peoples R China ETH Inst Cartog & Geoinformat CH-8093 Zurich Switzerland Wuhan Univ Technol Sch Automat Wuhan 430070 Peoples R China Shanghai Jiao Tong Univ Inst Marine Equipment Shanghai 200240 Peoples R China Northwestern Polytech Univ Sch Astronaut Natl Key Lab Aerosp Flight Dynam Xian 710072 Peoples R China Northwestern Polytech Univ Res Ctr Intelligent Robot Sch Astronaut Xian 710072 Peoples R China Univ Liverpool Dept Comp Sci Liverpool L69 3BX England

As robots grow increasingly intelligent and utilize data from various sensors, relying solely on unimodal data sources is becoming inadequate for their operational needs. Consequently, integrating multimodal data has emerged as a critical area of focus. However, the effective combination of different data modalities poses a considerable challenge, especially in complex and dynamic settings where accurate object recognition and manipulation are essential. In this paper, we introduce a novel framework integrating with Multimodal Information for Grasping Synthesis with vision-language models (MIG) designed to improve robotic grasping capabilities. This framework incorporates visual data, textual information, and human-derived prior knowledge. We start by creating target object masks based on this prior knowledge, which are then used to segregate the target objects from their surroundings in the image. Subsequently, we employ language cues to refine the visual representations of these objects. Finally, our system executes precise grasping actions using visual and textual data synthesis, thus facilitating more effective and contextually aware robotic grasping. We carry out experiments using the OCID-VLG dataset. We observe that our methodology surpasses current state-of-the-art (SOTA) techniques, delivering improvements of 9.91% and 5.70% for top-1 and top-5 predictions in grasp accuracy. Moreover, when apply to the reconstructed Grasp-MultiObject dataset, our approach demonstrates even more substantial enhancements, achieving gains of 17.63% and 22.76% over SOTA methods for top-1 and top-5 predictions, respectively. Note to Practitioners-As robotic systems evolve, the challenge of enabling them to function effectively in complex environments has become increasingly apparent. This paper introduces a solution that integrates multiple sources of data-visual, textual, and human knowledge-to enhance robotic grasping capabilities. The practical problems addressed include the

关键词： Robots Grasping Visualization Robot sensing systems Object recognition Automation Training Robot kinematics Accuracy Hands Robot learning robotic grasping multimodal fusion vision-language models human-computer interaction

来源：评论

学校读者我要写书评

暂无评论

Bootstrapping vision-language models for Frequency-Centric Self-Supervised Remote Physiological Measurement

引用

INTERNATIONAL JOURNAL OF COMPUTER vision 2025年 1-22页

作者： Yue, Zijie Shi, Miaojing Wang, Hanli Ding, Shuai Chen, Qijun Yang, Shanlin Tongji Univ Coll Elect & Informat Engn Shanghai Peoples R China Tongji Univ Shanghai Inst Intelligent Sci & Technol Shanghai Peoples R China Hefei Univ Technol Sch Management Hefei Peoples R China

Facial video-based remote physiological measurement is a promising research area for detecting human vital signs (e.g., heart rate, respiration frequency) in a non-contact way. Conventional approaches are mostly supervised learning, requiring extensive collections of facial videos and synchronously recorded photoplethysmography (PPG) signals. To tackle it, self-supervised learning has recently gained attentions;due to the lack of ground truth PPG signals, its performance is however limited. In this paper, we propose a novel frequency-centric self-supervised framework that successfully integrates the popular vision-language models (VLMs) into the remote physiological measurement task. Given a facial video, we first augment its positive and negative video samples with varying rPPG signal frequencies. Next, we introduce a frequency-oriented vision-text pair generation method by carefully creating contrastive spatio-temporal maps from positive and negative samples and designing proper text prompts to describe their relative ratios of signal frequencies. A pre-trained VLM is employed to extract features for these formed vision-text pairs and estimate rPPG signals thereafter. We develop a series of frequency-related generative and contrastive learning mechanisms to optimize the VLM, including the text-guided visual reconstruction task, the vision-text contrastive learning task, and the frequency contrastive and ranking task. Overall, our method for the first time adapts VLMs to digest and align the frequency-related knowledge in vision and text modalities. Extensive experiments on four benchmark datasets demonstrate that it significantly outperforms state of the art self-supervised methods. Our codes will be available at https://***/yuezijie/Bootstrapping-VLM-for-Frequency-centric-Self-supervised-Remote-Physiological-Measurement.

关键词： Remote physiological measurement vision-language models Frequency-related generative and contrastive learning Facial video analysis

来源：评论

学校读者我要写书评

暂无评论

Pseudo-Prompt Generating in Pre-trained vision-language models for Multi-label Medical Image Classification 7th

Pseudo-Prompt Generating in Pre-trained Vision-Language Mode...

引用

7th Chinese Conference on Pattern Recognition and Computer vision

作者： Ye, Yaoqin Zhang, Junjie Shi, Hongwei Sichuan Univ Coll Comp Sci Chengdu Peoples R China

ISBN: (纸本)9789819784950;9789819784967

The task of medical image recognition is notably complicated by the presence of varied and multiple pathological indications, presenting a unique challenge in multi-label classification with unseen labels. This complexity underlines the need for computer-aided diagnosis methods employing multi-label zero-shot learning. Recent advancements in pre-trained vision-language models (VLMs) have showcased notable zero-shot classification abilities on medical images. However, these methods have limitations on leveraging extensive pre-trained knowledge from broader datasets, and often depend on manual prompt construction by expert radiologists. By automating the process of prompt tuning, prompt learning techniques have emerged as an efficient way to adapt VLMs to downstream tasks. Yet, existing CoOp-based strategies fall short in performing class-specific prompts on unseen categories, limiting generalizability in fine-grained scenarios. To overcome these constraints, we introduce a novel prompt generation approach inspirited by text generation in natural language processing (NLP). Our method, named Pseudo-Prompt Generating (PsPG), capitalizes on the priori knowledge of multi-modal features. Featuring a RNN-based decoder, PsPG autoregressively generates class-tailored embedding vectors, i.e., pseudo-prompts. Comparative evaluations on various multi-label chest radiograph datasets affirm the superiority of our approach against leading medical vision-language and multi-label prompt learning methods. The source code is available at https://***/fallingnight/PsPG.

关键词： Prompt Learning Medical Image Recognition Multi-label Classification vision-language models

来源：评论

学校读者我要写书评

暂无评论

Adapting vision-language models to Open Classes via Test-Time Prompt Tuning 7th

Adapting Vision-Language Models to Open Classes via Test-Tim...

引用

7th Chinese Conference on Pattern Recognition and Computer vision

作者： Gao, Zhengqing Ao, Xiang Zhang, Xu-Yao Liu, Cheng-Lin Chinese Acad Sci Inst Automat MAIS Beijing Peoples R China Univ Chinese Acad Sci Sch Artificial Intelligence Beijing Peoples R China

ISBN: (纸本)9789819786190;9789819786206

Adapting pre-trained models to open classes is a challenging problem in machine learning. vision-language models fully explore the knowledge of text modality, demonstrating strong zero-shot recognition performance, which is naturally suited for various open-set problems. More recently, some research focuses on fine-tuning such models to downstream tasks. Prompt tuning methods achieved huge improvements by learning context vectors on few-shot data. However, through the evaluation under open-set adaptation setting with the test data including new classes, we find that there exists a dilemma that learned prompts have worse generalization abilities than hand-crafted prompts. In this paper, we consider combining the advantages of both and come up with a test-time prompt tuning approach, which leverages the maximum concept matching (MCM) scores as dynamic weights to generate an input-conditioned prompt for each image during test. Through extensive experiments on 11 different datasets, we show that our proposed method outperforms all comparison methods on average considering both base and new classes. The code is available at https://***/gaozhengqing/TTPT.

关键词： vision-language models Test-time adaptation Prompt tuning

来源：评论

学校读者我要写书评

暂无评论

Fine-grained multi-modal prompt learning for vision-language models

引用

NEUROCOMPUTING 2025年 636卷

作者： Liu, Yunfei Deng, Yunziwei Liu, Anqi Liu, Yanan Li, Shengyang Chinese Acad Sci Technol & Engn Ctr Space Utilizat Beijing 100094 Peoples R China Chinese Acad Sci Key Lab Space Utilizat Beijing 100094 Peoples R China Univ Chinese Acad Sci Beijing 100094 Peoples R China

Recently advanced pre-trained vision language models have demonstrated outstanding performance in many downstream tasks via prompt learning. Prompt learning provides task-specific prompt information to exploit beneficial knowledge stored in pre-trained models to promote generalization ability for downstream tasks. However, previous work mainly focused on single modal prompt tuning (with only one prompt per modality) and salient distinguished features, which unable to flexibly adjust the two representation spaces on downstream tasks dynamically, yet makes it hard to capture subtle discriminative knowledge, which resulting in suboptimal solutions. In this work, we propose a novel Fine-Grained Multi-modal Prompt Learning framework, denoted as FGMPL, based on the contrastive language-image pre-trained model (CLIP). To facilitate the pre-trained CLIP model to learn and represent more effective features, we design a dual-grained visual prompt scheme to learn global discrepancies as well as specify the subtle discriminative details among visual classes, and transform random vectors with class names in class-aware text prompt into class-specific discrepancy representation. Moreover, in contrast to the previous prompt approaches, we use shared latent semantic space to generate visual and text prompts to encourage cross-modal interaction. Furthermore, a multimodal prompt tuning evaluator is proposed, which can make the vision and text prompts semantically aligned and enhance each other to promote cross-modal collaborative reasoning to further improve FGMPL. Comprehensive experiments on popular image recognition benchmarks show that our approach has superior generalization and few-shot capabilities.

关键词： Multi-modal prompt learning vision-language models Fine-grained image recognition

来源：评论

学校读者我要写书评

暂无评论

A novel approach with vision-language models for custom e-commerce product listings

引用

Multimedia Tools and Applications 2025年 1-30页

作者： Huynh Ngoc Nhu, Y. Nguyen, Quoc-Dung Kingkan, Cherdsak Applied Artificial Intelligence Institute (A2I2) Deakin University Burwood Melbourne Australia School of Engineering and Technology Asian Institute of Technology Khlong Nueng Thailand Faculty of Mechanical - Electrical and Computer Engineering School of Technology Van Lang University Ho Chi Minh City Viet Nam

This study introduces an innovative approach to enhancing e-commerce product listings through subject-driven text-to-image generation, leveraging advanced AI technologies. Focused on transforming consumer first impressions, it blends personalized visual styles with online retail needs, striking a balance between standardization and customization. The research develops a unique method for image synthesis, improving upon existing AI models such as DreamBooth and Textual Inversion. This work not only equips online sellers with dynamic visual tools but also significantly enriches AI applications in e-commerce, offering both practical and academic contributions to the field. Our proposed model is evaluated based on various numerical and human-based evaluation metrics. The experimental results show that our model achieves a significant performance compared to other baseline models. Our model is further analyzed and discussed under correlation analysis, visual quality assessment, and ablation study to ensure its practical applicability and user satisfaction. © The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2025.

关键词： E-commerce Generative models Human-computer interaction Personalization in retail Subject-driven text-to-image synthesis vision-language models Visual styles

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：