检索结果-内蒙古大学图书馆

Large vision-language models enabled novel objects 6D pose estimation for human-robot collaboration

ROBOTICS AND COMPUTER-INTEGRATED MANUFACTURING 2025年 95卷

作者： Xia, Wanqing Zheng, Hao Xu, Weiliang Xu, Xun Univ Auckland Dept Mech & Mechatron Engn Auckland New Zealand

Six-Degree-of-Freedom (6D) pose estimation is essential for robotic manipulation tasks, especially in human-robot collaboration environments. Recently, 6D pose estimation has been extended from seen objects to novel objects due to the frequent encounters with unfamiliar items in real-life scenarios. This paper presents a three-stage pipeline for 6D pose estimation of previously unseen objects, leveraging the capabilities of large vision-language models. Our approach consists of vision-language model-based object detection and segmentation, mask selection with pose hypothesis generated from CAD models, and refinement and scoring of pose candidates. We evaluate our method on the YCB-Video dataset, achieving a state-of-the-art Average Recall (AR) score of 75.8 with RGB-D images, demonstrating its effectiveness in accurately estimating 6D poses for a diverse range of objects. The effectiveness of each operation stage is investigated in the ablation study. To validate the practical applicability of our approach, we conduct case studies on a real-world robotic platform, focusing on object pickup tasks by integrating our 6D pose estimation pipeline with human intention prediction and task analysis algorithms. Results show that the proposed method can effectively handle novel objects in our test environments, as demonstrated through the YCB dataset evaluation and case studies. Our work contributes to the field of human-robot collaboration by introducing a flexible, generalizable approach to 6D pose estimation, enabling robots to adapt to new objects without requiring extensive retraining-a vital capability for advancing human-robot collaboration in dynamic environments. More information can be found in the project GitHub page: https ://***/WanqingXia/HRC_DetAnyPose.

关键词： Machine vision vision-language models 6d pose estimation Human-robot collaboration Machine learning

来源：评论

学校读者我要写书评

暂无评论

Consistent prompt learning for vision-language models

引用

KNOWLEDGE-BASED SYSTEMS 2025年 310卷

作者： Zhang, Yonggang Tian, Xinmei Hong Kong Baptist Univ Dept Comp Sci Hong Kong Peoples R China Univ Sci & Technol China Natl Engn Lab Brain Inspired Intelligence Technol Hefei 230000 Anhui Peoples R China

Pre-trained vision-language models, such as CLIP, have shown remarkable capabilities across various downstream tasks by learning prompts that consist of context concatenated with a class name;for example, 'a photo of a (dog]' with (dog] as a class prior. Advanced prompt-learning methods typically initialize and optimize the context;for example, 'a photo of a ' for downstream task adaptation. However, context optimization typically leads to poor generalization performance over novel classes or datasets sampled from different distributions. This maybe attributed to prompt inconsistency;namely, prompts optimized using one image distribution may differ from those optimized using a different image distribution. To improve the generalization performance of optimized prompts, we propose the novel consistent prompt learning (CPL) approach that identifies and addresses the image distribution that causes prompt inconsistency by performing distributional exploration. CPL identifies and mitigates prompt inconsistency in an adversarial training scheme, in which prompt inconsistency is measured as the similarity discrepancy between images and two different prompts. Specifically, CPL calculates two similarities between a query image and two prompts, and determines the prompt inconsistency through the discrepancy between these two similarities. Subsequently, CPL performs distributional exploration to enlarge the discrepancy and uses an adversarial-training approach to mitigate the discrepancy. Consequently, the model predictions are insensitive to prompt changes. The optimized prompt performs well under various image distributions. Comprehensive experiments show that the proposed CPL method performs favorably on four types of representative tasks across 11 datasets, which improves on existing prompt-learning methods, achieving state-of-the-art performance.

关键词： Prompt learning vision-language models Domain generalization Domain adaptation

来源：评论

学校读者我要写书评

暂无评论

H2R Bridge: Transferring vision-language models to few-shot intention meta-perception in human robot collaboration

引用

JOURNAL OF MANUFACTURING SYSTEMS 2025年 80卷 524-535页

作者： Wu, Duidi Zhao, Qianyou Fan, Junming Qi, Jin Zheng, Pai Hu, Jie Shanghai Jiao Tong Univ Sch Mech Engn Shanghai 200240 Peoples R China State Key Lab Mech Syst & Vibrat Shanghai 200240 Peoples R China Hong Kong Polytech Univ Dept Ind & Syst Engn Hong Kong Peoples R China

Human-robot collaboration enhances efficiency by enabling robots to work alongside human operators in shared tasks. Accurately understanding human intentions is critical for achieving a high level of collaboration. Existing methods heavily rely on case-specific data and face challenges with new tasks and unseen categories, while often limited data is available under real-world conditions. To bolster the proactive cognitive abilities of collaborative robots, this work introduces a Visual-language-Temporal approach, conceptualizing intent recognition as a multimodal learning problem with HRC-oriented prompts. A large model with prior knowledge is fine-tuned to acquire industrial domain expertise, then enables efficient rapid transfer through few-shot learning in data-scarce scenarios. Comparisons with state-of-the-art methods across various datasets demonstrate the proposed approach achieves new benchmarks. Ablation studies confirm the efficacy of the multimodal framework, and few-shot experiments further underscore meta-perceptual potential. This work addresses the challenges of perceptual data and training costs, building a human-robot bridge (H2R Bridge) for semantic communication, and is expected to facilitate proactive HRC and further integration of large models in industrial applications.

关键词： Human-robot collaboration Intent recognition Few-shot learning vision-language models

来源：评论

学校读者我要写书评

暂无评论

APOVIS: Automated pixel-level open-vocabulary instance segmentation through integration of pre-trained vision-language models and foundational segmentation models

引用

IMAGE AND vision COMPUTING 2025年 154卷

作者： Ma, Qiujie Yang, Shuqi Zhang, Lijuan Lan, Qing Yang, Dongdong Chen, Honghan Tan, Ying Southwest Minzu Univ State Ethn Affairs Commiss Key Lab Comp Syst Chengdu Peoples R China Hosp Chengdu Univ TCM Chengdu Peoples R China

In recent years, substantial advancements have been achieved in vision-language integration and image segmentation, particularly through the use of pre-trained models like BERT and vision Transformer (ViT). Within the domain of open-vocabulary instance segmentation (OVIS), accurately identifying an instance's positional information is critical, as it directly influences the precision of subsequent segmentation tasks. However, many existing methods rely on supplementary networks to generate pseudo-labels, such as multiple anchor frames containing object positional information. While these pseudo-labels aid visual language models in recognizing the absolute position of objects, they often compromise the overall efficiency and performance of the OVIS pipeline. In this study, we introduce a novel Automated Pixel-level OVIS (APOVIS) framework aimed at enhancing OVIS. Our approach automatically generates pixel-level annotations by leveraging the matching capabilities of pre-trained vision-language models for image-text pairs alongside a foundational segmentation model that accepts multiple prompts (e.g., points or anchor boxes) to guide the segmentation process. Specifically, our method first utilizes a pre-trained vision-language model to match instances within image-text pairs to identify relative positions. Next, we employ activation maps to visualize the instances, enabling us to extract instance location information and generate pseudo-label prompts that direct the segmentation process. These pseudo-labels then guide the segmentation model to execute pixel-level segmentation, enhancing both the accuracy and generalizability of object segmentation across images. Extensive experimental results demonstrate that our model significantly outperforms current state-of-the-art models in object detection accuracy and pixel- level instance segmentation on the COCO dataset. Additionally, the generalizability of our approach is validated through image-text pair data inference tas

关键词： Open-vocabulary instance segmentation vision-language models Foundational segmentation models Object detection Zero-shot instance segmentation

来源：评论

学校读者我要写书评

暂无评论

Fine-grained multi-modal prompt learning for vision-language models

引用

NEUROCOMPUTING 2025年 636卷

作者： Liu, Yunfei Deng, Yunziwei Liu, Anqi Liu, Yanan Li, Shengyang Chinese Acad Sci Technol & Engn Ctr Space Utilizat Beijing 100094 Peoples R China Chinese Acad Sci Key Lab Space Utilizat Beijing 100094 Peoples R China Univ Chinese Acad Sci Beijing 100094 Peoples R China

Recently advanced pre-trained vision language models have demonstrated outstanding performance in many downstream tasks via prompt learning. Prompt learning provides task-specific prompt information to exploit beneficial knowledge stored in pre-trained models to promote generalization ability for downstream tasks. However, previous work mainly focused on single modal prompt tuning (with only one prompt per modality) and salient distinguished features, which unable to flexibly adjust the two representation spaces on downstream tasks dynamically, yet makes it hard to capture subtle discriminative knowledge, which resulting in suboptimal solutions. In this work, we propose a novel Fine-Grained Multi-modal Prompt Learning framework, denoted as FGMPL, based on the contrastive language-image pre-trained model (CLIP). To facilitate the pre-trained CLIP model to learn and represent more effective features, we design a dual-grained visual prompt scheme to learn global discrepancies as well as specify the subtle discriminative details among visual classes, and transform random vectors with class names in class-aware text prompt into class-specific discrepancy representation. Moreover, in contrast to the previous prompt approaches, we use shared latent semantic space to generate visual and text prompts to encourage cross-modal interaction. Furthermore, a multimodal prompt tuning evaluator is proposed, which can make the vision and text prompts semantically aligned and enhance each other to promote cross-modal collaborative reasoning to further improve FGMPL. Comprehensive experiments on popular image recognition benchmarks show that our approach has superior generalization and few-shot capabilities.

关键词： Multi-modal prompt learning vision-language models Fine-grained image recognition

来源：评论

学校读者我要写书评

暂无评论

Advancements in vision-language models for Remote Sensing: Datasets, Capabilities, and Enhancement Techniques

引用

REMOTE SENSING 2025年第1期17卷 162-162页

作者： Tao, Lijie Zhang, Haokui Jing, Haizhao Liu, Yu Yan, Dawei Wei, Guoting Xue, Xizhe Northwestern Polytech Univ Sch Cybersecur Int Cooperat Dept Xian 710072 Peoples R China Zhejiang Lab Inst Intelligent Percept Hangzhou 311500 Peoples R China Nanjing Univ Sci & Technol Sch Comp Sci & Engn Nanjing 210094 Peoples R China Tech Univ Munich Dept Aerosp & Geodesy DE-80333 Munich Germany

Recently, the remarkable success of ChatGPT has sparked a renewed wave of interest in artificial intelligence (AI), and the advancements in vision-language models (VLMs) have pushed this enthusiasm to new heights. Differing from previous AI approaches that generally formulated different tasks as discriminative models, VLMs frame tasks as generative models and align language with visual information, enabling the handling of more challenging problems. The remote sensing (RS) field, a highly practical domain, has also embraced this new trend and introduced several VLM-based RS methods that have demonstrated promising performance and enormous potential. In this paper, we first review the fundamental theories related to VLM, then summarize the datasets constructed for VLMs in remote sensing and the various tasks they address. Finally, we categorize the improvement methods into three main parts according to the core components of VLMs and provide a detailed introduction and comparison of these methods.

关键词： vision-language models remote sensing

来源：评论

学校读者我要写书评

暂无评论

WildCLIP: Scene and Animal Attribute Retrieval from Camera Trap Data with Domain-Adapted vision-language models

引用

INTERNATIONAL JOURNAL OF COMPUTER vision 2024年第9期132卷 3770-3786页

作者： Gabeff, Valentin Russwurm, Marc Tuia, Devis Mathis, Alexander EPFL Brain Mind & NeuroX Inst Sch Life Sci Geneva Switzerland EPFL Environm Computat Sci & Earth Observat Lab ECEO Sion Switzerland WUR Lab Geoinformat Sci & Remote Sensing Wageningen Netherlands

Wildlife observation with camera traps has great potential for ethology and ecology, as it gathers data non-invasively in an automated way. However, camera traps produce large amounts of uncurated data, which is time-consuming to annotate. Existing methods to label these data automatically commonly use a fixed pre-defined set of distinctive classes and require many labeled examples per class to be trained. Moreover, the attributes of interest are sometimes rare and difficult to find in large data collections. Large pretrained vision-language models, such as contrastive language image pretraining (CLIP), offer great promises to facilitate the annotation process of camera-trap data. Images can be described with greater detail, the set of classes is not fixed and can be extensible on demand and pretrained models can help to retrieve rare samples. In this work, we explore the potential of CLIP to retrieve images according to environmental and ecological attributes. We create WildCLIP by fine-tuning CLIP on wildlife camera-trap images and to further increase its flexibility, we add an adapter module to better expand to novel attributes in a few-shot manner. We quantify WildCLIP's performance and show that it can retrieve novel attributes in the Snapshot Serengeti dataset. Our findings outline new opportunities to facilitate annotation processes with complex and multi-attribute captions. The code is available at https://***/amathislab/wildclip.

关键词： vision-language models CLIP Wildlife Camera traps Few-shot learning Vocabulary replay

来源：评论

学校读者我要写书评

暂无评论

Exploring vision-language models for Imbalanced Learning

引用

INTERNATIONAL JOURNAL OF COMPUTER vision 2024年第1期132卷 224-237页

作者： Wang, Yidong Yu, Zhuohao Wang, Jindong Heng, Qiang Chen, Hao Ye, Wei Xie, Rui Xie, Xing Zhang, Shikun Peking Univ Natl Engn Res Ctr Software Engn Beijing Peoples R China Mircosoft Res Asia Beijing Peoples R China North Carolina State Univ Raleigh NC USA Carnegie Mellon Univ Pittsburgh PA 15213 USA

vision-language models (VLMs) that use contrastive language-image pre-training have shown promising zero-shot classification performance. However, their performance on imbalanced dataset is relatively poor, where the distribution of classes in the training dataset is skewed, leading to poor performance in predicting minority classes. For instance, CLIP achieved only 5% accuracy on the iNaturalist18 dataset. We propose to add a lightweight decoder to VLMs to avoid out of memory problem caused by large number of classes and capture nuanced features for tail classes. Then, we explore improvements of VLMs using prompt tuning, fine-tuning, and incorporating imbalanced algorithms such as Focal Loss, Balanced SoftMax and Distribution Alignment. Experiments demonstrate that the performance of VLMs can be further boosted when used with decoder and imbalanced methods. Specifically, our improved VLMs significantly outperforms zero-shot classification by an average accuracy of 6.58%, 69.82%, and 6.17%, on ImageNet-LT, iNaturalist18, and Places-LT, respectively. We further analyze the influence of pre-training data size, backbones, and training cost. Our study highlights the significance of imbalanced learning algorithms in face of VLMs pre-trained by huge data. We release our code at https://***/Imbalance-VLM/Imbalance-VLM.

关键词： vision-language models Imbalanced classification Long-tailed recognition

来源：评论

学校读者我要写书评

暂无评论

Tuning vision-language models With Multiple Prototypes Clustering

引用

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2024年第12期46卷 11186-11199页

作者： Guo, Meng-Hao Zhang, Yi Mu, Tai-Jiang Huang, Sharon X. Hu, Shi-Min Tsinghua Univ Dept Comp Sci & Technol BNRist Beijing 100084 Peoples R China Penn State Univ Coll Informat Sci & Technol University Pk PA 16802 USA

Benefiting from advances in large-scale pre-training, foundation models, have demonstrated remarkable capability in the fields of natural language processing, computer vision, among others. However, to achieve expert-level performance in specific applications, such models often need to be fine-tuned with domain-specific knowledge. In this paper, we focus on enabling vision-language models to unleash more potential for visual understanding tasks under few-shot tuning. Specifically, we propose a novel adapter, dubbed as lusterAdapter, which is based on trainable multiple prototypes clustering algorithm, for tuning the CLIP model. It can not only alleviate the concern of catastrophic forgetting of foundation models by introducing anchors to inherit common knowledge, but also improve the utilization efficiency of few annotated samples via bringing in clustering and domain priors, thereby improving the performance of few-shot tuning. We have conducted extensive experiments on 11 common classification benchmarks. The results show our method significantly surpasses the original CLIP and achieves state-of-the-art (SOTA) performance under all benchmarks and settings. For example, under the 16-shot setting, our method exhibits a remarkable improvement over the original CLIP by 19.6%, and also surpasses TIP-Adapter and GraphAdapter by 2.7% and 2.2%, respectively, in terms of average accuracy across the 11 benchmarks.

关键词： Parameter-efficient tuning vision-language models foundation models adapter deep learning clustering

来源：评论

学校读者我要写书评

暂无评论

How Does Fine-Tuning Impact Out-of-Distribution Detection for vision-language models?

引用

INTERNATIONAL JOURNAL OF COMPUTER vision 2024年第2期132卷 596-609页

作者： Ming, Yifei Li, Yixuan Univ Wisconsin Madison Dept Comp Sci Madison WI 53715 USA

Recent large vision-language models such as CLIP have shown remarkable out-of-distribution (OOD) detection and generalization performance. However, their zero-shot in-distribution (ID) accuracy is often limited for downstream datasets. Recent CLIP-based fine-tuning methods such as prompt learning have demonstrated significant improvements in ID classification and OOD generalization where OOD labels are available. Nonetheless, it remains unclear whether the model is reliable to semantic shifts without OOD labels. In this paper, we aim to bridge the gap and present a comprehensive study to understand how fine-tuning impact OOD detection for few-shot downstream tasks. By framing OOD detection as multi-modal concept matching, we establish a connection between fine-tuning methods and various OOD scores. Our results suggest that a proper choice of OOD scores is essential for CLIP-based fine-tuning. In particular, the maximum concept matching (MCM) score provides a promising solution consistently. We also show that prompt learning demonstrates the state-of-the-art OOD detection performance over the zero-shot counterpart.

关键词： CLIP OOD detection Fine-tuning Multi-modality vision-language models Prompt learning Few-shot learning Adaptor

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：