检索结果-内蒙古大学图书馆

Ground4Act: Leveraging visual-language model for collaborative pushing and grasping in clutter

IMAGE AND VISION COMPUTING 2024年 151卷

作者： Yang, Yuxiang Guo, Jiangtao Li, Zilong He, Zhiwei Zhang, Jing Hangzhou Dianzi Univ Sch Elect & Informat Hangzhou Peoples R China Univ Sydney Sch Comp Sci Sydney NSW Australia

The challenge in robotics is to enable robots to transition from visual perception and language understanding to performing tasks such as grasp and assembling objects, bridging the gap between "seeing" and "hearing" to "doing". In this work, we propose Ground4Act, a two-stage approach for collaborative pushing and grasping in clutter using a visual-language model. In the grounding stage, Ground4Act extracts target features from multi- modal data via visual grounding. In the action stage, it embeds a collaborative pushing and grasping framework to generate the action's position and direction. Specifically, we propose a DQN-based reinforcement learning pushing policy that uses RGBD images as the state space to determine the push action's pixel-level coordinates and direction. Additionally, a least squares-based linear fitting grasping policy takes the target mask from the grounding stage as input to achieve efficient grasp. Simulations and real-world experiments demonstrate Ground4Act's superior performance. The simulation suite, source code, and trained models will be made publicly available.

关键词： visual grounding Collaborative pushing and grasping Deep reinforcement learning visual-language model

来源：评论

学校读者我要写书评

暂无评论

Constraint embedding for prompt tuning in vision-language pre-trained model

引用

MULTIMEDIA SYSTEMS 2025年第1期31卷 1-16页

作者： Cheng, Keyang Wei, Liutao Tang, Jingfeng Zhan, Yongzhao Jiangsu Univ Sch Comp Sci & Commun Engn Zhenjiang 212013 Jiangsu Peoples R China

Prompt tuning, which fine-tunes the feature distributions in pre-trained vision-language (VL) models by adding learnable tokens or contexts into image and text branches, has emerged as a popular method for enhancing task-specific performance. However, this approach may result in overfitting specific target data distributions, thereby undermining the original generalization capabilities of frozen models such as CLIP. To tackle this issue, a novel framework named constraint embedding for prompt tuning (CEPT) is proposed for optimizing the learnable prompt tokens. To maintain the feature extraction capabilities of the pre-trained CLIP model while extracting relevant data features for downstream tasks, the block consistency constraint (BCC) approach is proposed. This approach adjusts the feature extraction step by ensuring that block-wise embeddings are aligned, thereby preserving the original generalization performance of the pre-trained VL model. Additionally, to achieve a more harmonious distribution of image-text features in the potential space, the distribution constraint (DC) strategy is introduced. This strategy enhances multimodal data feature alignment by evenly dispersing different classes of data features and concentrating the same class of image features within the potential space. Finally, CEPT surpassed the state-of-the-art for base-to-novel generalization, achieving a harmonic mean improvement of over 1.04%. Additionally, for few-shot learning, it demonstrates an average improvement of 1.63% across five few-shot scenarios.

关键词： visual-language model Prompt tuning Block-wise embedding Potential space Few-shot learning

来源：评论

学校读者我要写书评

暂无评论

Contrastive Pretraining for Computational Pathology with visual-language models 22

Contrastive Pretraining for Computational Pathology with Vis...

引用

22nd IEEE International Symposium on Biomedical Imaging, ISBI 2025

作者： Zhou, Qifeng Dang, Thao M. Guo, Yuzhi Ma, Hehuan Zhong, Wenliang Na, Saiyang Gao, Jean Huang, Junzhou The University of Texas at Arlington Department of Computer Science and Engineering Arlington United States

ISBN: (纸本)9798331520526

In computational pathology, effectively capturing visual-language embeddings from extensive pathology image-text pairs has become increasingly crucial for diverse downstream tasks. Although prior studies have fine-tuned models like CLIP using large pathology image-text datasets, these models encounter limitations due to their separate processing of text and images, restricting their ability to capture essential cross-modal relationships critical in pathology. Recent advancements in large language models (LLMs) have led to the development of vision-language models (VLMs) that demonstrate enhanced multimodal capabilities, including stronger language comprehension and reasoning skills compared to CLIP. However, while VLMs show potential for multimodal embedding, previous efforts have primarily focused on text-based tasks, leaving their application to multimodal pathology data largely unexplored. In this work, we introduce a VLM-based framework designed to integrate and align pathology visual-language embeddings within a single model. We validate our framework's effectiveness through cross-modal retrieval on pathology image-caption datasets and zero-shot patch classification across seven pathology image datasets, demonstrating its superiority over CLIP-based models and underscoring its potential for advancing pathology research. © 2025 IEEE.

关键词： computational pathology contrastive learning visual-language model zero-shot learning

来源：评论

学校读者我要写书评

暂无评论

Enhancing generalization in camera trap image recognition: Fine-tuning visual language models

引用

NEUROCOMPUTING 2025年 634卷

作者： Yang, Zihe Tian, Ye Wang, Lifeng Zhang, Junguo Beijing Forestry Univ Sch Technol Beijing 100083 Peoples R China Beijing Forestry Univ Key Lab State Forestry & Grassland Adm Forestry Eq Beijing 100083 Peoples R China Beijing Forestry Univ Res Ctr Biodivers Intelligent Monitoring Beijing 100083 Peoples R China

This study introduces a novel fine-tuning approach for enhancing the generalization capabilities of visual language models in the context of wildlife monitoring, particularly for camera trap image recognition. In this paper, we introduce Ecological visual language models (Eco-VLMs), a model fine-tuned using an ecological subset of the ImageNet1K dataset (ImageNet1K-E), aimed at reducing the reliance on spurious correlations that affect the performance of models like CLIP when applied to specialized domains. By employing text augmentation techniques and expanding species names with rich descriptors, Eco-VLM is optimized to extract more distinctive features from images, thereby improving its discriminative capabilities for wildlife features. Meanwhile, random contrastive loss is proposed to improve the diversity of training data and the generalization of Eco-VLMs. The proposed Eco-CLIP and Eco-SigLIP model are rigorously evaluated against various camera trap datasets and demonstrates superior performance, with average F1 scores improved by 4.44% and 3.79% compared to the standard CLIP and SigLIP model. Intrinsic evaluations further confirm that Eco-VLMs have acquired a broader ecological knowledge base, highlighting its enhanced generalization abilities. This research contributes to the field by addressing the limitations of current visual language models in specialized ecological applications and underscores the potential of Eco-VLMs for improving wildlife monitoring efforts.

关键词： Camera trap visual-language model Text augmentation Contrastive loss

来源：评论

学校读者我要写书评

暂无评论

Pixel-level semantic parsing in complex industrial scenarios using large vision-language models

引用

INFORMATION FUSION 2025年 116卷

作者： Ji, Xiaofeng Gong, Faming Wang, Nuanlai Zhao, Yanpu Ma, Yuhui Shi, Zhuang China Univ Petr East China Coll Comp Sci & Technol Qingdao 266580 Peoples R China Aerosp Informat Res Inst QiLu Lab 32 Jinan 250100 Peoples R China

The emergence of vision-language models, particularly Contrastive language-Image Pre-Training (CLIP), has significantly improved the performance of numerous visual tasks, demonstrating notable zero-shot transfer abilities. CLIP's remarkable generalization ability offers substantial innovation potential for smart manufacturing and public safety surveillance, potentially accelerating the advancement of Industry 5.0. However, most current research focuses on public datasets, with limited investigation into complex industrial scenarios. These industrial scenarios' semantic structures and image qualities differ significantly from the datasets used to train CLIP, presenting challenges for its effectiveness in industrial applications. This paper presents a Context-Aware Masked CLIP (CAM-CLIP) framework for high-performance pixel-level semantic parsing in complex industrial scenarios, under few-shot conditions. The framework autonomously detects and identifies objects in industrial scenarios based on textual descriptions, enhancing safety monitoring and anomaly detection. We constructed a dedicated dataset using offshore drilling platforms as a case study and conducted empirical validation. Results demonstrate that CAM-CLIP achieved an 80.7 mIoU in pixel-level semantic parsing of offshore drilling platforms with a limited sample size, outperforming state-of-the-art methods by 8.47 mIoU. This study extends CLIP's applicability to industrial settings and offers a model for future implementations. It advances semantic parsing in industrial scenarios and promotes the development of intelligent, interpretable systems.

关键词： visual-language model CAM-CLIP Semantic segmentation in industrial Pixel-level semantic parsing Complex industrial scenarios

来源：评论

学校读者我要写书评

暂无评论

EDIR: an expert method for describing image regions based on knowledge distillation and triple fusion

引用

APPLIED INTELLIGENCE 2025年第1期55卷 1-16页

作者： Ren, Kai Hu, Chuanping Xi, Hao Li, Yongqiang Fan, Jinhao Liu, Lihua Univ Zhengzhou Sch Elect & Informat Engn Zhengzhou Henan Peoples R China Zhengzhou Univ Sch Cyber Sci & Engn Zhengzhou Henan Peoples R China Henan Remote Sensing Inst Zhengzhou Henan Peoples R China

Fine-grained visual features generally require higher image input resolutions, which in turn necessitate a larger parameter count for general visual models to effectively analyze these features. However, the substantial computational demands of larger models present significant challenges to research in this domain. To address these challenges, our research integrates descriptions of fine-grained visual information from images. We propose an innovative Expert method for Describing Image Regions (EDIR) based on knowledge distillation and triple fusion techniques. Our method comprises a Knowledge-Distilled Expert Network (KDEN) and a Triple Information Set Fusion Network (TIFN) that combine global and regional image descriptions in a controlled prompting manner. Unlike existing studies, our approach not only extracts global and regional image features independently but also relates their spatial information. Our EDIR method reduces visual model parameters by 6.7 times compared to CogVLM, improves ImageNet-1K zero-shot detection accuracy by 0.68%, increases the CIDEr score on NoCaps by 1.9 points, and achieves an average improvement of 1.39% in hallucination accuracy. It also increases the average inference frame rate to 32.92 FPS, representing a 5.82-fold improvement over the baseline.

关键词： Image caption visual-language model Fine-grained description Regional description Knowledge distillation

来源：评论

学校读者我要写书评

暂无评论

Adaptive Face Recognition for Multi-Type Occlusions

引用

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 2024年第11期34卷 11400-11412页

作者： Liu, Yuxi Luo, Guibo Weng, Zhenyu Zhu, Yuesheng Peking Univ Shenzhen Grad Sch Sch Elect & Comp Engn Shenzhen 518055 Peoples R China Nanyang Technol Univ Sch Elect & Elect Engn Singapore 639798 Singapore

Due to the prevalence of influenza outbreaks and outdoor scenarios with various obstructing decorations, recognizing faces with occlusions has become a pressing challenge to address. However, current research mainly focuses on facial recognition with one kind of occlusion and does not provide compatible solutions for different kinds of common occlusions like glasses, sunglasses, and masks. Therefore, an Adaptive Multi-Type Occluded Face Recognition model (AMOFR) is proposed to effectively handle multiple occlusion types simultaneously in this paper. In AMOFR, a generator is developed to produce diverse occluded face images for training, achieved by simulating various occlusion types on unoccluded face images. Subsequently, an occlusion type-based adapter is formulated to address a range of occlusion scenarios, guided by prompts from a visual-language model. To enhance overall performance by leveraging complete facial information, a feature-level knowledge distillation loss function is implemented, facilitating joint learning of unoccluded-face and occluded-face features. Furthermore, a new sunglasses-wearing dataset (CALFW-SUNGLASSES) is generated for more comprehensive test for AMOFR and further occlusion recognition research. Experimental results on datasets containing different types of occlusions have demonstrated that AMOFR achieves significantly higher accuracy compared to other advanced face recognition models. The implementation codes of AMOFR is available at https://***/LIU-YUXI/Adaptive-Multi-occlusion-Face-Recognition.

关键词： Face recognition with occlusion adapter knowledge distillation visual-language model

来源：评论

学校读者我要写书评

暂无评论

Military Image Captioning for Low-Altitude UAV or UGV Perspectives

引用

DRONES 2024年第9期8卷 421-421页

作者： Pan, Lizhi Song, Chengtian Gan, Xiaozheng Xu, Keyu Xie, Yue Beijing Inst Technol Sch Mechatron Engn Beijing 100081 Peoples R China Sci & Technol Electromech Dynam Control Lab Xian 710065 Peoples R China

Low-altitude unmanned aerial vehicles (UAVs) and unmanned ground vehicles (UGVs), which boast high-resolution imaging and agile maneuvering capabilities, are widely utilized in military scenarios and generate a vast amount of image data that can be leveraged for textual intelligence generation to support military decision making. Military image captioning (MilitIC), as a visual-language learning task, provides innovative solutions for military image understanding and intelligence generation. However, the scarcity of military image datasets hinders the advancement of MilitIC methods, especially those based on deep learning. To overcome this limitation, we introduce an open-access benchmark dataset, which was termed the Military Objects in Real Combat (MOCO) dataset. It features real combat images captured from the perspective of low-altitude UAVs or UGVs, along with a comprehensive set of captions. Furthermore, we propose a novel encoder-augmentation-decoder image-captioning architecture with a map augmentation embedding (MAE) mechanism, MAE-MilitIC, which leverages both image and text modalities as a guiding prefix for caption generation and bridges the semantic gap between visual and textual data. The MAE mechanism maps both image and text embeddings onto a semantic subspace constructed by relevant military prompts, and augments the military semantics of the image embeddings with attribute-explicit text embeddings. Finally, we demonstrate through extensive experiments that MAE-MilitIC surpasses existing models in performance on two challenging datasets, which provides strong support for intelligence warfare based on military UAVs and UGVs.

关键词： unmanned aerial vehicle (UAV) military image captioning unmanned ground vehicle (UGV) image understanding visual-language model

来源：评论

学校读者我要写书评

暂无评论

Real-time object detection method with single-domain generalization based on YOLOv8

引用

JOURNAL OF REAL-TIME IMAGE PROCESSING 2024年第6期21卷 1-12页

作者： Zhou, Yipeng Qian, Huaming Harbin Engn Univ Coll Intelligent Syst Sci & Engn Harbin 150001 Peoples R China

The prevailing models for object detection are often beset by a dearth of generalizability across domains. Specifically, while these models may perform exceptionally well on a given dataset, their efficacy can plummet when confronted with novel domains that lie beyond their training purview. The single-domain generalization methods based on Faster R-CNN are constrained by the underlying strategies, which not only exhibit slow speeds and suboptimal accuracy levels but also demonstrate inadequate generalization. This paper proposes a Complementary Pseudo Multi-domain Generation Method based on YOLOv8 (Y-CPMG). The methodology fortifies the generalization prowess by fabricating a spectrum of pseudo domain information within the feature space. To elaborate, we harness the capabilities of pre-trained visual-language model, leveraging textual prompts to extract domain-specific feature enhancements. These enhancements are then amalgamated with the original images to simulate multi-domain scenarios. Building on this foundation, we delve deeper into the nuances of the real world by introducing normalization perturbation (NP) to uncover a variety of latent domain styles. This approach addresses potential limitations in visual-language models when emulating scenes of diverse styles. Empirical evaluations conducted across a spectrum of weather-diverse public datasets have demonstrated that the proposed method achieves a marked enhancement in performance for the task of domain generalization object detection. With an input dimension of 3 x\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times $$\end{document} 608 x\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{

关键词： Object detection Single domain generalization Pseudo multi-domain visual-language model Normalization perturbation

来源：评论

学校读者我要写书评

暂无评论

Advance One-Shot Multispectral Instance Detection With Text's Supervision

引用

IEEE SIGNAL PROCESSING LETTERS 2024年 31卷 1605-1609页

作者： Feng, Chen Cheng, Jian Xiao, Yang Cao, Zhiguo Huazhong Univ Sci & Technol Sch Artificial Intelligence & Automat Wuhan 430074 Peoples R China

One key issue within one-shot multispectral instance detection (OMID) is to extract features of strong instance discriminative power, domain adaptation capability, and instance-wise generality. Existing methods generally only rely on visual clues. Comparatively, text is advantageous due to its structured information, high semantics, and low noise. Inspired by recent emergence of large image-text datasets and breakthrough visual-language models, we propose to advance OMID with text's supervision for the first time. To this end, our key idea is to establish the relationship between one-shot multispectral instance with ImageNet class labels via the CLIP model. Particularly, we retrieve, rank, and ensemble the text features of ImageNet labels via instance image feature as query. Then the resulting instance image and text features are realigned and fused to obtain a multimodal feature. Meanwhile, a multispectral contrastive learning approach is proposed to drive multimodal feature learning for OMID. Note that all the procedures are end-to-end trained in a unified network. In this way, the instance discriminative power and domain adaptation capability are facilitated simultaneously. Experiments on two tailored multispectral instance detection datasets verify the effectiveness of our method.

关键词： Feature extraction Task analysis Training visualization Representation learning Semantics Noise Multispectral instance detection text supervision visual-language model multimodal feature learning

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：