检索结果-内蒙古大学图书馆

15th Indian Conference on Computer vision Graphics and Image Processing

作者： Solanki, Bhupendra Nair, Ashwin R. Singha, Mainak Mukhopadhyay, Souradeep Jha, Ankit Banerjee, Biplab Indian Inst Technol Mumbai Maharashtra India Indian Inst Sci Educ & Res Thiruvananthapuram Thiruvananthapuram Kerala India Indian Inst Sci Bangalore Karnataka India LNM Inst Informat Technol Jaipur Rajasthan India

ISBN: (纸本)9798400710759

Generalized Category Discovery (GCD) aims to cluster unlabeled images into known and novel categories using labeled images from known classes. To address the challenge of transferring features from known to unknown classes while mitigating model bias, we introduce GraphVL, a novel approach for vision-language modeling in GCD, leveraging CLIP. Our method integrates a graph convolutional network (GCN) with CLIP's text encoder to preserve class neighborhood structure. We also employ a lightweight visual projector for image data, ensuring discriminative features through margin-based contrastive losses for image-text mapping. This neighborhood preservation criterion effectively regulates the semantic space, making it less sensitive to known classes. Additionally, we learn textual prompts from known classes and align them to create a more contextually meaningful semantic feature space for the GCN layer using a contextual similarity loss. Finally, we represent unlabeled samples based on their semantic distance to class prompts from the GCN, enabling semi-supervised clustering for class discovery and minimizing errors. Our experiments on seven benchmark datasets consistently demonstrate the superiority of GraphVL when integrated with the CLIP backbone.

关键词： Class Discovery Contrastive Learning Graph Convolutional Networks Unsupervised learning vision-language models

来源：评论

学校读者我要写书评

暂无评论

Cross-Modal Concept Learning and Inference for vision-language models

引用

NEUROCOMPUTING 2024年 583卷

作者： Zhang, Yi Zhang, Ce Tang, Yushun He, Zhihai Harbin Inst Technol Harbin 150001 Peoples R China Southern Univ Sci & Technol Shenzhen 518055 Peoples R China Pengcheng Lab Shenzhen 518000 Peoples R China

Large-scale pre -trained vision -language models (VLMs), such as CLIP, establish the correlation between texts and images, achieving remarkable success on various downstream tasks with fine-tuning. In existing fine-tuning methods, the class -specific text description is matched against the whole image. We recognize that this imagescale matching is not effective since images from the same class often contain a set of different semantic objects, and an object further consists of a set of semantic parts or concepts. Individual semantic parts or concepts may appear in image samples from different classes. To address this issue, in this paper, we develop a new method called cross -model concept learning and inference (CCLI). Using the powerful text -image correlation capability of CLIP, our method automatically learns a large set of distinctive visual concepts from images using a set of semantic text concepts. Based on these visual concepts, we construct a discriminative representation of images and learn a concept inference network to perform downstream image classification tasks, such as few -shot learning and domain generalization. Extensive experimental results demonstrate that our CCLI method is able to improve the performance of the current state-of-the-art methods by large margins, for example, by up to 8.0% improvement on few -shot learning and by up to 1.3% for domain generalization.

关键词： vision-language models Concept learning Few-shot learning Domain generalization

来源：评论

学校读者我要写书评

暂无评论

Reflectance estimation for proximity sensing by vision-language models: utilizing distributional semantics for low-level cognition in robotics

引用

ADVANCED ROBOTICS 2024年第18期38卷 1287-1306页

作者： Osada, Masashi Ricardez, Gustavo A. Garcia Suzuki, Yosuke Taniguchi, Tadahiro Ritsumeikan Univ Coll Informat Sci & Engn Kusatsu Japan Kanazawa Univ Coll Sci & Engn Kanazawa Japan

Large language models (LLMs) and vision-language models (VLMs) have been increasingly used in robotics for high-level cognition, but their use for low-level cognition, such as interpreting sensor information, remains underexplored. In robotic grasping, estimating the reflectance of objects is crucial for successful grasping, as it significantly impacts the distance measured by proximity sensors. We investigate whether LLMs can estimate reflectance from object names alone, leveraging the embedded human knowledge in distributional semantics, and if the latent structure of language in VLMs positively affects image-based reflectance estimation. In this paper, we verify that (1) LLMs such as GPT-3.5 and GPT-4 can estimate an object's reflectance using only text as input;and (2) VLMs such as CLIP can increase their generalization capabilities in reflectance estimation from images. Our experiments show that GPT-4 can estimate an object's reflectance using only text input with a mean error of 14.7%, lower than the image-only ResNet. Moreover, CLIP achieved the lowest mean error of 11.8%, while GPT-3.5 obtained a competitive 19.9% compared to ResNet's 17.8%. These results suggest that the distributional semantics in LLMs and VLMs increases their generalization capabilities, and the knowledge acquired by VLMs benefits from the latent structure of language.

关键词： vision-language models large language models low-level cognition in robotics proximity sensors reflectance estimation

来源：评论

学校读者我要写书评

暂无评论

Reflex-based open-vocabulary navigation without prior knowledge using omnidirectional camera and multiple vision-language models

引用

ADVANCED ROBOTICS 2024年第18期38卷 1307-1317页

作者： Kawaharazuka, Kento Obinata, Yoshiki Kanazawa, Naoaki Tsukamoto, Naoto Okada, Kei Inaba, Masayuki Univ Tokyo Grad Sch Informat Sci & Technol Dept Mechanoinformat Tokyo Japan

Various robot navigation methods have been developed, but they are mainly based on Simultaneous Localization and Mapping (SLAM), reinforcement learning, etc., which require prior map construction or learning. In this study, we consider the simplest method that does not require any map construction or learning, and execute open-vocabulary navigation of robots without any prior knowledge to do this. We applied an omnidirectional camera and pre-trained vision-language models to the robot. The omnidirectional camera provides a uniform view of the surroundings, thus eliminating the need for complicated exploratory behaviors including trajectory generation. By applying multiple pre-trained vision-language models to this omnidirectional image and incorporating reflective behaviors, we show that navigation becomes simple and does not require any prior setup. Interesting properties and limitations of our method are discussed based on experiments with the mobile robot Fetch. [GRAPHICS]

关键词： Reflex-based control omnidirectional camera vision-language models

来源：评论

学校读者我要写书评

暂无评论

Few-Shot Image Classification of Crop Diseases Based on vision-language models

引用

SENSORS 2024年第18期24卷 6109页

作者： Zhou, Yueyue Yan, Hongping Ding, Kun Cai, Tingting Zhang, Yan China Univ Geosci Sch Informat Engn Beijing 100083 Peoples R China Chinese Acad Sci Inst Automat State Key Lab Multimodal Artificial Intelligence S Beijing 100190 Peoples R China

Accurate crop disease classification is crucial for ensuring food security and enhancing agricultural productivity. However, the existing crop disease classification algorithms primarily focus on a single image modality and typically require a large number of samples. Our research counters these issues by using pre-trained vision-language models (VLMs), which enhance the multimodal synergy for better crop disease classification than the traditional unimodal approaches. Firstly, we apply the multimodal model Qwen-VL to generate meticulous textual descriptions for representative disease images selected through clustering from the training set, which will serve as prompt text for generating classifier weights. Compared to solely using the language model for prompt text generation, this approach better captures and conveys fine-grained and image-specific information, thereby enhancing the prompt quality. Secondly, we integrate cross-attention and SE (Squeeze-and-Excitation) Attention into the training-free mode VLCD(vision-language model for Crop Disease classification) and the training-required mode VLCD-T (VLCD-Training), respectively, for prompt text processing, enhancing the classifier weights by emphasizing the key text features. The experimental outcomes conclusively prove our method's heightened classification effectiveness in few-shot crop disease scenarios, tackling the data limitations and intricate disease recognition issues. It offers a pragmatic tool for agricultural pathology and reinforces the smart farming surveillance infrastructure.

关键词： few-shot learning crop disease classification vision-language models attention mechanisms

来源：评论

学校读者我要写书评

暂无评论

Compositional Kronecker Context Optimization for vision-language models

引用

NEUROCOMPUTING 2024年 608卷

作者： Ding, Kun Li, Xiaohui Yu, Qiang Wang, Ying Zhang, Haojian Xiang, Shiming Chinese Acad Sci Inst Automat State Key Lab Multimodal Artificial Intelligence S Beijing Peoples R China Chinese Acad Sci Inst Automat Engn Lab Intelligent Ind Vis Beijing Peoples R China Chinese Acad Sci Inst Automat Res Ctr Aerosp Informat Beijing Peoples R China

Context Optimization (CoOp) has emerged as a simple yet effective technique for adapting CLIP-like vision- language models to downstream image recognition tasks. Nevertheless, learning context with satisfactory base-to-new, domain and cross-task generalization ability simultaneously while adapting to new tasks is a challenge. To tackle such a challenge, existing methods mainly exploit knowledge distillation with auxiliary text data written by human experts. However, we instead explore a new technique route by structuring the prompts without resorting to extra text data. As a result, we obtain a new lightweight yet generalizable approach termed Compositional Kronecker Context Optimization (CK-CoOp). Technically, the prompt's context words in CKCoOp are learnable vectors, which are crafted by linearly combining base vectors from a dictionary. These base vectors consist of a non-learnable component obtained by quantizing the weights in the token embedding layer, and a learnable component constructed by applying Kronecker product on several learnable tiny matrices. Intuitively, the compositional structure mitigates the risk of overfitting on training data and the Kronecker product breaks the non-learnable restrictions of the dictionary, thereby enhancing representation ability with minimal additional parameters. Extensive experiments confirm that, compared with existing methods, CKCoOp can not only achieve comparable or even better performance under base-to-new, domain and cross-task generalization evaluation without the help of auxiliary text data, but also has the merits of fewer learnable parameters and efficient training and inference speed.

关键词： vision-language models Prompt tuning Structural context optimization Few-shot image recognition

来源：评论

学校读者我要写书评

暂无评论

vision-language models Learn Super Images for Efficient Partially Relevant Video Retrieval

引用

ACM Transactions on Multimedia Computing, Communications, and Applications 1000年

作者： Taichi Nishimura Shota Nakada Masayoshi Kondo LY Corporation Japan

In this paper, we propose an efficient and high-performance method for partially relevant video retrieval. The method aims to retrieve long videos that contain at least one moment relevant to the input text query. The challenge lies in encoding dense frames using visual backbones. This requires models to handle the increased frames, resulting in significant computation costs for long videos. To mitigate the costs, previous studies use lightweight visual backbones, yielding sub-optimal retrieval performance due to their limited capabilities. However, it is undesirable to simply replace the backbones with high-performance large vision-and-language models (VLMs) due to their low efficiency. To address this dilemma, instead of dense frames, we focus on super images, which are created by rearranging the video frames in an \(N\times N\) grid layout. This reduces the number of visual encodings to \(\fraclanguage{N^{2}}\) and mitigates the low efficiency of large VLMs. Based on this idea, we make two contributions. First, we explore whether VLMs generalize to super images in a zero-shot setting. To this end, we propose a method called query-attentive super image retrieval (QASIR), which attends to partial moments relevant to the input query. The zero-shot QASIR yields two discoveries: (1) it enables VLMs to generalize to super images and (2) the grid size \(N\), image resolution, and VLM size are key trade-off parameters between performance and computation costs. Second, we introduce fine-tuning and hybrid QASIR that combines high- and low-efficiency models to strike a balance between performance and computation costs. This reveals two findings: (1) the fine-tuning QASIR enhances VLMs to learn super images effectively, and (2) the hybrid QASIR minimizes the performance drop of large VLMs while reducing the computation costs.

关键词： vision-language models Super images Partially relevant video retrieval

来源：评论

学校读者我要写书评

暂无评论

A Slim Prompt-Averaged Consistency prompt learning for vision-language model

引用

KNOWLEDGE-BASED SYSTEMS 2025年 310卷

作者： He, Siyu Wang, Shengsheng Long, Sifan Jilin Univ Coll Comp Sci & Technol Changchun 130012 Peoples R China Jilin Univ Key Lab Symbol Computat & Knowledge Engn Minist Educ Changchun 130012 Peoples R China

Recent advancements in prompt tuning have enhanced the adaptation of large pre-trained models to target tasks. However, existing methods struggle to establish an effective balance between task-specific knowledge and generalizable knowledge during tuning, skewing too heavily towards one at the expense of the other. To address this issue, we propose a Slim Prompt-Averaged Consistency (SPAC) prompt learning approach. Specifically, SPAC introduces a temporal ensembling-based averaged-prompt module and leverages a multifaceted consistency mechanism to ensure knowledge consistency under the guidance of averaged-prompt. Additionally, SPAC employs the contrastive learning strategy to further enhance the learning of target task representations based on positive and negative sample pairs. Furthermore, considering the notable resource consumption of existing prompt formats, we refine the prompt format, significantly reducing resource consumption during training and inference. Extensive experiments on 11 benchmark datasets demonstrate that our approach outperforms others in few-shot prompt learning transfer tasks, including base-to-novel generalization and cross-dataset transfer, while consuming fewer resources.

关键词： Prompt learning Generalization Few-shot learning vision-language models

来源：评论

学校读者我要写书评

暂无评论

Generalized Robotic vision-language Learning Model via Linguistic Foreground-Aware Contrast

引用

INTERNATIONAL JOURNAL OF COMPUTER vision 2025年第6期133卷 3481-3518页

作者： Liu, Kangcheng Wang, Chaoqun Han, Xiaodong Liu, Yong-Jin Chen, Baoquan Hunan Univ Coll Elect & Informat Engn Changsha Peoples R China CALTECH Div Engn & Appl Sci Pasadena CA 91125 USA Tsinghua Univ Dept Comp Sci & Technol Beijing Peoples R China Shandong Univ Sch Control Sci & Engn Jinan Peoples R China Minjiang Univ Sch Control Engn Fuzhou Peoples R China Peking Univ Sch Artificial Intelligence Beijing Peoples R China

Contrastive learning has recently demonstrated great potential for unsupervised pre-training in 3D scene understanding tasks. However, most existing work randomly selects point features as anchors while building contrast, leading to a clear bias toward background points that often dominate in 3D scenes. Also, object awareness and foreground-to-background discrimination are neglected, making contrastive learning less effective. To tackle these issues, we propose a general foreground-aware feature contrast FAC++ framework to learn more effective point cloud representations in pre-training. FAC++ consists of two novel contrast designs to construct more effective and informative contrast pairs. The first is building positive pairs within the same foreground segment where points tend to have the same semantics. The second is that we prevent over-discrimination between 3D segments/objects and encourage grouped foreground-to-background distinctions at the segment level with adaptive feature learning in a Siamese correspondence network, which adaptively learns feature correlations within and across point cloud views effectively. Our proposed approach enhances both the local coherence as well as the overall feature discrimination. Moreover, we have designed the linguistic foreground-aware regional point sampling to enhance more balanced foreground-aware learning, which is termed FAC++. Visualization with point activation maps shows that our contrast pairs capture clear correspondences among foreground regions during pre-training. Quantitative experiments also show that FAC++ achieves superior knowledge transfer and data efficiency in various downstream 3D semantic segmentation, instance segmentation as well as object detection tasks. All codes, data, and models are available at: (https://***/KangchengLiu/FAC_Foreground_Aware_Contrast).

关键词： Self-supervised learning vision-language models Representation learning Data-efficient learning 3D vision

来源：评论

学校读者我要写书评

暂无评论

A prompt-free vision-language model for environmental perception in automated driving systems

引用

PROCEEDINGS OF THE INSTITUTION OF MECHANICAL ENGINEERS PART D-JOURNAL OF AUTOMOBILE ENGINEERING 2025年

作者： Lin, Chaojun Shi, Ying Hu, Qin Zhang, Lei Wuhan Univ Technol Dept Automat 122 Luoshi Rd Wuhan 430070 Peoples R China

Environmental perception is a critical component of automated driving systems. Advancing environmental perception algorithms toward applications in open-world road scenarios is a current research trend. However, traditional single-modality detectors exhibit weak generalization capabilities, while modern vision-language models require manual input text prompts to function properly. Since conventional methods fail to meet the requirements of open-world environmental perception, this study proposes a prompt-free vision-language model to address this problem. The proposed model first introduces a prompt memory pretraining strategy, which stores text prompt memory by pretraining the model on large-scale object detection datasets. Subsequently, a dynamic prompt generation module is proposed to identify foreground categories within the vision modality input. It queries the prior text prompt memory to generate the corresponding text prompts. Extensive experiments demonstrate that the proposed method significantly outperforms conventional detectors and modern vision-language models, even when relying solely on vision modality input. The code and trained models are available at https://***/unbelieboomboom/prompt_free_G_DINO.

关键词： Automated driving systems environmental perception object detection vision-language models dynamic prompt generation

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：