检索结果-内蒙古大学图书馆

Recent Advances of Foundation language models-based Continual Learning: A Survey

ACM COMPUTING SURVEYS 2025年第5期57卷 1-38页

作者： Yang, Yutao Zhou, Jie Ding, Xuan wen Huai, Tianyu Liu, Shunyu Chen, Qin Xie, Yuan He, Liang East China Normal Univ Sch Comp Sci & Technol Shanghai Peoples R China

Recently, foundation language models (LMs) have marked significant achievements in the domains of natural language processing and computer vision. Unlike traditional neural network models, foundation LMs obtain a great ability for transfer learning by acquiring rich common sense knowledge through pre-training on extensive unsupervised datasets with a vast number of parameters. Despite these capabilities, LMs still struggle with catastrophic forgetting, hindering their ability to learn continuously like humans. To address this, continual learning (CL) methodologies have been introduced, allowing LMs to adapt to new tasks while retaining learned knowledge. However, a systematic taxonomy of existing approaches and a comparison of their performance are still lacking. In this article, we delve into a comprehensive review, summarization, and classification of the existing literature on CL-based approaches applied to foundation language models, such as pre-trained language models, large language models, and vision-language models. We divide these studies into offline and online CL, which consist of traditional methods, parameter-efficient-based methods, instruction tuning-based methods and continual pre-training methods. Additionally, we outline the typical datasets and metrics employed in CL research and provide a detailed analysis of the challenges and future work for LMs-based continual learning.

关键词： Continual learning foundation language models pre-trained language models large language models vision-language models survey

来源：评论

学校读者我要写书评

暂无评论

Progressive Visual Prompt Learning with Contrastive Feature Re-formation

引用

INTERNATIONAL JOURNAL OF COMPUTER vision 2025年第2期133卷 511-526页

作者： Xu, Chen Zhu, Yuhan Shen, Haocheng Chen, Boheng Liao, Yixuan Chen, Xiaoxin Wang, Limin Nanjing Univ State Key Lab Novel Software Technol Nanjing Peoples R China VIVO AI Lab Shenzhen Peoples R China Shanghai AI Lab Shanghai Peoples R China

Prompt learning has recently emerged as a compelling alternative to the traditional fine-tuning paradigm for adapting the pre-trained vision-language (V-L) models to downstream tasks. Drawing inspiration from the success of prompt learning in Natural language Processing, pioneering research efforts have been predominantly concentrated on text-based prompting strategies. By contrast, the visual prompting within V-L models remains underexploited. The straightforward transposition of existing visual prompt methods, tailored for vision Transformers (ViT), into the V-L models often leads to suboptimal performance or training instability. To mitigate these challenges, in this paper, we propose a novel structure called Progressive Visual Prompt (ProVP). This design aims to strengthen the interaction among prompts from adjacent layers, thereby enabling more effective propagation of image embeddings to deeper layers in a manner akin to an instance-specific manner. Additionally, to address the common issue of generalization deterioration in the training period of learnable prompts, we further introduce a contrastive feature re-formation technique for visual prompt learning. This method prevents significant deviations of prompted visual features from the fixed CLIP visual feature distribution, ensuring its better generalization capability. Combining the ProVP and the contrastive feature re-formation technique, our proposed method, ProVP-Ref, significantly stabilizes the training process and enhances both the adaptation and generalization capabilities of visual prompt learning in V-L models. To demonstrate the efficacy of our approach, we evaluate ProVP-Ref across 11 image datasets, achieving the state-of-the-art results on 7 of these datasets in both few-shot learning and base-to-new generalization settings. To the best of our knowledge, this is the first study to showcase the exceptional performance of visual prompts in V-L models compared to previous text prompting methods i

关键词： vision-language models Prompt learning Transfer learning

来源：评论

学校读者我要写书评

暂无评论

Image-text aggregation for open-vocabulary semantic segmentation

引用

NEUROCOMPUTING 2025年 630卷

作者： Cheng, Shengyang Huang, Jianyong Wang, Xiaodong Huang, Lei Wei, Zhiqiang Ocean Univ China Fac Informat Sci & Engn Qingdao 266100 Peoples R China Qingdao Educ Equipment & Informat Technol Ctr Qingdao 266022 Peoples R China

Existing works on open-vocabulary semantic segmentation explore utilizing large-scale vision-language models. Recent methods have relied mostly on visual features while treating text features as supporting components. Our method explores the potential of image-related text information. This paper proposes a novel open- vocabulary semantic segmentation method based on image-text aggregation (ITA). We design a dominant category unearthing module to mine text features strongly correlated with the image, facilitating the aggregation of image-text information. Additionally, we employ a detail enhancement module to mitigate the problem of losing image details. Moreover, our ITA accomplishes single-stage semantic segmentation via the image-text aggregation module. It outperforms the two-stage methods, which have the inherent challenges of inaccurate cropped image recognition and multiple forwarding, and thus demonstrates better performance and efficiency. Experimental results on multiple widely used benchmark datasets demonstrate that our ITA achieves excellent segmentation performance compared with the state-of-the-art open-vocabulary semantic segmentation methods. The code is available at https://***/huanglab-research/ITA.

关键词： Open-vocabulary Semantic segmentation vision-language models

来源：评论

学校读者我要写书评

暂无评论

Image-text feature learning for unsupervised visible-infrared person re-identification

引用

IMAGE AND vision COMPUTING 2025年 158卷

作者： Guo, Jifeng Pang, Zhiqi Guilin Univ Aerosp Technol Coll Comp Sci & Engn Guilin 541000 Guangxi Peoples R China Harbin Inst Technol Fac Comp Harbin 150001 Heilongjiang Peoples R China

Visible-infrared person re-identification (VI-ReID) focuses on matching infrared and visible images of the same person. To reduce labeling costs, unsupervised VI-ReID (UVI-ReID) methods typically use clustering algorithms to generate pseudo-labels and iteratively optimize the model based on these pseudo-labels. Although existing UVI-ReID methods have achieved promising performance, they often overlook the effectiveness of text semantics in inter-modality matching and modality-invariant feature learning. In this paper, we propose an image-text feature learning (ITFL) method, which not only leverages text semantics to enhance intra-modality identity-related learning but also incorporates text semantics into inter-modality matching and modality-invariant feature learning. Specifically, ITFL first performs modality-aware feature learning to generate pseudo-labels within each modality. Then, ITFL employs modality-invariant text modeling (MTM) to learn a text feature for each cluster in the visible modality, and utilizes inter-modality dual-semantics matching (IDM) to match inter-modality positive clusters. To obtain modality-invariant and identity-related image features, we not only introduce a cross-modality contrastive loss in ITFL to mitigate the impact of modality gaps, but also develop a text semantic consistency loss to further promote modality-invariant feature learning. Extensive experimental results on VI-ReID datasets demonstrate that ITFL not only outperforms existing unsupervised methods but also competes with some supervised approaches.

关键词： Unsupervised learning Visible-infrared person re-identification Contrastive learning vision-language models

来源：评论

学校读者我要写书评

暂无评论

VideoQA in the Era of LLMs: An Empirical Study

引用

INTERNATIONAL JOURNAL OF COMPUTER vision 2025年 1-24页

作者： Xiao, Junbin Huang, Nanxin Qin, Hangyu Li, Dongyang Li, Yicong Zhu, Fengbin Tao, Zhulin Yu, Jianxing Lin, Liang Chua, Tat-Seng Yao, Angela Natl Univ Singapore Singapore Singapore Commun Univ China Beijing Peoples R China Sun Yat Sen Univ Guangzhou Peoples R China

Video Large language models (Video-LLMs) are flourishing and has advanced many video-language tasks. As a golden testbed, Video Question Answering (VideoQA) plays pivotal role in Video-LLM developing. This work conducts a timely and comprehensive study of Video-LLMs' behavior in VideoQA, aiming to elucidate their success and failure modes, and provide insights towards more human-like video understanding and question answering. Our analyses demonstrate that Video-LLMs excel in VideoQA;they can correlate contextual cues and generate plausible responses to questions about varied video contents. However, models falter in handling video temporality, both in reasoning about temporal content ordering and grounding QA-relevant temporal moments. Moreover, the models behave unintuitively - they are unresponsive to adversarial video perturbations while being sensitive to simple variations of candidate answers and questions. Also, they do not necessarily generalize better. The findings demonstrate Video-LLMs' QA capability in standard condition yet highlight their severe deficiency in robustness and interpretability, suggesting the urgent need on rationales in Video-LLM developing.

关键词： Video question answering Multimodal LLMs Video analysis Empirical study vision-language models

来源：评论

学校读者我要写书评

暂无评论

Synth-CLIP: Synthetic data make CLIP generalize better in data-limited scenarios

引用

NEURAL NETWORKS 2025年 184卷 107083页

作者： Liu, Mushui He, Weijie Lu, Ziqian Dan, Jun Yu, Yunlong Li, Yingming Li, Xi Han, Jungong Zhejiang Univ Coll Informat Sci & Elect Engn Hangzhou Peoples R China Zhejiang Univ Sch Aeronaut & Astronaut Hangzhou Peoples R China Zhejiang Univ Coll Comp Sci & Technol Hangzhou Peoples R China Univ Sheffield Dept Comp Sci Sheffield England

Prompt learning is a powerful technique that enables the transfer of vision-language models (VLMs) like CLIP to downstream tasks. However, when the prompt-based methods are fine-tuned solely on base classes, they often struggle to generalize to novel classes lacking visual samples during training, especially in scenarios with limited training data. To address this challenge, we propose an innovative approach called Synth-CLIP that leverages synthetic data to enhance CLIP's generalization capability for base classes and the general capability for novel classes. Synth-CLIP fine-tunes the pre-trained CLIP model by seamlessly integrating tailored prompts that are both domain-specific and domain-shared, specifically designed for visual samples, reorganizing visual features from real and synthetic domains into the semantic space. This approach efficiently expands the data pool and enriches category diversity. Moreover, based on semantic structure consistency, we introduce a cross-domain feature alignment loss to match the real and synthetic samples in the feature embedding space. By aligning the visual and semantic distributions, the synthetic data from base and novel classes provide crucial discriminative information, enabling the model to rebalance the decision boundaries even in the absence of real novel visual samples. Experimental results on three model generalization tasks demonstrate that our method performs very competitively across various benchmarks. Notably, Synth-CLIP outperforms the recent competitor PromptSRC by an average improvement of 3.0% on novel classes across 11 datasets in open-vocabulary scenarios.

关键词： vision-language models CLIP Generalization Synthetic samples Data limited scenarios

来源：评论

学校读者我要写书评

暂无评论

Open-Vocabulary Action Localization With Iterative Visual Prompting

引用

IEEE ACCESS 2025年 13卷 56908-56917页

作者： Wake, Naoki Kanehira, Atsushi Sasabuchi, Kazuhiro Takamatsu, Jun Ikeuchi, Katsushi Microsoft Appl Robot Res Redmond WA 98052 USA

Video action localization aims to find the timings of specific actions from a long video. Although existing learning-based approaches have been successful, they require annotating videos, which comes with a considerable labor cost. This paper proposes a training-free, open-vocabulary approach based on emerging off-the-shelf vision-language models (VLMs). The challenge stems from the fact that VLMs are neither designed to process long videos nor tailored for finding actions. We overcome these problems by extending an iterative visual prompting technique. Specifically, we sample video frames and create a concatenated image with frame index labels, allowing a VLM to identify the frames that most likely correspond to the start and end of the action. By iteratively narrowing the sampling window around the selected frames, the estimation gradually converges to more precise temporal boundaries. We demonstrate that this technique yields reasonable performance, achieving results comparable to state-of-the-art zero-shot action localization. These results support the use of VLMs as a practical tool for understanding videos. Sample code is available at (https://***/VLM-Video-Action-Localization/).

关键词： Location awareness Pipelines Timing Robots Visualization Iterative methods Hands Benchmark testing Indexes Grasping Open-vocabulary action localization vision-language models large language models GPT action localization

来源：评论

学校读者我要写书评

暂无评论

RS3Lip: Consistency for remote sensing image classification on part embeddings using self-supervised learning and CLIP

引用

COMPUTER vision AND IMAGE UNDERSTANDING 2025年 251卷

作者： Jhaa, Ankit Singhab, Mainak Bhattacharyab, Avigyan Banerjeeb, Biplab LNM Inst Informat Technol Jaipur 302031 India Indian Inst Technol Mumbai 400076 India

Tackling domain and class generalization challenges remains a significant hurdle in the realm of remote sensing (RS). Recently, large-scale pre-trained vision-language models (VLMs), exemplified by CLIP, have showcased impressive zero-shot and few-shot generalization capabilities through extensive contrastive training. Existing literature emphasizes prompt learning as a means of enriching prompts with both domain and content information, particularly through smaller learnable projectors, thereby addressing multi-domain data challenges perceptibly. Along with this, it is observed that CLIP's vision encoder fails to generalize well on the puzzled or corrupted RS images. In response, we propose a novel solution utilizing self-supervised learning (SSL) to ensure consistency for puzzled RS images in domain generalization (DG). This approach strengthens visual features, facilitating the generation of domain-invariant prompts. Our proposed RS3Lip, trained with small projectors featuring few layers, complements the pre-trained CLIP. It incorporates SSL and inpainting losses for visual features, along with a consistency loss between the features of SSL tasks and textual features. Empirical findings demonstrate that RS3Lip consistently outperforms state-of-the-art prompt learning methods across five benchmark optical remote sensing datasets, achieving improvements of at least by 1.3% in domain and class generalization tasks.

关键词： Prompt Learning Remote Sensing Self-supervised Learning vision-language models

来源：评论

学校读者我要写书评

暂无评论

Continual Test-Time Adaptation for Single Image Defocus Deblurring via Causal Siamese Networks

引用

INTERNATIONAL JOURNAL OF COMPUTER vision 2025年 1-24页

作者： Cui, Shuang Li, Yi Li, Jiangmeng Tang, Xiongxin Su, Bing Xu, Fanjiang Xiong, Hui Chinese Acad Sci Natl Key Lab Space Integrated Informat Syst Inst Software Beijing Peoples R China Univ Chinese Acad Sci Beijing Peoples R China Renmin Univ China Gaoling Sch Artificial Intelligence Beijing Key Lab Big Data Management & Anal Methods Beijing Peoples R China Hong Kong Univ Sci & Technol Guangzhou Thrust Artificial Intelligence Guangzhou Peoples R China Hong Kong Univ Sci & Technol Dept Comp Sci & Engn Guangzhou Peoples R China

Single image defocus deblurring (SIDD) aims to restore an all-in-focus image from a defocused one. Distribution shifts in defocused images generally lead to performance degradation of existing methods during out-of-distribution inferences. In this work, we gauge the intrinsic reason behind the performance degradation, which is identified as the heterogeneity of lens-specific point spread functions. Empirical evidence supports this finding, motivating us to employ a continual test-time adaptation (CTTA) paradigm for SIDD. However, traditional CTTA methods, which primarily rely on entropy minimization, cannot sufficiently explore task-dependent information for pixel-level regression tasks like SIDD. To address this issue, we propose a novel Siamese networks-based continual test-time adaptation framework, which adapts source models to continuously changing target domains only requiring unlabeled target data in an online manner. To further mitigate semantically erroneous textures introduced by source SIDD models under severe degradation, we revisit the learning paradigm through a structural causal model and propose Causal Siamese networks (CauSiam). Our method leverages large-scale pre-trained vision-language models to derive discriminative universal semantic priors and integrates these priors into Siamese networks, ensuring causal identifiability between blurry inputs and restored images. Extensive experiments demonstrate that CauSiam effectively improves the generalization performance of existing SIDD methods in continuously changing domains.

关键词： Continual test-time adaptation Single image defocus deblurring Causality vision-language models

来源：评论

学校读者我要写书评

暂无评论

LAMARS: Large language Model-Based Anticipation Mechanism Acceleration in Real-Time Robotic Systems

引用

IEEE ACCESS 2025年 13卷 3864-3880页

作者： Gao, Yifang Luo, Wei Wang, Xuye Zhang, Shunshun Goh, Patrick Univ Sains Malaysia Sch Elect & Elect Engn Nibong Tebal 14300 Penang Malaysia Beijing Jiaotong Univ Sch Elect Engn Beijing 100044 Peoples R China Univ Sains Malaysia Sch Pharmaceut Sci Gelugor 11800 Penang Malaysia Guangxi Univ Sci & Technol Sch Automat Liuzhou 545006 Peoples R China

Large language models (LLMs) have assumed an increasingly crucial role in robotic systems because of their ability to leverage the extensive knowledge they possess in robotic inference and task handling. Although LLMs offer significant potential, their integration into robotic systems poses substantial challenges, particularly with regard to computational efficiency and latency. To address this challenge, this study presents LAMARS, an LLM-based anticipation mechanism designed to accelerate real-time robotic systems. LAMARS leverages the predictive power and zero-shot capabilities of LLMs combined with an anticipation mechanism and vision-language processing to position a robot in advance for upcoming tasks. This reduces latency and optimizes path planning without requiring expensive training data. Our evaluations in a realistic simulation environment and with a variation of the RLBench dataset demonstrated that LAMARS achieved an average success rate of 0.79 and improves efficiency by up to 52.4% compared to existing methods, significantly lowering path planning costs. These results indicate that LAMARS effectively accelerates directive execution, making it a promising solution to minimize delays in real-time robotic systems.

关键词： Robot kinematics Real-time systems Hidden Markov models Visualization Costs Data models Planning Adaptation models Predictive models Vectors Human-robot interaction large language models latency reduction vision-language models path planning

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：