检索结果-内蒙古大学图书馆

A vision-language model with multi-granular knowledge fusion in medical imaging

WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS 2025年第1期28卷 1-21页

作者： Chen, Kai Li, Yunxin Zhu, Xiwen Zhang, Wentai Hu, Baotian Huaqiao Univ Sch Comp Sci & Technol 668 Jimei Ave Xiamen 361021 Fujian Peoples R China Harbin Inst Technol Shenzhen Sch Comp Sci & Technol 6 Pingshan 1st Rd Nanshan Dist Shenzhen 518055 Guangdong Peoples R China Peking Univ First Hosp Dept Thorac Surg 8 Xishiku St Beijing 100034 Peoples R China

The rapid expansion of radiological imaging data has placed a significant burden on radiologists, increasing the risk of diagnostic errors. vision-language models offer a promising solution to alleviate this workload and improve diagnostic accuracy within the medical imaging domain. However, most current models rely solely on training data to activate general-purpose performance, which often results in inadequate understanding and generation of high-quality outputs in complex and specialized medical scenarios due to insufficient domain knowledge. To address this limitation, we propose a vision-language model with Multi-Granular Knowledge Fusion (MGKF) that integrates diverse sources of knowledge to enhance performance across medical imaging tasks. Our model dynamically incorporates multi-granular knowledge, including medical entities, their definitions, and retrieved auxiliary knowledge. We improve the semantic alignment of visual and textual information through fine-tuning, introduce a pre-generation mechanism to incorporate this multi-granular knowledge, and ultimately enhance the model's ability to apply medical knowledge during inference. Experimental results across multiple medical imaging tasks, including Medical Report Generation, Medical Image Captioning, and Medical Visual Question Answering, demonstrate the effectiveness of the proposed MGKF model. This work provides valuable insights into the integration of specialized knowledge in medical imaging and contributes to reducing diagnostic errors.

关键词： Medical imaging vision-language model Multi-granular knowledge fusion

来源：评论

学校读者我要写书评

暂无评论

Joint feature extraction and alignment in object tracking with vision-language model

引用

ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE 2025年 152卷

作者： Zhu, Hong Lu, Qingyang Xue, Lei Yuan, Guanglin Zhang, Kaihua Natl Univ Def Technol Coll Elect Engn Hefei 230037 Peoples R China Army Artillery & Air Def Acad PLA Hefei 230031 Peoples R China Anhui Key Lab Polarizat Imaging Detect Technol Hefei 230031 Peoples R China Nanjing Univ Informat Sci & Technol Coll Comp Sci Nanjing 210044 Peoples R China

vision-language tracking is a new rising research topic that focuses on locating the target object in a video sequence using its language description. The main challenge is to model the correspondence between the vision-language input references and the test image, that is, vision-language fusion learning. Nevertheless, current vision-language trackers still follow the conventional framework in vision-only tracking, overlooking the heterogeneity between vision and language. To tackle the problem, we present a novel framework for vision-language tracking by combining joint feature extraction, alignment, and interaction. Before fusing, we first perform joint feature extraction and modality alignment in a vision-language embedding space learned by our proposed Adapter-equipped model, so as to obtain semantically unified feature representation. The model, named A-CLIP, which integrates some lightweight Adapters into CLIP (Contrastive language-Image Pre-training), is an efficient method of seamlessly transferring the large-scale foundation model to downstream tracking task. It inherits the remarkable generalization ability from the large-scale model, providing a solution to the challenge of limited training data in vision-language tracking. Further, a transformer-based deep fusion is designed to model the multi-source correlations, highlighting the localization-relevant cues for accurate reasoning. Our proposed method has been extensively evaluated on three benchmark datasets, and the experimental results provide both quantitative and qualitative analyses that demonstrate its superior performance compared to the state-of-the-arts.

关键词： vision-language tracking Joint feature extraction and alignment vision-language model Transformer

来源：评论

学校读者我要写书评

暂无评论

Situation classification of living environment by daily life support robot using pre-trained large-scale vision-language model

引用

ADVANCED ROBOTICS 2025年第7期39卷 323-337页

作者： Obinata, Yoshiki Kawaharazuka, Kento Kanazawa, Naoaki Yamaguchi, Naoya Tsukamoto, Naoto Yanokura, Iori Kitagawa, Shingo Okada, Kei Inaba, Masayuki Univ Tokyo Grad Sch Informat Sci & Technol Dept Mechanoinformat Bunkyo Ku Tokyo Japan

Various conditions exist in individual daily life environments. It is important for a daily life support robot to observe states in the daily life environment and perform tasks depending on the living environment. Today, pre-trained vision-language models have been developed and are good at the general interpretation of images. With these backgrounds, we propose a method to classify situations in real daily life environments for situation-aware task execution using the pre-trained vision-language model. Our classifier requires no additional training and is robust to minor pose changes of objects and robots. In our experiments, we have successfully clustered a variety of situations, ranging from object situations to human actions, and executed tasks based on the situation by mapping cluster results to tasks.

关键词： vision-language model daily life support computer vision

来源：评论

学校读者我要写书评

暂无评论

A Dual-State-Based Surface Anomaly Detection model for Rail Transit Trains Using vision-language model

引用

IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT 2025年 74卷

作者： Lei, Kaiyan Qi, Zhiquan Univ Chinese Acad Sci Sch Comp Sci & Technol Beijing 101408 Peoples R China Chinese Acad Sci Res Ctr Fictitious Econ & Data Sci Beijing 100190 Peoples R China

For the anomaly detection on the surface of rail transit train body (RTTB-AD), due to the scarcity of anomalies, the complexity and variability of the detection environment, and the exceptionally high identification rate required by practical application, the task is quite challenging. This article proposes a novel differential-based anomaly detection model (DSE-AD) for the surface of rail train bodies based on visual-language model. It utilizes the differences between history and current images of the same position on the same train type to achieve anomaly localization, while addressing nonanomalous changes interference caused by the environment. Specifically, we first propose the normal-abnormal dual-state contrast prompt suitable for rail trains, and fine-grained align the image features with the prompt features from the pretrained encoder to obtain the task-specific dual-state feature representation. Next, we propose the dual-state difference enhancement (DSDE) module, which utilizes a learnable difference attention matrix to enhance the anomaly-specific dual-state information, allowing the model to focus on the anomaly semantics. Finally, a anomaly highlight module (AHM) is designed in the inference process to reduce nonanomalous predictions by improving the discrimination of abnormal features. Experiments show that DSE-AD is able to adapt to the complex and variable detection environment, and outperforms other methods in both same-domain and cross-domain detection, especially for unknown anomalies. And it shows robustness in dealing with the interference of changes between the history and current images, as well as faster convergence and independence of the pretrained model scale.

关键词： Dual-state features (DSFs) rail transit train surface anomaly detection vision-language model visual inspection Dual-state features (DSFs) rail transit train surface anomaly detection vision-language model visual inspection

来源：评论

学校读者我要写书评

暂无评论

MammoVLM: A generative large vision-language model for mammography-related diagnostic assistance

引用

INFORMATION FUSION 2025年 118卷

作者： Cao, Zhenjie Deng, Zhuo Ma, Jie Hu, Jintao Ma, Lan Tsinghua Univ Shenzhen Int Grad Sch Shenzhen 518055 Peoples R China AI Lab Pingan Tech Shenzhen Peoples R China Chinese Univ Hong Kong Dept Anat & Cellular Pathol Hong Kong Peoples R China Shenzhen Peoples Hosp Radiol Dept Shenzhen 518020 Peoples R China

Inspired by the recent success of large language models (LLMs) in the general domain, many large multimodal models, such as vision-language models, have been developed to tackle problems across modalities. In the realm of breast cancer, which is now the most deadly cancer worldwide, mammography serves as the primary screening approach for early detection. There is a practical need for patients to have a diagnostic assistant for their follow-up Q&A regarding their mammography screening. We believe large vision-language models have great potential to address this need. However, applying off-the-shelf large models directly in medical scenarios normally provides unsatisfactory results. In this work, we present MammoVLM, a large vision-language model to assist patients with problems related to mammograms. MammoVLM has a sparse visual-MoE module that attends to different encoders based on the densities of the input image. Besides, we build a novel projection module, UMiCon, that leverages unimodal and multimodal contrastive learning training strategies to improve the alignment between visual and textual features. GLM-4 9B, an open-source LLM, is attached after previous multimodal modules to generate answers after supervised fine-tuning. We build our own dataset with 33,630 mammogram studies with diagnostic reports from 30,495 patients. MammoVLM has shown extraordinary potential in multi-round interactive dialogues. Our experimental results show that it has not only beaten other leading VLMs but also shows a professional capability similar to that of a junior radiologist.

关键词： Mammogram Multimodal foundation model vision-language model Breast cancer Medical Q&A Diagnostic assistance

来源：评论

学校读者我要写书评

暂无评论

HiE-VL: A Large vision-language model with Hierarchical Adapter for Handwritten Mathematical Expression Recognition

HiE-VL: A Large Vision-Language Model with Hierarchical Adap...

引用

2025 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2025

作者： Guo, Hong-Yu Yin, Fei Xu, Jian Liu, Cheng-Lin School of Artificial Intelligence University of Chinese Academy of Sciences Beijing China State Key Laboratory of Multimodal Artificial Intelligence Systems Institute of Automation of Chinese Academy of Sciences Beijing China

ISBN: (纸本)9798350368741

Large vision-language models (LVLMs) have shown impressive capabilities across various domains, but existing LVLMs have limited performance in dense perception and structured learning problems, such as Handwritten Mathematical Expression Recognition (HMER). The primary challenges stem from the complexity of formula images comprising multiple symbols and complicated inter-symbol relationships. This poses difficulties to LVLMs with locality insensitive visual encoders and structure-agnostic vision-language projectors. To overcome these challenges, we propose HiE-VL, the first LVLM for HMER containing: (1) a primitive-aware high-resolution visual encoder, (2) a hierarchical adapter, (3) a math-context enhanced large language model (LLM). Specifically, the adopted visual encoder allows locating and recognizing symbols in complex formula images. The hierarchical adapter functions as a vision-language projector to progressively capture primitive and structure information for facilitating expression decoding. The whole model is optimized in a two-stage training pipeline. In experiments on two benchmark datasets of HMER, our model achieves significantly higher performance than existing LVLMs like GPT-4V and state-of-the-art HMER models. Our codes are available at https://***/guohy17/HiE-VL. © 2025 IEEE.

关键词： handwritten mathematical expression recognition instruction tuning pre-training vision-language model

来源：评论

学校读者我要写书评

暂无评论

Context-aware prompt learning for test-time vision recognition with frozen vision-language model

引用

PATTERN RECOGNITION 2025年 162卷

作者： Yin, Junhui Zhang, Xinyu Wu, Lin Wang, Xiaojie Beijing Univ Posts & Telecommun Beijing Peoples R China Univ Adelaide Adelaide Australia Swansea Univ Swansea Wales

Current pre-trained vision-language models, such as CLIP, have demonstrated remarkable zero-shot generalization capabilities across various downstream tasks. However, their performance significantly degrades when test inputs exhibit different distributions. In this paper, we explore the concept of test-time prompt tuning (TTPT), which facilitates the adaptation of the CLIP model to novel downstream tasks through a one-step unsupervised optimization that involves only test samples. Inspired by in-context learning in natural language processing (NLP), we propose C ontext-aware P rompt L earning (CaPL) for test-time visual recognition tasks, which empowers a pre-trained vision-language model with labeled examples as context information on downstream tasks. Specifically, CaPL associates a new test sample with very few labeled examples (sometimes just one) as context information, enabling reliable label estimation for the test sample and facilitating model adaptation. To achieve this, CaPL employs an efficient language-to-vision translator to explore the textual prior information for visual prompt learning. Further, we introduce a context-aware unsupervised loss to optimize visual prompts tailored to test samples. Finally, we design a cyclic learning strategy for visual and textual prompts to ensure mutual synergy across different modalities. This enables a pre-trained, frozen CLIP model to adapt to any task using its learned adaptive prompt. Our method demonstrates superior performance and achieves state-of-the-art results across various downstream datasets. The code is available at: https://***/yjh576/CaPL.

关键词： In-context learning Prompt learning vision-language model vision recognition Test-time adaptation

来源：评论

学校读者我要写书评

暂无评论

Controlling vision-language model for enhancing image restoration

引用

IMAGE AND vision COMPUTING 2025年 158卷

作者： Shao, Mingwen Liu, Weihan Meng, Lingzhuang Wan, Yecong China Univ Petr East China Qingdao Inst Software Coll Comp Sci & Technol State Key Lab Chem Safety Qingdao 266580 Peoples R China

Restoring low-quality images to their original high-quality state remains a significant challenge due to inherent uncertainties, particularly in blind image restoration scenarios where the nature of degradation is unknown. Despite recent advances, many restoration techniques still grapple with robustness and adaptability across diverse degradation conditions. In this paper, we introduce an approach to augment the restoration model by exploiting the robust prior features of CLIP, a large-scale vision-language model, to enhance its proficiency in handling a broader spectrum of degradation tasks. We integrate the robust priors from CLIP into the pre-trained image restoration model via cross-attention mechanisms, and we design a Prior Adapter to modulate these features, thereby enhancing the model's restoration performance. Additionally, we introduce an innovative prompt learning framework that harnesses CLIP's multimodal alignment capabilities to fine-tune pre-trained restoration models. Furthermore, we utilize CLIP's contrastive loss to ensure that the restored images align more closely with the prompts of clean images in CLIP's latent space, thereby improving the quality of the restoration. Through comprehensive experiments, we demonstrate the effectiveness and robustness of our method, showcasing its superior adaptability to a wide array of degradation tasks. Our findings emphasize the potential of integrating vision-language models such as CLIP to advance the cutting-edge in image restoration.

关键词： Image restoration Prompt learning vision-language model CLIP

来源：评论

学校读者我要写书评

暂无评论

Multimodal multitask similarity learning for vision language model on radiological images and reports

引用

NEUROCOMPUTING 2025年 636卷

作者： Yu, Yang Wang, Jiahao Liu, Weide Mien, Ivan Ho Krishnaswamy, Pavitra Yang, Xulei Cheng, Jun ASTAR Machine Intellect Dept Inst Infocomm Res I R 2 1 Fusionopolis Way21-01 Connexis Singapore 138632 Singapore Natl Univ Singapore NUS Mechanobiol Inst MBI 5A Engn Dr 1 Singapore 117411 Singapore ASTAR Inst Infocomm Res I 2 R Healthcare & Medtech Div 1 Fusionopolis Way21-01 Connexis Singapore 138632 Singapore Natl Neurosci Inst NNI Dept Neuroradiol 11 Jln Tan Tock Seng Singapore 308433 Singapore

In recent years, large-scale vision-language models (VLM) have shown promise in learning general representations for various medical image analysis tasks. However, current medical VLM methods typically employ contrastive learning approaches that have limited ability to capture nuanced yet crucial medical knowledge, particularly within similar medical images, and do not explicitly consider the uneven and complementary semantic information contained in different modalities. To address these challenges, we propose a novel Multimodal Multitask Similarity Learning (M2SL) method that learns joint representations of image-text pairs and captures the relational similarity between different modalities via a coupling network. Our method also notably leverages the rich information in the text inputs to construct a knowledge-driven semantic similarity matrix as the supervision signal. We conduct extensive experiments for cross-modal retrieval and zero-shot classification tasks on radiological images and reports and demonstrate substantial performance gains over existing methods. Our method also accommodates low-resource settings with limited training data availability and has significant implications for enhancing VLM development.

关键词： Diagnostic imaging Multimodal representation learning vision-language model

来源：评论

学校读者我要写书评

暂无评论

Continual Learning of Image Classes With language Guidance From a vision-language model

引用

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 2024年第12期34卷 13152-13163页

作者： Zhang, Wentao Huang, Yujun Zhang, Weizhuo Zhang, Tong Lao, Qicheng Yu, Yue Zheng, Wei-Shi Wang, Ruixuan Sun Yat Sen Univ Sch Comp Sci & Engn Guangzhou 510275 Peoples R China Minist Educ Key Lab Machine Intelligence & Adv Comp Guangzhou 510275 Peoples R China Peng Cheng Lab Shenzhen 518066 Peoples R China Beijing Univ Posts & Telecommun Sch Artificial Intelligence Beijing 100876 Peoples R China

Current deep learning models often catastrophically forget the knowledge of old classes when continually learning new ones. State-of-the-art approaches to continual learning of image classes often require retaining a small subset of old data to partly alleviate the catastrophic forgetting issue, and their performance would be degraded sharply when no old data can be stored due to privacy or safety concerns. In this study, inspired by human learning of visual knowledge with the effective help of language, we propose a novel continual learning framework based on a pre-trained vision-language model (VLM) without retaining any old data. Rich prior knowledge of each new image class is effectively encoded by the frozen text encoder of the VLM, which is then used to guide the learning of new image classes. The output space of the frozen text encoder is unchanged over the whole process of continual learning, through which image representations of different classes become comparable during model inference even when the image classes are learned at different times. Extensive empirical evaluations on multiple image classification datasets under various settings confirm the superior performance of our method over existing ones. The source code is available at https://***/Fatflower/CIL_LG_VLM/.

关键词： Visualization Continuing education Adaptation models Data models Task analysis Knowledge engineering Semantics Continual learning vision-language model language guidance

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：