检索结果-内蒙古大学图书馆

vision-language model for Generating Textual Descriptions From Clinical Images: model Development and Validation Study

引用

JMIR FORMATIVE RESEARCH 2024年第1期8卷 e32690页

作者： Ji, Jia Hou, Yongshuai Chen, Xinyu Pan, Youcheng Xiang, Yang Shenzhen Inst Informat Technol Shenzhen Peoples R China Peng Cheng Lab 2 Xingke 1st St Shenzhen 518000 Peoples R China Harbin Inst Technol Shenzhen Peoples R China

Background: The automatic generation of radiology reports, which seeks to create a free -text description from a clinical radiograph, is emerging as a pivotal intersection between clinical medicine and artificial intelligence. Leveraging natural language processing technologies can accelerate report creation, enhancing health care quality and standardization. However, most existing studies have not yet fully tapped into the combined potential of advanced language and vision models. Objective: The purpose of this study was to explore the integration of pretrained vision -language models into radiology report generation. This would enable the vision -language model to automatically convert clinical images into high -quality textual reports. Methods: In our research, we introduced a radiology report generation model named ClinicalBLIP, building upon the foundational InstructBLIP model and refining it using clinical image -to -text data sets. A multistage fine-tuning approach via low -rank adaptation was proposed to deepen the semantic comprehension of the visual encoder and the large language model for clinical imagery. Furthermore, prior knowledge was integrated through prompt learning to enhance the precision of the reports generated. Experiments were conducted on both the IU X-RAY and MIMIC-CXR data sets, with ClinicalBLIP compared to several leading methods. Results: Experimental results revealed that ClinicalBLIP obtained superior scores of 0.570/0.365 and 0.534/0.313 on the IU X-RAY/MIMIC-CXR test sets for the Metric for Evaluation of Translation with Explicit Ordering (METEOR) and the Recall -Oriented Understudy for Gisting Evaluation (ROUGE) evaluations, respectively. This performance notably surpasses that of existing state-of-the-art methods. Further evaluations confirmed the effectiveness of the multistage fine-tuning and the integration of prior information, leading to substantial improvements. Conclusions: The proposed ClinicalBLIP model demonstrated robus

关键词： clinical image radiology report generation vision-language model multistage fine-tuning prior knowledge

来源：评论

学校读者我要写书评

暂无评论

CLIP-Llama: A New Approach for Scene Text Recognition with a Pre-Trained vision-language model and a Pre-Trained language model

引用

SENSORS 2024年第22期24卷 7371页

作者： Zhao, Xiaoqing Xu, Miaomiao Silamu, Wushour Li, Yanbing Xinjiang Univ Coll Comp Sci & Technol 777 Huarui St Urumqi 830017 Peoples R China

This study focuses on Scene Text Recognition (STR), which plays a crucial role in various applications of artificial intelligence such as image retrieval, office automation, and intelligent transportation systems. Currently, pre-trained vision-language models have become the foundation for various downstream tasks. CLIP exhibits robustness in recognizing both regular (horizontal) and irregular (rotated, curved, blurred, or occluded) text in natural images. As research in scene text recognition requires substantial linguistic knowledge, we introduce the pre-trained vision-language model CLIP and the pre-trained language model Llama. Our approach builds upon CLIP's image and text encoders, featuring two encoder-decoder branches: one visual branch and one cross-modal branch. The visual branch provides initial predictions based on image features, while the cross-modal branch refines these predictions by addressing the differences between image features and textual semantics. We incorporate the large language model Llama2-7B in the cross-modal branch to assist in correcting erroneous predictions generated by the decoder. To fully leverage the potential of both branches, we employ a dual prediction and refinement decoding scheme during inference, resulting in improved accuracy. Experimental results demonstrate that CLIP-Llama achieves state-of-the-art performance on 11 STR benchmark tests, showcasing its robust capabilities. We firmly believe that CLIP-Llama lays a solid and straightforward foundation for future research in scene text recognition based on vision-language models.

关键词： scene text recognition vision-language model pre-trained language model

来源：评论

学校读者我要写书评

暂无评论

Unsupervised graph reasoning distillation hashing for multimodal hamming space search with vision-language model

引用

INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL 2024年第2期13卷 16-16页

作者： Sun, Lina Dong, Yumin Chongqing Normal Univ Sch Comp & Informat Sci Chongqing 401331 Peoples R China

Multimodal hash technology maps high-dimensional multimodal data into hash codes, which greatly reduces the cost of data storage and improves query speed through the Hamming similarity calculation. However, existing unsupervised methods still have two key obstacles: (1) With the evolution of large multimodal models, how to efficiently distill the multimodal matching relationship of large models to train a powerful student model? (2) Existing methods do not consider other adjacencies between multimodal instances, resulting in limited similarity representation. To address these obstacles, called Unsupervised Graph Reasoning Distillation Hashing (UGRDH) is proposed. The UGRDH approach uses the CLIP as the teacher model, thus extracting fine-grained multimodal features and relations for teacher-student distillation. Specifically, the multimodal features of the teacher are used to construct a similarity-complementary relation graph matrix, and the proposed graph convolution auxiliary network performs feature aggregation guided by the relation graph matrix to generate a more discriminative hash code. In addition, a cross-attention module was designed to reason potential instance relations to enable effective teacher-student distilled learning. Finally, UGRDH greatly improves search precision while maintaining lightness. Experimental results show that our method achieves about 1.5%, 3%, and 2.8% performance improvements on MS COCO, NUS-WIDE, and MIRFlickr, respectively.

关键词： Knowledge distillation Deep multimodal hashing Hamming space search vision-language model

来源：评论

学校读者我要写书评

暂无评论

Practical Techniques for vision-language Segmentation model in Remote Sensing

Practical Techniques for Vision-Language Segmentation Model ...

引用

ISPRS TC II Mid-term Symposium on Role of Photogrammetry for a Sustainable World

作者： Lin, Yuting Suzuki, Kumiko Sogo, Shinichiro Kokusai Kogyo Co Ltd Tokyo Japan

Traditional semantic segmentation models often struggle with poor generalizability in zero-shot scenarios such as recognizing attributes unseen in the training labels. On the other hands, language-vision models (VLMs) have shown promise in improving performance on zero-shot tasks by leveraging semantic information from textual inputs and fusing this information with visual features. However, existing VLM-based methods do not perform as effectively on remote sensing data due to the lack of such data in their training datasets. In this paper, we introduce a two-stage fine-tuning approach for a VLM-based segmentation model using a large remote sensing image-caption dataset, which we created using an existing image-caption model. Additionally, we propose a modified decoder and a visual prompt technique using a saliency map to enhance segmentation results. Through these methods, we achieve superior segmentation performance on remote sensing data, demonstrating the effectiveness of our approach.

关键词： Segmentation of Remote Sensing Data vision-language model Fine-tuning Visual Prompting

来源：评论

学校读者我要写书评

暂无评论

UMPA: Unified multi-modal prompt with adapter for vision-language models

引用

MULTIMEDIA SYSTEMS 2025年第2期31卷 1-11页

作者： Jin, Zhengwei Wei, Yun Univ Shanghai Sci & Technol Sch Opt Elect & Comp Engn Shanghai 200093 Peoples R China

Large-scale multi-modal pretraining model, such as CLIP, has shown remarkable generalization in vision-language tasks. However, the transfer of large models to downstream tasks requires large-scale computing resources, so adapter is proposed to realize fine-tuning for downstream tasks. As input text, prompt can guide CLIP learn the correspondence between images and texts when performing specific tasks. The selection of prompt template is very sensitive. In this paper, we propose Unified Multi-modal Prompt with Adapter For vision-language models (UMPA) based on CLIP, for parameter-efficient fine-tuning (PEFT). Learnable prompt design can improve the adaption ability of model. Adapter can realize lightweight model fine-tuning. With applying learnable prompt and adapter both on vision and language branches while the entire pre-trained parameters freeze, the model improve alignment between the vision and language representations. The model can learn separate visual representations across different stages to extract stage-wise feature relationships progressively, achieve comprehensive learning of rich visual information. Considering the additional learning modules, we apply image attention masking to prevent overfitting in downstream tasks. Experiments and extensive ablation studies on various visual classification tasks demonstrate the effectiveness and strong few-shot generalization ability of our approach.

关键词： vision-language model CLIP Parameter-efficient fine-tuning Image classification

来源：评论

学校读者我要写书评

暂无评论

Towards label-free defect detection in additive manufacturing via dual-classifier semi-supervised learning for vision-language models

引用

JOURNAL OF INTELLIGENT MANUFACTURING 2025年 1-16页

作者： Wang, Kang Liu, Lanqing Xu, Cheng Zou, Jing Lin, Haoneng Fang, Naiyu Jiang, Jingchao Nanyang Technol Univ Singapore Singapore Hong Kong Polytech Univ Hung Hom Kowloon Hong Kong Peoples R China Univ Exeter Exeter England

Complex components can now be fabricated in innovative ways thanks to additive manufacturing (AM) technology, but it also presents a severe challenge in the detection of defects, primarily due to extensive labeling efforts of defect samples. In order to address the labeling problem in AM defect identification, this work suggests a unique strategy called the Dual-Classifier Semi-supervised Learning (DCSL) method. DCSL reduces the requirement for intensive tagging and improves detection accuracy by utilizing both labeled and unlabeled data. Two distinct classifiers, namely the one-hot classifier and the semantic classifier, are designed to train the defect class labels from different perspectives. The one-hot classifier adopts a one-hot encoding of labels, treating each class independently from the visual perspective, while the semantic classifier employs a distributed representation of labels, grouping potentially similar classes from the natural language view. The incorporation of dual classifiers transforms the proposed DCSL into a vision-language model, leveraging the semantic classifier to improve pseudo-labeling quality by identifying semantic relationships among class labels. Extensive tests on the publicly accessible AM defect dataset confirm that DCSL is more efficacious than state-of-the-art techniques for AM defect identification at the moment. The findings show that our DCSL is designed to enhance image discrimination skills towards label-free defect identification by training classifiers with different data perspectives in a synergistic manner. Given its innovative approach to defect detection, which can improve the effectiveness and precision of quality control and ultimately aid in the broad adoption of intelligent manufacturing across numerous industries, this study has the potential to herald in a new era of embodied intelligence in the current manufacturing system.

关键词： Defect detection Additive manufacturing Semi-supervised learning vision-language model Label free Embodied intelligence

来源：评论

学校读者我要写书评

暂无评论

Federated fine-grained prompts for vision-language models based on open-vocabulary object detection

引用

APPLIED INTELLIGENCE 2025年第7期55卷 1-15页

作者： Li, Yu China Univ Petr East China Sch Comp Sci & Technol Qingdao 266580 Peoples R China

vision-language models can be used for open-vocabulary object detection. The existing methods suffer from low matching accuracy between prompt and image regions, as well as limited generalization capability as they adopt a data-centralized model training approach that ignores data heterogeneity. To alleviate these issues, we propose a federated fine-grained prompts learning method called FFPLearning, for open-vocabulary object detection using vision-language models. Specifically, FFPLearning quantifies the quality of proposals using pre-fused EoG (Energy of Gradient) and IoU (Intersection over Union) scores and organizes them into individual groups. Then learnable fine-grained prompts are trained to align the grouped region proposals in the feature space. A momentum update algorithm is designed to assess the quality of each participating client in the federated learning. Additionally, a Transformer-based feedback aggregation algorithm is designed to thoroughly leverage the semantic information from prompts and aggregate them based on the qualities of clients. Comprehensive evaluations on COCO and LVIS datasets demonstrate that FFPLearning is very effective, with +5.8 Novel AP50 and +3.3 APr improvements compared with existing state-of-the-art methods.

关键词： Federated learning vision-language model Open-vocabulary object detection

来源：评论

学校读者我要写书评

暂无评论

LVLM-EHub: A Comprehensive Evaluation Benchmark for Large vision-language models

引用

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2025年第3期47卷 1877-1893页

作者： Xu, Peng Shao, Wenqi Zhang, Kaipeng Gao, Peng Liu, Shuo Lei, Meng Meng, Fanqing Huang, Siyuan Qiao, Yu Luo, Ping Shanghai AI Lab OpenGVLab Shanghai 200232 Peoples R China Univ Hong Kong Dept Comp Sci Hong Kong Peoples R China Shanghai AI Lab OpenGVLab Shanghai 200232 Peoples R China Peking Univ Beijing 100871 Peoples R China

Large vision-language models (LVLMs) have recently played a dominant role in multimodal vision-language learning. Despite the great success, it lacks a holistic evaluation of their efficacy. This paper presents a comprehensive evaluation of publicly available large multimodal models by building an LVLM evaluation Hub (LVLM-eHub). Our LVLM-eHub consists of 13 representative LVLMs such as InstructBLIP and LLaVA, which are thoroughly evaluated by a quantitative capability evaluation and an online arena platform. The former evaluates five categories of multimodal capabilities of LVLMs such as visual question answering and object hallucination on 42 in-domain text-related visual benchmarks, while the latter provides the user-level evaluation of LVLMs in an open-world question-answering scenario. The study investigates how specific features of LVLMs such as model configurations, modality alignment mechanisms, and training data affect the multimodal understanding. By conducting a comprehensive comparison of these features on quantitative and arena evaluation, our study uncovers several innovative findings, which establish a fundamental framework for the development and evaluation of innovative strategies aimed at enhancing multimodal techniques.

关键词： vision-language model multimodal evaluation large language model large language model multi-turn evaluation multi-turn evaluation multi-turn evaluation

来源：评论

学校读者我要写书评

暂无评论

RelVid: Relational Learning with vision-language models for Weakly Video Anomaly Detection

引用

SENSORS 2025年第7期25卷 2037-2037页

作者： Wang, Jingxin Li, Guohan Liu, Jiaqi Xu, Zhengyi Chen, Xinrong Wei, Jianming Chinese Acad Sci Shanghai Adv Res Inst Shanghai 201210 Peoples R China ShanghaiTech Univ Sch Informat Sci & Technol Shanghai 201210 Peoples R China Univ Chinese Acad Sci Sch Elect Elect & Commun Engn Beijing 100049 Peoples R China Fudan Univ Acad Engn & Technol Shanghai 200433 Peoples R China

Weakly supervised video anomaly detection aims to identify abnormal events in video sequences without requiring frame-level supervision, which is a challenging task in computer vision. Traditional methods typically rely on low-level visual features with weak supervision from a single backbone branch, which often struggles to capture the distinctive characteristics of different categories. This limitation reduces their adaptability to real-world scenarios. In real-world situations, the boundary between normal and abnormal events is often unclear and context-dependent. For example, running on a track may be considered normal, but running on a busy road could be deemed abnormal. To address these challenges, RelVid is introduced as a novel framework that improves anomaly detection by expanding the relative feature gap between classes extracted from a single backbone branch. The key innovation of RelVid lies in the integration of auxiliary tasks, which guide the model to learn more discriminative features, significantly boosting the model's performance. These auxiliary tasks-including text-based anomaly detection and feature reconstruction learning-act as additional supervision, helping the model capture subtle differences and anomalies that are often difficult to detect in weakly supervised settings. In addition, RelVid incorporates two other components, which include class activation feature learning for improved feature discrimination and a temporal attention module for capturing sequential dependencies. This approach enhances the model's robustness and accuracy, enabling it to better handle complex and ambiguous scenarios. Evaluations on two widely used benchmark datasets, UCF-Crime and XD-Violence, demonstrate the effectiveness of RelVid. Compared to state-of-the-art methods, RelVid achieves superior performance in both detection accuracy and robustness.

关键词： vision-language model Adapter weakly video anomaly detection feature learning

来源：评论

学校读者我要写书评

暂无评论

VaVLM: Toward Efficient Edge-Cloud Video Analytics With vision-language models

引用

IEEE TRANSACTIONS ON BROADCASTING 2025年

作者： Zhang, Yang Wang, Hanling Bai, Qing Liang, Haifeng Zhu, Peican Muntean, Gabriel-Miro Li, Qing Xian Technol Univ Sch Optoelect Engn Xian 710021 Shaanxi Peoples R China Pengcheng Lab Dept Adv Interdisciplinary Res Shenzhen 518055 Guangdong Peoples R China Tsinghua Univ Shenzhen Int Grad Sch Shenzhen 518055 Guangdong Peoples R China Northern Optoelect Co Ltd NORTHEO Dept Qual Safety Xian 710043 Shaanxi Peoples R China Northwestern Polytech Univ Sch Artificial Intelligence Opt & Elect Xian 710072 Shaanxi Peoples R China Dublin City Univ Sch Elect Engn Dublin 9 Ireland

The advancement of Large language models (LLMs) with vision capabilities in recent years has elevated video analytics applications to new heights. To address the limited computing and bandwidth resources on edge devices, edge-cloud collaborative video analytics has emerged as a promising paradigm. However, most existing edge-cloud video analytics systems are designed for traditional deep learning models (e.g., image classification and object detection), where each model handles a specific task. In this paper, we introduce VaVLM, a novel edge-cloud collaborative video analytics system tailored for vision-language models (VLMs), which can support multiple tasks using a single model. VaVLM aims to enhance the performance of VLM-powered video analytics systems in three key aspects. First, to reduce bandwidth consumption during video transmission, we propose a novel Region-of-Interest (RoI) generation mechanism based on the VLM's understanding of the task and scene. Second, to lower inference costs, we design a task-oriented inference trigger that processes only a subset of video frames using an optimized inference logic. Third, to improve inference accuracy, the model is augmented with additional information from both the environment and auxiliary analytics models during the inference stage. Extensive experiments on real-world datasets demonstrate that VaVLM achieves an 80.3% reduction in bandwidth consumption and an 89.5% reduction in computational cost compared to baseline methods.

关键词： Visual analytics Analytical models Computational modeling Image edge detection Cognition Bandwidth Artificial neural networks Streaming media Servers Object detection Video analytics vision-language model large language model edge computing

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：