检索结果-内蒙古大学图书馆

Soccer-CLIP: vision language model for Soccer Action Spotting

IEEE ACCESS 2025年 13卷 44354-44365页

作者： Shin, Yoonho Park, Sanghoon Han, Youngsub Jeon, Byoung-Ki Lee, Soonyoung Kang, Byung Jun LG UPlus Seoul 07795 South Korea LG AI Res Seoul 07336 South Korea

In the rapidly advancing field of computer vision, the application of multimodal models-specifically, vision-language frameworks-has shown substantial promise for complex tasks such as video-based action spotting. This paper introduces Soccer-CLIP, a vision-language model specially designed for soccer action spotting. Soccer-CLIP incorporates an innovative domain-specific prompt engineering strategy, leveraging large language models (LLMs) to refine textual representations for precise alignment with soccer-specific actions. Our model integrates both visual and textual features to enhance recognition accuracy of critical soccer events. With the temporal augmentation techniques devised for input videos, Soccer-CLIP builds upon existing methodologies to address the inherent challenges of temporally sparse event annotations within video sequences. Evaluations on the SoccerNet Action Spotting benchmark demonstrate that Soccer-CLIP outperforms previous state-of-the-art models, exploring the effectiveness of our model's capacity to capture domain-specific contextual nuances. This work represents a significant advancement in automated sports analysis, providing a robust and adaptable framework for broader applications in video recognition and temporal action localization tasks.

关键词： Sports Videos Accuracy Visualization Data models Context modeling Adaptation models Prompt engineering Deep learning Streams Action spotting multimodal model prompt engineering SoccerNet-v2 temporal augmentation video augmentation vision language model video recognition

来源：评论

学校读者我要写书评

暂无评论

vision language model for interpretable and fine-grained detection of safety compliance in diverse workplaces

引用

EXPERT SYSTEMS WITH APPLICATIONS 2025年 265卷

作者： Chen, Zhiling Chen, Hanning Imani, Mohsen Chen, Ruimin Imani, Farhad Univ Connecticut Sch Mech Aerosp & Mfg Engn Storrs CT 06269 USA Univ Calif Irvine Dept Comp Sci Irvine CA USA

Workplace accidents due to personal protective equipment (PPE) non-compliance raise serious safety concerns and lead to legal liabilities, financial penalties, and reputational damage. While object detection models have shown the capability to address this issue by identifying safety gear, most existing models, such as YOLO, Faster R-CNN, and SSD, are limited in verifying the fine-grained attributes of PPE across diverse workplace scenarios. vision language models (VLMs) are gaining traction for detection tasks by leveraging the synergy between visual and textual information, offering a promising solution to traditional object detection limitations in PPE recognition. Nonetheless, VLMs face challenges inconsistently verifying PPE attributes due to the complexity and variability of workplace environments, requiring them to interpret context-specific language and visual cues simultaneously. We introduce Clip2Safety, an interpretable detection framework for diverse workplace safety compliance, which comprises four main modules: scene recognition, visual prompt, safety gear detection, and fine-grained verification. Scene recognition identifies the current scenario to determine the necessary safety gear. Visual prompt formulates specific visual cues needed for the detection process. Safety gear detection identifies whether the required safety gear is being worn according to the specified scenario. Lastly, fine-grained verification assesses whether the worn safety equipment meets the fine-grained attribute requirements. We conduct real-world case studies across six different scenarios. The results show that Clip2Safety not only demonstrates an accuracy improvement over state-of-the-art question-answering based VLMs but also achieves inference times that are 21x faster.

关键词： Personal protective equipment Zero-shot object detection vision language model Large language model

来源：评论

学校读者我要写书评

暂无评论

Sample Efficient Reinforcement Learning via Large vision language model Distillation

Sample Efficient Reinforcement Learning via Large Vision Lan...

引用

2025 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2025

作者： Lee, Donghoon Luu, Tung M. Lee, Younghwan Yoo, Chang D. Robotics Program KAIST Daejeon Korea Republic of Electrical Engineering KAIST Daejeon Korea Republic of

ISBN: (纸本)9798350368741

Recent research highlights the potential of multimodal foundation models in tackling complex decision-making challenges. However, their large parameters make real-world deployment resource-intensive and often impractical for constrained systems. Reinforcement learning (RL) shows promise for task-specific agents but suffers from high sample complexity, limiting practical applications. To address these challenges, we introduce LVLM to Policy (LVLM2P), a novel framework that distills knowledge from large vision-language models (LVLM) into more efficient RL agents. Our approach leverages the LVLM as a teacher, providing instructional actions based on trajectories collected by the RL agent, which helps reduce less meaningful exploration in the early stages of learning, thereby significantly accelerating the agent's learning progress. Additionally, by leveraging the LVLM to suggest actions directly from visual observations, we eliminate the need for manual textual descriptors of the environment, enhancing applicability across diverse tasks. Experiments show that LVLM2P significantly enhances the sample efficiency of baseline RL algorithms. The code is available at https://***/i22024/LVLM2P. © 2025 IEEE.

关键词： Knowledge Distillation Reinforcement Learning vision language model

来源：评论

学校读者我要写书评

暂无评论

VLM-MSGraph: vision language model-enabled Multi-hierarchical Scene Graph for robotic assembly

引用

ROBOTICS AND COMPUTER-INTEGRATED MANUFACTURING 2025年 94卷

作者： Li, Shufei Yan, Zhijie Wang, Zuoxu Gao, Yiping Beihang Univ Sch Mech Engn & Automat Beijing Peoples R China City Univ Hong Kong Dept Syst Engn Hong Kong Peoples R China Huazhong Univ Sci & Technol State Key Lab Digital Mfg Equipment & Technol Wuhan Peoples R China

Intelligent robotic assembly is becoming a pivotal component of the manufacturing sector, driven by growing demands for flexibility, sustainability, and resilience. Robots in manufacturing environments need perception, decision-making, and manipulation skills to support the flexible production of diverse products. However, traditional robotic assembly systems typically rely on time-consuming training processes specific to fixed settings, lacking generalization and zero-shot learning capabilities. To address these challenges, this paper introduces a vision language model-enabled Multi-hierarchical Scene Graph (VLM-MSGraph) approach for robotic assembly, featuring generalized assembly sequence learning and 3D manipulation in open scenarios. The MSGraph incorporates high-level task planning structured as triplets, organized by multiple VLM agents. At a low level, the MSGraph retains 3D spatial relationships between industrial parts, enabling the robot to perform assembly tasks while accounting for object geometry for effective manipulation. Assembly drawings, physics simulations, and assembly tasks in a laboratory setting are used to evaluate the proposed system, advancing flexible automation in robotics.

关键词： Smart manufacturing vision language model Scene graph Robotic assembly Flexible automation

来源：评论

学校读者我要写书评

暂无评论

Alzheimer's disease recognition using graph neural network by leveraging image-text similarity from vision language model

引用

SCIENTIFIC REPORTS 2025年第1期15卷 1-14页

作者： Lee, Byounghwa Bang, Jeong-Uk Song, Hwa Jeon Kang, Byung Ok Elect & Telecommun Res Inst ETRI Integrated Intelligence Res Sect Daejeon 34129 South Korea

Alzheimer's disease (AD), a progressive neurodegenerative condition, notably impacts cognitive functions and daily activity. One method of detecting dementia involves a task where participants describe a given picture, and extensive research has been conducted using the participants' speech and transcribed text. However, very few studies have explored the modality of the image itself. In this work, we propose a method that predicts dementia automatically by representing the relationship between images and texts as a graph. First, we transcribe the participants' speech into text using an automatic speech recognition system. Then, we employ a vision language model to represent the relationship between the parts of the image and the corresponding descriptive sentences as a bipartite graph. Finally, we use a graph convolutional network (GCN), considering each subject as an individual graph, to classify AD patients through a graph-level classification task. In experiments conducted on the ADReSSo Challenge datasets, our model surpassed the existing state-of-the-art performance by achieving an accuracy of 88.73%. Additionally, ablation studies that removed the relationship between images and texts demonstrated the critical role of graphs in improving performance. Furthermore, by utilizing the sentence representations learned through the GCN, we identified the sentences and keywords critical for AD classification.

关键词： Alzheimer's disease Bipartite graph Dementia Multimodal Graph neural network vision language model

来源：评论

学校读者我要写书评

暂无评论

vision language model Empowered Surgical Planning

Vision Language Model Empowered Surgical Planning

引用

2024 International Conference on Intelligent Robotics and Automatic Control

作者： Chen, Yihe Yu, Runsheng Wang, Xin Wang, Wensheng Tan, Ning Zhang, Youzhi Nanjing Univ Sch Artificial Intelligence Nanjing Peoples R China Hong Kong Univ Sci & Technol Dept Comp Sci Hong Kong Peoples R China Sun Yat Sen Univ Sch Comp Sci & Engn Guangzhou Peoples R China Chinese Acad Sci Hong Kong Inst Sci & Innovat Ctr Artificial Intelligence & Robot Hong Kong Peoples R China

ISBN: (纸本)9798350389814;9798350389807

The integration of a flexible endoscope with a surgical manipulator is crucial in minimally invasive surgery (MIS), facilitating detailed visualization of the operative field within the patient's body. During MIS, the remote center of motion (RCM) constraints are essential for achieving visual servoing control and ensuring accurate tracking control of the robotic endoscope. Existing work requires the exact trajectory for the tracking control and does not connect both tasks with the RCM constraints. In this paper, we exploit GPT-V to develop vision language model Empowered surgical Planning (VLM-EP), which uses environmental observations and task description to finish the tracking task without the exact trajectory and connect both tasks through the exploration procedure in vivo safety range. Our simulated experiments show that our VLM-EP significantly outperforms the state-of-the-art control-based baseline. We demonstrate a practical implementation of VLM-EP in real-world scenarios, which shows that VLM-EP effectively handles the tracking control task and the visual servoing control task.

关键词： vision language model visual servoing endoscope motion planning

来源：评论

学校读者我要写书评

暂无评论

ZEN-IQA: Zero-Shot Explainable and No-Reference Image Quality Assessment With vision language model

引用

IEEE ACCESS 2024年 12卷 70973-70983页

作者： Miyata, Takamichi Chiba Inst Technol Chiba 2750016 Japan

No-reference image quality assessment (NR-IQA), which aims to estimate the perceptual quality of a degraded image without accessing the corresponding original image, is a key challenge in low-level computer vision. Recent advances in deep learning have enabled the development of high-performance NR-IQA methods. However, such methods are limited, as they are highly dependent on the training dataset. Recognizing this limitation and avoiding task-specific training, an alternative method has been proposed that employ pre-trained visual language models for zero-shot NR-IQA;however, this approach does not provide any basis for decision-making and is not explainable. In this study, we propose ZEN-IQA, a new zero-shot and explainable NR-IQA method. Utilizing a new approach involving carefully constructed prompt pairs and triplets makes the evaluation process more intuitive and easier to understand. Our comparative analysis reveals that ZEN-IQA not only has high interpretability, but also outperforms methods using handcrafted features and state-of-the-art deep learning methods trained based on datasets that differ from the test set. We also applied ZEN-IQA to images before and after image processing and conducted experiments to evaluate how perceptual quality changes with image processing.

关键词： Image quality Brightness Computational modeling Zero-shot learning Training Task analysis Semantics Perceptual quality image quality assessment no reference image quality assessment zero shot learning vision language model antonym prompt pairing

来源：评论

学校读者我要写书评

暂无评论

Enhancing metal additive manufacturing training with the advanced vision language model: A pathway to immersive augmented reality training for non-experts

引用

JOURNAL OF MANUFACTURING SYSTEMS 2024年 75卷 257-269页

作者： Fan, Haolin Zhang, Hongji Ma, Changyu Wu, Tongzi Fuh, Jerry Ying Hsi Li, Bingbing Natl Univ Singapore Dept Mech Engn Singapore 117575 Singapore Calif State Univ Northridge Auton Res Ctr STEAHM ARCS Northridge CA 91330 USA

This paper introduces an innovative training system for the Renishaw AM400 metal printer, leveraging the synergy of the advanced vision language model (VLM) with Augmented Reality (AR) within the Digital Twins (DT) framework. Aimed at overcoming the limitations of conventional training methods in metal additive manufacturing (AM), our system integrates AR to provide an immersive learning environment, enhancing the real-world experience with interactive digital overlays. The core of the system lies in its use of VLM, which, pre-trained on diverse datasets, excels in processing multi-modal data, thereby offering nuanced and contextually relevant guidance for trainees. Key experiments demonstrate the system's effectiveness, particularly highlighting the usage of VLM as an Artificial Intelligence (AI) agent to integrate external tools like YOLO-v7 for valve state classification and CRAFT for control panel text recognition. This approach significantly improves recognition accuracy, operational understanding, and human-machine interaction, especially for nonexpert users, making complex metal AM operations more accessible. The research not only showcases the potential of AR and VLM in industrial training but also sets a new standard for smart manufacturing practices, indicating broader applications in various industrial domains.

关键词： Metal additive manufacturing vision language model Augmented reality Non-expert training Digital twins

来源：评论

学校读者我要写书评

暂无评论

Rugby Scene Classification Enhanced by vision language model

Rugby Scene Classification Enhanced by Vision Language Model

引用

IEEE/CVF Conference on Computer vision and Pattern Recognition (CVPR)

作者： Nonaka, Naoki Fujihira, Ryo Koshiba, Toshiki Maeda, Akira Seita, Jun RIKEN Informat R&D & Strategy Headquarters Adv Data Sci Project Wako Saitama Japan Hakata Knee & Sports Clin Fukuoka Japan

ISBN: (纸本)9798350365474

This study investigates the integration of vision language models (VLM) to enhance the classification of situations within rugby match broadcasts. The importance of accurately identifying situations in sports videos is emphasized for understanding game dynamics and facilitating downstream tasks like performance evaluation and injury prevention. Utilizing a dataset comprising 18, 000 labeled images extracted at 0.2-second intervals from 100 minutes of rugby match broadcasts, scene classification tasks including contact plays (scrums, mauls, rucks, tackles, lineouts), rucks, tackles, lineouts, and multiclass classification were performed. The study aims to validate the utility of VLM outputs in improving classification performance compared to using solely image data. Experimental results demonstrate substantial performance improvements across all tasks with the incorporation of VLM outputs. Our analysis of prompts suggests that, when provided with appropriate contextual information through natural language, VLMs can effectively capture the context of a given image. The findings of our study indicate that leveraging VLMs in the domain of sports analysis holds promise for developing image processing models capable of incorpolating the tacit knowledge encoded within language models, as well as information conveyed through natural language descriptions.

关键词： Rugby Scene classification vision language model

来源：评论

学校读者我要写书评

暂无评论

QViLa: Quantum Infused vision-language model for Enhanced Multimodal Understanding

引用

SN Computer Science 2024年第8期5卷 1023页

作者： Mukesh, K. Jayaprakash, S.L. Kumar, R. Prasanna Department of Computer Science and Engineering Amrita School of Computing Amrita Vishwa Vidyapeetham Tamilnadu Chennai 601103 India

vision-language models have emerged as transformative tools, revolutionizing the integration of visual and textual information, forging pathways for nuanced interpretations across various applications. The evolution of these models underscores the challenge of achieving seamless modality fusion, particularly in aligning raw pixel values of images with high-level semantics of text, a bottleneck that often hinders optimal cross-modal representations. Addressing this, our research introduces a quantum-enhanced multimodal framework. The main objective of our proposed work is the integration of classical vision-language transformers with quantum-augmented layer, aimed at enhancing the fusion of extracted feature embeddings, thereby bridging the modality gap. The quantum computing techniques, offers an innovative approach to information processing, which paves the way for richer and more intricate visual-textual representations. Furthermore, the shared self-attention mechanism accentuates the model’s ability to detect complex modality interactions. The quantum-enhanced framework is empirically evaluated on the VQA v2 dataset. This evaluation not only considers accuracy across diverse question categories but also the model’s computational efficiency, emphasizing the pivotal contributions of quantum computations in achieving heightened accuracy levels. Further exploration into the influence of different quantum feature maps aided in identifying the most optimal model variant. Our findings highlight the quantum layer’s pivotal role in improving the efficacy of classical vision language models. © The Author(s), under exclusive licence to Springer Nature Singapore Pte Ltd. 2024.

关键词： Multimodal learning Quantum computing vision language model Visual reasoning

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：