检索结果-内蒙古大学图书馆

Human-Centric Context and Self-Uncertainty-Driven multi-modal large language model for Training-Free Vision-Based Driver State Recognition

引用

IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS 2025年

作者： Hu, Chuanfei Li, Xinde Southeast Univ Sch Automat Key Lab Measurement & Control CSE Nanjing 210096 Peoples R China Nanjing Ctr Appl Math Nanjing 211135 Peoples R China Southeast Univ Shenzhen Res Inst Shenzhen 518063 Peoples R China

A vision-driven driver monitoring system plays a vital role to guarantee the driving safety. Recent advances focus on modeling a learning-based method to realize the driver monitoring system, benefiting from the powerful capability of data-driven feature extraction. Although the acceptable performances of these methods are achieved, the training procedure with massive data would significantly increase the labor costs. Thus, it is intuitive to explore a training-free vision-based driver state recognition in the era of large language model (LLM)/multi-modal large language model (MLLM). There are two issues should be considered. First, the general prompt might not guide MLLM to focus on the human-centric visual appearances, resulting in the insufficient understanding of MLLM for the contextual cues of driver. Second, the inherent uncertainty of MLLM might impact the reasoning precision, which is not considered comprehensively for the MLLM-based driver state recognition. In this paper, we focus on a vision-based driver state monitoring method, where a novel training-free driver state recognition method via human-centric context and self-uncertainty-driven MLLM (HSUM). Specifically, a human-centric context generator (HCG) is first proposed based on a context-specific prompt. MLLM is guided to capture the human-centric contextual cues as a scene graph. The contextual interaction of objects with their surroundings can be represented effectively. Then, a self-uncertainty response enumerator (SRE) is proposed to exploit the uncertainty of MLLM. The potential reasoning responses are enumerated repeatedly based on the assembly of the human-centric context and uncertainty-specific prompt. Furthermore, to reveal the precise reasoning result from the enumerated responses, we introduce the Dempster-Shafer evidence theory (DST)-based combination rule to conduct an evidence-aware fusion (EAF). The precise answer could be gathered based on DST-based combination rule theoretically. Ex

关键词： Driver monitoring system multi-modal large language model driver state recognition driver state recognition driver distraction and emotion recognition driver distraction and emotion recognition driver distraction and emotion recognition

来源：评论

学校读者我要写书评

暂无评论

multi-modal large language model Enhanced Pseudo 3D Perception Framework for Visual Commonsense Reasoning

引用

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 2024年第11期34卷 11682-11694页

作者： Zhu, Jian Wang, Hanli Shi, Miaojing Tongji Univ Dept Comp Sci & Technol Shanghai 200092 Peoples R China Tongji Univ Key Lab Embedded Syst & Serv Comp Minist Educ Shanghai 200092 Peoples R China Tongji Univ Coll Elect & Informat Engn Shanghai 201804 Peoples R China

The visual commonsense reasoning (VCR) task is to choose an answer and provide a justifying rationale based on the given image and textural question. Representative works first recognize objects in images and then associate them with key words in texts. However, existing approaches do not consider exact positions of objects in a human-like three-dimensional (3D) manner, making them incompetent to accurately distinguish objects and understand visual relation. Recently, multi-modal large language models (MLLMs) have been used as powerful tools for several multi-modal tasks but not for VCR yet, which requires elaborate reasoning on specific visual objects referred by texts. In light of the above, an MLLM enhanced pseudo 3D perception framework is designed for VCR. Specifically, we first demonstrate that the relation between objects is relevant to object depths in images, and hence introduce object depth into VCR frameworks to infer 3D positions of objects in images. Then, a depth-aware Transformer is proposed to encode depth differences between objects into the attention mechanism of Transformer to discriminatively associate objects with visual scenes guided by depth. To further associate the answer with the depth of visual scene, each word in the answer is tagged with a pseudo depth to realize depth-aware association between answer words and objects. On the other hand, BLIP-2 as an MLLM is employed to process images and texts, and the referring expressions in texts involving specific visual objects are modified with linguistic object labels to serve as comprehensible MLLM inputs. Finally, a parameter optimization technique is devised to fully consider the quality of data batches based on multi-level reasoning confidence. Experiments on the VCR dataset demonstrate the superiority of the proposed framework over state-of-the-art approaches. The source code of this work can be found in https://***.

关键词： Visual commonsense reasoning pseudo 3D perception transformer multi-modal large language model parameter optimization

来源：评论

学校读者我要写书评

暂无评论

multi-modal large language models in radiology: principles, applications, and potential

引用

ABDOMINAL RADIOLOGY 2025年第6期50卷 2745-2757页

作者： Shen, Yiqiu Xu, Yanqi Ma, Jiajian Rui, Wushuang Zhao, Chen Heacock, Laura Huang, Chenchan New York Univ Langone Med Ctr New York NY 10016 USA NYU New York NY USA New York Univ Shanghai Shanghai Peoples R China

large language models (LLMs) and multi-modal large language models (MLLMs) represent the cutting-edge in artificial intelligence. This review provides a comprehensive overview of their capabilities and potential impact on radiology. Unlike most existing literature reviews focusing solely on LLMs, this work examines both LLMs and MLLMs, highlighting their potential to support radiology workflows such as report generation, image interpretation, EHR summarization, differential diagnosis generation, and patient education. By streamlining these tasks, LLMs and MLLMs could reduce radiologist workload, improve diagnostic accuracy, support interdisciplinary collaboration, and ultimately enhance patient care. We also discuss key limitations, such as the limited capacity of current MLLMs to interpret 3D medical images and to integrate information from both image and text data, as well as the lack of effective evaluation methods. Ongoing efforts to address these challenges are introduced.

关键词： Deep learning Generative artificial intelligence large language model multi-modal large language model

来源：评论

学校读者我要写书评

暂无评论

SafetyGPT: An autonomous agent of electrical safety risks for monitoring workers' unsafe behaviors

引用

INTERNATIONAL JOURNAL OF ELECTRICAL POWER & ENERGY SYSTEMS 2025年 168卷

作者： Li, Wei Ma, Fuqi Zuo, Zhiyuan Jia, Rong Wang, Bo Alharbi, Abdullah M. Xian Univ Technol Sch Elect Engn Xian 710048 Shaanxi Peoples R China Univ Manchester Sch Elect & Elect Engn Manchester M13 9PL England Wuhan Univ Sch Elect Engn & Automat Wuhan 430072 Hubei Peoples R China Prince Sattam Bin Abdulaziz Univ Al Kharj 16278 Saudi Arabia

Workers' unsafe behavior is one of the major causes of accidents in electric power production. Intelligent monitoring of workers' unsafe behaviors can effectively prevent the expansion of safety risks, thereby blocking the development process of risks to accidents. Electric power production processes are diverse in nature and require the frequent switching of operating scenarios. This makes it difficult to identify what is "unsafe" since worker behaviors within the given electrical context also exhibit variability and diversity. Existing methods have insufficient generalization and adaptability, which makes them inadequate for the case of electric power production. Therefore, this paper proposes Safety Generative Pre-trained Transformers (SafetyGPT), an autonomous agent of safety risk based on a multi-modal large language model, which incorporates a human-machine collaborative monitoring mode for unsafe behaviors of workers. SafetyGPT loads the electric power production video, and the backend supervisors set instructions for SafetyGPT based on task requirements. The model encodes visual and textual features into corresponding tokens, realizes multi-modal feature alignment and fusion through the cross-attention mechanism, and then generates targeted responses through the large language model. Next, the proposed method is applied to real production site data to confirm the effectiveness and superiority through comparison with other methods designed to identify unsafe behaviors. Experimental results show that the accuracy of the proposed method for the identification of unsafe behaviors in complex environments is 96.5%, and that it can generate reasonable recommended plan based on the identification results, assist backend supervisors in making decisions, and effectively improve the safety level of power production.

关键词： Generative artificial intelligence Risk identification Electric power production multi-modal large language model Unsafe behaviors

来源：评论

学校读者我要写书评

暂无评论

ETC: Temporal Boundary Expand Then Clarify for Weakly Supervised Video Grounding With multimodal large language model

引用

IEEE TRANSACTIONS ON multiMEDIA 2025年 27卷 1772-1782页

作者： Li, Guozhang Ding, Xinpeng Cheng, De Li, Jie Wang, Nannan Gao, Xinbo Xidian Univ Sch Elect Engn State Key Lab Integrated Serv Networks Xian 710071 Peoples R China Hong Kong Univ Sci & Technol Sch Engn Hong Kong Peoples R China Xidian Univ Sch Telecommun Engn State Key Lab Integrated Serv Networks Xian 710071 Peoples R China Chongqing Univ Posts & Telecommun Chongqing Key Lab Image Cognit Chongqing 400065 Peoples R China Xidian Univ Sch Elect Engn Xian 710071 Peoples R China

Early weakly supervised video grounding (WSVG) methods often struggle with incomplete boundary detection due to the absence of temporal boundary annotations. To bridge the gap between video-level and boundary-level annotations, explicit supervision methods (i.e., generating pseudo-temporal boundaries for training) have achieved great success. However, data augmentation in these methods might disrupt critical temporal information, yielding poor pseudo-temporal boundaries. In this paper, we propose a new perspective that maintains the integrity of the original temporal content while introducing more valuable information for expanding the incomplete boundaries. To this end, we propose ETC (Expand then Clarify), first using the additional information to expand the initial incomplete pseudo-temporal boundaries, and subsequently refining these expanded ones to achieve precise boundaries. Motivated by video continuity, i.e., visual similarity across adjacent frames, we use powerful multi-modal large language models (MLLMs) to annotate each frame within the initial pseudo-temporal boundaries, yielding more comprehensive descriptions for expanded boundaries. To further clarify the noise in expanded boundaries, we combine mutual learning with a tailored proposal-level contrastive objective to use a learnable approach to harmonize a balance between incomplete yet clean (initial) and comprehensive yet noisy (expanded) boundaries for more precise ones. Experiments demonstrate the superiority of our method on two challenging WSVG datasets.

关键词： multi-modal large language model weakly supervised video grounding video grounding video grounding

来源：评论

学校读者我要写书评

暂无评论

VCF: An effective Vision-Centric Framework for Visual Question Answering

引用

NEUROCOMPUTING 2025年 625卷

作者： Wang, Fengjuan Peng, Longkun Cao, Shan Yang, Zhaoqilin Zhang, Ruonan An, Gaoyun Beijing Jiaotong Univ Sch Comp Sci & Technol Beijing 100044 Peoples R China Capital Normal Univ Coll Informat Engn Beijing 100048 Peoples R China Guizhou Univ Coll Comp Sci & Technol Guiyang 550025 Peoples R China

Recently, the wide application of large language models in the field of Visual Question Answering(VQA) has significantly boosted the progress in this field. Despite achieved advancements, LLMs cannot fully perceive and comprehend visual information well from image. Therefore, how to fully mine visual information is very important for language models to effectively deal with the VQA task. In response to this challenge, we propose a straightforward yet effective Vision-Centric Framework(VCF) for VQA, which mainly includes an adaptive visual perceptron module, a multi-source feature fusion module, and a large language model. The adaptive visual perceptron module effectively condenses and integrates the extensive visual information sequence from the visual encoder output using a fixed number of query embeddings. The multi-source feature fusion module is concentrated on extracting fine-grained visual perception information by fusing visual features of different scales. Finally, by channeling their outputs, the language model leverages its extensive implicit knowledge to produce amore nuanced and precise synthesis of visual information, ultimately delivering the answer. The synergy and complementarity of the two modules jointly enhance the robustness of the model. Through extensive experiments, VCF achieves nearly state-of-the-art experimental results on datasets such as VQAv2, OK-VQA, GQA, Text-VQA and others. At the same time, a series of ablation experiments have been conducted to demonstrate the efficacy of the proposed module. Additionally, VCF even achieves better or equivalent performance compared to some larger-scale models, such as LLaVa-1.5, Pink.

关键词： Visual Question Answering Adaptive visual perceptron multi-source feature fusion multi-modal large language model

来源：评论

学校读者我要写书评

暂无评论

Evaluation of Data Inconsistency for multi-modal Sentiment Analysis 19th

Evaluation of Data Inconsistency for Multi-modal Sentiment A...

引用

19th National Conference on Man-Machine Speech Communication

作者： Wang, Yufei Wu, Mengyue Shanghai Jiao Tong Univ Shanghai 200000 Peoples R China

ISBN: (纸本)9789819610440;9789819610457

Emotion semantic inconsistency is a ubiquitous challenge in multi-modal sentiment analysis (MSA). MSA involves analyzing sentiment expressed across various modalities like text, audio, and videos. Each modality may convey distinct aspects of sentiment, due to the subtle and nuanced expression of human beings, leading to inconsistency, which may hinder the prediction of artificial agents. In this work, we introduce a modality-conflicting test set and assess the performance of both traditional multi-modal sentiment analysis models and multi-modal large language models (MLLMs). Our findings reveal significant performance degradation across traditional models when confronted with semantically conflicting data and point out the drawbacks of MLLMs when handling multi-modal emotion analysis. Our research presents a new challenge and offers valuable insights for the future development of sentiment analysis systems.

关键词： multi-modal Sentiment Analysis multi-modal large language model Data Inconsistency

来源：评论

学校读者我要写书评

暂无评论

Diagram Formalization Enhanced multi-modal Geometry Problem Solver

Diagram Formalization Enhanced Multi-Modal Geometry Problem ...

引用

2025 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2025

作者： Zhang, Zeren Cheng, Jo-Ku Deng, Jingyang Tian, Lu Ma, Jinwen Qin, Ziran Zhang, Xiaokai Zhu, Na Leng, Tuo School of Mathematical Sciences Peking University Beijing100871 China 01.AI Beijing China School of Electronic Information and Electrical Engineering Shanghai Jiao Tong University Shanghai200240 China School of Computer Engineering and Science Shanghai University Shanghai200444 China

ISBN: (纸本)9798350368741

Mathematical reasoning remains an ongoing challenge for AI models, especially for geometry problems, which require both linguistic and visual signals. As the vision encoders of most MLLMs are trained on natural scenes, they often struggle to understand geometric diagrams, performing no better in geometry problem-solving than LLMs that only process text. This limitation is further amplified by the lack of effective methods for representing geometric relationships. To address these issues, we introduce the Diagram Formalization Enhanced Geometry Problem Solver (DFE-GPS), a new framework that integrates visual features, geometric formal language, and natural language representations. Specifically, we propose a novel synthetic data approach and construct a large-scale geometric dataset, SynthGeo228K, annotated with formal and natural language captions, designed to enhance the vision encoder to understand geometric structures better. Our framework improves MLLMs' ability to process geometric diagrams and extends their application to open-ended tasks on the formalgeo7k dataset. © 2025 IEEE.

关键词： geometric diagram formalization geometry problem solver mathematical reasoning multi-modal large language model

来源：评论

学校读者我要写书评

暂无评论

Improving Accuracy and Generalizability via multi-modal large language models Collaboration

Improving Accuracy and Generalizability via Multi-Modal Larg...

引用

International Joint Conference on Neural Networks (IJCNN)

作者： Zhang, Shuili Mu, Hongzhang Liu, Tingwen Chinese Acad Sci Inst Informat Engn Beijing Peoples R China Univ Chinese Acad Sci Sch Cyber Secur Beijing Peoples R China

ISBN: (纸本)9798350359329;9798350359312

With the growing interest in large language models (LLMs), integrating visual tasks has led to the development of multi-Layer language models (MLLMs). Despite their advancements, MLLMs face challenges in accuracy and generalization, often due to resource and time constraints. Addressing these issues, our paper introduces a novel multi-Agent Collaborative Network for MLLMs (MLLM network). This framework harnesses collective intelligence and cooperation among multiple agents to enhance the accuracy and generalizability of MLLMs. The collaborative nature of our MLLMs-featuring inter-layer neuron interaction and information exchange-facilitates superior processing and integration of multi-modal data. This leads to marked improvements in performance. The findings underscore the efficacy and potential of our proposed framework, presenting a robust solution for complex multi-modal challenges in machine learning and artificial intelligence. Our experimental evaluations demonstrate that this approach significantly surpasses traditional single MLLM architectures in task accuracy and generalization.

关键词： multi-modal large language model multiAgent Collaboration

来源：评论

学校读者我要写书评

暂无评论

CheckGuard: Advancing Stolen Check Detection with a Cross-modal Image-Text Benchmark Dataset 24

CheckGuard: Advancing Stolen Check Detection with a Cross-Mo...

引用

33rd ACM International Conference on Information and Knowledge Management (CIKM)

作者： Zhao, Fei Chen, Jiawen Huang, Bin Zhang, Chengcui Warner, Gary Univ Alabama Birmingham Birmingham AL 35294 USA Beijing Univ Technol Beijing Dublin Int Coll Beijing Peoples R China Ji Zhi Xing Huo Technol Beijing Beijing Peoples R China

ISBN: (纸本)9798400704369

The prevalence of check fraud, particularly with stolen checks sold on platforms such as Telegram, creates significant challenges for both individuals and financial institutions. This underscores the urgent need for innovative solutions to detecting and preventing such fraud on social media platforms. While deep learning techniques show great promise in detecting objects and extracting information from images, their effectiveness in addressing check fraud is hindered by the lack of comprehensive, open-source, large training datasets specifically for check information extraction. To bridge this gap, this paper introduces "CheckGuard," a large labeled image-to-text cross-modal dataset designed for check information extraction. CheckGuard comprises over 7,000 real-world stolen check image segments from more than 15 financial institutions, featuring a variety of check styles and layouts. These segments have been manually labeled, resulting in over 50,000 samples across seven key elements: Drawer, Payee, Amount, Date, Drawee, Routing Number, and Check Number. This dataset supports various tasks such as visual question answering (VQA) on checks and check image captioning. Our paper details the rigorous data collecting, cleaning, and annotation processes that make CheckGuard a valuable resource for researchers in check fraud detection, machine learning, and multimodal large language models (MLLMs). We not only benchmark state-of-the-art (SOTA) methods on this dataset to assess their performance but also explore potential enhancements. Our application of parameter-efficient fine-tuning (PEFT) techniques on the SOTA MLLMs demonstrates significant performance improvements, providing valuable insights and practical approaches for enhancing model efficacy on this task. As an evolving project, CheckGuard will continue to be updated with new data, enhancing its utility and driving further advancements in the field. Our PEFT-based MLLM code is available at: https://***/feiz

关键词： Machine Learning multi-modal large language model Cross-modal Generation Check Fraud Detection Stolen Check

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：