A vision-driven driver monitoring system plays a vital role to guarantee the driving safety. Recent advances focus on modeling a learning-based method to realize the driver monitoring system, benefiting from the power...
详细信息
A vision-driven driver monitoring system plays a vital role to guarantee the driving safety. Recent advances focus on modeling a learning-based method to realize the driver monitoring system, benefiting from the powerful capability of data-driven feature extraction. Although the acceptable performances of these methods are achieved, the training procedure with massive data would significantly increase the labor costs. Thus, it is intuitive to explore a training-free vision-based driver state recognition in the era of largelanguagemodel (LLM)/multi-modal large language model (MLLM). There are two issues should be considered. First, the general prompt might not guide MLLM to focus on the human-centric visual appearances, resulting in the insufficient understanding of MLLM for the contextual cues of driver. Second, the inherent uncertainty of MLLM might impact the reasoning precision, which is not considered comprehensively for the MLLM-based driver state recognition. In this paper, we focus on a vision-based driver state monitoring method, where a novel training-free driver state recognition method via human-centric context and self-uncertainty-driven MLLM (HSUM). Specifically, a human-centric context generator (HCG) is first proposed based on a context-specific prompt. MLLM is guided to capture the human-centric contextual cues as a scene graph. The contextual interaction of objects with their surroundings can be represented effectively. Then, a self-uncertainty response enumerator (SRE) is proposed to exploit the uncertainty of MLLM. The potential reasoning responses are enumerated repeatedly based on the assembly of the human-centric context and uncertainty-specific prompt. Furthermore, to reveal the precise reasoning result from the enumerated responses, we introduce the Dempster-Shafer evidence theory (DST)-based combination rule to conduct an evidence-aware fusion (EAF). The precise answer could be gathered based on DST-based combination rule theoretically. Ex
作者:
Zhu, JianWang, HanliShi, MiaojingTongji Univ
Dept Comp Sci & Technol Shanghai 200092 Peoples R China Tongji Univ
Key Lab Embedded Syst & Serv Comp Minist Educ Shanghai 200092 Peoples R China Tongji Univ
Coll Elect & Informat Engn Shanghai 201804 Peoples R China
The visual commonsense reasoning (VCR) task is to choose an answer and provide a justifying rationale based on the given image and textural question. Representative works first recognize objects in images and then ass...
详细信息
The visual commonsense reasoning (VCR) task is to choose an answer and provide a justifying rationale based on the given image and textural question. Representative works first recognize objects in images and then associate them with key words in texts. However, existing approaches do not consider exact positions of objects in a human-like three-dimensional (3D) manner, making them incompetent to accurately distinguish objects and understand visual relation. Recently, multi-modal large language models (MLLMs) have been used as powerful tools for several multi-modal tasks but not for VCR yet, which requires elaborate reasoning on specific visual objects referred by texts. In light of the above, an MLLM enhanced pseudo 3D perception framework is designed for VCR. Specifically, we first demonstrate that the relation between objects is relevant to object depths in images, and hence introduce object depth into VCR frameworks to infer 3D positions of objects in images. Then, a depth-aware Transformer is proposed to encode depth differences between objects into the attention mechanism of Transformer to discriminatively associate objects with visual scenes guided by depth. To further associate the answer with the depth of visual scene, each word in the answer is tagged with a pseudo depth to realize depth-aware association between answer words and objects. On the other hand, BLIP-2 as an MLLM is employed to process images and texts, and the referring expressions in texts involving specific visual objects are modified with linguistic object labels to serve as comprehensible MLLM inputs. Finally, a parameter optimization technique is devised to fully consider the quality of data batches based on multi-level reasoning confidence. Experiments on the VCR dataset demonstrate the superiority of the proposed framework over state-of-the-art approaches. The source code of this work can be found in https://***.
largelanguagemodels (LLMs) and multi-modal large language models (MLLMs) represent the cutting-edge in artificial intelligence. This review provides a comprehensive overview of their capabilities and potential impac...
详细信息
largelanguagemodels (LLMs) and multi-modal large language models (MLLMs) represent the cutting-edge in artificial intelligence. This review provides a comprehensive overview of their capabilities and potential impact on radiology. Unlike most existing literature reviews focusing solely on LLMs, this work examines both LLMs and MLLMs, highlighting their potential to support radiology workflows such as report generation, image interpretation, EHR summarization, differential diagnosis generation, and patient education. By streamlining these tasks, LLMs and MLLMs could reduce radiologist workload, improve diagnostic accuracy, support interdisciplinary collaboration, and ultimately enhance patient care. We also discuss key limitations, such as the limited capacity of current MLLMs to interpret 3D medical images and to integrate information from both image and text data, as well as the lack of effective evaluation methods. Ongoing efforts to address these challenges are introduced.
Workers' unsafe behavior is one of the major causes of accidents in electric power production. Intelligent monitoring of workers' unsafe behaviors can effectively prevent the expansion of safety risks, thereby...
详细信息
Workers' unsafe behavior is one of the major causes of accidents in electric power production. Intelligent monitoring of workers' unsafe behaviors can effectively prevent the expansion of safety risks, thereby blocking the development process of risks to accidents. Electric power production processes are diverse in nature and require the frequent switching of operating scenarios. This makes it difficult to identify what is "unsafe" since worker behaviors within the given electrical context also exhibit variability and diversity. Existing methods have insufficient generalization and adaptability, which makes them inadequate for the case of electric power production. Therefore, this paper proposes Safety Generative Pre-trained Transformers (SafetyGPT), an autonomous agent of safety risk based on a multi-modal large language model, which incorporates a human-machine collaborative monitoring mode for unsafe behaviors of workers. SafetyGPT loads the electric power production video, and the backend supervisors set instructions for SafetyGPT based on task requirements. The model encodes visual and textual features into corresponding tokens, realizes multi-modal feature alignment and fusion through the cross-attention mechanism, and then generates targeted responses through the largelanguagemodel. Next, the proposed method is applied to real production site data to confirm the effectiveness and superiority through comparison with other methods designed to identify unsafe behaviors. Experimental results show that the accuracy of the proposed method for the identification of unsafe behaviors in complex environments is 96.5%, and that it can generate reasonable recommended plan based on the identification results, assist backend supervisors in making decisions, and effectively improve the safety level of power production.
Early weakly supervised video grounding (WSVG) methods often struggle with incomplete boundary detection due to the absence of temporal boundary annotations. To bridge the gap between video-level and boundary-level an...
详细信息
Early weakly supervised video grounding (WSVG) methods often struggle with incomplete boundary detection due to the absence of temporal boundary annotations. To bridge the gap between video-level and boundary-level annotations, explicit supervision methods (i.e., generating pseudo-temporal boundaries for training) have achieved great success. However, data augmentation in these methods might disrupt critical temporal information, yielding poor pseudo-temporal boundaries. In this paper, we propose a new perspective that maintains the integrity of the original temporal content while introducing more valuable information for expanding the incomplete boundaries. To this end, we propose ETC (Expand then Clarify), first using the additional information to expand the initial incomplete pseudo-temporal boundaries, and subsequently refining these expanded ones to achieve precise boundaries. Motivated by video continuity, i.e., visual similarity across adjacent frames, we use powerful multi-modal large language models (MLLMs) to annotate each frame within the initial pseudo-temporal boundaries, yielding more comprehensive descriptions for expanded boundaries. To further clarify the noise in expanded boundaries, we combine mutual learning with a tailored proposal-level contrastive objective to use a learnable approach to harmonize a balance between incomplete yet clean (initial) and comprehensive yet noisy (expanded) boundaries for more precise ones. Experiments demonstrate the superiority of our method on two challenging WSVG datasets.
Recently, the wide application of largelanguagemodels in the field of Visual Question Answering(VQA) has significantly boosted the progress in this field. Despite achieved advancements, LLMs cannot fully perceive an...
详细信息
Recently, the wide application of largelanguagemodels in the field of Visual Question Answering(VQA) has significantly boosted the progress in this field. Despite achieved advancements, LLMs cannot fully perceive and comprehend visual information well from image. Therefore, how to fully mine visual information is very important for languagemodels to effectively deal with the VQA task. In response to this challenge, we propose a straightforward yet effective Vision-Centric Framework(VCF) for VQA, which mainly includes an adaptive visual perceptron module, a multi-source feature fusion module, and a largelanguagemodel. The adaptive visual perceptron module effectively condenses and integrates the extensive visual information sequence from the visual encoder output using a fixed number of query embeddings. The multi-source feature fusion module is concentrated on extracting fine-grained visual perception information by fusing visual features of different scales. Finally, by channeling their outputs, the languagemodel leverages its extensive implicit knowledge to produce amore nuanced and precise synthesis of visual information, ultimately delivering the answer. The synergy and complementarity of the two modules jointly enhance the robustness of the model. Through extensive experiments, VCF achieves nearly state-of-the-art experimental results on datasets such as VQAv2, OK-VQA, GQA, Text-VQA and others. At the same time, a series of ablation experiments have been conducted to demonstrate the efficacy of the proposed module. Additionally, VCF even achieves better or equivalent performance compared to some larger-scale models, such as LLaVa-1.5, Pink.
Emotion semantic inconsistency is a ubiquitous challenge in multi-modal sentiment analysis (MSA). MSA involves analyzing sentiment expressed across various modalities like text, audio, and videos. Each modality may co...
详细信息
ISBN:
(纸本)9789819610440;9789819610457
Emotion semantic inconsistency is a ubiquitous challenge in multi-modal sentiment analysis (MSA). MSA involves analyzing sentiment expressed across various modalities like text, audio, and videos. Each modality may convey distinct aspects of sentiment, due to the subtle and nuanced expression of human beings, leading to inconsistency, which may hinder the prediction of artificial agents. In this work, we introduce a modality-conflicting test set and assess the performance of both traditional multi-modal sentiment analysis models and multi-modal large language models (MLLMs). Our findings reveal significant performance degradation across traditional models when confronted with semantically conflicting data and point out the drawbacks of MLLMs when handling multi-modal emotion analysis. Our research presents a new challenge and offers valuable insights for the future development of sentiment analysis systems.
Mathematical reasoning remains an ongoing challenge for AI models, especially for geometry problems, which require both linguistic and visual signals. As the vision encoders of most MLLMs are trained on natural scenes...
详细信息
With the growing interest in largelanguagemodels (LLMs), integrating visual tasks has led to the development of multi-Layer languagemodels (MLLMs). Despite their advancements, MLLMs face challenges in accuracy and ...
详细信息
ISBN:
(纸本)9798350359329;9798350359312
With the growing interest in largelanguagemodels (LLMs), integrating visual tasks has led to the development of multi-Layer languagemodels (MLLMs). Despite their advancements, MLLMs face challenges in accuracy and generalization, often due to resource and time constraints. Addressing these issues, our paper introduces a novel multi-Agent Collaborative Network for MLLMs (MLLM network). This framework harnesses collective intelligence and cooperation among multiple agents to enhance the accuracy and generalizability of MLLMs. The collaborative nature of our MLLMs-featuring inter-layer neuron interaction and information exchange-facilitates superior processing and integration of multi-modal data. This leads to marked improvements in performance. The findings underscore the efficacy and potential of our proposed framework, presenting a robust solution for complex multi-modal challenges in machine learning and artificial intelligence. Our experimental evaluations demonstrate that this approach significantly surpasses traditional single MLLM architectures in task accuracy and generalization.
The prevalence of check fraud, particularly with stolen checks sold on platforms such as Telegram, creates significant challenges for both individuals and financial institutions. This underscores the urgent need for i...
详细信息
ISBN:
(纸本)9798400704369
The prevalence of check fraud, particularly with stolen checks sold on platforms such as Telegram, creates significant challenges for both individuals and financial institutions. This underscores the urgent need for innovative solutions to detecting and preventing such fraud on social media platforms. While deep learning techniques show great promise in detecting objects and extracting information from images, their effectiveness in addressing check fraud is hindered by the lack of comprehensive, open-source, large training datasets specifically for check information extraction. To bridge this gap, this paper introduces "CheckGuard," a large labeled image-to-text cross-modal dataset designed for check information extraction. CheckGuard comprises over 7,000 real-world stolen check image segments from more than 15 financial institutions, featuring a variety of check styles and layouts. These segments have been manually labeled, resulting in over 50,000 samples across seven key elements: Drawer, Payee, Amount, Date, Drawee, Routing Number, and Check Number. This dataset supports various tasks such as visual question answering (VQA) on checks and check image captioning. Our paper details the rigorous data collecting, cleaning, and annotation processes that make CheckGuard a valuable resource for researchers in check fraud detection, machine learning, and multimodallargelanguagemodels (MLLMs). We not only benchmark state-of-the-art (SOTA) methods on this dataset to assess their performance but also explore potential enhancements. Our application of parameter-efficient fine-tuning (PEFT) techniques on the SOTA MLLMs demonstrates significant performance improvements, providing valuable insights and practical approaches for enhancing model efficacy on this task. As an evolving project, CheckGuard will continue to be updated with new data, enhancing its utility and driving further advancements in the field. Our PEFT-based MLLM code is available at: https://***/feiz
暂无评论