检索结果-内蒙古大学图书馆

22nd International Conference on Business Process Management (BPM)

作者： Voelter, Marvin Hadian, Raheleh Kampik, Timotheus Breitmayer, Marius Reichert, Manfred SAP Berlin Germany Ulm Univ Ulm Germany

ISBN: (纸本)9783031786655;9783031786662

This paper investigates the vision capabilities of multimodal Generative Pre-trained Transformers (GPTs) to auto-generate structured process models from diagram- and text-based documents. We introduce a dataset of 123 process models and corresponding documentation, emphasizing real-world element distributions. Using evaluation metrics for process model similarity, this enables ground truth-based assessment of process model generation. We evaluate commercial GPT capabilities with zero-, one-, and few-shot prompting strategies. Our results indicate that generative vision models can be useful tools for semi-automated process modeling based on multimodal documents. More importantly, the dataset and evaluation metrics as well as the open-source evaluation code provide a structured framework for continued systematic evaluations moving forward.

关键词： Generative Vision models multimodal large language models Document Analysis Business Process Management

来源：评论

学校读者我要写书评

暂无评论

PhysID: Physics-based Interactive Dynamics from a Single-view Image

PhysID: Physics-based Interactive Dynamics from a Single-vie...

引用

2025 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2025

作者： Gothe, Sourabh Vasant Chattopadhyay, Ayon Kiran, Gunturi Venkata Sai Phani Pratik Agarwal, Vibhav Vachhani, Jayesh Rajkumar Ghosh, Sourav Parameswaranath, V.M. Barath Raj, K.R. Samsung R&D Institute India - Bangalore India

ISBN: (纸本)9798350368741

Transforming static images into interactive experiences remains a challenging task in computer vision. Tackling this challenge holds the potential to elevate mobile user experiences, notably through interactive and AR/VR applications. Current approaches aim to achieve this either using pre-recorded video responses or requiring multi-view images as input. In this paper, we present PhysID, that streamlines the creation of physics-based interactive dynamics from a single-view image by leveraging large generative models for 3D mesh generation and physical property prediction. This significantly reduces the expertise required for engineering-intensive tasks like 3D modeling and intrinsic property calibration, enabling the process to be scaled with minimal manual intervention. We integrate an on-device physics-based engine for physically plausible real-time rendering with user interactions. PhysID represents a leap forward in mobile-based interactive dynamics, offering real-time, non-deterministic interactions and user-personalization with efficient on-device memory consumption. Experiments evaluate the zero-shot capabilities of various multimodal large language models (MLLMs) on diverse tasks and the performance of 3D reconstruction models. These results demonstrate the cohesive functioning of all modules within the end-to-end framework, contributing to its effectiveness. © 2025 IEEE.

关键词： 3D Modelling Diffusion multimodal large language models Physics-Based Rendering

来源：评论

学校读者我要写书评

暂无评论

User-in-the-Loop Evaluation of multimodal LLMs for Activity Assistance

User-in-the-Loop Evaluation of Multimodal LLMs for Activity ...

引用

2025 IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2025

作者： Verghese, Mrinal Chen, Brian Eghbalzadeh, Hamid Nagarajan, Tushar Desai, Ruta Carnegie Mellon University United States Samsung Research America United States Meta Reality Labs Research United States Meta Fundamental Ai Research United States

ISBN: (纸本)9798331510831

Our research investigates the capability of modern multimodal reasoning models, powered by large language models (LLMs), to facilitate vision-powered assistants for multi-step daily activities. Such assistants must be able to 1) encode relevant visual history from the assistant's sensors, e.g., camera, 2) forecast future actions for accomplishing the activity, and 3) replan based on the user in the loop. To evaluate the first two capabilities, grounding visual history and forecasting in short and long horizons, we conduct benchmarking of two prominent classes of multimodal LLM approaches - Socratic models [46] and Vision Conditioned language models (VCLMs) [31] on video-based action anticipation tasks using offline datasets. These offline benchmarks, however, do not allow us to close the loop with the user, which is essential to evaluate the replanning capabilities and measure successful activity completion in assistive scenarios. To that end, we conduct a first-of-its-kind user study, with 18 participants performing 3 different multi-step cooking activities while wearing an egocentric observation device called Aria [37] and following assistance from multimodal LLMs. We find that the Socratic approach outperforms VCLMs in both offline and online settings. We further highlight how grounding long visual history, common in activity assistance, remains challenging in current models, especially for VCLMs, and demonstrate that offline metrics do not indicate online performance. © 2025 IEEE.

关键词： multimodal large language models video understanding

来源：评论

学校读者我要写书评

暂无评论

Perceive. Query & Reason: Enhancing Video QA with Question-Guided Temporal Queries

Perceive. Query & Reason: Enhancing Video QA with Question-G...

引用

2025 IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2025

作者： Amoroso, Roberto Zhang, Gengyuan Koner, Rajat Baraldi, Lorenzo Cucchiara, Rita Tresp, Volker Lmu Munich Germany Mcml Germany University of Modena and Reggio Emilia Italy IIT-CNR Italy

ISBN: (纸本)9798331510831

Video Question Answering (Video QA) is a challenging video understanding task that requires models to compre-hend entire videos, identify the most relevant information based on contextual cues from a given question, and rea-son accurately to provide answers. Recent advancements in multimodal large language models (MLLMs) have trans-formed video QA by leveraging their exceptional common-sense reasoning capabilities. This progress is largely driven by the effective alignment between visual data and the language space of MLLMs. However, for video QA, an ad-ditional space-time alignment poses a considerable chal-lenge for extracting question-relevant information across frames. In this work, we investigate diverse temporal modeling techniques to integrate with MLLMs, aiming to achieve question-guided temporal modeling that leverages pre-trained visual and textual alignment in MLLMs. We propose T-Former, a novel temporal modeling method that creates a question-guided temporal bridge between frame-wise visual perception and the reasoning capabilities of LLMs. Our evaluation across multiple video QA bench-marks demonstrates that T-Former competes favorably with existing temporal modeling approaches and aligns with re-cent advancements in video QA. © 2025 IEEE.

关键词： multimodal large language models question-guided temporal modeling video question answering

来源：评论

学校读者我要写书评

暂无评论

Simignore: Exploring and enhancing multimodal large model complex reasoning via similarity computation

引用

NEURAL NETWORKS 2025年 184卷 107059页

作者： Zhang, Xiaofeng Zeng, Fanshuo Gu, Chaochen Shanghai Jiao Tong Univ 800 Dongchuan Rd Shanghai 200240 Peoples R China Cent South Univ 932 South Lushan Rd Changsha 410083 Hunan Peoples R China

Recently, the field of multimodal large language models (MLLMs) has grown rapidly, with many large Vision- language models (LVLMs) relying on sequential visual representations. In these models, images are broken down into numerous tokens before being fed into the large language Model (LLM) alongside text prompts. However, the opaque nature of these models poses significant challenges to their interpretability, particularly when dealing with complex reasoning tasks. To address this issue, we utilized Grad-CAM to investigate the interaction dynamics between images and text within complex reasoning processes. Our information flow analysis revealed a distinct pattern: it tends to converge in the initial layers and then disperse as it progresses through deeper layers. This pattern suggests that the early stages of processing focus on the interaction between visual and textual elements, while later stages engage in deeper reasoning. We developed Simignore, a novel image token reduction technique based on this insight. Simignore enhances the model's complex reasoning capabilities by calculating the similarity between image and text embeddings, thereby ignoring tokens that are not semantically relevant. Extensive experiments across different MLLM architectures have shown that our approach consistently improves performance in complex reasoning tasks. This work not only contributes to the advancement of MLLM interpretability but also provides a robust framework for future research in this area. The paper's source code can be accessed from https://***/FanshuoZeng/Simignore.

关键词： multimodal large language models Information flow Image-text similarity

来源：评论

学校读者我要写书评

暂无评论

MS-RRBR: A Multi-Model Synergetic Framework for Restricted and Repetitive Behavior Recognition in Children with Autism

引用

APPLIED SCIENCES-BASEL 2025年第3期15卷 1577-1577页

作者： Wang, Yonggu Shao, Yifan Yu, Zengyi Wang, Zihan Zhejiang Univ Technol Coll Educ Hangzhou 310023 Peoples R China

Restricted and Repetitive Behaviors (RRBs) are hallmark features of children with autism spectrum disorder (ASD) and are also one of the diagnostic criteria for the condition. Traditional methods of RRBs assessment through manual observation are limited by low diagnostic efficiency and uncertainty in outcomes. As a result, AI-assisted screening for autism has emerged as a promising research direction. In this study, we explore the synergy of visual foundation models and multimodal large language models (MLLMs), proposing a Multi-Model Synergistic Restricted and Repetitive Behavior Recognition method (MS-RRBR). Based on this method, we developed an interpretable multi-model autonomous question-answering system. To evaluate the effectiveness of our approach, we collected and annotated the Autism Restricted and Repetitive Behavior Dataset (ARRBD), which includes 10 ASD-related behaviors easily observable from various visual perspectives. Experimental results on the ARRBD dataset demonstrate that our multi-model collaboration outperforms single-model approaches, achieving the highest recognition accuracy of 94.94%. The MS-RRBR leverages the extensive linguistic knowledge of GPT-4o to enhance the zero-shot visual recognition capabilities of the MLLM, while also providing clear explanations for system decisions. This approach holds promise for providing timely, reliable, and accurate technical support for clinical diagnosis and educational rehabilitation in ASD.

关键词： autism spectrum disorder restricted and repetitive behaviors action recognition multimodal large language models GPT-4o

来源：评论

学校读者我要写书评

暂无评论

From Vision to Perception: Transforming Art Experience for the Blind with C-ArtQA

引用

JOURNAL OF IMAGING SCIENCE AND TECHNOLOGY 2025年第1期69卷 1-11页

作者： Guo, Jia Hsieh, Yung-Cheng Zhejiang Univ Hangzhou 310058 Peoples R China

Blind and low vision (BLV) individuals face unique challenges due to a lack of objective explanations and shared artistic vocabulary. This study introduces Cultural ArtQA (C-ArtQA), a benchmark designed to assess whether current multimodal large language models (MLLMs;GPT-4V and Gemini) meet BLV needs by integrating structured visual art descriptions into auditory and tactile domains. The approach categorizes art into Visual, multimodal Extended, and Imagery Perceptions, distributed across 19 fine-grained categories. The study employs visual question answering with 361 questions generated from a dataset of modern artworks, selected for their accessibility and cultural richness by BLV volunteers and art experts. Results indicate that GPT-4V excels in Visual and Imagery Perceptions while both models underperform in multimodal Extended Perceptions, highlighting areas for improvement in AI's support for BLV individuals. This study lays the foundation for developing MLLMs to meet the visual art appreciation needs of the BLV community.

关键词： multimodal large language models visual art blind and low vision Imagery Perceptions

来源：评论

学校读者我要写书评

暂无评论

Probing Fundamental Visual Comprehend Capabilities on Vision language models via Visual Phrases from Structural Data

引用

COGNITIVE COMPUTATION 2024年第6期16卷 3484-3504页

作者： Xie, Peijin Liu, Bingquan Harbin Inst Technol Fac Comp Harbin Peoples R China

Does the model demonstrate exceptional proficiency in "item counting,""color recognition," or other Fundamental Visual Comprehension Capability (FVCC)? There have been remarkable advancements in the field of multimodal, the pretrained general Vision language models exhibit strong performance across a range of intricate Visual language (VL) tasks and multimodal large language models (MLLMs) emerge novel visual reasoning abilities from several examples. But models tend to encounter difficulties when confronted with texts supplemented with specific details by simple visual phrases. Moreover, there is a scarcity of datasets in sufficient quantity, variety, and composability to enable the evaluation of each FVCC using statistical metrics. Accordingly, we decomposed the complete VL task into 9 M simple Visual Phrase Triplets (VPTs) across 16 categories representing 16 distinct FVCCs from the structural scene graph. Then, we reconstructed a Multilevel Scene Graph (MLSG) for each image and introduced our unbiased, balanced, and binary Visual Phrase Entailment benchmark with 20 times the data volume of SNLI-VE. The benchmark consisted of three exams and evaluated the performance of 8 widely used VLM and 10 MLLMs respectively. The results demonstrate the performance of each model across 16 classes in FVCC, as well as their lower and upper limits under conditions of increased text complexity or unnoised image input. Finally, we enhanced the efficiency of MLLM and evoked their In-Context Learning characteristics by appending multiple VPT generated QA pairs of identical types to the conversation history without tuning. The proposed structural VPTs and MLSG data hold promise for facilitating future explorations on FVCC.

关键词： Vision language models multimodal large language models Visual reasoning Multilevel scene graph

来源：评论

学校读者我要写书评

暂无评论

Human 0, MLLM 1: Unlocking New Layers of Automation in language-Conditioned Robotics with multimodal LLMs 21

Human 0, MLLM 1: Unlocking New Layers of Automation in Langu...

引用

21st International Conference on Mechatronics-Mechatronika

作者： ElMallah, Raley Zamani, Nima Lee, Chi-Guhn Univ Toronto Mech & Ind Engn Toronto ON Canada Cobionix Corp Kitchener ON Canada

ISBN: (纸本)9798350394917;9798350394900

language-conditioned robotics has seen tremendous growth in frameworks that aim to improve the success rates of robots acting upon the environment according to free-form language instructions. However, most existing frameworks leverage a human in the loop to assist with critical functions. Humans are mainly involved in ensuring that a human-requested task is feasible, resetting the robot when it diverges from achieving the requested goal, and deciding if it has completed the task. As human involvement limits the scalability of language-conditioned robotics, we propose automating these human functions through multimodal large language models in the Loop (MLLM-IL). We conduct experiments leveraging multimodal large language models, specifically OpenAI's GPT-4, and Google Gemini, to evaluate their potential in automating crucial functions. The introduced new layers of automation include analyzing task feasibility, assessing task progress, and detecting task success. We investigate how different factors, including the choice of LLM, image resolution of the input images, and the structure of the prompt, affect the performance of the LLMs in achieving the target functions. Results show significant zero-shot success with feasibility analysis accuracies exceeding 90%. Our work demonstrates the immense potential of utilizing MLLM-IL to complement existing frameworks in language-conditioned robotics, opening the space for a wealth of new applications.

关键词： language-conditioned Robotics Intelligent Automation Man-machine Interactions Vision for Robots multimodal large language models

来源：评论

学校读者我要写书评

暂无评论

Feeling Textiles through AI: An Exploration into multimodal language models and Human Perception Alignment 24

Feeling Textiles through AI: An Exploration into Multimodal ...

引用

Companion International Conference on multimodal Interaction

作者： Zhong, Shu Gatti, Elia Cho, Youngjun Obrist, Marianna UCL Dept Comp Sci London England

ISBN: (纸本)9798400704628

Human-artificial intelligence (AI) alignment ensures that AI systems align with human goals and behaviors. This paper introduces perceptual alignment as a critical aspect of this alignment, focusing on the concurrence between human judgments and AI evaluations across sensory modalities. We particularly explore how multimodal large language models (MLLMs), which process both visual and textual data, interpret the tactile qualities of textiles-a significant challenge in online shopping environments. Our research analyzes six vision-based MLLMs to see how they describe the tactile experience of textiles and compares these AI-generated descriptions with human assessments. Through semantic similarity measures and in-person evaluations, we investigate the extent of alignment between human perceptions and AI descriptions. Our findings indicate significant variability in the AI's ability to interpret different textiles, highlighting both the potential and limitations of current AI models in achieving perceptual alignment. This work contributes to understanding the complexities of aligning AI capabilities with human touch sensory experiences.

关键词： Human-AI Alignment Human-AI interaction Touch Experience Textile Hand multimodal large language models

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：