检索结果-内蒙古大学图书馆

DDC-Chat: Achieving accurate distracted driver classification through instruction tuning of visual language model

JOURNAL OF SAFETY SCIENCE AND RESILIENCE 2025年第2期6卷 250-264页

作者： Liao, Chupei Lin, Kuoyi Guilin Univ Elect Technol Sch Business Guilin 541004 Guangxi Peoples R China

Driver behavior is a critical factor in road safety, highlighting the need for advanced methods in Distracted Driving Classification (DDC). In this study, we introduce DDC-Chat, a novel classification method based on a visual large language model (VLM). DDC-Chat is an interactive multimodal system built upon LLAVA-Plus, fine-tuned specifically for addressing distracted driving detection. It utilizes logical reasoning chains to activate visual skills, including segmentation and pose detection, through end-to-end training. Furthermore, instruction tuning allows DDC-Chat to continuously incorporate new visual skills, enhancing its ability to classify distracted driving behavior. Our extensive experiments demonstrate that DDC-Chat achieves state-of-the-art performance on public DDC datasets, surpassing previous benchmarks. In evaluations on the 100-Driver dataset, the model exhibits superior results in both zero-shot and few-shot learning contexts, establishing it as a valuable tool for improving driving safety by accurately identifying driver distraction. Due to the computational intensity of inference, DDC-Chat is optimized for deployment on remote servers, with data streamed from in-vehicle monitoring systems for real-time analysis.

关键词： Classifying distracted driving visual language model LLAVA-plus Logical chain

来源：评论

学校读者我要写书评

暂无评论

RAVL: A Retrieval-Augmented visual language model Framework for Knowledge-Based visual Question Answering 13th

RAVL: A Retrieval-Augmented Visual Language Model Framework ...

引用

13th International Conference on Natural language Processing and Chinese Computing

作者： Chai, Naiquan Zou, Dongsheng Liu, Jiyuan Wang, Hao Yang, Yuming Song, Xinyi Chongqing Univ Sch Comp Sci Chongqing Peoples R China

ISBN: (纸本)9789819794362;9789819794379

Knowledge-based visual question answering (VQA) requires external knowledge in addition to the image content to answer questions. Recent studies convert images to text descriptions and then generate answers or acquire implicit knowledge using a large language model (LLM). These methods achieve encouraging results with the strong knowledge retrieval and reasoning capabilities of LLMs. However, methods that incorporate LLMs are limited by the discrepancies between images and their text descriptions presented to LLMs. To address this challenge, we present RAVL, a retrieval-augmented visual language model (VLM) framework for knowledge-based VQA. Specifically, we first fine-tune a VLM on the knowledge-based VQA task with inputs consisting of retrieved knowledge and image-question pairs to adapt the VLM to inputs with retrieved knowledge. After that, we adapt the retrieval module to the fine-tuned VLM using supervision signals provided by the VLM, enabling the retrieved knowledge to improve the VLM perplexity. RAVL overcomes the limitation of visual information loss and improves the effectiveness of VLMs with external knowledge. We conduct experiments on OK-VQA dataset and our method achieves 65.73% accuracy, surpassing the previous state-of-the-art method (+3.63%).

关键词： Knowledge-based visual question answering Retrieval augmentation visual language model

来源：评论

学校读者我要写书评

暂无评论

visual language model for Keyword Spotting on Historical Mongolian Document Images 29

Visual Language Model for Keyword Spotting on Historical Mon...

引用

第29届中国控制与决策会议

作者： Hongxi Wei Guanglai Gao School of Computer Science Inner Mongolia University

ISBN: (纸本)9781509046584

The Bag-of-visual-Words(BoVW) approach has been attracted some attention in the field of keyword ***,the BoVW approach discards the spatial relations of the visual ***,a visual language model is integrated into the BoVW framework in this study so as to add the spatial *** accomplish the process of keyword spotting,two well-known retrieval schemes,including query likelihood model and KL divergence,have been *** experimental results show that the visual language model can significantly improve the performance of keyword spotting on a collection of historical Mongolian document images than the original BoVW ***,the influence of different codebook sizes on the performance has been analyzed in this *** the best appropriate size of the codebook has been determined.

关键词： visual language model Query Likelihood model KL Divergence Smoothing Keyword Spotting

来源：评论

学校读者我要写书评

暂无评论

Leveraging visual language model and Generative Diffusion model for Zero-Shot SAR Target Recognition

引用

REMOTE SENSING 2024年第16期16卷 2927页

作者： Wang, Junyu Sun, Hao Tang, Tao Sun, Yuli He, Qishan Lei, Lin Ji, Kefeng Natl Univ Def Technol Coll Elect Sci & Technol Changsha 410073 Peoples R China

Simulated data play an important role in SAR target recognition, particularly under zero-shot learning (ZSL) conditions caused by the lack of training samples. The traditional SAR simulation method is based on manually constructing target 3D models for electromagnetic simulation, which is costly and limited by the target's prior knowledge base. Also, the unavoidable discrepancy between simulated SAR and measured SAR makes the traditional simulation method more limited for target recognition. This paper proposes an innovative SAR simulation method based on a visual language model and generative diffusion model by extracting target semantic information from optical remote sensing images and transforming it into a 3D model for SAR simulation to address the challenge of SAR target recognition under ZSL conditions. Additionally, to reduce the domain shift between the simulated domain and the measured domain, we propose a domain adaptation method based on dynamic weight domain loss and classification loss. The effectiveness of semantic information-based 3D models has been validated on the MSTAR dataset and the feasibility of the proposed framework has been validated on the self-built civilian vehicle dataset. The experimental results demonstrate that the first proposed SAR simulation method based on a visual language model and generative diffusion model can effectively improve target recognition performance under ZSL conditions.

关键词： SAR simulation target recognition visual language model generative diffusion model domain adaption

来源：评论

学校读者我要写书评

暂无评论

Automatic Findings Generation for Distress Images Using In-Context Few-Shot Learning of visual language model Based on Image Similarity and Text Diversity

引用

JOURNAL OF ROBOTICS AND MECHATRONICS 2024年第2期36卷 353-364页

作者： Watanabe, Yuto Ogawa, Naoki Maeda, Keisuke Ogawa, Takahiro Haseyama, Miki Hokkaido Univ Grad Sch Informat Sci & Technol Kita 14 Nishi 9Kita-ku Sapporo 0600814 Japan Hokkaido Univ Fac Informat Sci & Technol Kita 14 Nishi 9Kita-ku Sapporo 0600814 Japan

This study proposes an automatic findings generation method that performs in-context few-shot learning of a visual language model. The automatic generation of findings can reduce the burden of creating inspection records for infrastructure facilities. However, the findings must include the opinions and judgments of engineers, in addition to what is recognized from the image;therefore, the direct generation of findings is still challenging. With this background, we introduce incontext few-short learning that focuses on image similarity and text diversity in the visual language model, which enables text output with a highly accurate understanding of both vision and language. Based on a novel in-context few-shot learning strategy, the proposed method comprehensively considers the characteristics of the distress image and diverse findings and can achieve high accuracy in generating findings. In the experiments, the proposed method outperformed the comparative methods in generating findings for distress images captured during bridge inspections.

关键词： automatic findings generation infrastruc- ture maintenance large language model visual language model in-context few-shot learning

来源：评论

学校读者我要写书评

暂无评论

OpenECAD: An efficient visual language model for editable 3D-CAD design☆ ☆

引用

COMPUTERS & GRAPHICS-UK 2024年 124卷

作者： Yuan, Zhe Shi, Jianqi Huang, Yanhong East China Normal Univ Software Engn Inst Shanghai Peoples R China Natl Trusted Embedded Software Engn Technol Res Ct Shanghai Peoples R China

Computer-aided design (CAD) tools are utilized in the manufacturing industry for modeling everything from cups to spacecraft. These programs are complex to use and typically require years of training and experience to master. Structured and well-constrained 2D sketches and 3D constructions are crucial components of CAD modeling. A well-executed CAD model can be seamlessly integrated into the manufacturing process, thereby enhancing production efficiency. Deep generative models of 3D shapes and 3D object reconstruction models have garnered significant research interest. However, most of these models produce discrete forms of 3D objects that are not editable. Moreover, the few models based on CAD operations often have substantial input restrictions. In this work, we fine-tuned pre-trained models to create OpenECAD models (0.55B, 0.89B, 2.4B and 3.1B), leveraging the visual, logical, coding, and general capabilities of visual language models. OpenECAD models can process images of 3D designs as input and generate highly structured 2D sketches and 3D construction commands, ensuring that the designs are editable. These outputs can be directly used with existing CAD tools' APIs to generate project files. To train our network, we created a series of OpenECAD datasets. These datasets are derived from existing public CAD datasets, adjusted and augmented to meet the specific requirements of vision language model (VLM) training. Additionally, we have introduced an approach that utilizes dependency relationships to define and generate sketches, further enriching the content and functionality of the datasets.

关键词： Small language model visual language model Computer aided design Geometric deep learning

来源：评论

学校读者我要写书评

暂无评论

LATENT TOPIC visual language model FOR OBJECT CATEGORIZATION

LATENT TOPIC VISUAL LANGUAGE MODEL FOR OBJECT CATEGORIZATION

引用

International Conference on Signal Processing and Multimedia Applications

作者： Lei Wu Nenghai Yu Jing Liu Mingjing Li Department of EEIS University of Science and Technology of China Institute of Automation Chinese Academy of Sciences

ISBN: (纸本)9789898425720

This paper presents a latent topic visual language model to handle variation problem in object categorization. Variations including different views, styles, poses, etc., have greatly affected the spatial arrangement and distribution of visual features, on which previous categorization models largely depend. Taking the object variations as hidden topics within each category, the proposed model explores the relationship between object variations and visual feature arrangement in the traditional visual language modeling process. With this improvement, the accuracy of object categorization is further boosted. Experiments on Caltech101 dataset have shown that this model makes sense and is effective.

关键词： Multimedia content analysis Latent topic model visual language model Object categorization

来源：评论

学校读者我要写书评

暂无评论

KN-VLM: KNowledge-guided Vision-and-language model for visual abductive reasoning

引用

MULTIMEDIA SYSTEMS 2025年第2期31卷 1-16页

作者： Tan, Kuo Qi, Zhaobo Zhong, Jianping Xu, Yuanrong Zhang, Weigang Harbin Inst Technol Sch Comp Sci & Technol Weihai 264209 Peoples R China

visual abductive reasoning strives to deduce the most suitable hypothesis that effectively explains the underlying visual context, garnering considerable attention in the academic community. However, recent efforts are inherently limited by their exclusive reliance on visual information, overlooking the invaluable commonsense and the semantic/causal relationships, leading to inaccurate abductive reasoning outcomes. To tackle the above issue, we propose a simple but powerful KNowledge-guided Vision-and-language model (KN-VLM), which primarily consists of a visual reasoning branch and a knowledge reasoning branch. The visual reasoning branch utilizes a powerful visual embedding model followed by a visual-Qformer to capture visual features. The knowledge reasoning branch aims to acquire two complementary types of knowledge, commonsense knowledge and complemented knowledge. The former aims to extract the intricate and detailed conceptual knowledge embedded within the observed video, which deepens the model's comprehension of the presented video content. The latter utilizes the external knowledge base to further augment the understanding of the interconnections and causal relationships among these concepts, thereby strengthening the model's abductive reasoning capability. After that, the effective fusion of the two branches completes the abductive reasoning task, which generates descriptions for the observed and explanation events. Experimental results on the VAR and CookReasoning dataset show that our model achieves promising performance.

关键词： Commonsense knowledge visual language model visual abductive reasoning Dense video captioning

来源：评论

学校读者我要写书评

暂无评论

Spatiotemporal-Aware visual Captioning using Vision-language Pre-Training model

Spatiotemporal-Aware Visual Captioning using Vision-Language...

引用

2025 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2025

作者： Wu, Shuai Yang, Weidong Wu, Shuyan School of Computer Science Fudan University Shanghai China Faculty of Electronic and Information Engineering Xi'an Jiaotong University Xi'an China

ISBN: (纸本)9798350368741

Current visual captioning technologies typically transform 3D/2D visual information into one-dimensional sequential data and employ language models to generate corresponding descriptions. This approach, however, compromises the spatiotemporal information in visual data, making it difficult for models to capture temporal variations and the relative spatial relationships between objects. To address this issue, we propose STPos-VC, a pre-trained vision-language model that maps visual information from the visual vector space to the textual vector space through a visual-text mapper and generates natural language descriptions using a decoder. The mapper incorporates three-dimensional rotational position encoding, which effectively preserves the relative spatiotemporal positional relationships. Furthermore, we pre-train the model on a mixed dataset comprising images and videos through a visual question-answering framework, enabling the model to perform well even with small sample sizes. Experimental results across multiple datasets demonstrate that, compared to existing methods, STPos-VC achieves superior performance in both general-purpose and domain-specific applications. © 2025 IEEE.

关键词： Multimodality Pre-training Spatiotemporal position encoding visual language model

来源：评论

学校读者我要写书评

暂无评论

Enabling High-Level Worker-Centric Semantic Understanding of Onsite Images Using visual language models with Attention Mechanism and Beam Search Strategy

引用

BUILDINGS 2025年第6期15卷 959-959页

作者： Deng, Hui Fu, Kejie Yu, Binglin Li, Huimin Duan, Rui Deng, Yichuan Lin, Jia-rui South China Univ Technol Sch Civil Engn & Transportat Guangzhou 510641 Peoples R China State Key Lab Subtrop Bldg & Urban Sci Guangzhou 510641 Peoples R China Tsinghua Univ Dept Civil Engn Beijing 100084 Peoples R China

visual information is becoming increasingly essential in construction management. However, a significant portion of this information remains underutilized by construction managers due to the limitations of existing image processing algorithms. These algorithms primarily rely on low-level visual features and struggle to capture high-order semantic information, leading to a gap between computer-generated image semantics and human interpretation. However, current research lacks a comprehensive justification for the necessity of employing scene understanding algorithms to address this issue. Moreover, the absence of large-scale, high-quality open-source datasets remains a major obstacle, hindering further research progress and algorithmic optimization in this field. To address this issue, this paper proposes a construction scene visual language model based on attention mechanism and encoder-decoder architecture, with the encoder built using ResNet101 and the decoder built using LSTM (long short-term memory). The addition of the attention mechanism and beam search strategy improves the model, making it more accurate and generalizable. To verify the effectiveness of the proposed method, a publicly available construction scene visual-language dataset containing 16 common construction scenes, SODA-ktsh, is built and verified. The experimental results demonstrate that the proposed model achieves a BLEU-4 score of 0.7464, a CIDEr score of 5.0255, and a ROUGE_L score of 0.8106 on the validation set. These results indicate that the model effectively captures and accurately describes the complex semantic information present in construction images. Moreover, the model exhibits strong generalization, perceptual, and recognition capabilities, making it well suited for interpreting and analyzing intricate construction scenes.

关键词： visual language model construction scene image scene understanding image captioning attention mechanism

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：