检索结果-内蒙古大学图书馆

MDAPT: Multi-Modal Depth Adversarial Prompt Tuning to Enhance the Adversarial Robustness of visual language models

SENSORS 2025年第1期25卷 258页

作者： Li, Chao Liao, Yonghao Ding, Caichang Ye, Zhiwei Hubei Univ Technol Sch Comp Sci Wuhan 430068 Peoples R China Hubei Engn Univ Sch Comp & Informat Sci Xiaogan 432000 Peoples R China

Large visual language models like Contrastive language-Image Pre-training (CLIP), despite their excellent performance, are highly vulnerable to the influence of adversarial examples. This work investigates the accuracy and robustness of visual language models (VLMs) from a novel multi-modal perspective. We propose a multi-modal fine-tuning method called Multi-modal Depth Adversarial Prompt Tuning (MDAPT), which guides the generation of visual prompts through text prompts to improve the accuracy and performance of visual language models. We conducted extensive experiments and significantly improved performance on three datasets (& varepsilon;=4/255). Compared with traditional manual design prompts, the accuracy and robustness increased by an average of 17.84% and 10.85%, respectively. Not only that, our method still has a very good performance improvement under different attack methods. With our efficient settings, compared with traditional manual prompts, our average accuracy and robustness are improved by 32.16% and 21.00%, respectively, under three different attacks.

关键词： multi-modal adversarial robustness visual language models prompt tuning

来源：评论

学校读者我要写书评

暂无评论

The future of action recognition: are multi-modal visual language models the key?

引用

SIGNAL IMAGE AND VIDEO PROCESSING 2025年第4期19卷 1-12页

作者： Gumuskaynak, Enes Eken, Suleyman Hisar Hlth Res Ctr Med Biochem Istanbul Turkiye Kocaeli Univ Informat Syst Engn TR-41001 Izmit Kocaeli Turkiye

This study investigates the potential of visual language models for action recognition, a critical task in video analysis. Traditional action recognition methods predominantly rely on visual features, often struggling with challenges such as complex actions, varied environments, and high intra-class variability. VLMs, which integrate visual and textual data, offer a promising solution by leveraging contextual information to enhance recognition accuracy and robustness. We evaluate several state-of-the-art multi-modal VLMs, including Moondream2, Florence-2-large, PaliGemma-3B, and Meta Chameleon 7B, on the UCF101 and kinetics-400 action recognition datasets. The performance of these models is analyzed in their fine-tuning-free states, providing insights into their applicability and effectiveness in action recognition tasks. Our results indicate that while these models demonstrate substantial potential, further fine-tuning and optimization could unlock even greater performance. This study contributes to the understanding of VLMs capabilities in action recognition and highlights areas for future research and development.

关键词： visual language models Action recognition Multi-modal models Video analysis UCF101 Dataset kinetics-400 dataset

来源：评论

学校读者我要写书评

暂无评论

ReplanVLM: Replanning Robotic Tasks With visual language models

引用

IEEE ROBOTICS AND AUTOMATION LETTERS 2024年第11期9卷 10201-10208页

作者： Mei, Aoran Zhu, Guo-Niu Zhang, Huaxiang Gan, Zhongxue Fudan Univ Acad Engn & Technol Shanghai 200433 Peoples R China

Large language models (LLMs) have gained increasing popularity in robotic task planning due to their exceptional abilities in text analytics and generation, as well as their broad knowledge of the world. However, they fall short in decoding visual cues. LLMs have limited direct perception of the world, which leads to a deficient grasp of the current state of the world. By contrast, the emergence of visual language models (VLMs) fills this gap by integrating visual perception modules, which can enhance the autonomy of robotic task planning. Despite these advancements, VLMs still face challenges, such as the potential for task execution errors, even when provided with accurate instructions. To address such issues, this letter proposes a ReplanVLM framework for robotic task planning. In this study, we focus on error correction interventions. An internal error correction mechanism and an external error correction mechanism are presented to correct errors under corresponding phases. A replan strategy is developed to replan tasks or correct error codes when task execution fails. Experimental results on real robots and in simulation environments have demonstrated the superiority of the proposed framework, with higher success rates and robust error correction capabilities in open-world tasks.

关键词： Robots Planning Chatbots Error correction visualization Error correction codes Adaptation models visual perception Predictive models Computational modeling AI-enabled robotics task planning human-robot collaboration visual language models

来源：评论

学校读者我要写书评

暂无评论

GameVLM: A Decision-making Framework for Robotic Task Planning Based on visual language models and Zero-sum Games 21

GameVLM: A Decision-making Framework for Robotic Task Planni...

引用

21st IEEE International Conference on Mechatronics and Automation (IEEE ICMA)

作者： Mei, Aoran Wang, Jianhua Zhu, Guo-Niu Gan, Zhongxue Fudan Univ Acad Engn & Technol Shanghai 200433 Peoples R China

ISBN: (纸本)9798350388084;9798350388077

With their prominent scene understanding and reasoning capabilities, pre-trained visual-language models (VLMs) such as GPT-4V have attracted increasing attention in robotic task planning. Compared with traditional task planning strategies, VLMs are strong in multimodal information parsing and code generation and show remarkable efficiency. Although VLMs demonstrate great potential in robotic task planning, they suffer from challenges like hallucination, semantic complexity, and limited context. To handle such issues, this paper proposes a multi-agent framework, i.e., GameVLM, to enhance the decision-making process in robotic task planning. In this study, VLM-based decision and expert agents are presented to conduct the task planning. Specifically, decision agents are used to plan the task, and the expert agent is employed to evaluate these task plans. Zero-sum game theory is introduced to resolve inconsistencies among different agents and determine the optimal solution. Experimental results on real robots demonstrate the efficacy of the proposed framework, with an average success rate of 83.3%. Videos of our experiments are available at https://***/sam-MKCPP7Y.

关键词： Task planning Multi-agent visual language models Zero-sum game theory Decision-making

来源：评论

学校读者我要写书评

暂无评论

A Design of Interface for visual-Impaired People to Access visual Information from Images Featuring Large language models and visual language models

A Design of Interface for Visual-Impaired People to Access V...

引用

CHI Conference on Human Factors in Computing Sytems (CHI)

作者： Zhang, Zhe-Xin Univ Tsukuba Digital Nat Grp Tsukuba Ibaraki Japan

ISBN: (纸本)9798400703317

We propose a design of interface for visual-impaired People to access visual information from images utilizing Large language models(LLMs), visual language models (VLMs), and Segment-Anything. We use Semantic-Segment-Anything to generate the segmentation of semantic objects in images. The segmentation includes two parts: a term set describing the semantic object, and segmented mask which represents the shape of the semantic object. We provide two methods for the visual-impaired user to access the information of the semantic object and its peripheral information in image. In one method, the LLM summarize the term set to create an description. In the other method, the image with the object masked is provided to visual language models which is prompted to respond with a description. In both methods, the mask can be accessed with dot display after processed for the visual-impaired people to access, and the description is prompted to the user in synthesized voice.

关键词： Human-Computer Interaction Large language models visual language models Segment-Anything

来源：评论

学校读者我要写书评

暂无评论

Unbiased Scene Graph Generation via visual language models 24

Unbiased Scene Graph Generation via Visual Language Models

引用

24th International Conference on Control, Automation and Systems

作者： Kim, Eunseo Park, Han-Mu Korea Elect Technol Inst Artificial Intelligence Res Ctr Seoul 13488 South Korea

ISBN: (纸本)9798331517939;9788993215380

Scene graph generation becomes significantly important as it bridges the gap between linguistic and visual information of scenes, facilitating a high-dimensional understanding of scenes. In this paper, we analyze the limitations of current scene graph generation methods induced by the inherent semantic relationship biases embedded in existing datasets. Furthermore, we propose a method to enhance scene graph by leveraging visual language models (VLMs). This approach leverages the strengths of VLMs in understanding and generating triplets with semantic predicates, ensuring unbiased and fine-grained scene graphs.

关键词： Scene graph generation Multimodal understanding Large language models visual language models

来源：评论

学校读者我要写书评

暂无评论

Efficient Generation of Targeted and Transferable Adversarial Examples for Vision-language models via Diffusion models

引用

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY 2025年 20卷 1333-1348页

作者： Guo, Qi Pang, Shanmin Jia, Xiaojun Liu, Yang Guo, Qing Xi An Jiao Tong Univ Sch Software Engn Xian 710049 Peoples R China Agcy Sci Technol & Res Ctr Frontier AI Res Singapore 138632 Singapore Nanyang Technol Univ Coll Comp & Data Sci Singapore 639798 Singapore Agcy Sci Technol & Res Inst High Performance Comp Singapore 138632 Singapore

Adversarial attacks, particularly targeted transfer-based attacks, can be used to assess the adversarial robustness of large visual-language models (VLMs), allowing for a more thorough examination of potential security flaws before deployment. However, previous transfer-based adversarial attacks incur high costs due to high iteration counts and complex method structure. Furthermore, due to the unnaturalness of adversarial semantics, the generated adversarial examples have low transferability. These issues limit the utility of existing methods for assessing robustness. To address these issues, we propose AdvDiffVLM, which uses diffusion models to generate natural, unrestricted and targeted adversarial examples via score matching. Specifically, AdvDiffVLM uses Adaptive Ensemble Gradient Estimation (AEGE) to modify the score during the diffusion model's reverse generation process, ensuring that the produced adversarial examples have natural adversarial targeted semantics, which improves their transferability. Simultaneously, to improve the quality of adversarial examples, we use the GradCAM-guided Mask Generation (GCMG) to disperse adversarial semantics throughout the image rather than concentrating them in a single area. Finally, AdvDiffVLM embeds more target semantics into adversarial examples after multiple iterations. Experimental results show that our method generates adversarial examples 5x to 10x faster than state-of-the-art (SOTA) transfer-based adversarial attacks while maintaining higher quality adversarial examples. Furthermore, compared to previous transfer-based adversarial attacks, the adversarial examples generated by our method have better transferability. Notably, AdvDiffVLM can successfully attack a variety of commercial VLMs in a black-box environment, including GPT-4V. The code is available at https://***/gq-max/AdvDiffVLM.

关键词： Adversarial attack visual language models diffusion models score matching

来源：评论

学校读者我要写书评

暂无评论

ReTrackVLM: Transformer-Enhanced Multi-Object Tracking with Cross-Modal Embeddings and Zero-Shot Re-Identification Integration

引用

APPLIED SCIENCES-BASEL 2025年第4期15卷 1907-1907页

作者： Bayraktar, Ertugrul Yildiz Tech Univ Dept Mechatron Engn TR-34349 Istanbul Turkiye

Multi-object tracking (MOT) is an important task in computer vision, particularly in complex, dynamic environments with crowded scenes and frequent occlusions. Traditional tracking methods often suffer from identity switches (IDSws) and fragmented tracks (FMs), which limits their ability to maintain consistent object trajectories. In this paper, we present a novel framework, called ReTrackVLM, that integrates multimodal embedding from a visual language model (VLM) with a zero-shot re-identification (ReID) module to enhance tracking accuracy and robustness. ReTrackVLM leverages the rich semantic information from VLMs to distinguish objects more effectively, even under challenging conditions, while the zero-shot ReID mechanism enables robust identity matching without additional training. The system also includes a motion prediction module, powered by Kalman filtering, to handle object occlusions and abrupt movements. We evaluated ReTrackVLM on several widely used MOT benchmarks, including MOT15, MOT16, MOT17, MOT20, and DanceTrack. Our approach achieves state-of-the-art results, with improvements of 1.5% MOTA and a reduction of 10. 3% in IDSws compared to existing methods. ReTrackVLM also excels in tracking precision, recording a 91.7% precision on MOT17. However, in extremely dense scenes, the framework faces challenges with slight increases in IDSws. Despite the computational overhead of using VLMs, ReTrackVLM demonstrates the ability to track objects effectively in diverse scenarios.

关键词： multimodal embeddings transformer-based multi-object tracking visual language models visual object detection zero-shot re-identification

来源：评论

学校读者我要写书评

暂无评论

visualising the language practices of lower secondary students: outlines for practice-based models of multilingualism

引用

APPLIED LINGUISTICS REVIEW 2024年第5期15卷 2035-2059页

作者： Storto, Andre Haukas, Asta Tiurikova, Irina Univ Bergen Bergen Norway Univ Bergen Dept Foreign Languages Bergen Norway

The multilingual turn in applied linguistics has produced a number of models that approach multilingualism from a variety of disciplinary and theoretical perspectives. However, fully developed models of multilingualism that focus on the language practices of individuals and groups are still lacking. This paper contributes to address this gap by introducing visual models that represent the contexts of practice and attitudes to the languages in the repertoire of lower secondary pupils in Norway. The paper starts by introducing the rich linguistic scenario in Norway and the role of language learning in developing students' multilingual abilities. After a brief discussion on the role of practice in language learning, we provide an outline of current models of multilingualism, situating our visual models, the Ungsprak Practice-Based models of Multilingualism (UPMM), in the field. The paper then focuses on the properties of the UPMM, which represent data collected from an online questionnaire answered by 593 students in lower secondary school and allow for the exploration of data both from the perspective of the whole group of participants and from an individual perspective. Particular attention is paid to the interactive features of the models, which can be used by teachers and educators as pedagogical tools for exploring multilingualism and language learning. The paper concludes with a discussion of the contexts of practice for the languages in the participants' repertoires based on the visual models.

关键词： exploratory research foreign language learning multilingualism visual language models

来源：评论

学校读者我要写书评

暂无评论

Semantic Scene Understanding with Large language models on Unmanned Aerial Vehicles

引用

DRONES 2023年第2期7卷 114-114页

作者： de Curto, J. de Zarza, I. Calafate, Carlos T. Ctr Intelligent Multidimens Data Anal HK Sci Pk Shatin Hong Kong Peoples R China Univ Politecn Valencia Dept Informat Sistemas & Comp Valencia 46022 Spain GOETHE Univ Frankfurt Main Informat & Math D-60323 Frankfurt Germany Univ Oberta Catalunya Estudis Informat Multimedia & Telecomun Barcelona 08018 Spain

Unmanned Aerial Vehicles (UAVs) are able to provide instantaneous visual cues and a high-level data throughput that could be further leveraged to address complex tasks, such as semantically rich scene understanding. In this work, we built on the use of Large language models (LLMs) and visual language models (VLMs), together with a state-of-the-art detection pipeline, to provide thorough zero-shot UAV scene literary text descriptions. The generated texts achieve a GUNNING Fog median grade level in the range of 7-12. Applications of this framework could be found in the filming industry and could enhance user experience in theme parks or in the advertisement sector. We demonstrate a low-cost highly efficient state-of-the-art practical implementation of microdrones in a well-controlled and challenging setting, in addition to proposing the use of standardized readability metrics to assess LLM-enhanced descriptions.

关键词： scene understanding large language models visual language models CLIP GPT-3 YOLOv7 UAV

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：