检索结果-内蒙古大学图书馆

An Experimental Evaluation of Smart Sensors for Pedestrian Attribute Recognition Using Multi-Task Learning and vision language models

引用

SENSORS 2025年第6期25卷

作者： Greco, Antonio Saggese, Alessia Sansone, Carlo Vento, Bruno Univ Salerno I-84084 Fisciano Italy Univ Naples Federico II I-80125 Naples Italy

This paper presents the experimental evaluation and analyzes the results of the first edition of the pedestrian attribute recognition (PAR) contest, the international competition which focused on smart visual sensors based on multi-task computer vision methods for the recognition of binary and multi-class pedestrian attributes from images. The participant teams designed intelligent sensors based on vision-language models, transformers and convolutional neural networks that address the multi-label recognition problem leveraging task interdependencies to enhance model efficiency and effectiveness. Participants were provided with the MIVIA PAR Dataset, containing 105,244 annotated pedestrian images for training and validation, and their methods were evaluated on a private test set of over 20,000 images. In the paper, we analyze the smart visual sensors proposed by the participating teams, examining the results in terms of accuracy, standard deviation and confusion matrices and highlighting the correlations between design choices and performance. The results of this experimental evaluation, conducted in a challenging and realistic framework, suggest possible directions for future improvements in these smart sensors that are thoroughly discussed in the paper.

关键词： pedestrian attribute recognition contest multi-task learning vision language models

来源：评论

学校读者我要写书评

暂无评论

Mutual Prompt Leaning for vision language models

引用

INTERNATIONAL JOURNAL OF COMPUTER vision 2025年第3期133卷 1258-1276页

作者： Long, Sifan Zhao, Zhen Yuan, Junkun Tan, Zichang Liu, Jiangjiang Feng, Jingyuan Wang, Shengsheng Wang, Jingdong Jilin Univ Coll Comp Sci & Technol 2699 Qianjin St Changchun 130012 Jilin Peoples R China Jilin Univ Key Lab Symbol Computat & Knowledge Engn Minist Educ 2699 Qianjin St Changchun 130012 Jilin Peoples R China Baidu Inc Dept Comp Vis Technol VIS Beijing Peoples R China Univ Sydney Sch Elect & Informat Engn Sydney Australia Zhejiang Univ Coll Comp Sci & Technol Hangzhou Peoples R China

Large pre-trained vision language models (VLMs) have demonstrated impressive representation learning capabilities, but their transferability across various downstream tasks heavily relies on prompt learning. Since VLMs consist of text and visual sub-branches, existing prompt approaches are mainly divided into text and visual prompts. Recent text prompt methods have achieved great performance by designing input-condition prompts that encompass both text and image domain knowledge. However, roughly incorporating the same image feature into each learnable text token may be unjustifiable, as it could result in learnable text prompts being concentrated on one or a subset of characteristics. In light of this, we propose a fine-grained text prompt (FTP) that decomposes the single global image features into several finer-grained semantics and incorporates them into corresponding text prompt tokens. On the other hand, current methods neglect valuable text semantic information when building the visual prompt. Furthermore, text information contains redundant and negative category semantics. To address this, we propose a text-reorganized visual prompt (TVP) that reorganizes the text descriptions of the current image to construct the visual prompt, guiding the image branch to attend to class-related representations. By leveraging both FTP and TVP, we enable mutual prompting between the text and visual modalities, unleashing their potential to tap into the representation capabilities of VLMs. Extensive experiments on 11 classification benchmarks show that our method surpasses existing methods by a large margin. In particular, our approach improves recent state-of-the-art CoCoOp by 4.79% on new classes and 3.88% on harmonic mean over eleven classification benchmarks.

关键词： vision language models Prompt learning Visual prompt Mutual learning Visual recognition

来源：评论

学校读者我要写书评

暂无评论

Assessing the spatial accuracy of geocoding flood-related imagery using vision language models

引用

SPATIAL INFORMATION RESEARCH 2025年第2期33卷

作者： Schmidt, Sebastian Fragachan, Eleonor Diaz Arifi, Dorian Hanny, David Resch, Bernd Univ Salzburg Dept Geoinformat Z GIS Schillerstr 30 A-5020 Salzburg Austria Res & Innovat Eviden C Albarracin25 Madrid 28037 Spain IT U Interdisciplinary Transformat Univ Austria Geosocial Artificial Intelligence Altenberger Str 66c A-4040 Linz Austria Harvard Univ Ctr Geog Anal 1737 Cambridge St Cambridge MA 02138 USA

While the capabilities of large language models and visual language models for various classification tasks have advanced significantly, their potential for location inference remains largely underexplored. Therefore, this study evaluates the performance of four prominent models - BLIP-2, LLaVA1.6, OpenFlamingo, and GPT-4o - for geocoding flood-related images from Flickr. Model inferences are compared against the original photo locations and human-labelled assessments. Our findings reveal that GPT-4o achieves the highest spatial accuracy (median deviation of 89.12 km). OpenFlamingo geocodes the highest number of images (90.7%), albeit with fluctuating quality (median 408.35 km), while still outperforming the human annotators. LLaVA1.6 geocodes only 18.9% of all images, while BLIP-2 exhibits the highest median deviation (1,781 km). We observe a spatial bias in our results, with inferences being most accurate in Central Europe. Additionally, model results improve when images feature recognisable landmarks. The proposed workflow could significantly increase the amount of geocoded web-based data available for disaster management, though further research is required to enhance accuracy across diverse geographic contexts.

关键词： Location inference Disaster management vision language models Geocoding Flickr

来源：评论

学校读者我要写书评

暂无评论

Perceptual visual security index: Analyzing image content leakage for vision language models

引用

JOURNAL OF INFORMATION SECURITY AND APPLICATIONS 2025年 89卷

作者： Hu, Lishuang Xiang, Tao Guo, Shangwei Li, Xiaoguo Yang, Ying Chongqing Univ Coll Comp Sci Chongqing 401331 Peoples R China Agcy Sci Technol & Res Singapore 138632 Singapore

During the training phase of vision language models (VLMs), the privacy storage and sharing of images are of paramount importance. While the Visual Security Index (VSI) is commonly used for content leakage analysis, it usually focuses on comparing content similarity between plain and protected or encrypted images, neglecting the threat model of visual security. In this paper, considering the functionality of the human visual capability, we comprehensively analyze the system model of VSIs and propose a novel perceptual visual security index (PVSI) to evaluate the content leakage of perceptually encrypted images for VLMs. In particular, we take visual perception (VP) as the adversary's capability and present the definition of VSI under an honest-but-curious threat model. To evaluate the content leakage of encrypted images under the VP assumption, we first present a robust feature descriptor and obtain the semantic content sets of both plain and encrypted images. Then, we propose a systematic method to reduce the impact of different encryption algorithms. We further evaluate the similarity between semantic content sets to obtain the proposed PVSI. We also analyze the consistency between the proposed visual security definition and PVSI. Extensive experiments are performed on five publicly available image databases. Our experimental results demonstrate that compared with many existing state-ofthe-art visual security metrics, the proposed PVSI exhibits better performance not only on images generated from specific image encryption algorithms but also on publicly available image databases.

关键词： Data privacy Privacy leakage Visual security index vision language models

来源：评论

学校读者我要写书评

暂无评论

Learning with Enriched Inductive Biases for vision-language models

引用

INTERNATIONAL JOURNAL OF COMPUTER vision 2025年第6期133卷 3746-3761页

作者： Yang, Lingxiao Zhang, Ru-Yuan Chen, Qi Xie, Xiaohua Sun Yat sen Univ Sch Syst Sci & Engn Guangzhou Peoples R China Shanghai Jiao Tong Univ Brain Hlth Inst Natl Ctr Mental Disorders Shanghai Mental Hlth CtrSch Med Shanghai Peoples R China Sun Yat Sen Univ Sch Comp Sci & Engn Guangzhou Peoples R China Guangdong Prov Key Lab Informat Secur Technol Guangzhou Peoples R China Pazhou Lab Huangpu Guangzhou Peoples R China

vision-language models, pre-trained on large-scale image-text pairs, serve as strong foundation models for transfer learning across a variety of downstream tasks. For few-shot generalization tasks, i.e., when the model is trained on few-shot samples and then tested on unseen categories or datasets, there is a balance to be struck between generalization and discrimination when tweaking these models. Existing approaches typically rely on one or two strategies during training to learn task-specific knowledge, while preserving as much task-agnostic representation as possible. However, these methods overlook the importance of other useful inductive biases, thereby limiting their generalization capabilities. In this work, we propose a method - Learning with Enriched Inductive Biases (LwEIB) - to explore multiple inductive biases at the text, model, and optimization levels. Specifically, we first propose to enrich the handcrafted text prompt with Large language Model generated descriptions for each category. To better capture structural cues in both linguistics and vision, we design two new adapters for text and image encoders, respectively. Additionally, we propose a slow-fast optimization method to explore different degrees of adaptation more efficiently, learning task-specific representations while maintaining task-agnostic ones. We empirically validate the effectiveness of LwEIB on three widely used benchmarks. Remarkably, our LwEIB outperforms numerous state-of-the-art methods across all evaluation metrics, demonstrating its efficacy and versatility. Our code is available at https://***/ZjjConan/VLM-LwEIB.

关键词： vision language models Inductive biases Few-shot adaptation Transformer

来源：评论

学校读者我要写书评

暂无评论

Probing Fundamental Visual Comprehend Capabilities on vision language models via Visual Phrases from Structural Data

引用

COGNITIVE COMPUTATION 2024年第6期16卷 3484-3504页

作者： Xie, Peijin Liu, Bingquan Harbin Inst Technol Fac Comp Harbin Peoples R China

Does the model demonstrate exceptional proficiency in "item counting,""color recognition," or other Fundamental Visual Comprehension Capability (FVCC)? There have been remarkable advancements in the field of multimodal, the pretrained general vision language models exhibit strong performance across a range of intricate Visual language (VL) tasks and Multimodal Large language models (MLLMs) emerge novel visual reasoning abilities from several examples. But models tend to encounter difficulties when confronted with texts supplemented with specific details by simple visual phrases. Moreover, there is a scarcity of datasets in sufficient quantity, variety, and composability to enable the evaluation of each FVCC using statistical metrics. Accordingly, we decomposed the complete VL task into 9 M simple Visual Phrase Triplets (VPTs) across 16 categories representing 16 distinct FVCCs from the structural scene graph. Then, we reconstructed a Multilevel Scene Graph (MLSG) for each image and introduced our unbiased, balanced, and binary Visual Phrase Entailment benchmark with 20 times the data volume of SNLI-VE. The benchmark consisted of three exams and evaluated the performance of 8 widely used VLM and 10 MLLMs respectively. The results demonstrate the performance of each model across 16 classes in FVCC, as well as their lower and upper limits under conditions of increased text complexity or unnoised image input. Finally, we enhanced the efficiency of MLLM and evoked their In-Context Learning characteristics by appending multiple VPT generated QA pairs of identical types to the conversation history without tuning. The proposed structural VPTs and MLSG data hold promise for facilitating future explorations on FVCC.

关键词： vision language models Multimodal large language models Visual reasoning Multilevel scene graph

来源：评论

学校读者我要写书评

暂无评论

Integrating Text-to-Image and vision language models for Synergistic Dataset Generation: The Creation of Synergy-General-Multimodal Pairs 2nd

Integrating Text-to-Image and Vision Language Models for Syn...

引用

2nd International Workshop on Generalizing from Limited Resources in the Open World (GLOW)

作者： Huang, Mao Xun Huang, Hen-Hsen Natl Chengchi Univ Dept Management Informat Syst Taipei Taiwan Acad Sinica Inst Informat Sci Taipei Taiwan

ISBN: (纸本)9789819761241;9789819761258

This study presents the creation of the Synergy-General-Multimodal Pairs dataset through an innovative integration of vision language models (VLMs) and text-to-image (T2I) technologies. The code and dataset used in this research are publicly available for replication and further research. The code can be accessed at GitHub Repository and the dataset at Dataset Link. We developed a cyclical generation process that begins with generating initial narratives using either VLMs or large language models (LLMs), which are then visualized by a T2I model. This initiates a feedback loop where each generated image inspires a new narrative, creating a rich sequence of text-image pairs. This iterative approach enhances the diversity and complexity of the dataset, fostering advancements in multimodal research by providing a voluminous and varied resource. Key experimental results show significant improvements: the mean BERTScore increased by 15% (from 0.54 to 0.625), BLEU score by 20% (from 0.026 to 0.032), and ROUGE-L score by 18% (from 0.20 to 0.235). These results demonstrate substantial enhancements in the multimodal model's performance. The dataset is specifically designed to support the development and fine-tuning of models for enhanced performance and generalization in tasks requiring deep multimodal understanding and generation.

关键词： Multimodal generalization Dataset construction vision language models

来源：评论

学校读者我要写书评

暂无评论

Enhancing Interactive Image Retrieval With Query Rewriting Using Large language models and vision language models 24

Enhancing Interactive Image Retrieval With Query Rewriting U...

引用

4th Annual International Conference on Multimedia Retrieval (ICMR)

作者： Zhu, Hongyi Huang, Jia-Hong Rudinac, Stevan Kanoulas, Evangelos Univ Amsterdam Amsterdam Netherlands

ISBN: (纸本)9798400706028

Image search stands as a pivotal task in multimedia and computer vision, finding applications across diverse domains, ranging from internet search to medical diagnostics. Conventional image search systems operate by accepting textual or visual queries, retrieving the top-relevant candidate results from the database. However, prevalent methods often rely on single-turn procedures, introducing potential inaccuracies and limited recall. These methods also face the challenges, such as vocabulary mismatch and the semantic gap, constraining their overall effectiveness. To address these issues, we propose an interactive image retrieval system capable of refining queries based on user relevance feedback in a multi-turn setting. This system incorporates a vision language model (VLM) based image captioner to enhance the quality of text-based queries, resulting in more informative queries with each iteration. Moreover, we introduce a large language model (LLM) based denoiser to refine text-based query expansions, mitigating inaccuracies in image descriptions generated by captioning models. To evaluate our system, we curate a new dataset by adapting the MSR-VTT video retrieval dataset to the image retrieval task, offering multiple relevant ground truth images for each query. Through comprehensive experiments, we validate the effectiveness of our proposed system against baseline methods, achieving state-of-the-art performance with a notable 10% improvement in terms of recall. Our contributions encompass the development of an innovative interactive image retrieval system, the integration of an LLM-based denoiser, the curation of a meticulously designed evaluation dataset, and thorough experimental validation.

关键词： Interactive Image Retrieval Query Rewriting vision language models Large language models

来源：评论

学校读者我要写书评

暂无评论

Beyond Human vision: The Role of Large vision language models in Microscope Image Analysis

Beyond Human Vision: The Role of Large Vision Language Model...

引用

2024 IEEE International Conference on Big Data, BigData 2024

作者： Verma, Prateek Van, Minh-Hao Wu, Xintao University of Arkansas Department of Electrical Engineering and Computer Science Fayetteville United States

ISBN: (纸本)9798350362480

vision language models (VLMs) such as LLaVA, ChatGPT-4, and Gemini have recently emerged and gained the spotlight for their ability to comprehend the dual modality of image and textual data showing impressive performance on tasks such as natural image captioning, visual question answering, and spatial reasoning. Additionally, a universal segmentation model by Meta AI, Segment Anything Model (SAM) shows unprecedented performance at isolating objects from unforeseen images. Because medical experts, biologists, and materials scientists routinely examine microscopy or medical images in conjunction with textual information in the form of captions, literature, or reports, and draw conclusions of great importance and merit, it is essential to evaluate their performance on these images. In this study, we charge ChatGPT, LLaVA, Gemini, and SAM quantitatively with classification, segmentation and counting tasks. We observed that ChatGPT and Gemini were impressively able to comprehend the visual features in microscopy images, while SAM was quite capable at isolating artifacts in a general sense. However, the performance was not close to that of a domain expert - the models were readily encumbered by the introduction of impurities, defects, object overlaps and diversity present in the images. © 2024 IEEE.

关键词： biology electron microscopy images materials science vision language models zero-shot evaluation

来源：评论

学校读者我要写书评

暂无评论

Non-autoregressive Sequence-to-Sequence vision-language models

Non-autoregressive Sequence-to-Sequence Vision-Language Mode...

引用

IEEE/CVF Conference on Computer vision and Pattern Recognition (CVPR)

作者： Shi, Kunyu Dong, Qi Goncalves, Luis Tu, Zhuowen Soatto, Stefano AWS AI Labs Seattle WA 98101 USA

ISBN: (纸本)9798350353006

Sequence-to-sequence vision-language models are showing promise, but their applicability is limited by their inference latency due to their autoregressive way of generating predictions. We propose a parallel decoding sequence-to-sequence vision-language model, trained with a Query-CTC loss, that marginalizes over multiple inference paths in the decoder. This allows us to model the joint distribution of tokens, rather than restricting to conditional distribution as in an autoregressive model. The resulting model, NARVL, achieves performance on-par with its state-of-the-art autoregressive counterpart, but is faster at inference time, reducing from the linear complexity associated with the sequential generation of tokens to a paradigm of constant time joint inference.

关键词： CTC Non-autoregressive vision language models

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：