检索结果-内蒙古大学图书馆

Zone-YOLO: vision-language Object Detection Using Zone Prompt

IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS 2024年

作者： Yang, Jiaxiong Jia, Ning Liu, Xianhui Fan, Rui Sun, Yougang Zhao, Weidong Tongji Univ Coll Elect & Informat Engn Shanghai 201804 Peoples R China Tongji Univ Inst Rail Transit Shanghai 201804 Peoples R China Tongji Univ Natl Maglev Transportat Engn Res & Dev Ctr Shanghai 201804 Peoples R China

Object detection in complex traffic scenarios is crucial for Intelligent Transportation Systems (ITS). At present, most real-time traffic object detection methods primarily rely on YOLO-style vision-only detectors, limiting their potential for further improvement. vision-language Object Detection (VLOD) has made promising progress currently, yet its adoption in the realm of ITS remains limited. Previous VLOD methods utilize text features in the classification task, without fully exploring their impact on the regression process for object localization. Besides, existing multi-modal fusion approaches fail to fuse text features with multi-scale image features at corresponding scales, which is detrimental to the representation capability of the model. In this work, we dive into the limitations above and introduce Zone-YOLO to improve the VLOD to a new level. Specifically, we propose Scale-Aware Modal Fusion (SAMF) to fully exploit the text and image features and learn to fuse the multi-modal representations seamlessly at different scales with channel-and modal-wise enhancement. Moreover, we present a novel Zone Prompt learning method to introduce text features into regression process and capture the zone-class-entity triple co-occurrence, which significantly improves the localization performance of the model. Extensive experiments show that Zone-YOLO outperforms the comparative methods by a considerable margin, achieving 55.1 AP, 72.1 AP(50) and 71.2 AP(L )on COCO. The competitive results on BDD100K and VisDrone2019 further demonstrate the superiority of Zone-YOLO on efficient traffic object detection.

关键词： Object detection Feature extraction Detectors Visualization Head Fuses Semantics YOLO Training Encoding Traffic object detection vision-language model multi-modal feature fusion prompt learning

来源：评论

学校读者我要写书评

暂无评论

Image clustering using generated text centroids

引用

SIGNAL PROCESSING-IMAGE COMMUNICATION 2024年 125卷

作者： Kong, Daehyeon Kong, Kyeongbo Kang, Suk-Ju NAVER Seongnam 13561 South Korea Sogang Univ Seoul 04107 South Korea Pusan Natl Univ Busan 46241 South Korea

In recent years, deep neural networks pretrained on large-scale datasets have been used to address data deficiency and achieve better performance through prior knowledge. Contrastive language-image pretraining (CLIP), a vision-language model pretrained on an extensive dataset, achieves better performance in image recognition. In this study, we harness the power of multimodality in image clustering tasks, shifting from a single modality to a multimodal framework using the describability property of image encoder of the CLIP model. The importance of this shift lies in the ability of multimodality to provide richer feature representations. By generating text centroids corresponding to image features, we effectively create a common descriptive language for each cluster. It generates text centroids assigned by the image features and improves the clustering performance. The text centroids use the results generated by using the standard clustering algorithm as a pseudo-label and learn a common description of each cluster. Finally, only text centroids were added when the image features on the same space were assigned to the text centroids, but the clustering performance improved significantly compared to the standard clustering algorithm, especially on complex datasets. When the proposed method is applied, the normalized mutual information score rises by 32% on the Stanford40 dataset and 64% on ImageNet-Dog compared to the k-means clustering algorithm.

关键词： Deep neural network Image clustering Multimodal task vision-language model

来源：评论

学校读者我要写书评

暂无评论

Self-supervised multi-modal training from uncurated images and reports enables monitoring AI in radiology

引用

MEDICAL IMAGE ANALYSIS 2024年 91卷 103021页

作者： Park, Sangjoon Lee, Eun Sun Shin, Kyung Sook Lee, Jeong Eun Ye, Jong Chul Yonsei Coll Med Dept Radiat Oncol Seoul South Korea Chung Ang Univ Hosp Seoul South Korea Chungnam Natl Univ Chungnam Natl Univ Hosp Coll Med Dept Radiol Daejeon South Korea Korea Adv Inst Sci & Technol KAIST Kim Jaechul Grad Sch AI Daejeon South Korea

The escalating demand for artificial intelligence (AI) systems that can monitor and supervise human errors and abnormalities in healthcare presents unique challenges. Recent advances in vision-language models reveal the challenges of monitoring AI by understanding both visual and textual concepts and their semantic correspondences. However, there has been limited success in the application of vision-language models in the medical domain. Current vision-language models and learning strategies for photographic images and captions call for a web-scale data corpus of image and text pairs which is not often feasible in the medical domain. To address this, we present a model named medical cross-attention vision-language model (Medical X-VL), which leverages key components to be tailored for the medical domain. The model is based on the following components: self-supervised unimodal models in medical domain and a fusion encoder to bridge them, momentum distillation, sentencewise contrastive learning for medical reports, and sentence similarity-adjusted hard negative mining. We experimentally demonstrated that our model enables various zero-shot tasks for monitoring AI, ranging from the zero-shot classification to zero-shot error correction. Our model outperformed current state-of-the-art models in two medical image datasets, suggesting a novel clinical application of our monitoring AI model to alleviate human errors. Our method demonstrates a more specialized capacity for fine-grained understanding, which presents a distinct advantage particularly applicable to the medical domain.

关键词： Monitoring AI Error detection vision-language model Radiograph

来源：评论

学校读者我要写书评

暂无评论

PASS-CCTV: Proactive Anomaly surveillance system for CCTV footage analysis in adverse environmental conditions

引用

EXPERT SYSTEMS WITH APPLICATIONS 2024年 254卷

作者： Jeon, Hobeom Kim, Hyungmin Kim, Dohyung Kim, Jeahong Univ Sci & Technol Deajeon South Korea Elect & Telecommun Res Inst Deajeon South Korea

In recent decades, the growing deployment of Closed-Circuit Television (CCTV) systems for crime prevention and facility security has accelerated the importance of intelligent surveillance technologies. One of the primary challenges in this field includes varying viewpoints and adverse weather conditions that significantly compromise the accuracy of human tracking and anomaly detection. Moreover, conventional surveillance systems often focus only on specific events within limited scenarios, which restricts their applicability. Existing deep learning approaches also face limitations in adaptability to environmental variations, mainly due to the high maintenance costs involved in data collection. To address these challenges, we present a comprehensive surveillance system that utilizes deep learning to enhance human tracking and anomaly detection across diverse environments. Our approach includes the implementation of novel object filtering algorithms that decrease false positive rates and improve tracking precision. Additionally, our system is capable of monitoring multiple types of abnormal events, such as intrusion, loitering, abandonment, and arson. We further introduce a prompt-based recognition mechanism that enables active user participation in identifying abnormal scenes. Extensive evaluations using the Korea Internet & Security Agency CCTV datasets have demonstrated significant performance enhancements by our system, particularly under challenging weather conditions. Moreover, our system achieved competitive accuracy on the ABODA and FireNet datasets, even without additional training. This research establishes a new baseline for practical surveillance solutions that focus on comprehensive monitoring across various abnormal scenarios.

关键词： Video surveillance system Video anomaly detection Human tracking vision-language model

来源：评论

学校读者我要写书评

暂无评论

A Multi-Modal Foundation model to Assist People with Blindness and Low vision in Environmental Interaction

引用

JOURNAL OF IMAGING 2024年第5期10卷 103-103页

作者： Hao, Yu Yang, Fan Huang, Hao Yuan, Shuaihang Rangan, Sundeep Rizzo, John-Ross Wang, Yao Fang, Yi NYU Tandon Sch Engn Brooklyn NY 11201 USA NYU NYU Langone Hlth New York NY 10016 USA New York Univ Abu Dhabi Elect Engn Abu Dhabi 129188 U Arab Emirates New York Univ Abu Dhabi Ctr Artificial Intelligence & Robot Abu Dhabi 129188 U Arab Emirates

People with blindness and low vision (pBLV) encounter substantial challenges when it comes to comprehensive scene recognition and precise object identification in unfamiliar environments. Additionally, due to the vision loss, pBLV have difficulty in accessing and identifying potential tripping hazards independently. Previous assistive technologies for the visually impaired often struggle in real-world scenarios due to the need for constant training and lack of robustness, which limits their effectiveness, especially in dynamic and unfamiliar environments, where accurate and efficient perception is crucial. Therefore, we frame our research question in this paper as: How can we assist pBLV in recognizing scenes, identifying objects, and detecting potential tripping hazards in unfamiliar environments, where existing assistive technologies often falter due to their lack of robustness? We hypothesize that by leveraging large pretrained foundation models and prompt engineering, we can create a system that effectively addresses the challenges faced by pBLV in unfamiliar environments. Motivated by the prevalence of large pretrained foundation models, particularly in assistive robotics applications, due to their accurate perception and robust contextual understanding in real-world scenarios induced by extensive pretraining, we present a pioneering approach that leverages foundation models to enhance visual perception for pBLV, offering detailed and comprehensive descriptions of the surrounding environment and providing warnings about potential risks. Specifically, our method begins by leveraging a large-image tagging model (i.e., Recognize Anything model (RAM)) to identify all common objects present in the captured images. The recognition results and user query are then integrated into a prompt, tailored specifically for pBLV, using prompt engineering. By combining the prompt and input image, a vision-language foundation model (i.e., InstructBLIP) generates detailed and comp

关键词： assistive technology multi-model foundation model vision-language model

来源：评论

学校读者我要写书评

暂无评论

ViLP: Knowledge Exploration using vision, language, and Pose Embeddings for Video Action Recognition 23

ViLP: Knowledge Exploration using Vision, Language, and Pose...

引用

Proceedings of the Fourteenth Indian Conference on Computer vision, Graphics and Image Processing

作者： Soumyabrata Chaudhuri Saumik Bhattacharya IIT Bhubaneswar IN E&ECE Indian Institute of Technology Kharagpur IN

ISBN: (纸本)9798400716256

Video Action Recognition (VAR) is a challenging task due to its inherent complexities. Though different approaches have been explored in the literature, designing a unified framework to recognize a large number of human actions is still a challenging problem. Recently, Multi-Modal Learning (MML) has demonstrated promising results in this domain. In literature, 2D skeleton or pose modality has often been used for this task, either independently or in conjunction with the visual information (RGB modality) present in videos. However, the combination of pose, visual information, and text attributes has not been explored yet, though text and pose attributes independently have been proven to be effective in numerous computer vision tasks. In this paper, we present the first pose augmented vision-language model (VLM) for VAR. Notably, our scheme achieves an accuracy of 92.81% and 73.02% on two popular human video action recognition benchmark datasets, UCF-101 and HMDB-51, respectively, even without any video data pre-training, and an accuracy of 96.11% and 75.75% after kinetics pre-training.

关键词： Action recognition vision-language model multimodal training

来源：评论

学校读者我要写书评

暂无评论

RS-CLIP: Zero shot remote sensing scene classification via contrastive vision-language supervision

引用

International Journal of Applied Earth Observation and Geoinformation 2023年 124卷

作者： Li, Xiang Wen, Congcong Hu, Yuan Zhou, Nan King Abdullah University of Science and Technology Jeddah 23955 Saudi Arabia New York University Abu Dhabi Abu Dhabi 129188 United Arab Emirates Institute of Remote Sensing and Geographic Information Systems School of Earth and Space Sciences Peking University Beijing 100871 China Jiangsu University of Science and Technology Zhenjiang 212100 China

Zero-shot remote sensing scene classification aims to solve the scene classification problem on unseen categories and has attracted numerous research attention in the remote sensing field. Existing methods mostly use shallow networks for visual and semantic feature learning, and the semantic encoder networks are usually fixed during the zero-shot learning process, thus failing to capture powerful feature representations for classification. In this work, we introduced a vision-language model for remote sensing scene classification based on contrastive vision-language supervision. Our method is capable of learning semantic-aware visual representations using a contrastive vision-language loss in the embedding space. By pretraining on large-scale image–text datasets, our baseline method shows good transferring ability on remote sensing scenes. To enable model training in zero-shot settings, we introduced a pseudo-labeling technique that can automatically generate pseudo labels from unlabeled data. A curriculum learning strategy is developed to boost the performance of zero-shot remote sensing scene classification with multiple stages of model finetuning. We conducted experiments on four benchmark datasets and showed considerable performance improvement on both zero-shot and few-shot remote sensing scene classification. The proposed RS-CLIP method achieved a zero-shot classification accuracy of 95.94%, 95.97%, 85.76%, and 87.52% on the novel classes of UCM-21, WHU-RS19, NWPU-RESISC45, and AID-30 datasets respectively. Our code will be released at https://***/lx709/RS-CLIP. © 2023 The Authors

关键词： CLIP Curriculum learning Pseudo labeling Remote sensing scene classification vision-language model

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：