Object detection in complex traffic scenarios is crucial for Intelligent Transportation Systems (ITS). At present, most real-time traffic object detection methods primarily rely on YOLO-style vision-only detectors, li...
详细信息
Object detection in complex traffic scenarios is crucial for Intelligent Transportation Systems (ITS). At present, most real-time traffic object detection methods primarily rely on YOLO-style vision-only detectors, limiting their potential for further improvement. vision-language Object Detection (VLOD) has made promising progress currently, yet its adoption in the realm of ITS remains limited. Previous VLOD methods utilize text features in the classification task, without fully exploring their impact on the regression process for object localization. Besides, existing multi-modal fusion approaches fail to fuse text features with multi-scale image features at corresponding scales, which is detrimental to the representation capability of the model. In this work, we dive into the limitations above and introduce Zone-YOLO to improve the VLOD to a new level. Specifically, we propose Scale-Aware Modal Fusion (SAMF) to fully exploit the text and image features and learn to fuse the multi-modal representations seamlessly at different scales with channel-and modal-wise enhancement. Moreover, we present a novel Zone Prompt learning method to introduce text features into regression process and capture the zone-class-entity triple co-occurrence, which significantly improves the localization performance of the model. Extensive experiments show that Zone-YOLO outperforms the comparative methods by a considerable margin, achieving 55.1 AP, 72.1 AP(50) and 71.2 AP(L )on COCO. The competitive results on BDD100K and VisDrone2019 further demonstrate the superiority of Zone-YOLO on efficient traffic object detection.
In recent years, deep neural networks pretrained on large-scale datasets have been used to address data deficiency and achieve better performance through prior knowledge. Contrastive language-image pretraining (CLIP),...
详细信息
In recent years, deep neural networks pretrained on large-scale datasets have been used to address data deficiency and achieve better performance through prior knowledge. Contrastive language-image pretraining (CLIP), a vision-language model pretrained on an extensive dataset, achieves better performance in image recognition. In this study, we harness the power of multimodality in image clustering tasks, shifting from a single modality to a multimodal framework using the describability property of image encoder of the CLIP model. The importance of this shift lies in the ability of multimodality to provide richer feature representations. By generating text centroids corresponding to image features, we effectively create a common descriptive language for each cluster. It generates text centroids assigned by the image features and improves the clustering performance. The text centroids use the results generated by using the standard clustering algorithm as a pseudo-label and learn a common description of each cluster. Finally, only text centroids were added when the image features on the same space were assigned to the text centroids, but the clustering performance improved significantly compared to the standard clustering algorithm, especially on complex datasets. When the proposed method is applied, the normalized mutual information score rises by 32% on the Stanford40 dataset and 64% on ImageNet-Dog compared to the k-means clustering algorithm.
The escalating demand for artificial intelligence (AI) systems that can monitor and supervise human errors and abnormalities in healthcare presents unique challenges. Recent advances in vision-language models reveal t...
详细信息
The escalating demand for artificial intelligence (AI) systems that can monitor and supervise human errors and abnormalities in healthcare presents unique challenges. Recent advances in vision-language models reveal the challenges of monitoring AI by understanding both visual and textual concepts and their semantic correspondences. However, there has been limited success in the application of vision-language models in the medical domain. Current vision-language models and learning strategies for photographic images and captions call for a web-scale data corpus of image and text pairs which is not often feasible in the medical domain. To address this, we present a model named medical cross-attention vision-language model (Medical X-VL), which leverages key components to be tailored for the medical domain. The model is based on the following components: self-supervised unimodal models in medical domain and a fusion encoder to bridge them, momentum distillation, sentencewise contrastive learning for medical reports, and sentence similarity-adjusted hard negative mining. We experimentally demonstrated that our model enables various zero-shot tasks for monitoring AI, ranging from the zero-shot classification to zero-shot error correction. Our model outperformed current state-of-the-art models in two medical image datasets, suggesting a novel clinical application of our monitoring AI model to alleviate human errors. Our method demonstrates a more specialized capacity for fine-grained understanding, which presents a distinct advantage particularly applicable to the medical domain.
In recent decades, the growing deployment of Closed-Circuit Television (CCTV) systems for crime prevention and facility security has accelerated the importance of intelligent surveillance technologies. One of the prim...
详细信息
In recent decades, the growing deployment of Closed-Circuit Television (CCTV) systems for crime prevention and facility security has accelerated the importance of intelligent surveillance technologies. One of the primary challenges in this field includes varying viewpoints and adverse weather conditions that significantly compromise the accuracy of human tracking and anomaly detection. Moreover, conventional surveillance systems often focus only on specific events within limited scenarios, which restricts their applicability. Existing deep learning approaches also face limitations in adaptability to environmental variations, mainly due to the high maintenance costs involved in data collection. To address these challenges, we present a comprehensive surveillance system that utilizes deep learning to enhance human tracking and anomaly detection across diverse environments. Our approach includes the implementation of novel object filtering algorithms that decrease false positive rates and improve tracking precision. Additionally, our system is capable of monitoring multiple types of abnormal events, such as intrusion, loitering, abandonment, and arson. We further introduce a prompt-based recognition mechanism that enables active user participation in identifying abnormal scenes. Extensive evaluations using the Korea Internet & Security Agency CCTV datasets have demonstrated significant performance enhancements by our system, particularly under challenging weather conditions. Moreover, our system achieved competitive accuracy on the ABODA and FireNet datasets, even without additional training. This research establishes a new baseline for practical surveillance solutions that focus on comprehensive monitoring across various abnormal scenarios.
People with blindness and low vision (pBLV) encounter substantial challenges when it comes to comprehensive scene recognition and precise object identification in unfamiliar environments. Additionally, due to the visi...
详细信息
People with blindness and low vision (pBLV) encounter substantial challenges when it comes to comprehensive scene recognition and precise object identification in unfamiliar environments. Additionally, due to the vision loss, pBLV have difficulty in accessing and identifying potential tripping hazards independently. Previous assistive technologies for the visually impaired often struggle in real-world scenarios due to the need for constant training and lack of robustness, which limits their effectiveness, especially in dynamic and unfamiliar environments, where accurate and efficient perception is crucial. Therefore, we frame our research question in this paper as: How can we assist pBLV in recognizing scenes, identifying objects, and detecting potential tripping hazards in unfamiliar environments, where existing assistive technologies often falter due to their lack of robustness? We hypothesize that by leveraging large pretrained foundation models and prompt engineering, we can create a system that effectively addresses the challenges faced by pBLV in unfamiliar environments. Motivated by the prevalence of large pretrained foundation models, particularly in assistive robotics applications, due to their accurate perception and robust contextual understanding in real-world scenarios induced by extensive pretraining, we present a pioneering approach that leverages foundation models to enhance visual perception for pBLV, offering detailed and comprehensive descriptions of the surrounding environment and providing warnings about potential risks. Specifically, our method begins by leveraging a large-image tagging model (i.e., Recognize Anything model (RAM)) to identify all common objects present in the captured images. The recognition results and user query are then integrated into a prompt, tailored specifically for pBLV, using prompt engineering. By combining the prompt and input image, a vision-language foundation model (i.e., InstructBLIP) generates detailed and comp
Video Action Recognition (VAR) is a challenging task due to its inherent complexities. Though different approaches have been explored in the literature, designing a unified framework to recognize a large number of hum...
详细信息
ISBN:
(纸本)9798400716256
Video Action Recognition (VAR) is a challenging task due to its inherent complexities. Though different approaches have been explored in the literature, designing a unified framework to recognize a large number of human actions is still a challenging problem. Recently, Multi-Modal Learning (MML) has demonstrated promising results in this domain. In literature, 2D skeleton or pose modality has often been used for this task, either independently or in conjunction with the visual information (RGB modality) present in videos. However, the combination of pose, visual information, and text attributes has not been explored yet, though text and pose attributes independently have been proven to be effective in numerous computer vision tasks. In this paper, we present the first pose augmented vision-language model (VLM) for VAR. Notably, our scheme achieves an accuracy of 92.81% and 73.02% on two popular human video action recognition benchmark datasets, UCF-101 and HMDB-51, respectively, even without any video data pre-training, and an accuracy of 96.11% and 75.75% after kinetics pre-training.
Zero-shot remote sensing scene classification aims to solve the scene classification problem on unseen categories and has attracted numerous research attention in the remote sensing field. Existing methods mostly use ...
详细信息
暂无评论