Spatial cognition refers to the ability to gain knowledge about their surroundings and utilize this information to identify their location, acquire resources, and navigate their way back to familiar places. People wit...
详细信息
ISBN:
(纸本)9798350377712;9798350377705
Spatial cognition refers to the ability to gain knowledge about their surroundings and utilize this information to identify their location, acquire resources, and navigate their way back to familiar places. People with blindness and low vision (pBLV) face significant challenges with spatial cognition due to the reliance on visual input. Without the full range of visual cues, pBLV individuals often find it difficult to grasp a comprehensive understanding of their environment, leading to obstacles in scene recognition and precise object localization, especially in unfamiliar environments. This limitation extends to their ability to independently detect and avoid potential tripping hazards, making navigation and interaction with their environment more challenging. In this paper, we present a pioneering wearable platform tailored to enhance the spatial cognition of pBLV through the integration of multi-modal foundation model. The proposed platform integrates a wearable camera with audio module and leverages the advanced capabilities of vision language foundation model (i.e., GPT-4 and GPT-4V), for the nuanced processing of visual and textual data. Specifically, we employ vision language models to bridge the gap between visual information and the proprioception of visually impaired users, offering more intelligible guidance by aligning visual data with the natural perception of space and movement. Then we apply prompt engineering to guide the large language model to act as an assistant tailored specifically for pBLV users to produce accurate answers. Another innovation in our model is the incorporation of a chain of thought reasoning process, which enhances the accuracy and interpretability of the model, facilitating the generation of more precise responses to complex user inquiries across diverse environmental contexts. To assess the practical impact of our proposed wearable platform, we carried out a series of real-world experiments across three tasks that are commonly
Machine drawing has gradually become a hot research topic in computer vision and robotics domains recently. However, decomposing a given target image from raster space into an ordered sequence and reconstructing those...
详细信息
ISBN:
(纸本)9781728198354
Machine drawing has gradually become a hot research topic in computer vision and robotics domains recently. However, decomposing a given target image from raster space into an ordered sequence and reconstructing those strokes is a challenging task. In this work, we focus on the drawing task for the images in various styles where the distribution of stroke parameters differs. We propose a multi-stage environment model based reinforcement learning (RL) drawing framework with fine-grained perceptual reward to guide the agent under this framework to draw details and an overall outline of the target image accurately. The experiments show that the visual quality of our method slightly outperforms SOTAmethod in nature and doodle style, while it outperforms the SOTA approaches by a large margin with high efficiency in sketch style.
The classification of temporal signals plays a significant role in deep learning tasks. However, it poses unique challenges due to the need for specialized architectures that can effectively capture the temporal depen...
详细信息
Two obstacles, the scarcity of annotated samples and the difficulty in preserving multi-scale hierarchical representations, hinder the advancement of vision Transformer-based aerial object detection. The emergence of ...
详细信息
ISBN:
(纸本)9781728198354
Two obstacles, the scarcity of annotated samples and the difficulty in preserving multi-scale hierarchical representations, hinder the advancement of vision Transformer-based aerial object detection. The emergence of self-supervised learning has inspired some solutions to the first issue. However, most solutions focus on single-scale features, conflicting with solving the second issue. To bridge this gap, this paper proposes a novel pyramid masked image modeling (MIM) framework, termed PyraMIM, for self-supervised pretraining in aerial scenarios. Without manual annotation, PyraMIM enables establishing pyramid representations during pretraining, which can be seamlessly adapted to downstream aerial object detection for performance improvement. Experimental results demonstrate the effectiveness and superiority of our method.
Similar to the human's multiple perception system, the robot can also benefit from cross-modal learning. The connection between visual input and tactile perception is potentially important for automated operations...
详细信息
In precision agriculture, it is crucial to have a reliable system for identifying diseases and suggesting measures to maintain crop health and enhance yield performance. Addressing the persistent challenge of accurate...
详细信息
Blood cell detection is a typical small-scale object detection problem in computer vision. In this paper, we propose a CST-YOLO model for blood cell detection based on YOLOv7 architecture and enhance it with the CNN-S...
详细信息
ISBN:
(纸本)9798350349405;9798350349399
Blood cell detection is a typical small-scale object detection problem in computer vision. In this paper, we propose a CST-YOLO model for blood cell detection based on YOLOv7 architecture and enhance it with the CNN-Swin Transformer (CST), which is a new attempt at CNN-Transformer fusion. We also introduce three other useful modules: Weighted Efficient Layer Aggregation Networks (W-ELAN), Multiscale Channel Split (MCS), and Concatenate Convolutional Layers (CatConv) in our CST-YOLO to improve small-scale object detection precision. Experimental results show that the proposed CST-YOLO achieves 92.7%, 95.6%, and 91.1% mAP@0.5, respectively, on three blood cell datasets, outperforming state-of-the-art object detectors, e.g., RT-DETR, YOLOv5, and YOLOv7. Our code is available at https://***/mkang315/CST-YOLO.
Federated Learning (FL) is a new pivotal paradigm for decentralized training on heterogeneous data. Recently fine-tuning of vision-Language Models (VLMs) has been extended to the federated setting to improve overall p...
详细信息
Known image processing software systems use filtering methods such as the Gaussian filter, the median filter, and others, which often do not have satisfactory performance in processing certain types of noise. This lea...
详细信息
ISBN:
(纸本)9798350368185;9798350368178
Known image processing software systems use filtering methods such as the Gaussian filter, the median filter, and others, which often do not have satisfactory performance in processing certain types of noise. This leads to a partial loss of the useful signal and a deterioration in image quality. The work is devoted to improving the quality of image filtering in various kinds of noise. A model of additive interaction of signals in additive impulse noise is proposed. A new method of least finite differences (MLFD) for image filtering has been developed. An interactive web service for the implementation of MLFD was created. Various filtering methods have been proposed to reduce the effect of impulsive noise on images. The proposed data processing is based on a priori information about the type of noise. The web service is a flexible and efficient solution for filtering digital images and is not available in well-known graphics packages. The new web service is a good compromise between the quality of filtering and the simplicity of the solution compared to complex and resource-intensive systems for graphic image processing.
Text detection is a fundamental task in computer vision that involves identifying and locating text within images or videos. It has been the subject of extensive research, with numerous approaches primarily tailored f...
详细信息
ISBN:
(纸本)9798350349405;9798350349399
Text detection is a fundamental task in computer vision that involves identifying and locating text within images or videos. It has been the subject of extensive research, with numerous approaches primarily tailored for open-scene text, but there are limited studies dedicated to practical industries such as e-commerce. E-commerce images are designed to capture human attention, and effective text detection can amplify this marketing strategy. Yet, identifying text in ecommerce images poses particular challenges due to their distinct visual attributes, which set them apart from open-scene images. Therefore, this paper aims to address this gap by exploring how human attention can aid text detection on e-commerce images. The proposed model merges high-level text features with low-level and saliency features and exploits both local and semantic characteristics of image regions. Leveraging visual cues, low-level and saliency features aid in predicting the saliency map, which is then employed to aid text detection. The proposed method achieves better localization of text, outperforming current state-of-the-art models on the benchmark e-commerce SalECI dataset. The code for this study is available at https: //***/bebbieyin/SalientTextDet.
暂无评论