Human activity recognition (HAR) based on multi-modal approach has been recently shown to improve the accuracy performance of HAR. However, restricted computational resources associated with wearable devices, i.e., sm...
详细信息
ISBN:
(纸本)9781665405409
Human activity recognition (HAR) based on multi-modal approach has been recently shown to improve the accuracy performance of HAR. However, restricted computational resources associated with wearable devices, i.e., smartwatch, failed to directly support such advanced methods. To tackle this issue, this study introduces an end-to-end vision-to-Sensor Knowledge Distillation (VSKD) framework. In this VSKD framework, only time-series data, i.e., accelerometer data, is needed from wearable devices during the testing phase. Therefore, this framework will not only reduce the computational demands on wearable devices, but also produce a learning model that closely matches the performance of the computational expensive multi-modal approach. In order to retain the local temporal relationship and facilitate visual deep learning models, we first convert time-series data to two-dimensional images by applying the Gramian Angular Field (GAF) based encoding method. We adopted multi-scale TRN with BN-Inception and ResNet18 as the teacher and student network in this study, respectively. A novel loss function, named Distance and Angle-wised Semantic Knowledge loss (DASK), is proposed to mitigate the modality variations between the vision and the sensor domain. Extensive experimental results on UTD-MHAD, MMAct, and Berkeley-MHAD datasets demonstrate the competitiveness of the proposed VSKD model which can be deployed on wearable devices.
Aiming at the problems of slow training speed and poor generalization ability of deep reinforcement learning model in path planning, this paper proposes a multi-controller model of Dueling DQN combined with fuzzy cont...
详细信息
ISBN:
(纸本)9781665464680
Aiming at the problems of slow training speed and poor generalization ability of deep reinforcement learning model in path planning, this paper proposes a multi-controller model of Dueling DQN combined with fuzzy control(DDFC). At the beginning of training, fuzzy control is used to provide a large number of positive samples for the Dueling DQN model, so as to improve the training efficiency of the model while ensuring that the mobile robot has certain obstacle avoidance ability at the beginning of training. The negative feedback shaping reward function and state space are designed to alleviate the problem of sparse reward. Aiming at the fact that the membership function of traditional fuzzy control can't deal with difference situations in the process of moving, an improved membership function is designed, which can change according to the change of the situation. The simulation results show that the improved model can make the mobile robot avoid obstacles effectively and improve the rate of convergence. It also has good performance in different scenes and improves the generalization ability of the model.
The image captioning is a process of generating descriptive sentence for a given image in human understandable language and such sentence is known as caption of the image. The automatic image caption generated is a re...
详细信息
Anomalous sound detection (ASD) encounters difficulties with domain shift, where the sounds of machines in target domains differ significantly from those in source domains due to varying operating conditions. Existing...
详细信息
The interest in autonomous robots is growing due to diverse usability. Autonomous robots are equipped with various sensors for stable operation. As the sensor data increases, the system for sensor signalprocessing an...
详细信息
Despite of pre-processing the raw point cloud is important, limited research has been conducted on learning-based approaches to point cloud upsampling. PU-EdgeFormer [1] model stands out for its exceptional performanc...
详细信息
Inspired by the remarkable zero-shot generalization capacity of vision-language pre-trained model, we seek to leverage the supervision from CLIP model to alleviate the burden of data labeling. However, such supervisio...
详细信息
ISBN:
(纸本)9781665405409
Inspired by the remarkable zero-shot generalization capacity of vision-language pre-trained model, we seek to leverage the supervision from CLIP model to alleviate the burden of data labeling. However, such supervision inevitably contains the label noise, which significantly degrades the discriminative power of the classification model. In this work, we propose Transductive CLIP, a novel framework for learning a classification network with noisy labels from scratch. Firstly, a class-conditional contrastive learning mechanism is proposed to mitigate the reliance on pseudo labels and boost the tolerance to noisy labels. Secondly, ensemble labels is adopted as a pseudo label updating strategy to stabilize the training of deep neural networks with noisy labels. This framework can reduce the impact of noisy labels from CLIP model effectively by combining both techniques. Experiments on multiple benchmark datasets demonstrate the substantial improvements over other state-of-the-art methods.
Image semantic segmentation is a dense prediction task in computer vision that is dominated by deep learning techniques in recent years. UNet, which is a symmetric encoder-decoder end-to-end Convolutional Neural Netwo...
详细信息
ISBN:
(纸本)9781728198354
Image semantic segmentation is a dense prediction task in computer vision that is dominated by deep learning techniques in recent years. UNet, which is a symmetric encoder-decoder end-to-end Convolutional Neural Network (CNN) with skip connections, has shown promising performance. Aiming to process the multiscale feature information efficiently, we propose a new Densely Connected Swin-UNet (DCS-UNet) with multiscale information aggregation for medical image segmentation. Firstly, inspired by Swin-Transformer to model long-range dependencies via shift-window-based self-attention, this work proposes the use of fully ViT-based network blocks with a shift-window approach, resulting in a purely self-attention-based U-shape segmentation network. The relevant layers including feature sampling and image tokenization are re-designed to align with the ViT fashion. Secondly, a full-scale deep supervision scheme is developed to process the aggregated feature map with various resolutions generated by different levels of decoders. Thirdly, dense skip connections are proposed that allow the semantic feature information to be thoroughly transferred from different levels of encoders to lower level decoders. Our proposed method is validated on a public benchmark MRI Cardiac segmentation data set with comprehensive validation metrics showing competitive performance against other variant encoder-decoder networks. The code is available at https://***/ziyangwang007/VIT4UNet.
Localization is a fundamental element required in various applications across fields such as vehicle navigation, smart factories, automation systems, and product shipping services. This paper discusses the fusion of t...
详细信息
In this paper, we leverage the human perceiving process, that involves vision and language interaction, to generate a coherent paragraph description of untrimmed videos. We propose vision-language (VL) features consis...
详细信息
ISBN:
(数字)9781665496209
ISBN:
(纸本)9781665496209
In this paper, we leverage the human perceiving process, that involves vision and language interaction, to generate a coherent paragraph description of untrimmed videos. We propose vision-language (VL) features consisting of two modalities, i.e., (i) vision modality to capture global visual content of the entire scene and (ii) language modality to extract scene elements description of both human and non-human objects (e.g. animals, vehicles, etc), visual and non-visual elements (e.g. relations, activities, etc). Furthermore, we propose to train our proposed VLCap under a contrastive learning VL loss. The experiments and ablation studies on ActivityNet Captions and YouCookII datasets show that our VLCap outperforms existing SOTA methods on both accuracy and diversity metrics. Source code: https://***/UARK- AICV/VLCAP
暂无评论