This study addresses the deficiencies in the analysis of local parameters of target features in human motion videoimages during the rapid extraction of local features. This deficiency leads to inaccurate description ...
详细信息
KeyWord Spotting (KWS), i.e. the capability to identify vocal commands as they are pronounced, is becoming one of the most important features of Human-Machine Interface (HMI), also thanks to the pervasive diffusion of...
详细信息
a kind of spatial-temporal neural network video smoke detection algorithm is proposed in order to solve the problems associated with the incorrect classification of the static approximate smoke background in the face ...
详细信息
Test-time adaptation (TTA) aims at boosting the generalization capability of a trained model by conducting self-/un-supervised learning during testing in real-world applications. Though TTA on image-based tasks has se...
详细信息
ISBN:
(纸本)9798400701085
Test-time adaptation (TTA) aims at boosting the generalization capability of a trained model by conducting self-/un-supervised learning during testing in real-world applications. Though TTA on image-based tasks has seen significant progress, TTA techniques for video remain scarce. Naively introducing image-based TTA methods into video tasks may achieve limited performance, since these methods do not consider the special nature of video tasks, e.g., the motion information. In this paper, we propose leveraging motion cues in videos to design a new test-time learning scheme for video classification. We extract spatial appearance and dynamic motion clip features using two sampling rates (i.e., slow and fast) and propose a fast-to-slow unidirectional alignment scheme to align fast motion and slow appearance features, thereby enhancing the motion encoding ability. Additionally, we propose a slow-fast dual contrastive learning strategy to learn a joint feature space for fastly and slowly sampled clips, guiding the model to extract discriminative video features. Lastly, we introduce a stochastic pseudo-negative sampling scheme to provide better adaptation supervision by selecting a more reliable pseudo-negative label compared to the pseudo-positive label used in prior TTA methods. This technique reduces the adaptation difficulty often caused by poor performance on out-of-distribution test data before adaptation. Our approach significantly improves performance on various video classification backbones, as demonstrated through extensive experiments on two benchmark datasets.
video streaming services typically employ traditional codecs, such as H.264, to encode videos into multiple bitrate representations. These codecs are tightly limited by discrete quantization parameters (QPs), resultin...
详细信息
video Object Segmentation (VOS) is a fundamental task in video recognition with many practical applications. It aims at predicting segmentation masks of multiple objects in an entire video. Recent video object segment...
详细信息
ISBN:
(纸本)9781665405409
video Object Segmentation (VOS) is a fundamental task in video recognition with many practical applications. It aims at predicting segmentation masks of multiple objects in an entire video. Recent video object segmentation(VOS) researches have achieved remarkable performance. However, as a videoprocessing task, the inference speed of the VOS method is also essential. VOS can be considered an extension of semantic segmentation from a static image to a dynamic image sequence. Following this idea, we propose a fast VOS framework based on YOLACT, a real-time static image segmentation framework. We employ a fast online training technique to make YOLACT grow wings to handle dynamic video sequences and achieve competitive performance(77.2 J&F and 30.9 FPS on DAVIS17) among fast VOS methods. Moreover, by linearly combining mask bases to generate masks for arbitrary objects, our method can process multi-object videos with minimal extra computations.
Recently, deep learning models have become more prominent due to their tremendous performance for real-time tasks like face recognition, object detection, natural language processing (NLP), instance segmentation, imag...
详细信息
Recently, deep learning models have become more prominent due to their tremendous performance for real-time tasks like face recognition, object detection, natural language processing (NLP), instance segmentation, image classification, gesture recognition, and video classification. image captioning is one of the critical tasks in NLP and computer vision (CV). It completes conversion from image to text;specifically, the model produces description text automatically based on the input images. In this aspect, this article develops a Lighting Search Algorithm (LSA) with a Hybrid Convolutional Neural Network image Captioning System (LSAHCNN-ICS) for NLP. This introduced LSAHCNN-ICS system develops an end-to-end model which employs convolutional neural network (CNN) based ShuffleNet as an encoder and HCNN as a decoder. At the encoding part, the ShuffleNet model derives feature descriptors of the image. Besides, in the decoding part, the description of text can be generated using the proposed hybrid convolutional neural network (HCNN) model. To achieve improved captioning results, the LSA is applied as a hyperparameter tuning strategy, representing the innovation of the study. The simulation analysis of the presented LSAHCNN-ICS technique is performed on a benchmark database, and the obtained results demonstrated the enhanced outcomes of the LSAHCNN-ICS algorithm over other recent methods with maximum Consensus-based image Description Evaluation (CIDEr Code) of 43.60, 59.54, and 135.14 on Flickr8k, Flickr30k, and MSCOCO datasets correspondingly.
With the continuous advancement of seeker technology and imageprocessing techniques, the precision of guided weapons has increasingly improved. However, due to the rigidly fixed structure between the seeker and the g...
详细信息
Diabetic retinopathy (DR) is a sight-threatening condition associated with diabetes, characterized by damage to the retinal blood vessels. Key to the automation of DR staging is the identification of various symptoms ...
详细信息
ISBN:
(纸本)9798350351439;9798350351422
Diabetic retinopathy (DR) is a sight-threatening condition associated with diabetes, characterized by damage to the retinal blood vessels. Key to the automation of DR staging is the identification of various symptoms directly or closely associated with retinal blood vessels, as well as the number of these symptoms in the four quadrants of the retina separated by the optic disc. Therefore, precise identification of the optic disc (OD) and blood vessels in fundus images is crucial for DR stage diagnosis but is often time-consuming and requires expert analysis. This study introduces a thresholding-based approach for the automated localization of the OD and the detection of blood vessels in fundus images of diabetic patients. Our algorithm is more robust than some deep learning-based algorithms, achieving more accurate results, particularly in advanced DR stages where the resemblance between various symptoms and blood vessels complicates the extraction of blood vessels. Additionally, our computer vision system can achieve OD localization and blood vessel segmentation in realtime. The experimental results on a dataset selected by an ophthalmologist from a Kaggle dataset, ensuring data quality, show that the proposed algorithm can achieve an accuracy higher than 94% for both OD localization and blood vessel detection, outperforming some state-of-the-art algorithms.
We present a lightweight model for high resolution portrait matting. The model does not use any auxiliary inputs such as trimaps or background captures and achieves realtime performance for HD videos and near real ti...
详细信息
ISBN:
(纸本)9798350318920;9798350318937
We present a lightweight model for high resolution portrait matting. The model does not use any auxiliary inputs such as trimaps or background captures and achieves realtime performance for HD videos and near realtime for 4K. Our model is built upon a two-stage framework with a low resolution network for coarse alpha estimation followed by a refinement network for local region improvement. However, a naive implementation of the two-stage model suffers from poor matting quality if not utilizing any auxiliary inputs. We address the performance gap by leveraging the vision transformer (ViT) as the backbone of the low resolution network, motivated by the observation that the tokenization step of ViT can reduce spatial resolution while retain as much pixel information as possible. To inform local regions of the context, we propose a novel cross region attention (CRA) module in the refinement network to propagate the contextual information across the neighboring regions. We demonstrate that our method achieves superior results and outperforms other baselines on three benchmark datasets while only uses 1/20 of the FLOPS compared to the existing state-of-the-art model.
暂无评论