Test-time adaptation (TTA) aims at boosting the generalization capability of a trained model by conducting self-/un-supervised learning during testing in real-world applications. Though TTA on image-based tasks has se...
详细信息
ISBN:
(纸本)9798400701085
Test-time adaptation (TTA) aims at boosting the generalization capability of a trained model by conducting self-/un-supervised learning during testing in real-world applications. Though TTA on image-based tasks has seen significant progress, TTA techniques for video remain scarce. Naively introducing image-based TTA methods into video tasks may achieve limited performance, since these methods do not consider the special nature of video tasks, e.g., the motion information. In this paper, we propose leveraging motion cues in videos to design a new test-time learning scheme for video classification. We extract spatial appearance and dynamic motion clip features using two sampling rates (i.e., slow and fast) and propose a fast-to-slow unidirectional alignment scheme to align fast motion and slow appearance features, thereby enhancing the motion encoding ability. Additionally, we propose a slow-fast dual contrastive learning strategy to learn a joint feature space for fastly and slowly sampled clips, guiding the model to extract discriminative video features. Lastly, we introduce a stochastic pseudo-negative sampling scheme to provide better adaptation supervision by selecting a more reliable pseudo-negative label compared to the pseudo-positive label used in prior TTA methods. This technique reduces the adaptation difficulty often caused by poor performance on out-of-distribution test data before adaptation. Our approach significantly improves performance on various video classification backbones, as demonstrated through extensive experiments on two benchmark datasets.
With the continuous advancement of seeker technology and imageprocessing techniques, the precision of guided weapons has increasingly improved. However, due to the rigidly fixed structure between the seeker and the g...
详细信息
video Object Segmentation (VOS) is a fundamental task in video recognition with many practical applications. It aims at predicting segmentation masks of multiple objects in an entire video. Recent video object segment...
详细信息
ISBN:
(纸本)9781665405409
video Object Segmentation (VOS) is a fundamental task in video recognition with many practical applications. It aims at predicting segmentation masks of multiple objects in an entire video. Recent video object segmentation(VOS) researches have achieved remarkable performance. However, as a videoprocessing task, the inference speed of the VOS method is also essential. VOS can be considered an extension of semantic segmentation from a static image to a dynamic image sequence. Following this idea, we propose a fast VOS framework based on YOLACT, a real-time static image segmentation framework. We employ a fast online training technique to make YOLACT grow wings to handle dynamic video sequences and achieve competitive performance(77.2 J&F and 30.9 FPS on DAVIS17) among fast VOS methods. Moreover, by linearly combining mask bases to generate masks for arbitrary objects, our method can process multi-object videos with minimal extra computations.
image mosaicing combines overlapping images of the same scene into a larger, seamless image. This work aims to develop a model that efficiently merges images while evaluating its performance in terms of runtime and th...
详细信息
In the past few years, several efforts have been devoted to reduce individual sources of latency in video delivery, including acquisition, coding and network transmission. The goal is to improve the quality of experie...
详细信息
ISBN:
(数字)9781665496209
ISBN:
(纸本)9781665496209
In the past few years, several efforts have been devoted to reduce individual sources of latency in video delivery, including acquisition, coding and network transmission. The goal is to improve the quality of experience in applications requiring real-time interaction. Nevertheless, these efforts are fundamentally constrained by technological and physical limits. In this paper, we investigate a radically different approach that can arbitrarily reduce the overall latency by means of video extrapolation. We propose two latency compensation schemes where video extrapolation is performed either at the encoder or at the decoder side. Since a loss of fidelity is the price to pay for compensating latency arbitrarily, we study the latency-fidelity compromise using three recent video prediction schemes. Our preliminary results show that by accepting a quality loss, we can compensate a typical latency of 100 ms with a loss of 8 dB in PSNR with the best extrapolator. This approach is promising but also suggests that further work should be done in video prediction to pursue zero-latency video transmission.
When the image algorithm is directly applied to the video scene and the video is processed frame by frame, an obvious pixel flickering phenomenon is happened, that is the problem of temporal inconsistency. In this pap...
详细信息
ISBN:
(纸本)9789819916382;9789819916399
When the image algorithm is directly applied to the video scene and the video is processed frame by frame, an obvious pixel flickering phenomenon is happened, that is the problem of temporal inconsistency. In this paper, a temporal consistency enhancement algorithm based on pixel flicker correction is proposed to enhance video temporal consistency. The algorithm consists of temporal stabilization module TSM-Net, optical flow constraint module and loss calculation module. The innovation of TSM-Net is that the ConvGRU network is embedded layer by layer with dual-channel parallel structure in the decoder, which effectively enhances the information extraction ability of the neural network in the time domain space through feature fusion. This paper also proposes a hybrid loss based on optical flow, which sums the temporal loss and the spatial loss to better balance the dominant role of the two during training. It improves temporal consistency while ensuring better perceptual similarity. Since the algorithm does not require optical flow during testing, it achieves real-time performance. This paper conducts experiments based on public datasets to verify the effectiveness of the pixel flicker correction algorithm.
In this paper, we experimentally demonstrate an optical camera communications (OCC) system for wearable light-emitting diode (LED) as the transmitter. Wearable devices are powerful tools for supporting Internet of Thi...
详细信息
ISBN:
(纸本)9798350302233
In this paper, we experimentally demonstrate an optical camera communications (OCC) system for wearable light-emitting diode (LED) as the transmitter. Wearable devices are powerful tools for supporting Internet of Things (IoT) systems because of their sensing, processing, and communication capability. The term "wearable devices" refers to a wide range of products that can be integrated into clothing and accessories, thus allowing real-time data detection, storage, and exchange without human intervention. This paper presents the practical evaluation of an LED-based wearable transmitter for an OCC system to demonstrate its feasibility. In particular, an LED array attached to the body is modulated using on-off keying to transmit data via visible light, and a smartphone camera captures video of the user wearing the device while slightly moving in a static position in the room. Finally, the data is decoded from the video frames using an imageprocessing algorithm that tracks the source and demodulates the signal.
Point cloud video streaming is a fundamental application of immersive multimedia. In it, objects represented as sets of points are streamed and displayed to remote users. Given the high bandwidth requirements of this ...
详细信息
ISBN:
(数字)9781665496209
ISBN:
(纸本)9781665496209
Point cloud video streaming is a fundamental application of immersive multimedia. In it, objects represented as sets of points are streamed and displayed to remote users. Given the high bandwidth requirements of this content, small changes in the network and/or encoding can affect the users' perceived quality in unexpected manners. To tackle the degradation of the service as fast as possible, real-time Quality of Experience (QoE) assessment is needed. As subjective evaluations are not feasible in realtime due to their inherent costs and duration, low-complexity objective quality assessment is a must. Traditional No-Reference (NR) objective metrics at client side are best suited to fulfill the task. However, they lack on accuracy to human perception. In this paper, we present a cluster-based objective NR QoE assessment model for point cloud video. By means of Machine Learning (ML)-based clustering and prediction techniques combined with NR pixel-based features (e.g., blur and noise), the model shows high correlations (up to a 0.977 Pearson Linear Correlation Coefficient (PLCC)) and low Root Mean Squared Error (RMSE) (down to 0.077 on a zero-to-one scale) towards objective benchmarks after evaluation on an adaptive streaming point cloud dataset consisting of sixteen source videos and 453 sequences in total.
video super-resolution is the task of converting low-resolution video to high-resolution video. Existing methods with better intuitive effects are mainly based on convolutional neural networks (CNNs), but the architec...
详细信息
ISBN:
(纸本)9781510666313;9781510666320
video super-resolution is the task of converting low-resolution video to high-resolution video. Existing methods with better intuitive effects are mainly based on convolutional neural networks (CNNs), but the architecture is heavy, resulting in a slow inference structure. Aiming at this problem, this paper proposes a real-timevideo super-resolution Transformer (RVSRT) can quickly complete the super-resolution task while considering the visual fluency of video frame switching. Unlike traditional methods based on CNNs, this paper does not process video frames separately with different network modules in the temporal domain, but batches adjacent frames through a single UNet-style structure end-to-end Transformer network architecture. Moreover, this paper creatively sets up two-stage interpolation sampling before and after the end-to-end network to maximize the performance of the traditional CV algorithm. The experimental results show that compared with SOTA TMNet [1], RVSRT has only 20% of the network size (2.3M vs 12.3M, parameters) while ensuring comparable performance, and the speed is increased by 80% (26.2 fps vs 14.3 fps, frame size is 720*576).
Modern day computer vision applications are frequently implemented using machine learning approaches. While these implementations can perform very well, the performance is heavily dependent on sufficient and accurate ...
详细信息
暂无评论