Nowadays, end-to-end videocoding for both machine and human vision has become an emerging research topic. In complicated systems such as large-scale internet of video things (IoVT), feature streams and video streams ...
详细信息
ISBN:
(纸本)9798400701085
Nowadays, end-to-end videocoding for both machine and human vision has become an emerging research topic. In complicated systems such as large-scale internet of video things (IoVT), feature streams and video streams can be separately encoded and delivered for machine judgement and human viewing. In this paper, we propose a deep scalable video codec (deepSVC) to support three-layer scalability from machine to human vision. First, we design a semantic layer that encodes semantic features extracted from the captured video for machine analysis. This layer employs a conditional semantic compression (CSC) method to remove redundancies between semantic features. Second, we design a structure layer that can be combined with semantic layer to predict the captured video at a low quality. This layer effectively estimates video frames based on semantic layer with an interlayer frame prediction (IFP) network. Third, we design a texture layer that can be combined with the above two layers to reconstruct high-quality video signals. This layer also takes advantage of the IFP network to improve its coding efficiency. In large-scale IoVT systems, deepSVC can deliver semantic layer for regular use and transmit the other layers on demand. Experimental results indicate that the proposed deepSVC outperforms popular codecs for machine and human vision. Compared with scalable extension of H.265/HEVC (SHVC), the proposed deepSVC reduces average bit-per-pixel (bpp) by 25.51%/27.63%/59.87% at the same mAP/PSNR/MS-SSIM. Source-code is available at: https://***/LHB116/deepSVC.
deep learning-based videocoding methods have demonstrated superior performance compared to classical videocoding standards in recent years. The vast majority of the existing deep video coding (DVC) networks are base...
详细信息
deep learning-based videocoding methods have demonstrated superior performance compared to classical videocoding standards in recent years. The vast majority of the existing deep video coding (DVC) networks are based on convolutional neural networks (CNNs), and their main drawback is that since CNNs are affected by the size of the receptive field, they cannot effectively handle long-range dependencies and local detail recovery. Therefore, how to better capture and process the overall structure as well as local texture information in the videocoding task is the core issue. Notably, the transformer employs a self-attention mechanism that captures dependencies between any two positions in the input sequence without being constrained by distance limitations. This is an effective solution to the problem described above. In this paper, we propose end-to-end transformer-based adaptive videocoding (TAVC). First, we compress the motion vector and residuals through a compression network built on the vision transformer (ViT) and design the motion compensation network based on ViT. Second, based on the requirement of videocoding to adapt to different resolution inputs, we introduce a position encoding generator (PEG) as adaptive position encoding (APE) to maintain its translation invariance across different resolution videocoding tasks. The experiment shows that for multiscale structural similarity index measurement (MS-SSIM) metrics, this method exhibits significant performance gaps compared to conventional engineering codecs, such as x 264, x 265, and VTM-15.2. We also achieved a good performance improvement compared to the CNN-based DVC methods. In the case of peak signal-to-noise ratio (PSNR) evaluation metrics, TAVC also achieves good performance.
deep-learning-based video compression technique has been rapidly growing in recent years. This paper adopts the Conditional Augmented Normalizing Flow video codec (CANF-VC) [8] as our basic system. To improve the qual...
详细信息
ISBN:
(纸本)9781450394789
deep-learning-based video compression technique has been rapidly growing in recent years. This paper adopts the Conditional Augmented Normalizing Flow video codec (CANF-VC) [8] as our basic system. To improve the quality of the condition signal (image) for CANF, we propose a two-layer structure learning-based video codec. At low cost of extra bit rate, the low-resolution base layer provides side information to improve the quality of motion-compensated reference frame through a super-resolution module with a merge-net. In addition, the base layer also provides information to the skip-mask generator. The skip-mask guides the coding mechanism to reduce the transmitted samples for the high-resolution enhancement layer. The experiment results indicate that the proposed two-layer coding scheme can provide 22.19% PSNR BD-Rate saving and 49.59% MS-SSIM BD-Rate saving over H.265 (HM 16.20) on the UVG test sequences.
暂无评论