This paper proposes the integration of residual blocks into neural representation for videos (NeRV)-based architectures with the aim of enhancing the reconstruction of detailed patterns and high-level features. Additi...
详细信息
ISBN:
(纸本)9798350349405;9798350349399
This paper proposes the integration of residual blocks into neural representation for videos (NeRV)-based architectures with the aim of enhancing the reconstruction of detailed patterns and high-level features. Additionally, a coding pipeline is introduced, placing the implicit neural decoder in a real-life video streaming framework. Indeed, DeepCABAC is employed for model compression, applying a quantization scheme followed by the context-adaptive binary arithmetic coding (CABAC) entropy coding algorithm, ultimately leading to bitstream generation. Our method outperforms NeRV, as well as x264 and x265, achieving BD-rate gains against NeRV : -12.06% using PSNR and -14.25% using MS-SSIM. Furthermore, it exhibits superior subjective quality compared to NeRV, attributed to enhanced high-level feature reconstruction. This observed behavior encourages the application of our method to other NeRV-based models, such as E-NeRV.
The technology for autonomous navigation on inland waterways is worth investigating, and navigable water surface segmentation is a key part of this technology. Semantic segmentation methods based on deep learning are ...
详细信息
The technology for autonomous navigation on inland waterways is worth investigating, and navigable water surface segmentation is a key part of this technology. Semantic segmentation methods based on deep learning are able to distinguish between water surface areas and non-water surface areas. However, existing semantic segmentation methods cannot meet the requirements of the water surface segmentation task in terms of both segmentation precision and real-time performance. In this study, a Swap Attention Bilateral Segmentation Network (SA-BiSeNet) is proposed to improve segmentation performance while ensuring model inference speed by better fusing the two features of the dual-branch down-sampling network using the attention mechanism. Specifically, an innovative Swap Attention Module is designed to model the dependency between the features of the spatial detail branch and the features of the semantic branches, thus expanding the receptive fields of the spatial detail and semantic branches to each other's global contexts. This design can effectively fuse features and thus enhance feature representation. Experiments were conducted on the inland waterway dataset USVInland to verify the performance of SA-BiSeNet in terms of segmentation precision and inference speed, and SA-BiSeNet achieved 93.65% Mean IoU and maintained the same level of fps as the baseline.
Surgical image and video applications using endoscopic datasets have been actively investigated to develop advanced surgical assistant systems. These applications are particularly crucial for understanding surgical sc...
详细信息
ISBN:
(纸本)9781510673878;9781510673861
Surgical image and video applications using endoscopic datasets have been actively investigated to develop advanced surgical assistant systems. These applications are particularly crucial for understanding surgical scenes during procedures. Specifically, segmentation techniques allow for identifying anatomical structures and surgical instruments, while quality control methods refine surgical techniques, and action recognition aids in discerning surgical steps. A significant improvement in performance across different downstream tasks has been achieved due to the advancements in deep neural networks and the expansive training dataset available. However, the exploration of surgical action recognition remains limited. Existing methods face challenges in real-world settings, mainly due to the lack of adaptability in a dynamic imaging environment. In this study, we present a framework for surgical action recognition in endoscopic datasets by leveraging video-masked autoencoders (videoMAE), which has shown promise in video dataset analysis with minimal datasets. Additionally, we incorporate a temporal data augmentation technique to represent diverse imaging conditions and resolve the issue of using single-source data with low quality. For our experiments, we utilize videoMAE v2 pre-trained on Unlabeled Hybrid datasets and fine-tune the model on the CholecT45 dataset for validation. Our proposed method shows the effectiveness of using the videoMAE structure with focal loss, particularly for action recognition tasks in surgical scenarios.
Commonly used datasets for evaluating video codecs are all very high quality and not representative of video typically used in video conferencing scenarios. We present the video Conferencing Dataset (VCD) for evaluati...
详细信息
ISBN:
(纸本)9798350344868;9798350344851
Commonly used datasets for evaluating video codecs are all very high quality and not representative of video typically used in video conferencing scenarios. We present the video Conferencing Dataset (VCD) for evaluating video codecs for real-time communication, the first such dataset focused on video conferencing. VCD includes a wide variety of camera qualities and spatial and temporal information. It includes both desktop and mobile scenarios and two types of video background processing. We report the compression efficiency of H.264, H.265, H.266, and AV1 in low-delay settings on VCD and compare it with the non-video conferencing datasets UVC, MLC-JVC, and HEVC. The results show the source quality and the scenarios have a significant effect on the compression efficiency of all the codecs. VCD enables the evaluation and tuning of codecs for this important scenario. The VCD is publicly available as an open-source dataset at https://***/microsoft/VCD.
To address the problem of managing dedicated parking zones arising from the increasing number of electric vehicles and vehicles for the physically challenged, this paper proposes a license plate recognition (LPR)-base...
详细信息
To address the problem of managing dedicated parking zones arising from the increasing number of electric vehicles and vehicles for the physically challenged, this paper proposes a license plate recognition (LPR)-based parking control system that combines the YOLO and MobileNet algorithms. These two algorithms are designed for real-time object detection and efficient preprocessing, respectively, and can operate in realtime in resource-constrained edge-device environments. In tests using data from more than 51,000 vehicles, the system achieved an accuracy rate of 95.76% in classifying electric vehicles and 97.18% in classifying vehicles for the physically challenged. The average CPU and RAM utilizations of the system were 34.54% and 45.04%, respectively. In addition, the processingtime per image was recorded as approximately 1.04 s, demonstrating its potential to run reliably on edge devices. These results are expected to facilitate the efficient resolution of parking management problems in smart cities and effective operation of parking zones reserved for electric vehicles and vehicles for the physically challenged.
Box filter is well-known for the image smoothing task, thanks to its effectiveness and computation efficiency. However, it can NOT preserve edges. In contrast, edge-preserving methods can NOT achieve the high computat...
详细信息
Box filter is well-known for the image smoothing task, thanks to its effectiveness and computation efficiency. However, it can NOT preserve edges. In contrast, edge-preserving methods can NOT achieve the high computation performance as the box filter. To tackle this issue, in this paper, we present a one-sided box filter that can preserve edges much better than the box filter. Meanwhile, it has a similar high computation performance as the box filter. More specifically, we perform the box filter on nine one-sided local windows, and then select the most possible candidate as the result. Such selection imposes the non-linearity, which preserves the edges and corners. Several numerical experiments are conducted to confirm this edge-preserving property. At the same time, it has a similar computation performance as the box filter. It inherits the constant computation complexity $O(1)$ and the linear complexity $O(N)$ from the box filter with respect to the window size and the total number of pixels, respectively. We numerically confirm that this filter is the fastest method among the edge-preserving methods, including the classical and the state of the art approaches. It is at least $10 \times $ faster than other edge-preserving methods. Thanks to the edge-preserving property and the high computation performance, the proposed one-sided box filter can be deployed in a large range of applications where the edge-preserving and high performance is required, such as real-timevideoprocessing, augmented reality and view synthesis.
In order to meet the demand of strong real-time display in large integrated video network, this paper proposes a lossless video strong real-time transmission display model based on fibre channel protocol. By accessing...
详细信息
Face regions containing rich semantic information appear frequently in the videos. As the video resolution increase dramatically, the face regions will inevitably attract more attentions. This paper proposes a face pe...
详细信息
Face regions containing rich semantic information appear frequently in the videos. As the video resolution increase dramatically, the face regions will inevitably attract more attentions. This paper proposes a face perception based coding scheme to improve the visual quality of the face regions in UHD videos. A specially tailored face perception model is first utilized to precisely and quickly locate the face regions. Then, a face perception map is generated based on a hierarchical mapping algorithm. Finally, the face perception map is employed as a guidance to optimize the encoding process, including mode decision, block partition and bit allocation. The proposed method is implemented on HEVC to demonstrate the effectiveness. Experimental results on a set of 4K test sequences show that the proposed method can obviously improve the objective and subjective quality of the face regions, while causing only slight quality decline over the rest of the frame. Additionally, the computation required for mode decision and block partition is reduced, thereby saving encoding time cost.
Space-time memory (STM) network methods have been dominant in semi-supervised video object segmentation (SVOS) due to their remarkable performance. In this work, we identify three key aspects where we can improve such...
详细信息
ISBN:
(纸本)9781728198354
Space-time memory (STM) network methods have been dominant in semi-supervised video object segmentation (SVOS) due to their remarkable performance. In this work, we identify three key aspects where we can improve such methods;i) supervisory signal, ii) pretraining and iii) spatial awareness. We then propose TrickVOS;a generic, method-agnostic bag of tricks addressing each aspect with i) a structure-aware hybrid loss, ii) a simple decoder pretraining regime and iii) a cheap tracker that imposes spatial constraints in model predictions. Finally, we propose a lightweight network and show that when trained with TrickVOS, it achieves competitive results to state-of-the-art methods on DAVIS and YouTube benchmarks, while being one of the first STM-based SVOS methods that can run in real-time on a mobile device.
The High-Efficiency video Coding (HEVC) standard has high compression efficiency. This efficiency is achieved at the expense of increasing the computational complexity. The HEVC encoder has the hierarchical search for...
详细信息
The High-Efficiency video Coding (HEVC) standard has high compression efficiency. This efficiency is achieved at the expense of increasing the computational complexity. The HEVC encoder has the hierarchical search for optimal Coding Unit (CU) partitioning. It is based on rate-distortion optimization. Various solutions are proposed to reduce the encoding time. But, the machine learning-based methods have more effective in reducing the encoding time. Yet, deep learning tools have a relatively high computational load. So, in this paper a new low complexity convolutional neural network has been designed. It is called Convolutional Neural Network-based CTU Partitioner (CNNCP). It reduces the computational complexity of the HEVC encoding. The CNNCP takes the CTU luminance component and the quantization parameter (QP) as inputs, and provides the CU depth matrix in output at once. The CNNCP does not follow the hierarchical approach. Thus, it has a fixed computation structure that facilitates the use of parallel processing tools. The CNNCP has a simple structure with a least number of parameters, and thus, it has the least computational complexity. It has been trained and tested with a large database for all QP values. The results show that it reduced the encoding time by more than 90%, and makes it suitable for real-time applications.
暂无评论