The lack of eye contact in video conference degrades the user experience. This problem has been known and studied for many years. There are hardware-based solutions to the eye gaze problem;however, these specialized s...
详细信息
ISBN:
(纸本)0819452114
The lack of eye contact in video conference degrades the user experience. This problem has been known and studied for many years. There are hardware-based solutions to the eye gaze problem;however, these specialized systems are not generally accessible. This paper assumed the depth of the scene taken from the monocular camera is normal distributed on the XY plane on the real world, the plane parallel to the view plane. The three dimensional normal curve is initially estimated from the face model of the user. By performing rotation operation on the normal curve, the orientation of the face is rectified. The approach suggested in this paper is fast and effective. It has the advantages of 3D modelling, but could save the steps on complex registration, texture mapping and rendering.
Recently, deep learning-based video compression algorithms have achieved competitive performance in Bjontegaard delta (BD) rate, especially those adopting super-resolution networks as post-processing modules in downsa...
详细信息
ISBN:
(纸本)9781665475921
Recently, deep learning-based video compression algorithms have achieved competitive performance in Bjontegaard delta (BD) rate, especially those adopting super-resolution networks as post-processing modules in downsampling-based video compression (DBC) frameworks. However, limited by the non-differentiable characteristics of traditional codecs, DBC frameworks mainly focus on improving the performance of super-resolution modules while ignoring optimizing downscaling modules. It is crucial to improve video compression performance without introducing additional modifications to the decoder client in practical application scenarios. We propose a context-aware processing network (CPN) compatible with standard codecs with no computational burden introduced to the client, which preserves the critical information and essential structures during downscaling. The proposed CPN works as a precoder cascaded by standard codecs to improve the compression performance on the server before encoding and transmission. Besides, a surrogate codec is employed to simulate the degradation process of the standard codecs and backpropagate the gradient to optimize the CPN. Experimental results show that the proposed method outperforms latest pre-processing networks and achieves considerable performance compared with the latest DBC frameworks.
VCIP 2022 "Tire pattern image classification based on lightweight network challenge" aims to design lightweight networks that correctly classify tire surface tread patterns and indentation images using less ...
详细信息
ISBN:
(纸本)9781665475921
VCIP 2022 "Tire pattern image classification based on lightweight network challenge" aims to design lightweight networks that correctly classify tire surface tread patterns and indentation images using less overhead. To this end, we present a novel lightweight tire tread classification network. Concretely, we adopt the ShuffleNet-V2-x0.5 network as our backbone. To reduce the computation complexity, we introduce the Space-To-Depth and Anti-Alias Downsampling modules to pre-process the input image. Moreover, to enhance the classification ability of our model, we adopt the knowledge distillation strategy by considering Vision Transformer as the teacher network. To ensure the robustness of our model, we pre-train it on imageNet and fine-tune the training set of the challenge. Experiments on the challenge dataset demonstrate that our model achieves superior performance, with 99.00% classification accuracy, 25.51M FLOPs, and 0.20M parameters.
Automod, the content moderation system, is an artificial intelligence solution that enables the detection of similarities and inconsistencies in visual content (image, video, etc.). It is designed as a content moderat...
详细信息
ISBN:
(纸本)9798350343557
Automod, the content moderation system, is an artificial intelligence solution that enables the detection of similarities and inconsistencies in visual content (image, video, etc.). It is designed as a content moderation system to detect the similarity and inconsistencies of user-generated visual content (images and videos). With the similarity module installed, labor savings of 15% were achieved, and F1 score results of 90% and higher were achieved for nonconformity detection models. More than 100.000 images can be evaluated daily, and the system's load was tested. Similarly, keyframes obtained from at least 65.000 video content that can be evaluated daily were passed through nonconformity models, and load test was applied.
Learning-based image codecs produce different compression artifacts, when compared to the blocking and blurring degradation introduced by conventional image codecs, such as JPEG, JPEG 2000 and HEIC. In this paper, a c...
详细信息
ISBN:
(纸本)9781728185514
Learning-based image codecs produce different compression artifacts, when compared to the blocking and blurring degradation introduced by conventional image codecs, such as JPEG, JPEG 2000 and HEIC. In this paper, a crowdsourcing based subjective quality evaluation procedure was used to benchmark a representative set of end-to-end deep learning-based image codecs submitted to the MMSP'2020 Grand Challenge on Learning-Based image Coding and the JPEG AI Call for Evidence. For the first time, a double stimulus methodology with a continuous quality scale was applied to evaluate this type of image codecs. The subjective experiment is one of the largest ever reported including more than 240 pair-comparisons evaluated by 118 naive subjects. The results of the benchmarking of learning-based image coding solutions against conventional codecs are organized in a dataset of differential mean opinion scores along with the stimuli and made publicly available.
Currently, action recognition is predominately performed on video data as processed by CNNs. We investigate if the representation process of CNNs can also be leveraged for multimodal action recognition by incorporatin...
详细信息
ISBN:
(纸本)9781665475921
Currently, action recognition is predominately performed on video data as processed by CNNs. We investigate if the representation process of CNNs can also be leveraged for multimodal action recognition by incorporating image-based audio representations of actions in a task. To this end, we propose Multimodal Audio-image and Video Action Recognizer (MAiVAR), a CNN-based audio-image to video fusion model that accounts for video and audio modalities to achieve superior action recognition performance. MAiVAR extracts meaningful image representations of audio and fuses it with video representation to achieve better performance as compared to both modalities individually on a large-scale action recognition dataset.
Due to the large memory requirement and a large amount of computation, traditional deep learning networks cannot run on mobile devices as well as embedded devices. In this paper, we propose a new mobile architecture c...
详细信息
ISBN:
(纸本)9781665475921
Due to the large memory requirement and a large amount of computation, traditional deep learning networks cannot run on mobile devices as well as embedded devices. In this paper, we propose a new mobile architecture combining MobileNetV2 and pruning, which further decreases the Flops and number of parameters. The performance of MobileNetV2 has been widely demonstrated, and pruning operation can not only allow further model compression but also prevent overfitting. We have done ablation experiments at CIIP Tire Data for different pruning combinations. In addition, we introduced a global hyperparameter to effectively weigh the accuracy and precision. Experiments show that the accuracy of 98.3 % is maintained under the premise that the model size is only 804.5 KB, showing better performance than the baseline method.
Blind image Quality Assessment (BIQA) is essential in computational vision for predicting the visual quality of digital images without reference counterparts. Despite advancements through convolutional neural networks...
详细信息
ISBN:
(纸本)9798331529543;9798331529550
Blind image Quality Assessment (BIQA) is essential in computational vision for predicting the visual quality of digital images without reference counterparts. Despite advancements through convolutional neural networks (CNNs), a significant challenge in BIQA remains the long-tail distribution of image quality scores, leading to biased training and reduced model generalization. To address this, we restructured the KonIQ-10k dataset to create an imbalanced version named KonIQ-10k-LT, manipulating the distribution of image quality scores to have opposing distributions in the training and validation sets. This restructuring increases the proportion of certain quality scores in the training set while decreasing them in the validation set. Experimental results show a significant performance decline of BIQA models on the KonIQ-10k-LT dataset compared to the original KonIQ-10k, highlighting the challenge posed by the long-tail distribution. To mitigate this issue, we propose a Proportion Weighted Balancing (PWB) method as a baseline, designed to enhance the robustness and generalization ability of BIQA models. Our findings demonstrate that the proposed WB method improves the performance and reliability of BIQA models under these challenging conditions.
In video coding, it is always an intractable problem to compress high frequency components including noise and visually imperceptible content that consumes large amount bandwidth resources while providing limited qual...
详细信息
ISBN:
(纸本)9781728185514
In video coding, it is always an intractable problem to compress high frequency components including noise and visually imperceptible content that consumes large amount bandwidth resources while providing limited quality improvement. Direct using of denoising methods causes coding performance degradation, and hence not suitable for video coding scenario. In this work, we propose a video pre-processing approach by leveraging edge preserving filter specifically designed for video coding, of which filter parameters are optimized in the sense of rate-distortion (R-D) performance. The proposed pre-processing method removes low R-D cost-effective components for video encoder while keeping important structural components, leading to higher coding efficiency and also better subjective quality. Comparing with the conventional denoising filters, our proposed pre-processing method using the R-D optimized edge preserving filter can improve the coding efficiency by up to -5.2% BD-rate with low computational complexity.
In this work, an efficient and robust learning-based JPEG2000 architecture is proposed. It uses machine learning techniques for predicting and encoding the decision bit in the embedded block coding with optimized trun...
详细信息
ISBN:
(纸本)9781728180687
In this work, an efficient and robust learning-based JPEG2000 architecture is proposed. It uses machine learning techniques for predicting and encoding the decision bit in the embedded block coding with optimized truncation (EBCOT) process. First, we apply non-locally weighted ridge regression to predict the quantized wavelet coefficients in the LL subband. Then, during the EBCOT process, we perform inter/intra subband prediction and inter/intra bit plane symbol prediction to estimate the activity of the decision bit using the deep learning architecture. Then, the binary prediction result is treated as an additional context and the decision bit is eventually coded using an advanced context-based adaptive binary arithmetic coder. Simulations show that the proposed framework provides the same visual quality as conventional codecs with as much as 30% bitrate savings.
暂无评论