In streaming media services, video transcoding is a common practice to alleviate bandwidth demands. Unfortunately, traditional methods employing a uniform rate factor (RF) across all videos often result in significant...
详细信息
ISBN:
(纸本)9798331529543;9798331529550
In streaming media services, video transcoding is a common practice to alleviate bandwidth demands. Unfortunately, traditional methods employing a uniform rate factor (RF) across all videos often result in significant inefficiencies. Content-adaptive encoding (CAE) techniques address this by dynamically adjusting encoding parameters based on video content characteristics. However, existing CAE methods are often tightly coupled with specific encoding strategies, leading to inflexibility. In this paper, we propose a model that predicts both RF-quality and RF-bitrate curves, which can be utilized to derive a comprehensive bitrate-quality curve. This approach facilitates flexible adjustments to the encoding strategy without necessitating model retraining. The model leverages codec features, content features, and anchor features to predict the bitrate-quality curve accurately. Additionally, we introduce an anchor suspension method to enhance prediction accuracy. Experiments confirm that the actual quality metric (VMAF) of the compressed video stays within +/- 1 of the target, achieving an accuracy of 99.14%. By incorporating our quality improvement strategy with the rate-quality curve prediction model, we conducted online A/B tests, obtaining both +0.107% improvements in video views and video completions and +0.064% app duration time. Our model has been deployed on the Xiaohongshu App.
AI-generated images (AGIs) are increasingly utilized across diverse domains due to their ability to quickly produce high-quality visuals. However, assessing the quality of AGIs remains challenging due to their inheren...
详细信息
ISBN:
(纸本)9798331529543;9798331529550
AI-generated images (AGIs) are increasingly utilized across diverse domains due to their ability to quickly produce high-quality visuals. However, assessing the quality of AGIs remains challenging due to their inherent variability and distinctive distortions. To address these challenges, we propose a novel AGI quality assessment method named SIRQA, which enhances feature representation by integrating visual features with textual prompts, effectively measureing the alignment between the generated images and the described content to improve the precision of quality assessment. Specifically, SIRQA employs self-ranking and inter-ranking mechanisms to refine feature representation. The self-ranking mechanism maintains consistency between feature distances and sampling scales, making sure that features from similar sampling scales are positioned closer together. Additionally, inter-ranking mechanism sorts the weighted similarity scores between images and prompts to align with the ranking in the label space. Extensive experiments on the AGIQA3K and PKUI2IQA datasets show that our SIRQA outperforms eight state-of-the-art algorithms in terms of both Spearman's rank correlation coefficient (SRCC) and Pearson linear correlation coefficient (PLCC).
Neural Radiance Fields (NeRF) have demonstrated exceptional performance in generating novel views of scenes by learning implicit volumetric representations from calibrated RGB images, without depth information. A majo...
详细信息
ISBN:
(纸本)9798331529543;9798331529550
Neural Radiance Fields (NeRF) have demonstrated exceptional performance in generating novel views of scenes by learning implicit volumetric representations from calibrated RGB images, without depth information. A major limitation is the need for large training datasets in neural network-based view synthesis frameworks. The challenge of effective data augmentation for view synthesis remains unresolved. NeRF models require extensive scene coverage from multiple views to accurately estimate radiance and density. Insufficient coverage reduces the model's ability to interpolate or extrapolate unseen parts of the scene effectively. In this paper, we propose a novel pipeline to address this data augmentation issue using depth map information. We use depth image-based rendering (DIBR) to overcome the lack of enough views for training NeRF. Experimental results indicate that our approach enhances the quality of rendered images using the NeRF framework, achieving an average peak signal-to-noise ratio (PSNR) increase of 7.2 dB, with a maximum improvement of 12 dB.
The structural similarity of point clouds presents challenges in accurately recognizing and segmenting semantic information at the demarcation points of complex scenes or objects. In this study, we propose a multi-sca...
详细信息
ISBN:
(纸本)9798331529543;9798331529550
The structural similarity of point clouds presents challenges in accurately recognizing and segmenting semantic information at the demarcation points of complex scenes or objects. In this study, we propose a multi-scale graph transformer network (MGTN) for 3D point cloud semantic segmentation. First, a multi-scale graph convolution (MSG-Conv) is devised to address the limitations faced by existing methods when extracting local and global features of point cloud data with varying densities simultaneously. Subsequently, we employ a graph-transformer (G-T) module to enhance edge details and spatial position information in the point cloud, thereby improving recognition accuracy for small objects and confusing elements such as columns and beams. Extensive testing on ShapeNet parts and S3DIS datasets was conducted to demonstrate the effectiveness of MGTN. Compared to the baseline network DGCNN, our proposed MGTN achieves substantial performance improvements, as evidenced by notable increases in mIoU of 1.5% and 18.5% on the ShapeNet parts and S3DIS datasets respectively. Additionally, MGTN outperforms the recent CFSA-Net by 2.3% and 3.4% on OA and mIoU respectively.
Due to the substantial storage requirements of the 4D medical images, achieving efficient compression of such images is a crucial topic. Existing traditional image/video coding methods have achieved remarkable results...
详细信息
ISBN:
(纸本)9798331529543;9798331529550
Due to the substantial storage requirements of the 4D medical images, achieving efficient compression of such images is a crucial topic. Existing traditional image/video coding methods have achieved remarkable results in most compression tasks, but their performance in encoding 4D medical images remain poor. This is because these methods cannot fully exploit the spatio-temporal correlations in 4D images. Recently, implicit neural representation (INR) based image/video compression methods have made significant progress, with coding performance comparable to traditional methods. However, they also suffer from significant performance losses in 4D medical image compression like traditional methods. In this paper, we propose an efficient hybrid representation framework, which includes six learnable feature planes and a tiny MLP decoder. This framework alleviates the issue of previous methods lacking the ability to utilize the spatio-temporal correlations in 4D medical images, enabling it to capture these information more effectively. We also introduce a novel adaptive plane scaling strategy that allocates the numbers of parameter in each plane based on the resolution of the image. This design allows the model to further enhance the reconstruction quality at the same compression ratio. Extensive experiments show that our model achieves better RD performance compared to traditional and INR-based methods, and it also offers faster encoding speeds than INR-based methods.
In computer vision applications, image enhancement is important for improving image quality and extracting meaningful information. Noise removal is a commonly used technique in image enhancement. In this study, the Ba...
详细信息
ISBN:
(纸本)9798350388978;9798350388961
In computer vision applications, image enhancement is important for improving image quality and extracting meaningful information. Noise removal is a commonly used technique in image enhancement. In this study, the Batch Renormalization Denoising Network (BRDNet), which performs well in noise removal, is used as the base model with the use of the Bottleneck Attention Module (BAM) to achieve performance improvement. The proposed method is tested on different datasets with different noise levels and their results are compared. In quantitative experiments, an increase in the PSNR metric value was observed and the visual results were found to be closer to the target images.
Captioning an image is the process of describing it with syntactically and semantically meaningful terms. An image caption generator is developed by the integration of computer vision and natural language processing t...
详细信息
ISBN:
(数字)9783031585357
ISBN:
(纸本)9783031585340;9783031585357
Captioning an image is the process of describing it with syntactically and semantically meaningful terms. An image caption generator is developed by the integration of computer vision and natural language processing technology. Despite the fact that numerous techniques for generating image captions have been developed, the result is inadequate and the need for research in this area is still a demanding topic. The human process of describing any image is by seeing, focusing and captioning, which is equivalent that of feature representation, visual encoding and language generation for the image captioning systems. This study presents the construction of a simple deep learning-based image captioning model and investigates the efficacy of different visual encoding methods employed in the model. We have analyzed and compared the performance of six different pre-trained CNN visual encoding models using Bilingual Evaluation Understudy (BLEU) scores.
Automod, the content moderation system, is an artificial intelligence solution that enables the detection of similarities and inconsistencies in visual content (image, video, etc.). It is designed as a content moderat...
详细信息
ISBN:
(纸本)9798350343557
Automod, the content moderation system, is an artificial intelligence solution that enables the detection of similarities and inconsistencies in visual content (image, video, etc.). It is designed as a content moderation system to detect the similarity and inconsistencies of user-generated visual content (images and videos). With the similarity module installed, labor savings of 15% were achieved, and F1 score results of 90% and higher were achieved for nonconformity detection models. More than 100.000 images can be evaluated daily, and the system's load was tested. Similarly, keyframes obtained from at least 65.000 video content that can be evaluated daily were passed through nonconformity models, and load test was applied.
In this study, the alignment of video-text and image-text datasets is studied. Firstly, similarities are calculated over the texts in the two data sets. A retrieval setup with visual similarities is then applied to th...
详细信息
ISBN:
(纸本)9798350343557
In this study, the alignment of video-text and image-text datasets is studied. Firstly, similarities are calculated over the texts in the two data sets. A retrieval setup with visual similarities is then applied to the subset which is created via calculated text similarities. A BERT-based embedding vector method is applied to the raw and pure texts. As a visual feature, object-based and CLIP-based methods are used to define video frames. According to the results, alignment with CLIP features achieves the best results in the subset created by filtering using raw text.
Perceptual quality metrics derived from deep features have led to a boost in modelling the Human visual System (HVS) to perceive the quality of visual content. In this work, we study the effectiveness of fine-tuning t...
详细信息
ISBN:
(纸本)9798350350463;9798350350456
Perceptual quality metrics derived from deep features have led to a boost in modelling the Human visual System (HVS) to perceive the quality of visual content. In this work, we study the effectiveness of fine-tuning three standard convolutional neural networks (CNNs) viz. ResNet50, VGG16 and MobileNetV2 to predict the quality of stereoscopic images in the no-reference setting. This work also aims to understand the impact of using disparity maps for quality prediction. Interestingly, our experiments demonstrate that disparity maps do not significantly contribute to improving perceptual quality estimation in the deep learning framework. To the best of our knowledge, this is the first study that explores the impact of disparity along with the chosen models for Stereoscopic image Quality Assessment. We present a detailed study of our experiments with various architectural configurations on the LIVE Phase I and II datasets. Further, our results demonstrate the innate capability of deep features for quality prediction. Finally, the simple fine-tuning of the models results in solutions that compete with state-of-the-art patch-based stereoscopic image quality assessment methods.
暂无评论