The process that produces written descriptions that effectively represent the meaning and context of an image is known as image captioning. To integrate visual and textual data, it needs to blend computer vision and n...
详细信息
Recently, transformer-based and convolution-based methods have achieved significant results in learned image compression. By comparing the design of convolutional network (convnet) and transformers, we replace the sel...
详细信息
VCIP 2022 "Tire pattern image classification based on lightweight network challenge" aims to design lightweight networks that correctly classify tire surface tread patterns and indentation images using less ...
详细信息
ISBN:
(纸本)9781665475921
VCIP 2022 "Tire pattern image classification based on lightweight network challenge" aims to design lightweight networks that correctly classify tire surface tread patterns and indentation images using less overhead. To this end, we present a novel lightweight tire tread classification network. Concretely, we adopt the ShuffleNet-V2-x0.5 network as our backbone. To reduce the computation complexity, we introduce the Space-To-Depth and Anti-Alias Downsampling modules to pre-process the input image. Moreover, to enhance the classification ability of our model, we adopt the knowledge distillation strategy by considering Vision Transformer as the teacher network. To ensure the robustness of our model, we pre-train it on imageNet and fine-tune the training set of the challenge. Experiments on the challenge dataset demonstrate that our model achieves superior performance, with 99.00% classification accuracy, 25.51M FLOPs, and 0.20M parameters.
Ultra-high resolution image segmentation has attracted increasing attention recently due to its wide applications in various scenarios such as road extraction and urban planning. The ultra-high resolution image facili...
详细信息
ISBN:
(纸本)9781665475921
Ultra-high resolution image segmentation has attracted increasing attention recently due to its wide applications in various scenarios such as road extraction and urban planning. The ultra-high resolution image facilitates the capture of more detailed information but also poses great challenges to the image understanding system. For memory efficiency, existing methods preprocess the global image and local patches into the same size, which can only exploit local patches of a fixed resolution. In this paper, we empirically analyze the effect of different patch sizes and input resolutions on the segmentation accuracy and propose a multi-scale collective fusion (MSCF) method to exploit information from multiple resolutions, which can be end-to-end trainable for more efficient training. Our method achieves very competitive performance on the widely-used DeepGlobe dataset while training on one single GPU.
Due to the large memory requirement and a large amount of computation, traditional deep learning networks cannot run on mobile devices as well as embedded devices. In this paper, we propose a new mobile architecture c...
详细信息
ISBN:
(纸本)9781665475921
Due to the large memory requirement and a large amount of computation, traditional deep learning networks cannot run on mobile devices as well as embedded devices. In this paper, we propose a new mobile architecture combining MobileNetV2 and pruning, which further decreases the Flops and number of parameters. The performance of MobileNetV2 has been widely demonstrated, and pruning operation can not only allow further model compression but also prevent overfitting. We have done ablation experiments at CIIP Tire Data for different pruning combinations. In addition, we introduced a global hyperparameter to effectively weigh the accuracy and precision. Experiments show that the accuracy of 98.3 % is maintained under the premise that the model size is only 804.5 KB, showing better performance than the baseline method.
Currently, action recognition is predominately performed on video data as processed by CNNs. We investigate if the representation process of CNNs can also be leveraged for multimodal action recognition by incorporatin...
详细信息
ISBN:
(纸本)9781665475921
Currently, action recognition is predominately performed on video data as processed by CNNs. We investigate if the representation process of CNNs can also be leveraged for multimodal action recognition by incorporating image-based audio representations of actions in a task. To this end, we propose Multimodal Audio-image and Video Action Recognizer (MAiVAR), a CNN-based audio-image to video fusion model that accounts for video and audio modalities to achieve superior action recognition performance. MAiVAR extracts meaningful image representations of audio and fuses it with video representation to achieve better performance as compared to both modalities individually on a large-scale action recognition dataset.
For the scientific exploration and research on Mars, it is an indispensable step to transmit high-quality Martian images from distant Mars to Earth. image compression is the key technique given the extremely limited M...
详细信息
ISBN:
(纸本)9781665475921
For the scientific exploration and research on Mars, it is an indispensable step to transmit high-quality Martian images from distant Mars to Earth. image compression is the key technique given the extremely limited Mars-Earth bandwidth. Recently, deep learning has demonstrated remarkable performance in natural image compression, which provides a possibility for efficient Martian image compression. However, deep learning usually requires large training data. In this paper, we establish the first large-scale high-resolution Martian image compression (MIC) dataset. Through analyzing this dataset, we observe an important non-local self-similarity prior for Marian images. Benefiting from this prior, we propose a deep Martian image compression network with the non-local block to explore both local and non-local dependencies among Martian image patches. Experimental results verify the effectiveness of the proposed network in Martian image compression, which outperforms both the deep learning based compression methods and HEVC codec.
Spatial frequency analysis and transforms serve a central role in most engineered image and video lossy codecs, but are rarely employed in neural network (NN)-based approaches. We propose a novel NN-based image coding...
详细信息
ISBN:
(纸本)9781665475921
Spatial frequency analysis and transforms serve a central role in most engineered image and video lossy codecs, but are rarely employed in neural network (NN)-based approaches. We propose a novel NN-based image coding framework that utilizes forward wavelet transforms to decompose the input signal by spatial frequency. Our encoder generates separate bitstreams for each latent representation of low and high frequencies. This enables our decoder to selectively decode bitstreams in a quality-scalable manner. Hence, the decoder can produce an enhanced image by using an enhancement bitstream in addition to the base bitstream. Furthermore, our method is able to enhance only a specific region of interest (ROI) by using a corresponding part of the enhancement latent representation. Our experiments demonstrate that the proposed method shows competitive rate-distortion performance compared to several non-scalable image codecs. We also showcase the effectiveness of our two-level quality scalability, as well as its practicality in ROI quality enhancement.
With the rapid development of multi-sensor fusion technology in various industrial fields, many composite images closely related to human life have been produced. To meet the rapidly growing needs of various image-bas...
详细信息
ISBN:
(纸本)9781665475921
With the rapid development of multi-sensor fusion technology in various industrial fields, many composite images closely related to human life have been produced. To meet the rapidly growing needs of various image-based applications, we have established the first multi-source composite image (MSCI) database for image quality assessment (IQA). Our MSCI database contains 80 reference images and 1600 distorted images, generated by four advanced compression standards with five distortion levels. In particular, these five distortion levels are determined based on the first five just noticeable difference (JND) levels. Moreover, we verify the IQA performance of some representative methods on our MSCI database. The experimental results show that the performance of the existing methods on the MSCI database needs to be further improved.
Recently, deep learning-based video compression algorithms have achieved competitive performance in Bjontegaard delta (BD) rate, especially those adopting super-resolution networks as post-processing modules in downsa...
详细信息
ISBN:
(纸本)9781665475921
Recently, deep learning-based video compression algorithms have achieved competitive performance in Bjontegaard delta (BD) rate, especially those adopting super-resolution networks as post-processing modules in downsampling-based video compression (DBC) frameworks. However, limited by the non-differentiable characteristics of traditional codecs, DBC frameworks mainly focus on improving the performance of super-resolution modules while ignoring optimizing downscaling modules. It is crucial to improve video compression performance without introducing additional modifications to the decoder client in practical application scenarios. We propose a context-aware processing network (CPN) compatible with standard codecs with no computational burden introduced to the client, which preserves the critical information and essential structures during downscaling. The proposed CPN works as a precoder cascaded by standard codecs to improve the compression performance on the server before encoding and transmission. Besides, a surrogate codec is employed to simulate the degradation process of the standard codecs and backpropagate the gradient to optimize the CPN. Experimental results show that the proposed method outperforms latest pre-processing networks and achieves considerable performance compared with the latest DBC frameworks.
暂无评论