This paper used Time-Frequency Analysis (TFA) techniques for signal processing on tasks of computer vision. Our main idea is as follows: To build a simple network architecture without two or more convolutional neural ...
详细信息
ISBN:
(纸本)9781665475921
This paper used Time-Frequency Analysis (TFA) techniques for signal processing on tasks of computer vision. Our main idea is as follows: To build a simple network architecture without two or more convolutional neural networks (CNNs), analyze hidden features by Discrete Wavelet Transform (DWT), and send them into filters as weights by convolutions, transformers or other methods. And we do not need to build the network with 2 or more stages to accomplish this idea. Actually, we try to directly use TFA skills on CNN to build one-stage network. Networks which build by this way not only keep their outstanding performance, but also cost lower computing resources. In this paper, we mainly use DWT on CNN to solve image inpainting problems. And the results show that our model can work stably in frequency domain to realize free-form image inpainting.
Self-attention based encoder-decoder models achieve dominant performance in image captioning. However, most existing image captioning models (ICMs) only focus on modeling the relation between spatial tokens, while cha...
详细信息
ISBN:
(纸本)9781665475921
Self-attention based encoder-decoder models achieve dominant performance in image captioning. However, most existing image captioning models (ICMs) only focus on modeling the relation between spatial tokens, while channel-wise attention is neglected for getting visual representation. Considering that different channels of visual representation usually denote different visual objects, it may lead to poor performance in terms of object and attribute words in the captioning sentences generated by the ICMs. In this paper, we propose a novel dual-stream self-attention module (DSM) to alleviate the above issue. Specifically, we propose a parallel self-attention based module that simultaneously encodes visual information from the spatial and channel dimensions. Besides, to obtain channel-wise visual features effectively and efficiently, we introduce a group self-attention block with linear computational complexity. To validate the effectiveness of our model, we conduct extensive experiments on the standard IC benchmarks including MSCOCO and Flickr30k. Without bells and whistles, the proposed model performs new SOTAs containing 135.4 CIDEr score on MSCOCO and 70.8 CIDEr score on Flickr30k.
Tire pattern image classification is an important computer vision problem in pubic security, which can guide policeman to detect criminal cases. It remains challenge due to the small diversity within different classes...
详细信息
ISBN:
(纸本)9781665475921
Tire pattern image classification is an important computer vision problem in pubic security, which can guide policeman to detect criminal cases. It remains challenge due to the small diversity within different classes. Generally, a tire pattern image classification system may require two characteristics: high accuracy and low computation. In this paper, we first assume that capturing rich feature representation will benefits tire classification and learning through a lightweight network will improve computing efficiency. We then propose a simple yet efficient two-stage training mechanism: 1) We learn a feature extractor using a Variational Auto-Encoder framework constrained by contrastive learning, projecting images to latent space owing rich feature representation. 2) We train a single-layer linear classification network depend on the features extracted by the previous trained encoder. The Top-1 and Top-5 accuracy on tire pattern dataset is 89.8% and 96.6% respectively, validating the effectiveness of our strategy.
Exposure errors in images, including both underexposure and overexposure, significantly diminish images' contrast and visual appeal. Existing deep learning-based exposure correction methods either require large ne...
详细信息
To speedup the image classification process which conventionally takes the reconstructed images as input, compressed domain methods choose to use the compressed images without decompression as input. Correspondingly, ...
详细信息
ISBN:
(纸本)9781665475921
To speedup the image classification process which conventionally takes the reconstructed images as input, compressed domain methods choose to use the compressed images without decompression as input. Correspondingly, there will be a certain decline about the accuracy. Our goal in this paper is to raise the accuracy of compressed domain classification method using compressed images output by the NN-based image compression networks. Firstly, we design a hybrid objective loss function which contains the reconstruction loss of deep feature map. Secondly, one image reconstruction layer is integrated into the image classification network for up-sampling the compressed representation. These methods greatly help increase the compressed domain image classification accuracy and need no extra computational complexity. Experimental results on the benchmark imageNet prove that our design outperforms the latest work ResNet-41 with a large accuracy gain, about 4.49% on the top-1 classification accuracy. Besides, the accuracy lagging behinds the method using reconstructed images is also reduced to 0.47%. Moreover, our designed classification network has the lowest computational complexity and model complexity.
In this paper, we propose a rate controllable image compression framework, Rate Controllable Variational Autoencoder (RC-VAE), based on the Rate-Feature-Level (RFL) model established through our exploration on the cor...
详细信息
ISBN:
(纸本)9781665475921
In this paper, we propose a rate controllable image compression framework, Rate Controllable Variational Autoencoder (RC-VAE), based on the Rate-Feature-Level (RFL) model established through our exploration on the correlation among target rates, image features and quantization levels. Considering that, when meeting the same target rate, different images should be quantized in different levels, we focus on jointly utilizing the target rate and the extracted features of the image to predict the corresponding quantization level and propose the RFL model. Combining the proposed RFL model with a Hyperprior Continuously Variable Rate (HCVR) image compression network, we further propose the RC-VAE. By controlling information loss in quantization process, the RC-VAE can work at the target rate. Experimental results have demonstrated that one single RC-VAE model can adapt to multiple target rates with higher rate control accuracy and better R-D performance compared with the stateof-the-art rate controllable image compression networks.
Learned image compression (LIC) has shown its superior compression ability. Quantization is an inevitable stage to generate quantized latent for the entropy coding. To solve the non-differentiable problem of quantizat...
详细信息
ISBN:
(纸本)9781665475921
Learned image compression (LIC) has shown its superior compression ability. Quantization is an inevitable stage to generate quantized latent for the entropy coding. To solve the non-differentiable problem of quantization in the training phase, many differentiable approximated quantization methods have been proposed. However, the derivative of quantized latent to non-quantized latent are set as one in most of the previous methods. As a result, the quantization error between non-quantized and quantized latent is not taken into consideration in the gradient descent. To address this issue, we exploit the gradient scaling method to scale the gradient of non-quantized latent in the back-propagation. The experimental results show that we can outperform the recent LIC quantization methods.
Multi-label image classification poses a formidable challenge due to the presence of multiple objects in each image, rendering it notably complex to decipher the visual content comprehensively. Discriminating between ...
详细信息
ISBN:
(纸本)9798400706028
Multi-label image classification poses a formidable challenge due to the presence of multiple objects in each image, rendering it notably complex to decipher the visual content comprehensively. Discriminating between multiple objects necessitates the establishment of robust visual label dependencies. Previous methods attempt to formulate cross-modal interaction or one-shot co-occurrence relationship guidance. However, it not only exhibits limitations when handling occluded or blurry objects but also fails to fully leverage the diverse hierarchical properties for sustainably guiding the learning process of label dependencies. To sustainably establish hierarchical visual label dependencies, this paper introduces a Pyramidal Cross-modal Transformer framework for MLIC tasks. Specifically, the pyramidal visual guidance layer parses the visual features into a multi-resolution pyramid structure, allowing the updated visual-related information to provide sustained guidance for label semantics. This surpasses the conventional pre-processing of co-occurrence relationships. Besides, the hybrid modal interaction layer is proposed to effectively mitigate the semantic disparities between visual and label information with modal-blended indiscriminate attention, replacing vanilla self-attention. Several combination blocks consisting of these two layers are integrated and embedded within the encoder-decoder structure to facilitate the exploration of meticulous visual label dependencies. Extensive experiments on two widely-used benchmarks, including MS-COCO and PASCAL VOC 2007, consistently demonstrate that PCMT could provide state-of-the-art results.
We propose an end-To-end learned image data hiding framework that embeds and extracts secrets in the latent representations of a generic neural compressor. By leveraging a perceptual loss function in conjunction with ...
详细信息
Recent advancements in learning-based image compression methods have shown promising results. The success of these methods heavily relies on the entropy model, which predicts the probability distribution of the quanti...
详细信息
暂无评论