Recent approaches to image captioning typically follow an encoder-decoder architecture. The feature vectors extracted from the region proposals obtained from an object detector network serve as input to encoder. Witho...
详细信息
ISBN:
(纸本)9783031581809;9783031581816
Recent approaches to image captioning typically follow an encoder-decoder architecture. The feature vectors extracted from the region proposals obtained from an object detector network serve as input to encoder. Without any explicit spatial information about the visual regions, the caption synthesis model is limited to learn relationship from captions only. However, the structure between the semantic units in images and sentences is different. This work introduces a grid based spatial position encoding scheme to learn relationship from both domains. Furthermore, bi-linear pooling is used with attention for exploiting spatial and channel-wise attention distribution to capture second order interaction between multi-modal inputs. These are integrated within the Transformer architecture achieving a competitive CIDEr score.
The newest video coding standard, Versatile Video Coding (VVC), adopts a quad-Tree (QT) plus multi-Type tree (QTMT) block partition structure and improves the compression performance by about 30%∼50%, compared with t...
详细信息
image captioning models are a type of "Natural Language processing"(NLP) models that are designed to generate textual descriptions of images. These models are trained on large datasets of images and captions...
详细信息
Single image high dynamic range image reconstruction has been receiving much attention for recovering image details and showing the possibility of simulating brightness distribution in the real world. While most curre...
详细信息
ISBN:
(纸本)9798350367164;9798350367157
Single image high dynamic range image reconstruction has been receiving much attention for recovering image details and showing the possibility of simulating brightness distribution in the real world. While most current works focus on recovering overexposed areas, this work is more focused on underexposed regions and the brightness adjustment of the whole image. This paper proposes an additional plug-in module with histogram guided image binning method for low-light image high dynamic range restoration. This plug-in module is mainly designed with histogram feature extraction and image binning based brightness restoration, enhancing the recovery for the darker regions. Extensive experimentation demonstrates the effectiveness of the approach in enhancing the visual quality of low-light images and preserving details in underexposed areas. At an extremely low-light condition, networks using this plug-in module achieve up to a 0.8227 PSNR improvement and a 0.8278 PU21-PSNR improvement.
Overfitting is usually regarded as a negative condition since it impairs the generalisation power of a model. Nevertheless, overfitting a Neural Network (NN) on test data may be advantageous to improve the compression...
详细信息
This paper explores the potential of a learned two-layer B-frame codec, known as TLZMC. TLZMC is one of the few early attempts that deviate from the hybrid-based coding architecture by skipping motion coding. With TLZ...
详细信息
Chest X-ray imaging is of critical importance in order to effectively diagnose chest diseases, which are increasing today due to various environmental and hereditary factors. Although chest X-ray is the most commonly ...
详细信息
ISBN:
(纸本)9798350343557
Chest X-ray imaging is of critical importance in order to effectively diagnose chest diseases, which are increasing today due to various environmental and hereditary factors. Although chest X-ray is the most commonly used device for detecting pathological abnormalities, it can be quite challenging for specialists due to misleading locations and sizes of pathological abnormalities, visual similarities, and complex backgrounds. Traditional deep learning (DL) architectures fall short due to relatively small areas of pathological abnormalities and similarities between diseased and healthy areas. In addition, DL structures with standard classification approaches are not ideal for dealing with problems involving multiple diseases. In order to overcome the aforementioned problems, firstly, background-independent feature maps were created using a conventional convolutional neural network (CNN). Then, the relationships between objects in the feature maps are made suitable for multi-label classification tasks using the focal modulation network (FMA), an innovative attention module that is more effective than the self-attention approach. Experiments using a Chest x-ray dataset containing both single and multiple labels for a total of 14 different diseases show that the proposed approach can provide superior performance for multi-label datasets.
Text-To-image person search is challenging due to the cross-scale correspondences and information inequality between modalities. Specifically, images and text are complexly linked at different scales and images are us...
详细信息
This paper proposes a novel hybrid light field (LF) denoising method which is based on a convolutional neural network (CNN) designed to reflect the characteristic of LF image in both pixel and frequency domains. Notin...
详细信息
Unsupervised image retrieval aims to learn the important visual characteristics without any given level to retrieve the similar images for a given query image. The Convolutional Neural Network (CNN)-based approaches h...
详细信息
ISBN:
(纸本)9798350349405;9798350349399
Unsupervised image retrieval aims to learn the important visual characteristics without any given level to retrieve the similar images for a given query image. The Convolutional Neural Network (CNN)-based approaches have been extensively exploited with self-supervised contrastive learning for image hashing. However, the existing approaches suffer due to lack of effective utilization of global features by CNNs and biased-ness created by false negative pairs in the contrastive learning. In this paper, we propose a TransClippedCLR model by encoding the global context of an image using Transformer having local context through patch based processing, by generating the hash codes through product quantization and by avoiding the potential false negative pairs through clipped contrastive learning. The proposed model is tested with superior performance for unsupervised image retrieval on benchmark datasets, including CIFAR10, NUS-Wide and Flickr25K, as compared to the recent state-of-the-art deep models. The results using the proposed clipped contrastive learning are greatly improved on all datasets as compared to same backbone network with vanilla contrastive learning.
暂无评论