Self-attention based encoder-decoder models achieve dominant performance in image captioning. However, most existing image captioning models (ICMs) only focus on modeling the relation between spatial tokens, while cha...
详细信息
ISBN:
(纸本)9781665475921
Self-attention based encoder-decoder models achieve dominant performance in image captioning. However, most existing image captioning models (ICMs) only focus on modeling the relation between spatial tokens, while channel-wise attention is neglected for getting visual representation. Considering that different channels of visual representation usually denote different visual objects, it may lead to poor performance in terms of object and attribute words in the captioning sentences generated by the ICMs. In this paper, we propose a novel dual-stream self-attention module (DSM) to alleviate the above issue. Specifically, we propose a parallel self-attention based module that simultaneously encodes visual information from the spatial and channel dimensions. Besides, to obtain channel-wise visual features effectively and efficiently, we introduce a group self-attention block with linear computational complexity. To validate the effectiveness of our model, we conduct extensive experiments on the standard IC benchmarks including MSCOCO and Flickr30k. Without bells and whistles, the proposed model performs new SOTAs containing 135.4 CIDEr score on MSCOCO and 70.8 CIDEr score on Flickr30k.
Automatic image cropping techniques have been developed recently to address the mismatch between the native display and image characteristics, such as resolution, aspect ratio, etc. These techniques usually rely on de...
详细信息
ISBN:
(纸本)9781538607008
Automatic image cropping techniques have been developed recently to address the mismatch between the native display and image characteristics, such as resolution, aspect ratio, etc. These techniques usually rely on determining the importance of various regions in the image, or the aesthetic appeal of the final cropped image. In this work, we present a cropping method that combines bottom-up visual saliency and top-down semantic analysis to create a cropped image that best preserves important image content. Experimental results illustrate that the new method outperforms popular saliency-based cropping, which only relies on bottom-up analysis.
Tire pattern image classification is an important computer vision problem in pubic security, which can guide policeman to detect criminal cases. It remains challenge due to the small diversity within different classes...
详细信息
ISBN:
(纸本)9781665475921
Tire pattern image classification is an important computer vision problem in pubic security, which can guide policeman to detect criminal cases. It remains challenge due to the small diversity within different classes. Generally, a tire pattern image classification system may require two characteristics: high accuracy and low computation. In this paper, we first assume that capturing rich feature representation will benefits tire classification and learning through a lightweight network will improve computing efficiency. We then propose a simple yet efficient two-stage training mechanism: 1) We learn a feature extractor using a Variational Auto-Encoder framework constrained by contrastive learning, projecting images to latent space owing rich feature representation. 2) We train a single-layer linear classification network depend on the features extracted by the previous trained encoder. The Top-1 and Top-5 accuracy on tire pattern dataset is 89.8% and 96.6% respectively, validating the effectiveness of our strategy.
Synthetic DNA has received much attention recently as a long-term archival medium alternative due to its high density and durability characteristics. However, most current work has primarily focused on using DNA as a ...
详细信息
ISBN:
(纸本)9781728185514
Synthetic DNA has received much attention recently as a long-term archival medium alternative due to its high density and durability characteristics. However, most current work has primarily focused on using DNA as a precise storage medium. In this work, we take an alternate view of DNA. Using neural-network-based compression techniques, we transform images into a latent-space representation, which we then store on DNA. By doing so, we transform DNA into an approximate image storage medium, as images generated back from DNA are only approximate representations of the original images. Using several datasets, we investigate the storage benefits of approximation, and study the impact of DNA storage errors (substitutions, indels, bias) on the quality of approximation. In doing so, we demonstrate the feasibility and potential of viewing DNA as an approximate storage medium.
Plenoptic cameras are light field capturing devices able to acquire large amounts of angular and spatial information. The lenslet video produced by such cameras presents on each frame a distinctive hexagonal pattern o...
详细信息
ISBN:
(纸本)9798331529543;9798331529550
Plenoptic cameras are light field capturing devices able to acquire large amounts of angular and spatial information. The lenslet video produced by such cameras presents on each frame a distinctive hexagonal pattern of micro-images. Due to the particular structure of lenslet images, traditional video codecs perform poorly on lenslet video. Previous works have proposed a preprocessing scheme that cuts and realigns the micro-images on each lenslet frame. While effective, this method introduces high frequency components into the processed image. In this paper, we propose an additional step to the aforementioned scheme by applying an invertible smoothing transform. We evaluate the enhanced scheme on lenslet video sequences captured with single-focused and multi-focused plenoptic cameras. On average, the enhanced scheme achieves 9.85% bitrate reduction compared to the existing scheme.
In many imageprocessing tasks it occurs that pixels or blocks of pixels are missing or lost in only some channels. For example during defective transmissions of RGB images, it may happen that one or more blocks in on...
详细信息
ISBN:
(纸本)9781728185514
In many imageprocessing tasks it occurs that pixels or blocks of pixels are missing or lost in only some channels. For example during defective transmissions of RGB images, it may happen that one or more blocks in one color channel are lost. Nearly all modern applications in imageprocessing and transmission use at least three color channels, some of the applications employ even more bands, for example in the infrared and ultraviolet area of the light spectrum. Typically, only some pixels and blocks in a subset of color channels are distorted. Thus, other channels can be used to reconstruct the missing pixels, which is called spatio-spectral reconstruction. Current state-of-the-art methods purely rely on the local neighborhood, which works well for homogeneous regions. However, in high-frequency regions like edges or textures, these methods fail to properly model the relationship between color bands. Hence, this paper introduces non-local filtering for building a linear regression model that describes the inter-band relationship and is used to reconstruct the missing pixels. Our novel method is able to increase the PSNR on average by 2 dB and yields visually much more appealing images in high-frequency regions.
The rapid advancements in medical imaging have led to a growing demand for high-performance lossless compression of large 3D medical image datasets. Unlike natural images, medical images typically feature three-dimens...
详细信息
ISBN:
(纸本)9798331529543;9798331529550
The rapid advancements in medical imaging have led to a growing demand for high-performance lossless compression of large 3D medical image datasets. Unlike natural images, medical images typically feature three-dimensional structures, and high bit-depth, necessitating specialized compression techniques. Based on a decoder-only transformer, we propose a learnable dual-decoder model for lossless compression of 3D medical images. Our approach packs voxels into patches, which are processed by a patch-level decoder to extract the patch feature. The voxels, along with the patch feature, are subsequently fed into a voxel-level decoder to model each voxel. This coarse-to-fine modeling strategy reduces the computational time for each voxel and enables long-range modeling dependencies. Experimental results demonstrate that our proposed model achieves state-of-the-art compression performance, with an approximately 15% improvement in compression performance over the traditional JP3D benchmark on various datasets.
Traffic sign recognition plays a crucial role in self-driving cars, but unfortunately, it is vulnerable to adversarial patches (AP). Although AP can efficiently fool DNN-based models in previous studies, the connectio...
详细信息
ISBN:
(纸本)9798331529543;9798331529550
Traffic sign recognition plays a crucial role in self-driving cars, but unfortunately, it is vulnerable to adversarial patches (AP). Although AP can efficiently fool DNN-based models in previous studies, the connection between image forensics and AP detection still needs to be explored. From a high-level point of view, their goals are the same. That is to find tampered regions and prevent false positives in the meantime. A natural question arises: "Is achieving application-agnostic anomaly detection possible?" In this paper, we propose image Forensics Defense Against Adversarial Patch (IDAP), a framework to defend against adversarial patches via generalizable features learned from tampered images. In addition, we incorporate the Hausdorff erosion loss into our network model for joint training to complete the shape of a predicted mask. Extensive experimental comparisons on three datasets, including COCO, DFG, and APRICOT demonstrate that IDAP outperforms state-of-the-art AP detection methods.
The design of stereo image quality assessment (SIQA) methods cannot be well based on the biological theory of human vision, so the performance of many SIQA methods cannot achieve good consistency with the subjective p...
详细信息
ISBN:
(纸本)9781728180687
The design of stereo image quality assessment (SIQA) methods cannot be well based on the biological theory of human vision, so the performance of many SIQA methods cannot achieve good consistency with the subjective perception. The research on the visual system tends to the dorsal and ventral pathways, which ignores the information asymmetry in the early visual pathways. It is worth noting that the ON and OFF receptive fields in retinal ganglion cells (RGCs) respond asymmetrically to the statistical features of images. Inspired by this, we propose a SIQA method based on monocular and binocular visual features, which takes into account the asymmetry of local contrast bright and dark features in early visual pathways. First, this paper extracts the response maps of ON and OFF cell in RGCs to left and right views respectively. And then the different information fusion modes of visual cortex are used to fuse the response maps information of left and right views. Final, monocular and binocular features were extracted and sent to support vector regression (SVR) for quality regression. Experimental results show that the proposed method is superior to several mainstream SIQA metrics on two publicly available databases.
This paper proposes a line segment based image registration method. Edges are detected and partitioned into line segments. Line-fitting is applied onto every line segment to rule out those segments of high fitting err...
详细信息
ISBN:
(纸本)9780819469946
This paper proposes a line segment based image registration method. Edges are detected and partitioned into line segments. Line-fitting is applied onto every line segment to rule out those segments of high fitting error. For each segment in a reference image, putative matching segments in a test image are picked with the constraints obtained by analyzing affine transformations. Putative segment correspondences result in the correspondences of intersections of segments, which are used as matching points. An affine matrix is derived from those point correspondences and evaluated by the similarity metric. The segment correspondences ending up with higher similarity metrics are used to compute the final transformation. Experimental results show that the proposed method is robust especially when salient points can not be detected accurately.
暂无评论