Sequence-to-sequence vision-language models are showing promise, but their applicability is limited by their inference latency due to their autoregressive way of generating predictions. We propose a parallel decoding ...
详细信息
ISBN:
(纸本)9798350353006
Sequence-to-sequence vision-language models are showing promise, but their applicability is limited by their inference latency due to their autoregressive way of generating predictions. We propose a parallel decoding sequence-to-sequence vision-language model, trained with a Query-CTC loss, that marginalizes over multiple inference paths in the decoder. This allows us to model the joint distribution of tokens, rather than restricting to conditional distribution as in an autoregressive model. The resulting model, NARVL, achieves performance on-par with its state-of-the-art autoregressive counterpart, but is faster at inference time, reducing from the linear complexity associated with the sequential generation of tokens to a paradigm of constant time joint inference.
The vast majority of corner and edge detectors measure image intensity gradients in order to estimate the positions and strengths of features. However, many of the most popular intensity gradient estimators are inhere...
详细信息
ISBN:
(纸本)0818672587
The vast majority of corner and edge detectors measure image intensity gradients in order to estimate the positions and strengths of features. However, many of the most popular intensity gradient estimators are inherently and significantly anisotropic. In spite of this, few algorithms take the anisotropy into account, and so the set of features uncovered is typically sensitive to rotations of the image, compromising recognition, matching (e.g. stereo), and tracking. We introduce an effective technique for removing unwanted anisotropies from analytical gradient estimates, by measuring local intensity gradients in four directions rather than the more traditional two. In experiments using real image data, our algorithm reduces the gradient anisotropy associated with conventional analytical gradient estimates by up to 85%, yielding more consistent feature topologies.
State-of-the-art motion estimation algorithms suffer from three major problems: Poorly textured regions, occlusions and small scale image structures. Based on the Gestalt principles of grouping we propose to incorpora...
详细信息
ISBN:
(纸本)9781424469840
State-of-the-art motion estimation algorithms suffer from three major problems: Poorly textured regions, occlusions and small scale image structures. Based on the Gestalt principles of grouping we propose to incorporate a low level image segmentation process in order to tackle these problems. Our new motion estimation algorithm is based on non-local total variation regularization which allows us to integrate the low level image segmentation process in a unified variational framework. Numerical results on the Middlebury optical flow benchmark data set demonstrate that we can cope with the aforementioned problems.
We study occluding contour artifacts in area-based stereo matching: they are false responses of the matching operator to the occlusion boundary and cause the objects extend beyond their true boundaries in disparity ma...
详细信息
ISBN:
(纸本)0780342364
We study occluding contour artifacts in area-based stereo matching: they are false responses of the matching operator to the occlusion boundary and cause the objects extend beyond their true boundaries in disparity maps. Most of the matching methods suffer from these artifacts;the effect is so strong that it cannot be ignored. We show what gives rise to the artifacts and design a matching criterion that accommodates the presence of occlusions as opposed to methods that identify and remove the artifacts. This approach leads to the problem of measurement contamination studied in statistics. We show that such a problem is hard given finite computational resources, unless more independent measurements directly related to occluding contours is available. What can be achieved is a substantial reduction of the artifacts, especially for large matching templates. Reduced artifacts allow for easier hierarchical matching and for easy fusion of reconstructions from different viewpoints into a coherent whole.
Winder et al. [15, 14] have recently shown the superiority of the DAISY descriptor [12] in comparison to other widely extended descriptors such as SIFT [8] and SURF [1]. Motivated by those results, we present a novel ...
详细信息
ISBN:
(纸本)9781424469840
Winder et al. [15, 14] have recently shown the superiority of the DAISY descriptor [12] in comparison to other widely extended descriptors such as SIFT [8] and SURF [1]. Motivated by those results, we present a novel algorithm that extracts viewpoint and illumination invariant keypoints and describes them with a particular implementation of a DAISY-like layout. We demonstrate how to efficiently compute the scale-space and re-use this information for the descriptor. Comparison to similar approaches such as SIFT and SURF show higher precision vs recall performance of the proposed method. Moreover, we dramatically reduce the computational cost by a factor of 6x and 3x, respectively. We also prove the use of the proposed method for computervision applications.
Recently, algorithms for object recognition and related tasks have become sufficiently proficient that new vision tasks can now be pursued. In this paper, we build a system capable of answering open-ended text-based q...
详细信息
ISBN:
(纸本)9781467388511
Recently, algorithms for object recognition and related tasks have become sufficiently proficient that new vision tasks can now be pursued. In this paper, we build a system capable of answering open-ended text-based questions about images, which is known as Visual Question Answering (VQA). Our approach's key insight is that we can predict the form of the answer from the question. We formulate our solution in a Bayesian framework. When our approach is combined with a discriminative model, the combined model achieves state-of-the-art results on four benchmark datasets for open-ended VQA: DAQUAR, COCO-QA, The VQA Dataset, and Visual7W.
We propose the residual expansion (RE) algorithm: a global (or near-global) optimization method for nonconvex least squares problems. Unlike most existing nonconvex optimization techniques, the RE algorithm is not bas...
详细信息
ISBN:
(纸本)9781538604571
We propose the residual expansion (RE) algorithm: a global (or near-global) optimization method for nonconvex least squares problems. Unlike most existing nonconvex optimization techniques, the RE algorithm is not based on either stochastic or multi-point searches;therefore, it can achieve fast global optimization. Moreover, the RE algorithm is easy to implement and successful in highdimensional optimization. The RE algorithm exhibits excellent empirical performance in terms of k-means clustering, point-set registration, optimized product quantization, and blind image deblurring.
Unsupervised representation learning holds the promise of exploiting large amounts of unlabeled data to learn general representations. A promising technique for unsupervised learning is the framework of Variational Au...
详细信息
ISBN:
(纸本)9781728171685
Unsupervised representation learning holds the promise of exploiting large amounts of unlabeled data to learn general representations. A promising technique for unsupervised learning is the framework of Variational Auto-encoders (VAEs). However, unsupervised representations learned by VAEs are significantly outperformed by those learned by supervised learning for recognition. Our hypothesis is that to learn useful representations for recognition the model needs to be encouraged to learn about repeating and consistent patterns in data. Drawing inspiration from the mid-level representation discovery work, we propose PatchVAE, that reasons about images at patch level. Our key contribution is a bottleneck formulation that encourages mid-level style representations in the VAE framework. Our experiments demonstrate that representations learned by our method perform much better on the recognition tasks compared to those learned by vanilla VAEs.
We propose a novel method to reconstruct volumetric flows from sparse views via a global transport formulation. Instead of obtaining the space-time function of the observations, we reconstruct its motion based on a si...
详细信息
ISBN:
(纸本)9781665445092
We propose a novel method to reconstruct volumetric flows from sparse views via a global transport formulation. Instead of obtaining the space-time function of the observations, we reconstruct its motion based on a single initial state. In addition we introduce a learned self-supervision that constrains observations from unseen angles. These visual constraints are coupled via the transport constraints and a differentiable rendering step to arrive at a robust end-to-end reconstruction algorithm. This makes the reconstruction of highly realistic flow motions possible, even from only a single input view We show with a variety of synthetic and real flows that the proposed global reconstruction of the transport process yields an improved reconstruction of the fluid motion.
We present a set of algorithms and a search strategy for the robust content-based retrieval of multispectral satellite images. Since the property of interest in these images is usually the physical characteristics of ...
详细信息
ISBN:
(纸本)0818672587
We present a set of algorithms and a search strategy for the robust content-based retrieval of multispectral satellite images. Since the property of interest in these images is usually the physical characteristics of ground cover, we use representations and methods that are invariant to illumination and atmospheric conditions. The representations and algorithms are derived for this application from a physical model for the formation of multispectral satellite images. The use of several representations and algorithms is necessary to interpret the diversity of physical and geometric structure in these images. Algorithms are used that exploit multispectral distributions, multispectral spatial structure, and labeled classes. The performance of the system is demonstrated on a large set of multispectral satellite images taken over different areas of the United States under different illumination and atmospheric conditions.
暂无评论