We posit that user behavior during natural viewing of images contains an abundance of information about the content of images as well as information related to user intent and user defined content importance. In this ...
详细信息
ISBN:
(纸本)9780769549897
We posit that user behavior during natural viewing of images contains an abundance of information about the content of images as well as information related to user intent and user defined content importance. In this paper, we conduct experiments to better understand the relationship between images, the eye movements people make while viewing images, and how people construct natural language to describe images. We explore these relationships in the context of two commonly used computervision datasets. We then further relate human cues with outputs of current visual recognition systems and demonstrate prototype applications for gaze-enabled detection and annotation.
We develop a unified framework for complex event retrieval, recognition and recounting. The framework is based on a compact video representation that exploits the temporal correlations in image features. Our feature a...
详细信息
ISBN:
(纸本)9781538604571
We develop a unified framework for complex event retrieval, recognition and recounting. The framework is based on a compact video representation that exploits the temporal correlations in image features. Our feature alignment procedure identifies and removes the feature redundancies across frames and outputs an intermediate tensor representation we call video imprint. The video imprint is then fed into a reasoning network, whose attention mechanism parallels that of memory networks used in language modeling. The reasoning network simultaneously recognizes the event category and locates the key pieces of evidence for event recounting. In event retrieval tasks, we show that the compact video representation aggregated from the video imprint achieves significantly better retrieval accuracy compared with existing methods. We also set new state of the art results in event recognition tasks with an additional benefit: The latent structure in our reasoning network highlights the areas of the video imprint and can be directly used for event recounting. As video imprint maps back to locations in the video frames, the network allows not only the identification of key frames but also specific areas inside each frame which are most influential to the decision process.
Tire's paper describes a representation for people and animals, called a body plan, which is adapted to segmentation and to recognition in complex environments. The representation is an organized collection of gro...
详细信息
ISBN:
(纸本)0780342364
Tire's paper describes a representation for people and animals, called a body plan, which is adapted to segmentation and to recognition in complex environments. The representation is an organized collection of grouping hints obtained from a combination of constraints on color and texture and constraints on geometric properties such as the structure of individual parts and the relationships between parts. Body plans can be learned from image data, using established statistical learning techniques. The approach is illustrated with two examples of programs that successfully use body plans for recognition: one example involves determining whether a picture contains a scantily clad human, using a body plan built by hand;We other involves determining whether a picture contains a horse, using a body plan learned front image data. In both cases, the system demonstrates excellent performance on large, uncontrolled test sets and very large and diverse control sets.
Recently, object detection in aerial images has gained much attention in computervision. Different from objects in natural images, aerial objects are often distributed with arbitrary orientation. Therefore, the detec...
详细信息
ISBN:
(纸本)9781665445092
Recently, object detection in aerial images has gained much attention in computervision. Different from objects in natural images, aerial objects are often distributed with arbitrary orientation. Therefore, the detector requires more parameters to encode the orientation information, which are often highly redundant and inefficient. Moreover, as ordinary CNNs do not explicitly model the orientation variation, large amounts of rotation augmented data is needed to train an accurate object detector. In this paper, we propose a Rotation-equivariant Detector (ReDet) to address these issues, which explicitly encodes rotation equivariance and rotation invariance. More precisely, we incorporate rotation-equivariant networks into the detector to extract rotation-equivariant features, which can accurately predict the orientation and lead to a huge reduction of model size. Based on the rotation-equivariant features, we also present Rotation-invariant RoI Align (RiRoI Align), which adaptively extracts rotation-invariant features from equivariant features according to the orientation of RoI. Extensive experiments on several challenging aerial image datasets DOTA-v1.0, DOTA-v1.5 and HRSC2016, show that our method can achieve state-of-the-art performance on the task of aerial object detection. Compared with previous best results, our ReDet gains 1.2, 3.5 and 2.6 mAP on DOTA-v1.0, DOTA-v1.5 and HRSC2016 respectively while reducing the number of parameters by 60% (313 Mb vs. 121 Mb).
Motivated by the success of CNNs in object recognition on images, researchers are striving to develop CNN equivalents for learning video features. However, learning video features globally has proven to be quite a cha...
详细信息
ISBN:
(纸本)9781509014378
Motivated by the success of CNNs in object recognition on images, researchers are striving to develop CNN equivalents for learning video features. However, learning video features globally has proven to be quite a challenge due to the difficulty of getting enough labels, processing large-scale video data, and representing motion information. Therefore, we propose to leverage effective techniques from both data-driven and data-independent approaches to improve action recognition system. Our contribution is three-fold. First, we explicitly show that local handcrafted features and CNNs share the same convolution-pooling network structure. Second, we propose to use independent subspace analysis (ISA) to learn descriptors for state-of-the-art handcrafted features. Third, we enhance ISA with two new improvements, which make our learned descriptors significantly outperform the handcrafted ones. Experimental results on standard action recognition benchmarks show competitive performance.
Human action recognition has emerged as one of the most challenging and active areas of research in the computervision domain. In addition to pose variation and scale variability, high complexity of human motions and...
详细信息
ISBN:
(纸本)9781509014378
Human action recognition has emerged as one of the most challenging and active areas of research in the computervision domain. In addition to pose variation and scale variability, high complexity of human motions and the variability of object interactions represent additional significant challenges. In this paper, we present an approach for human-object interaction modeling and classification. Towards that goal, we adopt relevant frame-level features;the inter-joint distances and joints-object distances. These proposed features are efficiently insensitive to position and pose variation. The evolution of the these distances in time is modeled by trajectories in a high dimension space and a shape analysis framework is used to model and compare the trajectories corresponding to human-object interaction in a Riemannian manifold. The experiments conducted following state-of-the-art settings and results demonstrate the strength of the proposed method. Using only the skeletal information, we achieve state-of-the-art classification results on the benchmark dataset.
We introduce a new approach for recognizing and reconstructing 3D objects in images. Our approach is based on an analysis by synthesis strategy. A forward synthesis model constructs possible geometric interpretations ...
详细信息
ISBN:
(纸本)9781479951178
We introduce a new approach for recognizing and reconstructing 3D objects in images. Our approach is based on an analysis by synthesis strategy. A forward synthesis model constructs possible geometric interpretations of the world, and then selects the interpretation that best agrees with the measured visual evidence. The forward model synthesizes visual templates defined on invariant (HOG) features. These visual templates are discriminatively trained to be accurate for inverse estimation. We introduce an efficient "brute-force" approach to inference that searches through a large number of candidate reconstructions, returning the optimal one. One benefit of such an approach is that recognition is inherently (re) constructive. We show state of the art performance for detection and reconstruction on two challenging 3D object recognition datasets of cars and cuboids.
From a simplified analysis of adaptive methods, we derive AvaGrad, a new optimizer which outperforms SGD on vision tasks when its adaptability is properly tuned. We observe that the power of our method is partially ex...
详细信息
ISBN:
(纸本)9781665445092
From a simplified analysis of adaptive methods, we derive AvaGrad, a new optimizer which outperforms SGD on vision tasks when its adaptability is properly tuned. We observe that the power of our method is partially explained by a decoupling of learning rate and adaptability, greatly simplifying hyperparameter search. In light of this observation, we demonstrate that, against conventional wisdom, Adam can also outperform SGD on vision tasks, as long as the coupling between its learning rate and adaptability is taken into account. In practice, AvaGrad matches the best results, as measured by generalization accuracy, delivered by any existing optimizer (SGD or adaptive) across image classification (CIFAR, ImageNet) and character-level language modelling (Penn Treebank) tasks. When training GANs, AvaGrad improves upon existing optimizers.
Background subtraction is a basic problem for change detection in videos and also the first step of high-level computervision applications. Most background subtraction methods rely on color and texture feature. Howev...
详细信息
ISBN:
(纸本)9781509014378
Background subtraction is a basic problem for change detection in videos and also the first step of high-level computervision applications. Most background subtraction methods rely on color and texture feature. However, due to illuminations changes in different scenes and affections of noise pixels, those methods often resulted in high false positives in a complex environment. To solve this problem, we propose an adaptive background subtraction model which uses a novel Local SVD Binary pattern (named LSBP) feature instead of simply depending on color intensity. This feature can describe the potential structure of the local regions in a given image, thus, it can enhance the robustness to illumination variation, noise, and shadows. We use a sample consensus model which is well suited for our LSBP feature. Experimental results on CDnet 2012 dataset demonstrate that our background subtraction method using LSBP feature is more effective than many state-of-the-art methods.
Several vision problems can be reduced to the problem of fitting a linear surface of low dimension to data, including the problems of structure-from-affine-motion, and of characterizing the intensity images of a Lambe...
详细信息
ISBN:
(纸本)0780342364
Several vision problems can be reduced to the problem of fitting a linear surface of low dimension to data, including the problems of structure-from-affine-motion, and of characterizing the intensity images of a Lambertian scene by constructing the intensity manifold. For these problems, one must deal with a data matrix with some missing elements. In structure-from-motion, missing elements will occur if some point features are not visible in some frames. To construct the intensity manifold missing matrix elements will arise when the surface normals of some scene points do not face the light source in some images. We propose a novel method for fitting a low rank matrix to a matrix with missing elements. We show experimentally that our method produces good results in the presence of noise. These results can be either used directly, or can serve as an excellent starting point for an iterative method.
暂无评论