In factorization approaches to nonrigid structure from motion, the 3D shape of a deforming object is usually modeled as a linear combination of a small number of basis shapes. The original approach to simultaneously e...
详细信息
ISBN:
(纸本)9781424439928
In factorization approaches to nonrigid structure from motion, the 3D shape of a deforming object is usually modeled as a linear combination of a small number of basis shapes. The original approach to simultaneously estimate the shape basis and nonrigid structure exploited orthonormality constraints for metric rectification. Recently it has been asserted that structure recovery through orthonormality constraints alone is inherently ambiguous and cannot result in a unique solution. This assertion has been accepted as conventional wisdom and is the justification of many remedial heuristics in literature. Our key contribution is to prove that orthonormality constraints are in fact sufficient to recover the 3D structure from image observations alone. We characterize the true nature of the ambiguity in using orthonormality constraints for the shape basis and show that it has no impact on structure reconstruction. We conclude from our experimentation that the primary challenge in using shape basis for nonrigid structure from motion is the difficulty in the optimization problem rather than the ambiguity in orthonormality constraints.
This paper exploits the context of natural dynamic scenes for human action recognition in video. Human actions are frequently constrained by the purpose and the physical properties of scenes and demonstrate high corre...
详细信息
ISBN:
(纸本)9781424439928
This paper exploits the context of natural dynamic scenes for human action recognition in video. Human actions are frequently constrained by the purpose and the physical properties of scenes and demonstrate high correlation with particular scene classes. For example, eating often happens in a kitchen while running is more common outdoors. The contribution of this paper is three-fold: (a) we automatically discover relevant scene classes and their correlation with human actions, (b) we show how to learn selected scene classes from video without manual supervision and (c) we develop a joint framework for action and scene recognition and demonstrate improved recognition of both in natural video. We use movie scripts as a means of automatic supervision for training. For selected action classes we identify correlated scene classes in text and then retrieve video samples of actions and scenes for training using script-to-video alignment. Our visual models for scenes and actions are formulated within the bag-of-features framework and are combined in a joint scene-action SVM-based classifier We report experimental results and validate the method on a new large dataset with twelve action classes and ten scene classes acquired from 69 movies.
This paper presents a system aimed to serve as the enabling platform for a wearable assistant. The method observes manipulations from a wearable camera and classifies activities from roughly stabilized low resolution ...
详细信息
In many image and video collections, we have access only to partially labeled data. For example, personal photo collections often contain several faces per image and a caption that only specifies who is in the picture...
详细信息
ISBN:
(纸本)9781424439928
In many image and video collections, we have access only to partially labeled data. For example, personal photo collections often contain several faces per image and a caption that only specifies who is in the picture, but not which name matches which face. Similarly;movie screenplays can tell us who is in the scene, but not when and where they are on the screen. We formulate the learning problem in this setting as partially-supervised multiclass classification where each instance is labeled ambiguously with more than one label. We show theoretically that effective learning is possible under reasonable assumptions even when all the data is weakly labeled. Motivated by the analysis, we propose a general convex learning formulation based on minimization of a surrogate loss appropriate for the ambiguous label setting. We apply our framework to identifying faces culled from web news sources and to naming characters in TV series and movies. We experiment on a very large dataset consisting of 100 hours of video, and in particular achieve 6% error for character naming on 16 episodes of LOST.
We introduce a text-based image feature and demonstrate that it consistently improves performance on hard object classification problems. The feature is built using an auxiliary dataset of images annotated with tags, ...
详细信息
ISBN:
(纸本)9781424439928
We introduce a text-based image feature and demonstrate that it consistently improves performance on hard object classification problems. The feature is built using an auxiliary dataset of images annotated with tags, downloaded from the internet. We do not inspect or correct the tags and expect that they are noisy We obtain the text feature of an unannotated image from the tags of its k-nearest neighbors in this auxiliary collection. A visual classifier presented with an object viewed under novel circumstances (say a new viewing direction) must rely on its visual examples. Our text feature may not change, because the auxiliary dataset likely contains a similar picture. While the tags associated with images are noisy they are more stable when appearance changes. We test the performance of this feature using PASCAL VOC 2006 and 2007 datasets. Our feature performs well, consistently improves the performance of visual object classifiers, and is particularly effective when the training dataset is small.
Latent Variable Models (LVM), like the Shared-GPLVM and the Spectral Latent Variable Model, help mitigate over-fitting when learning discriminative methods from small or moderately sized training sets. Nevertheless, e...
详细信息
ISBN:
(纸本)9781424439928
Latent Variable Models (LVM), like the Shared-GPLVM and the Spectral Latent Variable Model, help mitigate over-fitting when learning discriminative methods from small or moderately sized training sets. Nevertheless, existing methods suffer from several problems: 1) complexity;2) the lack of explicit mappings to and from the latent space;3) an inability to cope with multi-modality;and 4) the lack of a well-defined density over the latent space. We propose a LVM called the Shared Kernel Information Embedding (sKIE). It defines a coherent density over a latent space and multiple input/output spaces (e.g., image features and poses), and it is easy to condition on a latent state, or on combinations of the input/output states. Learning is quadratic, and it works well on small datasets. With datasets too large to learn a coherent global model, one can use sKIE to learn local online models. sKIE permits missing data during inference, and partially labelled data during learning. We use sKIE for human pose inference.
This paper presents a novel self-similarity based approach for the problem of vanishing point estimation in man-made scenes. A vanishing point (VP) is the convergence point of a pencil (a concurrent line set), that is...
详细信息
ISBN:
(纸本)9781424439928
This paper presents a novel self-similarity based approach for the problem of vanishing point estimation in man-made scenes. A vanishing point (VP) is the convergence point of a pencil (a concurrent line set), that is a perspective projection of a corresponding parallel line set in the scene. Unlike traditional VP detection that relies on extraction and grouping of individual straight lines, our approach detects entire pencils based on a property of 1D affine-similarity between parallel cross-sections of a pencil. Our approach is not limited to real pencils. Under some conditions (normally met in man-made scenes), our method can detect pencils made of virtual lines passing through similar image features, and hence can detect VPs from repeating patterns that do not contain straight edges. We demonstrate that detecting entire pencils rather than individual lines improves the detection robustness in that it improves VP detection in challenging conditions, such as very-low resolution or weak edges, and simultaneously reduces VP false-detection rate when only a small number of lines are detectable.
We investigate the problem of automatically labelling faces of characters in TV or movie material with their names, using only weak supervision from automatically-aligned subtitle and script text. Our previous work (E...
详细信息
ISBN:
(纸本)9781424439928
We investigate the problem of automatically labelling faces of characters in TV or movie material with their names, using only weak supervision from automatically-aligned subtitle and script text. Our previous work (Everingham et al. [8]) demonstrated promising results on the task, but the coverage of the method (proportion of video labelled) and generalization was limited by a restriction to frontal faces and nearest neighbour classification. In this paper we build on that method, extending the coverage greatly by the detection and recognition of characters in profile views. In addition, we make the following contributions: (i) seamless tracking, integration and recognition of profile and frontal detections, and (ii) a character specific multiple kernel classifier which is able to learn the features best able to discriminate between the characters. We report results on seven episodes of the TV series "Buffy the Vampire Slayer", demonstrating significantly increased coverage and performance with respect to previous methods on this material.
An important research area in computervision is developing algorithms that can reconstruct the 3D surface of an object represented by a single 2D line drawing. Previous work on 3D reconstruction from single 2D line d...
详细信息
ISBN:
(纸本)9781424439928
An important research area in computervision is developing algorithms that can reconstruct the 3D surface of an object represented by a single 2D line drawing. Previous work on 3D reconstruction from single 2D line drawings focuses on objects with planar faces. In this paper, we propose a novel approach to the reconstruction of solid objects that have not only planar but also curved faces. Our approach consists of four steps: (1) identifying the curved faces and planar faces in a line drawing, (2) transforming the line drawing into one with straight edges only (3) reconstructing the 3D wireframe of the curved object from the transformed line drawing and the original line drawing, and (4) generating the curved faces with Bezier patches and triangular meshes. With a number of experimental results, we demonstrate the ability of our approach to perform curved object reconstruction successfully.
Detecting image pairs with a common field of view is an important prerequisite for many computervision tasks. Typically, common local features are used as a criterion for identifying such image pairs. This approach, ...
详细信息
ISBN:
(纸本)9781424439928
Detecting image pairs with a common field of view is an important prerequisite for many computervision tasks. Typically, common local features are used as a criterion for identifying such image pairs. This approach, however, requires a reliable method for matching features, which is generally a very difficult problem especially in situations with a wide baseline or ambiguities in the scene. We propose two new approaches for the common field of view problem. The first one is still based on feature matching. Instead of requiring a very low false positive rate for the feature matching, however, geometric constraints are used to assess matches which may contain many false positives. The second approach completely avoids hard matching of features by evaluating the entropy of correspondence probabilities. We perform quantitative experiments on three different hand-labeled scenes with varying difficulty. In moderately difficult situations with a medium baseline and few ambiguities in the scene, our proposed methods give similarly good results to the classical matching based method. On the most challenging scene having a wide baseline and many ambiguities, the performance of the classical method deteriorates, while ours are much less affected and still produce good results. Hence, our methods show the best overall performance in a combined evaluation.
暂无评论