We assess the applicability of several popular learning methods for the problem of recognizing generic visual categories with invariance to pose, lighting, and surrounding clutter. A large dataset comprising stereo im...
详细信息
We assess the applicability of several popular learning methods for the problem of recognizing generic visual categories with invariance to pose, lighting, and surrounding clutter. A large dataset comprising stereo image pairs of 50 uniform-colored toys under 36 azimuths, 9 elevations, and 6 lighting conditions was collected (for a total of 194,400 individual images). The objects were 10 instances of 5 generic categories: four-legged animals, human figures, airplanes, trucks, and cars. Five instances of each category were used for training, and the other five for testing. Low-resolution grayscale images of the objects with various amounts of variability and surrounding clutter were used for training and testing. Nearest Neighbor methods, Support Vector Machines, and Convolutional Networks, operating on raw pixels or on PCA-derived features were tested. Test error rates for unseen object instances placed on uniform backgrounds were around 13% for SVM and 7% for Convolutional Nets. On a segmentation/recognition task with highly cluttered images, SVM proved impractical, while Convolutional nets yielded 16/7% error. A real-time version of the system was implemented that can detect and classify objects in natural scenes at around 10 frames per second.
In this paper, we integrate space carving and eigen detection methods to develop a bottom-up 3D human limb detector. We model the body in terms of its constituent body parts;here we focus on the head, lower arms, uppe...
详细信息
In this paper, we propose a stochastic algorithm using Markov chain Monte Carlo (MCMC) to automatically reconstruct buildings from a single image of architectural scenes by integrating segmentation and reconstruction....
详细信息
This paper provides a comprehensive analysis of exactly what visual information about the world is embedded within a single image of an eye. It turns out that the cornea of an eye and a camera viewing the eye form a c...
详细信息
This paper provides a comprehensive analysis of exactly what visual information about the world is embedded within a single image of an eye. It turns out that the cornea of an eye and a camera viewing the eye form a catadioptric imaging system. We refer to this as a corneal imaging system. Unlike a typical catadioptric system, a corneal one is flexible in that the reflector (cornea) is not rigidly attached to the camera. Using a geometric model of the cornea based on anatomical studies, its 3D location and orientation can be estimated from a single image of the eye. Once this is done, a wide-angle view of the environment of the person can be obtained from the image. In addition, we can compute the projection of the environment onto the retina with its center aligned with the gaze direction. This foveated retinal image reveals what the person is looking at. We present a detailed analysis of the characteristics of the corneal imaging system including field of view, resolution and locus of viewpoints. When both eyes of a person are captured in an image, we have a stereo corneal imaging system. We analyze the epipolar geometry of this stereo system and show how it can be used to compute 3D structure. The framework we present in this paper for interpreting eye images is passive and non-invasive. It has direct implications for several fields including visual recognition, human-machine interfaces, computer graphics and human affect studies.
Video cameras must produce images at a reasonable frame-rate and with a reasonable depth of field. These requirements impose fundamental physical limits on the spatial resolution of the image detector. As a result, cu...
详细信息
Video cameras must produce images at a reasonable frame-rate and with a reasonable depth of field. These requirements impose fundamental physical limits on the spatial resolution of the image detector. As a result, current cameras produce videos with a very low resolution. The resolution of videos can be computationally enhanced by moving the camera and applying super-resolution reconstruction algorithms. However, a moving camera introduces motion blur, which limits super-resolution quality. We analyze this effect and derive a theoretical result showing that motion blur has a substantial degrading effect on the performance of super resolution. The conclusion is, that in order to achieve the highest resolution, motion blur should be avoided. Motion blur can be minimized by sampling the space-time volume of the video in a specific manner. We have developed a novel camera, called the "jitter camera," that achieves this sampling. By applying an adaptive super-resolution algorithm to the video produced by the jitter camera, we show that resolution can be notably enhanced for stationary or slowly moving objects, while it is improved slightly or left unchanged for objects with fast and complex motions. The end result is a video that has a significantly higher resolution than the captured one.
This paper deals with the error analysis of a novel navigation algorithm that uses as input the sequence of images acquired from a moving camera and a Digital Terrain (or Elevation) Map (DIM/DEM). More specifically, i...
详细信息
This paper deals with the error analysis of a novel navigation algorithm that uses as input the sequence of images acquired from a moving camera and a Digital Terrain (or Elevation) Map (DIM/DEM). More specifically, it has been shown that the optical flow derived from two consecutive camera frames can be used in combination with a DTM to estimate the position, orientation and ego-motion parameters of the moving camera. As opposed to previous works, the proposed approach does not require an intermediate explicit reconstruction of the 3D world. In the present work the sensitivity of the algorithm outlined above is studied. The main sources for errors are identified to be the optical-flow evaluation and computation, the quality of the information about the terrain, the structure of the observed terrain and the trajectory of the camera. By assuming appropriate characterization of these error sources, a closed form expression for the uncertainty of the pose and motion of the camera is first developed and then the influence of these factors is confirmed using extensive numerical simulations. The main conclusion of this paper is to establish mat the proposed navigation algorithm generates accurate estimates for reasonable scenarios and error sources, and thus can be effectively used as part of a navigation system of autonomous vehicles.
This paper focuses on the problem of 3D object recognition from different viewing angles and positions. In particular, we propose a new approach that integrates Algebraic Functions of Views (AFoVs) with indexing and l...
详细信息
We pose the problem of 3D human tracking as one of inference in a graphical model. Unlike traditional kinematic tree representations, our model of the body is a collection of loosely-connected limbs. Conditional proba...
详细信息
We pose the problem of 3D human tracking as one of inference in a graphical model. Unlike traditional kinematic tree representations, our model of the body is a collection of loosely-connected limbs. Conditional probabilities relating the 3D pose of connected limbs are learned from motion-captured training data. Similarly, we learn probabilistic models for the temporal evolution of each limb (forward and backward in time). Human pose and motion estimation is then solved with non-parametric belief propagation using a variation of particle filtering that can be applied over a general loopy graph. The loose-limbed model and decentralized graph structure facilitate the use of low-level visual cues. We adopt simple limb and head detectors to provide "bottom-up" information that is incorporated into the inference process at every time-step;these detectors permit automatic initialization and aid recovery from transient tracking failures. We illustrate the method by automatically tracking a walking person in video imagery using four calibrated cameras. Our experimental apparatus includes a marker-based motion capture system aligned with the coordinate frame of the calibrated cameras with which we quantitatively evaluate the accuracy of our 3D person tracker.
We present a two-stage face recognition method based on infrared imaging and statistical modeling. In the first stage we reduce the search space by finding highly likely candidates before arriving at a singular conclu...
详细信息
The goal of this work is using off-the-shelf components for gaze-based interaction, with focus on eye typing. Avoiding the use of dedicated hardware such as IR light emitters makes eye tracking significantly more diff...
详细信息
The goal of this work is using off-the-shelf components for gaze-based interaction, with focus on eye typing. Avoiding the use of dedicated hardware such as IR light emitters makes eye tracking significantly more difficult and requires robust methods capable of handling large changes in image quality. We employ an active-contour method to obtain robust iris tracking. The main strength of the method is that the contour model avoids explicit feature detection: contours are simply assumed to remove statistical dependencies on opposite sides of the contour. The contour model is utilized in an approach combining particle filtering with the EM algorithm. The method is robust against light changes and camera defocusing. For the purpose of determining where the user is looking calibrations is usually needed. The number of calibration points used in different methods varies from from a few to several thousands, depending on the prior knowledge used on the setup and equipment. We examine basic properties of gaze determination when the geometry of the the camera, screen and user is unknown. In particular we present a lower bound on the number of calibration points needed for gaze determination on planar objects, and we examine degenerate configurations. Based on this lower bound we apply a simple calibration procedure, to facilitate button selections for fast on-screen typing.
暂无评论