This paper addresses a fundamental problem of scene understanding: How to parse the scene image into a structured configuration (i.e., a semantic object hierarchy with object interaction relations) that finely accords...
详细信息
ISBN:
(纸本)9781467388511
This paper addresses a fundamental problem of scene understanding: How to parse the scene image into a structured configuration (i.e., a semantic object hierarchy with object interaction relations) that finely accords with human perception. We propose a deep architecture consisting of two networks: i) a convolutional neural network (CNN) extracting the image representation for pixelwise object labeling and ii) a recursive neural network (RNN) discovering the hierarchical object structure and the inter-object relations. Rather than relying on elaborative user annotations (e.g., manually labeling semantic maps and relations), we train our deep model in a weakly-supervised manner by leveraging the descriptive sentences of the training images. Specifically, we decompose each sentence into a semantic tree consisting of nouns and verb phrases, and facilitate these trees discovering the configurations of the training images. Once these scene configurations are determined, then the parameters of both the CNN and RNN are updated accordingly by back propagation. The entire model training is accomplished through an Expectation-Maximization method. Extensive experiments suggest that our model is capable of producing meaningful and structured scene configurations and achieving more favorable scene labeling performance on PASCAL VOC 2012 over other state-of-theart weakly-supervised methods.
An experimental machine vision system is described which outputs the 3-D coordinates of grid points on surface patches of objects illuminated by a projected light grid. Multiple object scenes with occlusion are handle...
详细信息
ISBN:
(纸本)0818607211
An experimental machine vision system is described which outputs the 3-D coordinates of grid points on surface patches of objects illuminated by a projected light grid. Multiple object scenes with occlusion are handled. The major contribution of the work lies in its solution of the line labeling problem;grid line identity is deduced for use in the triangulation procedure. Connected components of illuminated points are extracted from the 2-D image to identify connected surface elements from the scene. Using geometrical constraints from camera and projector calibration, a small set of grid-label possibilities is assigned to each network grid point. For each possible grid label, 3-D surface coordinates can be computed via triangulation. Using the neighborhood constraints in the 2-D network and some assumptions about the 3-D scene, a constraint propagation process is able to cull most of the grid-label possibilities from each set. Grid label assignments in separate patches are then related and ambiguities are reduced further. Examples illustrate processing on both regular and irregular objects, where many of the visible surface patches are assigned unique 3-D solutions.
This paper introduces a new approach for food image segmentation utilizing the Segment Anything Model (SAM), with the additional refinement achieved through fine-tuning with Low-Rank Adaptation layers (LoRA). The segm...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
This paper introduces a new approach for food image segmentation utilizing the Segment Anything Model (SAM), with the additional refinement achieved through fine-tuning with Low-Rank Adaptation layers (LoRA). The segmentation task involves generating a binary mask for food in RGB images, with pixels categorized as background or food. We conduct various experiments to assess and compare the performance of our proposed method with previous approaches. Our findings indicate that our method consistently outperforms other techniques, achieving an accuracy of 94.14%. The improved accuracy of our approach highlights its potential for various applications in food image analysis, contributing to the advancement of computervision techniques in the realm of food recognition and segmentation.
In this paper, we address the problem of object class recognition via observations from actively selected views/modalities/features under limited resource budgets. A Partially Observable Markov Decision Process (POMDP...
详细信息
In this paper, we address the problem of object class recognition via observations from actively selected views/modalities/features under limited resource budgets. A Partially Observable Markov Decision Process (POMDP) is employed to find optimal sensing and recognition actions with the goal of long-term classification accuracy. Heterogeneous resource constraints -- such as motion, number of measurements and bandwidth -- are explicitly modeled in the state variable, and a prohibitively high penalty is used to prevent the violation of any resource constraint. To improve recognition performance, we further incorporate discriminative classification models with POMDP, and customize the reward function and observation model correspondingly. The proposed model is validated on several data sets for multi-view, multi-modal vehicle classification and multi-view face recognition, and demonstrates improvement in both recognition and resource management over greedy methods and previous POMDP formulations.
Researchers in computervision and patternrecognition have worked on automatic techniques for recognizing human faces for the last 20 years. While some systems, especially template-based ones, have been quite success...
详细信息
Researchers in computervision and patternrecognition have worked on automatic techniques for recognizing human faces for the last 20 years. While some systems, especially template-based ones, have been quite successful on expressionless, frontal views of faces with controlled lighting, not much work has taken face recognizers beyond these narrow imaging conditions. Our goal is to build a face recognizer that works under varying pose, the difficult part of which is to handle face relations in depth. Building on successful template-based systems, our basic approach is to represent faces with templates from multiple model views that cover different poses from the viewing sphere. To recognize a novel view, the recognizer locates the eyes and nose features, uses these locations to geometrically register the input with model views, and then uses correlation on model templates to find the best match in the data base of people. Our system has achieved a recognition rate of 98% on a data base of 62 people containing 10 testing and 15 modeling views per person.< >
This paper describes an investigation into the use of parametric 2D models describing the movement of edges for the determination of possible 3D shape and hence function of an object. An assumption of this research is...
详细信息
This paper describes an investigation into the use of parametric 2D models describing the movement of edges for the determination of possible 3D shape and hence function of an object. An assumption of this research is that the camera can foveate and track particular features. It is argued that simple 2D analytic descriptions of the movement of edges can infer 3D shape while the camera is moved. This uses an advantage of foveation i.e. The problem becomes object centred. The problem of correspondence for numerous edge points is overcome by the use of a tree based representation for the competing hypotheses. Numerous hypothesis are maintained simultaneously and it does not rely on a single kinematic model which assumes constant velocity or acceleration. The numerous advantages of this strategy are described.< >
This paper proposes a new approach to upsample depth maps when aligned high-resolution color images are given. Such a task is referred to as guided depth upsampling in our work. We formulate this problem based on the ...
详细信息
This paper proposes a new approach to upsample depth maps when aligned high-resolution color images are given. Such a task is referred to as guided depth upsampling in our work. We formulate this problem based on the recently developed sparse representation analysis models. More specifically, we exploit the cosparsity of analytic analysis operators performed on a depth map, together with data fidelity and color guided smoothness constraints for upsampling. The formulated problem is solved by the greedy analysis pursuit algorithm. Since our approach relies on the analytic operators such as the Wavelet transforms and the finite difference operators, it does not require any training data but a single depth-color image pair. A variety of experiments have been conducted on both synthetic and real data. Experimental results demonstrate that our approach outperforms the specialized state-of-the-art algorithms.
This paper introduces a Sphere-based representation to model a 3D scene and show its performance on various tasks, including Structure from Motion (SfM) and 3D scene classification. A significant target application of...
This paper introduces a Sphere-based representation to model a 3D scene and show its performance on various tasks, including Structure from Motion (SfM) and 3D scene classification. A significant target application of this work is Mixed Reality, where 3D data can be efficiently represented, and synthetic and real data can be mixed for an immersive experience. Over the past few decades, 3D big data has garnered increased attention in computervision. Acquiring, representing, reconstructing, querying, classifying, and visualizing 3D models for Mixed Reality has become crucial for many applications, such as medicine, architecture, entertainment, and bioinformatics. With the ever-increasing amount of data that the 3D scanners produce, storing, processing, and transmitting the data becomes challenging. Techniques that exploit the shape information need to be developed to model, classify and visualize the data. Our work offers a novel multi-scale surface representation based on spheres, with the ultimate goal of helping scientists to see and work with 3D data in Mixed Reality more effectively and efficiently.
Boundary detection in natural images represents an important but also challenging problem in computervision. Motivated by studies in psychophysics claiming that humans use multiple cues for segmentation, several prom...
详细信息
Boundary detection in natural images represents an important but also challenging problem in computervision. Motivated by studies in psychophysics claiming that humans use multiple cues for segmentation, several promising methods have been proposed which perform boundary detection by optimally combining local image measurements such as color, texture, and brightness. Very interesting results have been reported by applying these methods on challenging datasets such as the Berkeley segmentation benchmark. Although combining different cues for boundary detection has been shown to outperform methods using a single cue, results can be further improved by integrating perceptual organization cues with the boundary detection process. The main goal of this study is to investigate how and when perceptual organization cues improve boundary detection in natural images. In this context, we investigate the idea of integrating with segmentation the iterative multi-scale tensor voting (IMSTV), a variant of tensor voting (TV) that performs perceptual grouping by analyzing information at multiple-scales and removing background clutter in an iterative fashion, preserving salient, organized structures. The key idea is to use IMSTV to post-process the boundary posterior probability (PB) map produced by segmentation algorithms. Detailed analysis of our experimental results reveals how and when perceptual organization cues are likely to improve or degrade boundary detection. In particular, we show that using perceptual grouping as a post-processing step improves boundary detection in 84% of the grayscale test images in the Berkeley segmentation dataset.
暂无评论