Scanned 3D point clouds for real-world scenes often suffer from noise and incompletion. Observing that prior point cloud shape completion networks overlook local geometric features, we propose our ECG - an Edge-aware ...
详细信息
Scanned 3D point clouds for real-world scenes often suffer from noise and incompletion. Observing that prior point cloud shape completion networks overlook local geometric features, we propose our ECG - an Edge-aware point cloud Completion network with Graph convolution, which facilitates fine-grained 3D point cloud shape generation with multi-scale edge features. Our ECG consists of two consecutive stages: 1) skeleton generation and 2) details refinement. Each stage is a generation sub-network conditioned on the input incomplete point cloud. The first stage generates coarse skeletons to facilitate capturing useful edge features against noisy measurements. Subsequently, we design a deep hierarchical encoder with graph convolution to propagate multi-scale edge features for local geometric details refinement. To preserve local geometrical details while upsampling, we propose the Edge-aware Feature Expansion (EFE) module to smoothly expand/upsample point features by emphasizing their local edges. Extensive experiments show that our ECG significantly outperforms previous state-of-the-art (SOTA) methods for point cloud completion.
Motivated by the astonishing capabilities of natural intelligent agents and inspired by theories from psychology, this paper explores the idea that perception gets coupled to 3D properties of the world via interaction...
详细信息
Motivated by the astonishing capabilities of natural intelligent agents and inspired by theories from psychology, this paper explores the idea that perception gets coupled to 3D properties of the world via interaction with the environment. Existing works for depth estimation require either massive amounts of annotated training data or some form of hard-coded geometrical constraint. This paper explores a new approach to learning depth perception requiring neither of those. Specifically, we propose a novel global-local network architecture that can be trained with the data observed by a robot exploring an environment: images and extremely sparse depth measurements, down to even a single pixel per image. From a pair of consecutive images, the proposed network outputs a latent representation of the camera's and scene's parameters, and a dense depth map. Experiments on several datasets show that, when ground truth is available even for just one of the image pixels, the proposed network can learn monocular dense depth estimation up to 22.5% more accurately than state-of-the-art approaches. We believe that this work, in addition to its scientific interest, lays the foundations to learn depth with extremely sparse supervision, which can be valuable to all robotic systems acting under severe bandwidth or sensing constraints.
Semantic segmentation is key in autonomous driving. Using deepvisuallearning architectures is not trivial in this context, because of the challenges in creating suitable large scale annotated datasets. This issue ha...
详细信息
Semantic segmentation is key in autonomous driving. Using deepvisuallearning architectures is not trivial in this context, because of the challenges in creating suitable large scale annotated datasets. This issue has been traditionally circumvented through the use of synthetic datasets, that have become a popular resource in this field. They have been released with the need to develop semantic segmentation algorithms able to close the visual domain shift between the training and test data. Although exacerbated by the use of artificial data, the problem is extremely relevant in this field even when training on real data. Indeed, weather conditions, viewpoint changes and variations in the city appearances can vary considerably from car to car, and even at test time for a single, specific vehicle. How to deal with domain adaptation in semantic segmentation, and how to leverage effectively several different data distributions (source domains) are important research questions in this field. To support work in this direction, this letter contributes a new large scale, synthetic dataset for semantic segmentation with more than 100 different source visual domains. The dataset has been created to explicitly address the challenges of domain shift between training and test data in various weather and view point conditions, in seven different city types. Extensive benchmark experiments assess the dataset, showcasing open challenges for the current state of the art. The dataset will be available at: https://***/home/.
Domestic service robots (DSRs) are a promising solution to the shortage of home care workers. However, one of the main limitations of DSRs is their inability to interact naturally through language. Recently, data-driv...
详细信息
Domestic service robots (DSRs) are a promising solution to the shortage of home care workers. However, one of the main limitations of DSRs is their inability to interact naturally through language. Recently, data-driven approaches have been shown to be effective for tackling this limitation;however, they often require large-scale datasets, which is costly. Based on this background, we aim to perform automatic sentence generation of fetching instructions: for example, "Bring me a green tea bottle on the table." This is particularly challenging because appropriate expressions depend on the target object, as well as its surroundings. In this letter, we propose the attention branch encoder-decoder network (ABEN), to generate sentences from visual inputs. Unlike other approaches, the ABEN has multimodal attention branches that use subword-level attention and generate sentences based on subword embeddings. In experiments, we compared the ABEN with a baseline method using four standard metrics in image captioning. Results show that the ABEN outperformed the baseline in terms of these metrics.
Autonomous cars need continuously updated depth information. Thus far, depth is mostly estimated independently for a single frame at a time, even if themethod starts fromvideo input. Our method produces a time series ...
详细信息
Autonomous cars need continuously updated depth information. Thus far, depth is mostly estimated independently for a single frame at a time, even if themethod starts fromvideo input. Our method produces a time series of depth maps, which makes it an ideal candidate for online learning approaches. In particular, we put three different types of depth estimation (supervised depth prediction, self-supervised depth prediction, and self-supervised depth completion) into a common framework. We integrate the corresponding networks with a ConvLSTM such that the spatiotemporal structures of depth across frames can be exploited to yield a more accurate depth estimation. Our method is flexible. It can be applied to monocular videos only or be combined with different types of sparse depth patterns. We carefully study the architecture of the recurrent network and its training strategy. We are first to successfully exploit recurrent networks for real-time self-supervised monocular depth estimation and completion. Extensive experiments show that our recurrent method outperforms its image-based counterpart consistently and significantly in both self-supervised scenarios. It also outperforms previous depth estimation methods of the three popular groups. Please refer to our webpage(1) for details.
In this letter, we introduce a rotational primitive prediction based 6D object pose estimation using a single image as an input. We solve for the 6D object pose of a known object relative to the camera using a single ...
详细信息
In this letter, we introduce a rotational primitive prediction based 6D object pose estimation using a single image as an input. We solve for the 6D object pose of a known object relative to the camera using a single image with occlusion. Many recent state-of-the-art (SOTA) two-step approaches have exploited image keypoints extraction followed by PnP regression for pose estimation. Instead of relying on bounding box or keypoints on the object, we propose to learn orientation-induced primitive so as to achieve the pose estimation accuracy regardless of the object size. We leverage a Variational AutoEncoder (VAE) to learn this underlying primitive and its associated keypoints. The keypoints inferred from the reconstructed primitive image are then used to regress the rotation using PnP. Lastly, we compute the translation in a separate localization module to complete the entire 6D pose estimation. When evaluated over public datasets, the proposed method yields a notable improvement over the LINEMOD, the Occlusion LINEMOD, and the YCB-Video dataset. We further provide a synthetic-only trained case presenting comparable performance to the existing methods which require real images in the training phase.
暂无评论