We present a three-dimensional deep convolutional neural network (3D CNN) approach for grasping unknown objects with soft hands. Soft hands are compliant and capable of handling uncertainty in sensing and actuation, b...
详细信息
We present a three-dimensional deep convolutional neural network (3D CNN) approach for grasping unknown objects with soft hands. Soft hands are compliant and capable of handling uncertainty in sensing and actuation, but come at the cost of unpredictable deformation of the soft fingers. Traditional model-driven grasping approaches, which assume known models for objects, robot hands, and stable grasps with expected contacts, are inapplicable to such soft hands, since predicting contact points between objects and soft hands is not straightforward. Our solution adopts a deep CNN approach to find good caging grasps for previously unseen objects by learning effective features and a classifier from point cloud data. Unlike recent CNN models applied to robotic grasping which have been trained on 2D or 2.5D images and limited to a fixed top grasping direction, we exploit the power of a 3D CNN model to estimate suitable grasp poses from multiple grasping directions (top and side directions) and wrist orientations, which has great potential for geometry-related robotic tasks. Our soft hands guided by the 3D CNN algorithm show 87% successful grasping on previously unseen objects. A set of comparative evaluations shows the robustness of our approach with respect to noise and occlusions.
Robot navigation using deep neural networks has been drawing a great deal of attention. Although reactive neural networks easily learn expert behaviors and are computationally efficient, they suffer from generalizatio...
详细信息
Robot navigation using deep neural networks has been drawing a great deal of attention. Although reactive neural networks easily learn expert behaviors and are computationally efficient, they suffer from generalization of policies learned in specific environments. As such, reinforcement learning and value iteration approaches for learning generalized policies have been proposed. However, these approaches are more costly. In this letter, we tackle the problem of learning reactive neural networks that are applicable to general environments. The key concept is to crop, rotate, and rescale an obstacle map according to the goal location and the agent's current location so that the map representation will be better correlated with self-movement in the general navigation task, rather than the layout of the environment. Furthermore, in addition to the obstacle map, we input a map of visited locations that contains the movement history of the agent, in order to avoid failures that the agent travels back and forth repeatedly over the same location. Experimental results reveal that the proposed network outperforms the state-of-the-art value iteration network in the grid-world navigation task. We also demonstrate that the proposed model can be well generalized to unseen obstacles and unknown terrain. Finally, we demonstrate that the proposed system enables a mobile robot to successfully navigate in a real dynamic environment.
Recognition ability and, more broadly, machine learning techniques enable robots to perform complex tasks and allow them to function in diverse situations. In fact, robots can easily access an abundance of sensor data...
详细信息
Recognition ability and, more broadly, machine learning techniques enable robots to perform complex tasks and allow them to function in diverse situations. In fact, robots can easily access an abundance of sensor data that are recorded in real time such as speech, image, and video. Since such data are time sensitive, processing them in real time is a necessity. Moreover, machine learning techniques are known to be computationally intensive and resource hungry. As a result, an individual resource-constrained robot, in terms of computation power and energy supply, is often unable to handle such heavy real-time computations alone. To overcome this obstacle, we propose a framework to harvest the aggregated computational power of several low-power robots for enabling efficient, dynamic, and real-time recognition. Our method adapts to the availability of computing devices at runtime and adjusts to the inherit dynamics of the network. Our framework can be applied to any distributed robot system. To demonstrate, with several Raspberry-Pi3-based robots (up to 12) each equipped with a camera, we implement a state-of-the-art action recognition model for videos and two recognition models for images. Our approach allows a group of multiple low-power robots to obtain a similar performance (in terms of the number of images or video frames processed per second) compared to a high-end embedded platform, Nvidia Tegra TX2.
This letter presents a novel semantic mapping approach, Recurrent-OctoMap, learned from long-term three-dimensional (3-D) Lidar data. Most existing semantic mapping approaches focus on improving semantic understanding...
详细信息
This letter presents a novel semantic mapping approach, Recurrent-OctoMap, learned from long-term three-dimensional (3-D) Lidar data. Most existing semantic mapping approaches focus on improving semantic understanding of single frames, rather than 3-D refinement of semantic maps (i.e. fusing semantic observations). The most widely used approach for the 3-D semantic map refinement is "Bayes update," which fuses the consecutive predictive probabilities following a Markov-chain model. Instead, we propose a learning approach to fuse the semantic features, rather than simply fusing predictions from a classifier. In our approach, we represent and maintain our 3-D map as an OctoMap, and model each cell as a recurrent neural network, to obtain a Recurrent-OctoMap. In this case, the semantic mapping process can he formulated as a sequence-to-sequence encoding-decoding problem. Moreover, in order to extend the duration of observations in our Recurrent-OctoMap, we developed a robust 3-D localization and mapping system for successively mapping a dynamic environment using more than two weeks of data, and the system can he trained and deployed with arbitrary memory length. We validate our approach on the ETH long-term 3-D Lidar dataset. The experimental results show that our proposed approach outperforms the conventional "Bayes update" approach.
Robot-assisted deployment of fenestrated stent grafts in fenestrated endovascular aortic repair (FEVAR) requires accurate geometrical alignment. Currently, this process is guided by two-dimensional (2-D) fluoroscopy, ...
详细信息
Robot-assisted deployment of fenestrated stent grafts in fenestrated endovascular aortic repair (FEVAR) requires accurate geometrical alignment. Currently, this process is guided by two-dimensional (2-D) fluoroscopy, which is insufficiently informative and error prone. In this letter, a real-time framework is proposed to instantiate the 3-D shape of a fenestrated stent graft by utilizing only a single low-dose 2-D fluoroscopic image. First, markers were placed on the fenestrated stent graft. Second, the 3-D pose of each stent segment was instantiated by the robust perspective-n-point method. Third, the 3-D shape of the whole stent graft was instantiated via graft gap interpolation. Focal UNet was proposed to segment the markers from 2-D fluoroscopic images to achieve semiautomatic marker detection. The proposed framework was validated on five patient-specific 3-D printed aortic aneurysm phantoms and three stent grafts with new marker placements, showing an average distance error of 1-3mm and an average angular error of 4 degrees Shape instantiation codes are available online.
We present a novel deep neural network architecture for representing robot experiences in an episodic-like memory that facilitates encoding, recalling, and predicting action experiences. Our proposed unsupervised deep...
详细信息
We present a novel deep neural network architecture for representing robot experiences in an episodic-like memory that facilitates encoding, recalling, and predicting action experiences. Our proposed unsupervised deep episodic memory model as follows: First, encodes observed actions in a latent vector space and, based on this latent encoding, second, infers most similar episodes previously experienced, third, reconstructs original episodes, and finally, predicts future frames in an end-to-end fashion. Results show that conceptually similar actions are mapped into the same region of the latent vector space. Based on these results, we introduce an action matching and retrieval mechanism, benchmark its performance on two large-scale action datasets, 20BN-something-something and ActivityNet and evaluate its generalization capability in a real-world scenario on a humanoid robot.
Given two consecutive RGB-D images, we propose a model that estimates a dense three-dimensional (3D) motion field, also known as scene flow. We take advantage of the fact that in robot manipulation scenarios, scenes o...
详细信息
Given two consecutive RGB-D images, we propose a model that estimates a dense three-dimensional (3D) motion field, also known as scene flow. We take advantage of the fact that in robot manipulation scenarios, scenes often consist of a set of rigidly moving objects. Our model jointly estimates the following: First, the segmentation of the scene into an unknown but finite number of objects, second, the motion trajectories of these objects, and finally, the object scene flow. We employ an hourglass, deep neural network architecture. In the encoding stage, the RGB and depth images undergo spatial compression and correlation. In the decoding stage, the model outputs three images containing a per-pixel estimate of the corresponding object center as well as object translation and rotation. This forms the basis for inferring the object segmentation and final object scene flow. To evaluate our model, we generated a new and challenging, large scale, synthetic dataset that is specifically targeted at robotic manipulation: It contains a large number of scenes with a very diverse set of simultaneously moving 3D objects and is recorded with a simulated, static RGB-D camera. In quantitative experiments, we show that we outperform state-of-the-art scene flow and motion-segmentation methods on this data set. In qualitative experiments, we show how our learned model transfers to challenging real-world scenes, visually generating better results than existing methods.
This work proposes a novel deep network architecture to solve the camera ego-motion estimation problem. A motion estimation network generally learns features similar to optical flow (OF) fields starting from sequences...
详细信息
This work proposes a novel deep network architecture to solve the camera ego-motion estimation problem. A motion estimation network generally learns features similar to optical flow (OF) fields starting from sequences of images. This OF can be described by a lower dimensional latent space. Previous research has shown how to find linear approximations of this space. We propose to use an autoencoder network to find a nonlinear representation of the OF manifold. In addition, we propose to learn the latent space jointly with the estimation task, so that the learned OF features become a more robust description of the OF input. We call this novel architecture latent space visual odometry (LS-VO). The experiments show that LS-VO achieves a considerable increase in performances with respect to baselines, while the number of parameters of the estimation network only slightly increases.
In this letter, we present a data-driven method for scene parsing of road scenes to utilize single-channel near-infrared (NIR) images. To overcome the lack of data problem in non-RGB spectrum, we define a new color sp...
详细信息
In this letter, we present a data-driven method for scene parsing of road scenes to utilize single-channel near-infrared (NIR) images. To overcome the lack of data problem in non-RGB spectrum, we define a new color space and decompose the task of deep scene parsing into two subtasks with two separate CNN architectures for chromaticity channels and semantic masks. For chromaticity estimation, we build a spatially-aligned-RGB-NIR image database (40k urban scenes) to infer color information from RGB-NIR spectrum learning process and leverage existing scene parsing networks trained over already available RGB masks. From our database, we sample key frames and manually annotate them (4k ground truth masks) to finetune the network into the proposed color space. Hence, the key contribution of this work is to replace multispectral scene parsing methods with a simple yet effective approach using single NIR images. The benefits of using our algorithm and dataset are confirmed in the qualitative and quantitative experiments.
The commercial use of unmanned aerial vehicles (UAVs) would be enhanced by an ability to sense and avoid potential mid-air collision threats. In this letter, we propose a new approach to aircraft detection for long-ra...
详细信息
The commercial use of unmanned aerial vehicles (UAVs) would be enhanced by an ability to sense and avoid potential mid-air collision threats. In this letter, we propose a new approach to aircraft detection for long-range vision-based sense and avoid. We first train a deep convolutional neural network to learn aircraft visual features using flight data of mid-air head-on near-collision course encounters between two fixed-wing aircraft. We then propose an approach that fuses these learnt aircraft features with hand-crafted features that are used by the current state of the art. Finally, we evaluate the performance of our proposed approach on real flight data captured from a UAV, where it achieves a mean detection range of 2527 m and a mean detection range improvement of 299 m (or 13.4%) compared to the current state of the art with no additional false alarms.
暂无评论