DexNet on large has been a promising venture. It was created by Jeff Mahler and Prof. Ken Goldberg and is currently maintained by the Berkeley AUTOLAB. To summarise, it draws together code, datasets and algorithms for “generating datasets of synthetic point clouds, robot paralleljaw grasps and metrics of grasp robustness based on physics for thousands of 3D object models to train machine learningbased methods to …develop highly reliable robot grasping across a wide variety of rigid objects.”2
DexNet 1.0
The first DexNet paper introduced a dataset and associated algorithm “to study the scaling effects of Big Data and Cloud Computation on robust grasp planning.”3 It Includes a Multi Armed Bandit model with rewards and learned grasps in conjunction with 3D object model sets
1Releasing the Dexterity Network (DexNet) 2.0 Dataset for Deep Grasping 2017, https://arxiv.org/pdf/1703.09312.pdf
2 https://berkeleyautomation.github.io/dexnet/
3Mahler & Goldberg (2016).
(10000 unique) and 2.5 million parallel jaw grasps. Jaw grasps have estimates of probability of force closure under uncertainty in object and gripper pose and friction.4
CNNs are introduced as ‘MultiView Convolutional Neural Networks’ (MVCNNs). Classification is used as a similarity metric between objects and the Google Cloud Platform is utilised to parallel process across 1500 virtual cores, giving a claimed runtime reduction to three orders of magnitude.5
The MAB model algorithm is based on Continuous Correlated Beta Processes which is an efficient model for predicting belief distribution on grasp quality from the prior data.6
DexNet 2.0
In the second publication aims shift to reduction of data collection time for deep learning grasp plans. Trained with an increased data set (6.7 million point clouds, grasps and analytic grasp metrics note the shift in terminology), generated via 3D modelling. New data (DexNet 2.0) is used to train Grasp Quality Convolutional Neural Network (GQCNN), a model that rapidly predicts the probability of success of grasps from depth images. Grasps are specified as planar position, angle, and depth of a gripper relative to RGBD sensor.7
Experimental trials were on an ABB YuMi comparing grasp plans on objects in isolation. In summary the results showed that a GQCNN trained with synthetic data could ‘plan’ a grasp in 0.8 seconds, achieving a 93% success rate on eight unencountered objects with ‘adversarial geometry’.8 This is noted to be three times faster than registering point clouds to a precomputed dataset of objects and indexing grasps. On ten objects a 99% precision was achieved using a dataset of forty household objects.
DexNet 3.0
Explores the application of end effectors as a preferable choice over grasping techniques. The benefits of vacuum suction are described as: simplicity (single point of contact), ability to target a planar surface, computability of external gravity wrench in conjunction with quality of seal between suction cup and target surface.
Learning Deep Policies for Robot Bin Picking by Simulating Robust Grasping Sequences
Focusing on the task of ‘bin picking’ (picking a distinct object from an unsorted heap), and sequentially depositing items in a separate container; this iteration of DexNet, its ancillary
4 Ibid
5 Ibid
6 Ibid
7 Mahler & Goldberg (2017). Dexnet 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics. arXiv preprint arXiv:1703.09312.
8 Ibid
software and datasets represents the latest stage in the research project. Following on from the prior techniques e.g. CNNs we see the introduction of a discretetime Partially Observable Markov Decision Process that specifies “states of the heap, pointcloud observations, and rewards.”9
Machine learning is performed by algorithmic supervision using full state information in order to “optimise for the most robust collisionfree grasp in a forward simulator based on pybullet to model dynamic objectobject interactions and robust wrench space analysis from the Dexterity Network (DexNet) to model quasistatic contact between the gripper and object.”10 A GQCNN enumerated 2.1 classifies the supervisor actions on 10,000 trials with noiseinjection. Further to this 2192 physical trials were performed on the same hardware with 94% success rate (defined as the percentage of grasp attempts able to move object to the packing box) and 96% precision on heaps of 510 objects. Although accurate, it does take upwards of three minutes for any more than ten objects to be cleared. This is experimental research: erudite though it may be, when compared to the performance of a human being or specialised industrial machine the performance is hampered by its computational inefficiencies.
3.2 Problem Definitions
The table below provides a basic overview of the general structure of each research phase in the project entire.
Table 1
Version
Definitions and Problem Statements Introduced
DexNet 1.0
1. Grasp and object parametrisation 2. Sources of uncertainty
3. Contact Model
4. Quality Metric
5. Objective
DexNet 2.0
1. Assumptions
2. Definitions: states, grasps, pointclouds, robust analytic grasp metrics
3. Objective
DexNet 3.011
Learning Deep Policies for Robot Bin Picking by Simulating Robust Grasping
1. Goal (learn a policy that takes as input point clouds from an overhead depth camera and outputs a robust grasp, or gripper pose to remove an object from the heap with a
9 Mahler & Goldberg (2017). Learning deep policies for robot bin picking by simulating robust grasping sequences. In Conference on Robot Learning, pp. 515524.
10 Ibid
11 Not relevant (suction grasping unavailable).
Sequences
confidence value)
2. Assumptions
3. POMDP Model (initial state distribution, states, actions, observations, rewards, next state distribution, observation distribution
4. Policy
5. Objective
3.3 Model Performance
Refer to fig.1 (overleaf) for results on five bin picking policies (benchmark N = f5; 10g test objects from the Basic subset for 20 and 10 trials, respectively) performance measured on heap size and objects matching simulator assumptions.
Policies compared:
1. Force Closure (random planar force grasp, friction coefficient = 0:8)
2. DexNet 2.0. Ranks grasps using the GQCNN trained on DexNet
3. DexNet 2.1 ( = 0:1; = 0:5; = 0:9), ranked grasps (Finetuned) classifier for varying levels of noise injection in the training dataset. The DexNet 2.1 ( = 0:9) variant performed best across all metrics. Better performance over other noise levels plausibly due to training data being skewed negatively (predicts grasp failure when uncertain).
Figure 1
3.4 Generalisation
DexNet 2.1 = 0:9 policy trialled on larger heaps (N = f10; 20g, 50 objects for 5 independent trials) evaluating generalization to piles of novel objects not met in simulation results detailed in Figure 1. While performance decreases across all categories, the = 0:9 policy betters DexNet 2.0 and antipodal baseline across all metrics. Heap size is determined more performance inhibiting than object deformability. Factors inducing fails were collisions between grippers and objects, and thin curved profiles (adversarial geometry).
3.5 Discussion and Future Work
Policies trained using behaviour cloning and high levels of noise injection (90% probability of selecting a random action) have the highest performance across metrics when carried out robotically. Conservative policies favouring false negatives over false positives may transfer better from simulation to reality. Hierarchical policies for bin picking are cited as requiring development in future work.12
3.6 Current stateoftheart
Key features of the latest DexNet project iteration are CNNs, POMDPs, synthetic demos via algorithmic supervision for imitation learning, optimisation based on dynamic modelling of object interactions and GQCNN learning (10000 entry dataset of rollouts in simulation). The paper explicitly contributes uncertainty modelling during dataset generation, enhancing policy for a single viewpoint and outlines experimental evaluations. Over 2192 physical trials on an ABB YuMi with 50 novel objects a 94% success rate was achieved with 96% precision. The time taken for a 510 object heap was approximately 3 minutes. The paper notes few false positives during experimentation.[5]
The challenges identified in the current research are:
1. Partially obscured objects or obstacles prevent singular object perception
2. Disorganisation is compounded by sensor noise, obstruction of occlusion
2.2 A priori inference impossible via point cloud learning
3. Time cost of physical data collection (in acquisition of clean or voluminous data)
4. Synthetic training necessitates multiple scene viewpoints to be robust (datasets= grasps, point clouds labelled using geometric conditions e.g. grasp stability/antipodality)
In its current state the simulation in the software package can be summarised in three stages:
12 Mahler & Goldberg (2017). Dex-net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics. arXiv preprint arXiv:1703.09312.
1. Initial state distribution Po is sampled via uniform sampling of m 3D CAD models and dropping into randomised poses in pybullet. xt = state (of objects/poses in heap)
2. Generation of demos of roboticgrasp using algorithmic supervisor * from DexNet 2.0, indexing the most robust grasps from preplanned database using fullstate knowledge
3. Aggregation of point cloud observations and rewards
3.1 Labelled dataset for training a policy that classifies supervisor actions on partial observations using imitation learning
3.2 Preprocessing of training data via point cloud transformation/alignment/Centrepoint calibration/axispixel relationships
The POMDP model specified as a tuple (X , U, Y, R, ρ0, p, q) (paraphrased in appendix 2).
[space left blank]
4. Wider Literature Survey
4.1 Thematic Deconstruction
In surveying over 300 publications directly related to the research area, 50 were selected for thematic prominence across fields and relevance to this particular research problem, it was possible to deconstruct literature into a set of key recurrent thematic areas.
4.2 Recurrent Thematic Areas13
Agent Architecture
Papers focused on robotic control and, reasoning and belief using agent architectures
Data Representation
Visualisation and analysis of abstract datasets in highdimensional space
Fuzzy Logic
Robotic control systems implementing fuzzy inference
Grasp Planning
Applied robotics or studies and experiments directly relating to practical robotics in the context of grasping
Imaging
Computer imaging techniques for compression, dimensional reduction and visual correspondence
Inverse Kinematics
Mathematical process of recovering the movements of an object in the world from some other data
Machine Learning
CNN, NEAT, and ML domain specific theory
Object Analysis
Object or target specific analysis pertaining specifically to morphological interpretations in geometric or spatial terms
Occlusion
Research considering obscuration of targets or images, quasiobservable scenarios, inferred reconstructive techniques and strategies for adversarial conditions
POMDPs
Partially observable Markov decision processes and action in stochastic domains
Psychology
Clinical studies of human interpretation of patterns, objects and general environmental epistemology
13 [Please note that certain of these sources were omitted from the body of review due to the length constraint of the exercise but would have been included in a longer publication.]
4.3 CCLP Supra Categories
Further to the above thematic reductions the following four topics emerge as overarching categories encompassing the above and transcending their delineation.
Compression e.g. reduction of data and performance optimisation Control
Learning
Perception e.g. of similarity, shapeinference, object analysis etc.
CCLP absorbs voluminous interdisciplinary research in a highly active field. Emphasis has been given to those areas providing greatest contribution to current state of the art and experimental performance in robotic grasping. Categorical methods allow incorporation of disparate areas somewhat abstract from the main field but most likely to provide contributory understanding to an engineered problem response within a structured framework. This observation will be axiomatic in my project planning.
[space left blank]
4.4 Research Titles / Alluvial Diagram
Figure 2 n.b. please see additional file for higher resolution
In representing the research pool as per fig.3, we may note where the bulk of contributory information is being drawn from via topic and category e.g. Grasp Planning and Perception currently dominate the literature landscape in terms of contribution. Interestingly, this also shows us deeper connections such as Perception being comprised mainly of Occlusion and Object Analysis. On a subtler level, Grasp Planning being linked to all fields, but Perception could indicate that bridging is required between these supracategories in order to synthesise progressions made in either area of scientificenquiry.
Charting the flow of information enables scalar visualisation via nodes (refer to key), and hence identifies the substance of each interdisciplinary area and its precedence within the concerted effort of the whole; giving some estimation of suitable time allocation.
5. Agent Architecture
Platt Jr, Robert, et al. “Belief space planning assuming maximum likelihood observations.” (2010).
With even severely reductive dimensional approximations in ‘beliefstatedistributions’ necessitating planning in dimensions greater than the original state space, partially observable environment control problems are critical to robotics and mechatronics.
“All robots ultimately perceive the world through limited and imperfect sensors. A number of powerful tools exist for planning and control of highdimensional nonlinear underactuated systems, in particular, we use linear quadratic regulation (LQR) to
calculate belief space policies based on a local linearization of the belief space dynamics.”14
The paper’s contribution is recasting belief space planning so that conventional control techniques are applicable. Enabling policy discovery using linear quadratic regulation and directly transcribed locally optimal belief space trajectories. Results characterizing the effectiveness of a ‘plan/replan’ strategy are given, showing relevant behaviours on grasp problems where prior knowledge acquisition is required.15
The positive outcomes of this study lend credence to these approaches and their suitability for driving robotic control systems particularly in relation to grasping. This report provides very detailed definitions of Partially Observable Markov Decision Processes, belief systems and planning, and indicates that underactuated planning and control approaches could be a fruitful line of enquiry for future research.
6. Data Representation
Maaten, Laurens van der, and Geoffrey Hinton. “Visualizing data using t SNE.” Journal of machine learning research, 9.Nov (2008): 25792605.
A variation of Stochastic Neighbour Embedding (Hinton and Roweis, 2002) offering easier optimisation and clearer visualisation of highdimensional data. Datapoints have a location in a two or threedimensional map. Crowding is reduced so structures are seen at different scales. In visualizing structures of very large data sets, tSNE is demonstrated using random walks on neighbourhood graphs to “allow the implicit structure of all the data to influence the way in which a subset of the data is displayed.”16 Performance is measured on a range of data sets and compared with several nonparametric visualization techniques, e.g. Locally Linear Embedding (tSNE offers the best performance).
Visualization of highdimensional data is an important problem in many different domains, and deals with data of widely varying dimensionality…Most of these techniques simply provide tools to display more than two data dimensions, and leave the interpretation of the data to the human observer. This severely limits the applicability of these techniques to real world data sets that contain thousands of high dimensional datapoints.17
tSNE computational/memory complexity is O(n2) but this approach “successfully visualizes large realworld data sets with limited computational demands.”18 Our experiments on a variety of data sets show that tSNE outperforms existing stateoftheart techniques for visualizing a variety of realworld data sets.
14 Platt Jr, Robert, et al (2010). Belief space planning assuming maximum likelihood observations. Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology.
15 Platt Jr, Robert, et al (2010). Belief space planning assuming maximum likelihood observations. Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology.
16 Ibid
17 Ibid
18 Ibid
Figure 4. shows tSNE, Sammon mapping, Isomap, and LLE on COIL20 data set. tSNE accurately represents onedimensional manifold of viewpoints as a closed loop. For objects which look similar from the front and the back, tSNE distorts the loop so images of front and back are mapped to nearby points. For the four types of toy car in the COIL20 data set (the four aligned “sausages” in the bottomleft of the t SNE map), the four rotation manifolds are aligned by the orientation of the cars to capture the high similarity between different cars at the same orientation. This prevents tSNE from keeping the four manifolds clearly separate.