Professional Documents
Culture Documents
III. P RELIMINARIES
A. Assumptions
A. Implicit Neural Representations
We assume a robot arm equipped with a parallel-jaw gripper
Implicit neural representations [51] are continuous functions
in a cubic workspace with a planar tabletop. We initialize the
Φ that are parameterized by neural networks and satisfy
workspace by placing multiple rigid objects on the tabletop.
equations of the form:
A single-view depth image taken with a fixed side view depth
F (p, Φθ ) = 0, Φθ : p → Φθ (p) (1) camera is fused into a Truncated Signed Distance Function
(TSDF) and then passed into the model. The model outputs
where p ∈ Rm is a spatial or spatio-temporal coordinate, Φθ 6-DoF grasp pose predictions and associated grasp quality. We
is implicitly defined by relations defined by F , and θ refers train our model with grasp trials generated in a self-supervised
to the parameters of the neural networks. As Φθ is defined fashion from a physics engine. In simulated training, we
on continuous domains of p (e.g., Cartesian coordinates), it assume the pose and shape of the objects are known.
serves as a memory-efficient and differentiable representation
of high-resolution 3D data compared to explicit and discrete
representations. B. Notations
B. Occupancy Networks Observations Given the depth image captured by the depth
camera with known intrinsic and extrinsic, we fuse it into
Here we introduce Occupancy Network (ONet) [34], one of
a TSDF, which is an N 3 voxel grid V where each cell Vi
the pioneer works that learn implicit neural representations for
contains the truncated signed distance to the nearest surface.
3D reconstruction. Given an observation of a 3D object, ONet
We believe this additional distance-to-surface information pro-
fits a continuous occupancy field defined over the bounding
vided by TSDF can improve grasp detection performance.
3D volume of the object. The occupancy field is defined as:
Therefore we convert the raw depth image into TSDF volume
o : R3 → {0, 1}. (2) V and pass it into the model.
Grasps We define a 6-DoF grasp g as the grasp center
Following Equation (1), the occupancy function in ONet is position t ∈ R3 , the orientation r ∈ SO(3) of the gripper,
defined as: and the opening width w ∈ R between the fingers.
Φθ (p) − o(p) = 0. (3)
Grasp Quality A scalar grasp quality q ∈ [0, 1] estimates
The resulting occupancy function Φθ (p) maps a 3D location the probability of grasp success. We learn to predict the grasp
p ∈ R3 to the occupancy probability between 0 and 1. quality of a grasp with binary success labels of executing the
To generalize over shapes, we need to estimate occupancy grasp trial in simulation.
functions conditioning on a context vector x ∈ X computed Occupancy For an arbitrary point p ∈ R3 , the occupancy
from visual observations of the shape. A function that con- b ∈ {0, 1} is a binary value indicating whether this point is
ditions on the context x ∈ X and instantiates an implicit occupied by any of the objects in the scene.
Input TSDF 3D feature grid Projected 2D feature grids Structured feature grids
𝑧 𝑧
Projection
3D Conv 2D U-Nets
Aggregation 𝑥 𝑥
𝑦 𝑦
𝑥 3D Reconstruction
p Geometry 0
Implicit
Function 1
𝑦 (𝑥′, 𝑦′, 𝑧′)
Query point p Occupancy probability
Fig. 2: Model architecture of GIGA. The input is a TSDF fused from the depth image. After a 3D convolution layer, the output 3D voxel features are projected
to canonical planes and aggregated into 2D feature grids. After passing each of the three feature planes through three independent U-Nets, we query the local
feature at grasp center/occupancy query point with bilinear interpolation. The affordance implicit functions predict grasp parameters from the local feature at
the grasp center. The geometry implicit function predicts occupancy probability from the local feature at the query point.