Lecture 10

Principles of Robot Autonomy II
Learning-based Approaches to Manipulation & Interactive Perception

Jeannette Bohg
Learning Outcome for next four Lectures
Use Manipulation to Perceive better

Modeling and Evaluating Grasps
Apply Learning to Grasping and Manipulation
Modeling and Executing Manipulation
2/13/23 AA 274B | Lecture 10 2

Today’s itinerary
• Recap
• Learning-based Grasping
• Learning-based Approaches to
• Planar Pushing
• Contact-Rich Manipulation Tasks
• Interactive Perception
• Perception is not passive but active
• Not seeing, but looking
2/13/23 AA 274B | Lecture 10 3

For a Deeper Dive into Grasping and
Manipulation
• CS326 – Topics in Advanced Robotic Manipulation – Fall 2023
2/13/23 AA 274B | Lecture 10 4

Data-Driven Approaches to Grasping
Covered up till now
This lecture Transactions on Robotics 2014, ”Data-Driven Grasp Synthesis – A survey" by Bohg et al.
2/13/23 AA 274B | Lecture 10 5

Slides adopted from CS 326 Fall 2019 -- Tutorial presented by Victor Zhang
Robotic Grasping of Novel Objects using

Vision. Saxena et al. IJRR 2008.
2/13/23 AA 274B | Lecture 10 6

2/13/23 AA 274B | Lecture 10 7

2/13/23 AA 274B | Lecture 10 8

2/13/23 AA 274B | Lecture 10 9

Using more sensing modalities and data to
learn features and grasp policies
• DexNet 1.0 – 4.0 – Berkeley – AutoLab
• Google Arm Farm
”Learning Hand-Eye Coordination for Robotic Grasping with Deep Learning and Large-Scale Data ”Learning Deep Policies for Robot Bin Picking by Simulating Robust Grasping
Collection" by Levine et al. IJRR 2017. Sequences" by Mahler and Goldberg. CORL 2017.
https://berkeleyautomation.github.io/dex-net
2/13/23 AA 274B | Lecture 10 10

Today’s itinerary
• Recap
• Planar Pushing
2/13/23 AA 274B | Lecture 10 11

Controlling through contact
Finger
Goal
Object
Controlling through contact
Better Models
Better Feedback
Predictive Model
Sensory Observations
Model Predicted Effect
Action Model-Predictive Control

Modeling Planar Pushing
Friction limit surface: describes
friction forces occurring when part
slides over support.
When pushed with a wrench within

the limit surface: no motion.
For quasi-static pushing: wrench on

the limit surface; object twist normal
to limit surface. Relation between wrench cone, limit surface and unit twist
sphere. Adopted from Chapter 37, Fig 37.10 in Springer
Handbook of Robotics.
Validating Planar Pushing Models
Predicting physical effect
Goal
Keep Predicting
Goal
Bias-Variance Tradeoff
Error
Min Error
Bias Variance
Model Complexity
Model-based
Error
Min Error
Bias Variance
Model-based Model Complexity

Data-Driven
Error
Learning Structure
Min Error
Bias Variance
Model-based Model Complexity Data-Driven

Hypothesis
Physics Models + Learning
= Generalization
Hybrid Model Extrapolation
End-to-End
Loss on Effect
Learned Physics-based
Model Parameters Model
Predicted Effect
Action
Testing Hypothesis on a Case Study
More than a Million Ways to Be Pushed. A High-Fidelity

Experimental Dataset of Planar Pushing. Yu et al. IROS 2016.
Analytic Model
Relation between wrench cone, limit surface and unit twist

sphere. Adopted from Chapter 37, Fig 37.10 in Springer
Handbook of Robotics.
Compared Architectures
Raw Sensory Observations
Neural Network only Hybrid Model
Sensory Learned
Sensory
Observations Model
Observations
Learned Predicted
Model Effect
Physics-
Predicted
based
Action Action Parameters Effect
Model
Training: End-to-End
Loss: Error between Predicted and Ground Truth Effect
Advantages
Data Efficiency Extrapolation

Testing Data Efficiency
Training = Test Distribution

Testing Generalization
New Pushing Angles New Push Velocities New Object Shapes

& Contact Points
Training Testing Training Testing
Training Testing
Interpolation Extrapolation
Alina Kloss et al, “Combining learned and analytical models for predicting action effects,” IJRR 2020.
Generalization to new push velocities
Training Testing
Extrapolation
Training Testing
Extrapolation
Training Testing
Extrapolation
Training = Test Distribution

Don’t throw away
structure
Learn to extract given state
representation from raw
data
A Concrete Suggestion
End-to-End
Loss on Effect
Learned Physics-based
Model Parameters Model
Predicted Effect
Action
Interpretability
Today’s itinerary
• Recap
• Planar Pushing
2/13/23 AA 274B | Lecture 10 43

Models will never be perfect
RGB-D
Tactile Audio
Force/Torque
Active Stereo
Joint Torque
Human multimodal sensor-motor coordination
Sensory
Inputs
Example of
Task-Relevant
pose geo force friction Information
47
Sensory
Inputs
Example of
Task-Relevant
π Action
48
Sensory
Inputs
Example of
Task-Relevant
π Action
49
Generalizable representations for multimodal inputs
50
Generalizable representations for multimodal inputs
𝜋(𝑓(𝑜! , 𝑜" , 𝑜# . . . 𝑜$ )) = 𝑎
multimodal sensory inputs
Learn representation …
pose geo force friction
learn policy
for new task instances
51
Experimental setup
Multimodal
sensory
inputs
RGB force/torque robot states

52
Experimental setup
Training Testing
Peg geometry
53
Learning generalizable representation
Inputs
RGB image
Encoder Decoder
Force data
Robot state Representation

Inputs
RGB image
Fusion
Fusion Decoder
Force data

Inputs
RGB image
pose force
Fusion
Fusion Decoder
Force data
geo friction
Robot state Representation Task-relevant labels

Collect labeled data with self-supervision
Optical Flow
Normalized Force
1.0 x
y
0.5
z
Force
0
-0.5
-1.0
Time
57
Dynamics prediction from self-supervision
Inputs Decoders
Action-conditional
robot optical flow
action
RGB image
Encoder
Force data

Dynamics prediction from self-supervision
Inputs Decoders
Action-conditional
robot optical flow
action
RGB image
0/1
Encoder contact in
Force data the next step?

Concurrency prediction from self-supervision
Inputs Decoders
Action-conditional
robot optical flow
action
RGB image
0/1
Encoder contact in
Robot state Representation 0/1

time-aligned?
Inputs Decoders
Action-conditional
robot optical flow
action
RGB image
0/1
Encoder contact in

time-aligned?
Inputs Decoders
Action-conditional
robot optical flow
action
RGB image
0/1
Encoder contact in

yes, time aligned
Unpaired Inputs Decoders
Action-conditional
robot optical flow
action
RGB image
0/1
Encoder contact in

time-aligned?
Unpaired Inputs Decoders
Action-conditional
robot optical flow
action
RGB image
0/1
Encoder contact in
Robot state Representation no, not

time aligned!
Learning sample efficient policies
Inputs
RGB image
Encoder
Force data

How to efficiently learn a policy?
Inputs
Freeze
500k parameters
RL Policy
RGB image
Encoder
Force data

How to efficiently learn a policy?
Inputs
Freeze Learn 15k

500k parameters parameters
RL Policy
RGB image
Encoder
Force data

We evaluate our representation with policy learning
Episode 0 Episode 100

0% success rate 21% success rate
round x1
68
We efficiently learn policies in 5 hours
Episode 300 Episode 300 Episode 300
73% success rate 71% success rate 92% success rate
69
Our multimodal policy is robust against sensor noise
1 2
Force Camera
Perturbation Occlusion
70
Force Perturbation 1.0
Normalized Force
x
y
0.5
z
Force
0
-0.5
-1.0
Time
x1 71
Camera Occlusion
Agent View
x1 72
How is each modality used?
100
80 Force Only: Can’t find box

Success Rate (%)
60
Image Only: Struggles with peg alignment
40
77 Force & Image: Can learn full task
20 49 completion
1
0
Force Image Force & Image
Simulation Results
(Randomized box location)
73
Does our representation generalize to new geometries?
7
4
Does our representation generalize to new geometries?
92% Success Rate
Tested on
Representation
Policy
7
5
Does our representation generalize to new policies?
92% Success Rate 62% Success Rate
Policy does not transfer
Tested on Tested on
Representation Representation
Policy Policy
7
6
Does our representation generalize?
92% Success Rate 62% Success Rate 92% Success Rate
Policy does not transfer Representation transfers

Tested on Tested on Tested on
Representation Representation Representation

Policy Policy Policy
7
7
Overview of method
Self-supervised data collection Representation learning Policy learning
𝑜!"# , 𝑜$%&'( , 𝑜&%)%* 𝑓(𝑜!"# , 𝑜$%&'( , 𝑜&%)%* ) 𝜋(𝑓(⋅)) = 𝑎
100k data points 20 epochs on GPU Deep RL

90 minutes 24 hours 5 hours
Lee et al. Making Sense of Vision and Touch: Self-Supervised Learning of Multimodal 78
Representations for Contact-Rich Tasks. ICRA’19. Best Paper Award.
Lessons Learned
1. Self-supervision gives us rich learning objectives
2. Representation that captures concurrency and

dynamics can generalize across task instances
3. Our experiments show that multimodal representation

leads to learning efficiency and policy robustness
79
State Representation - Physically Meaningful
or Learned?
Explicit Representation Learned Representation
Sensory Learned
Observations Model
Physics-
Predicted
based
Action Parameters Effect
Model
Today’s itinerary
• Recap
• Planar Pushing
2/13/23 AA 274B | Lecture 10 81

Interactive Perception
Sensory Data Actions Time
Control
Interactive Perception - Leveraging Action in Perception and Perception in Action.

Bohg, Hausman, Sanakaran, Brock, Kragic, Schaal and Sukhatme. TRO ’17.
Exploiting Multi-Modality
125
100
Success Rate
75
50
25
Look Rotate Touch
J. J. Gibson (1966) - The Senses

considered as a Perceptual System.
Concurrency of Motion and Sensing
Held and Hein (1963).

Movement-Produced
Stimulation in the
Development of
Visually-Guided
Behaviour
Accumulation over Time
Thanks to Octavia Camps at Northeastern University, Boston

Selfsupervised Learning of a
Multimodal Representation

Inputs Decoders
Action-conditional
robot optical flow
action
RGB image
0/1
Encoder contact in

time-aligned?
Lee et al. Making Sense of Vision and Touch: Self-Supervised Learning of Multimodal
Representations for Contact-Rich Tasks. ICRA’19. Best Paper Award.
Exploiting Selfsupervised
RGB, Depth and Learning of aInstance Segmentation
Motion for
Multimodal Representation

c
Instance segmentation
from a robotics viewpoint.
Optical Flow - Scene Flow
A Database and Evaluation Methodology for Optical Flow. Baker et al. IJCV. 2011 A Primal-Dual Framework for Real-Time Dense RGB-D Scene Flow. Jaimez at al. ICRA, 2015.
Unknown number
of rigid objects
Motion-based Object Segmentation based on Dense RGB-D Scene Flow. Shao et al. Submitted to RAL + IROS. 2018. Pre-print on arXiv.
Pixel-wise Prediction
Encoder
Structure
Computing Scene Flow
Scene Flow = Displacement of one 3D point over time
?!
SE(3)
Motion-based Object Segmentation
Projection of centroid
Pixels that move rigidly belong to one object.
Trajectory feature
Clustering Clustering
Radius Enclosing Sphere
Data Set of Frame Pairs
Time
Qualitative Results - Synthetic
Qualitative Results - Real
Exploiting RGB, Depth and Motion for Instance Segmentation

Homework 3 - Problem 3 Learning Intuitive Physics

Conclusions
• Interaction generates a rich sensory signal that eases perception.
• Action-conditional, multi-modal representation helps contact-rich
manipulation and generalizes.
Suggested Reading
• ”Data-Driven Grasp Synthesis – A survey" by Bohg et al. TRO 2014
• ”Dex-Net 2.0: Deep Learning to Plan Robust Grasps with Synthetic Point Clouds and Analytic Grasp
Metrics" by Mahler et al.. RSS 2017. https://berkeleyautomation.github.io/dex-net
• ”Learning Hand-Eye Coordination for Robotic Grasping with Deep Learning and Large-Scale Data
Collection" by Levine et al. IJRR 2017.
• “Making Sense of Vision and Touch: Self-Supervised Learning of Multimodal Representations for Contact-
Rich Tasks.” Lee et al. ICRA’19. Best Paper Award.
• “Combining learned and analytical models for predicting action effects”, Kloss et al, IJRR 2020.
• “Interactive Perception - Leveraging Action in Perception and Perception in Action.”
Bohg, Hausman, Sankaran, Brock, Kragic, Schaal and Sukhatme. TRO ’17.
2/13/23 AA 274B | Lecture 10 102

Next time
Guest Lecture by Karol Hausman
2/13/23 AA 274B | Lecture 10 103

Lecture 10

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 10

Uploaded by

Copyright:

Available Formats

Principles of Robot Autonomy II

Learning-based Approaches to Manipulation & Interactive Perception

Use Manipulation to Perceive better

Apply Learning to Grasping and Manipulation

Modeling and Executing Manipulation

2/13/23 AA 274B | Lecture 10 2

2/13/23 AA 274B | Lecture 10 3

2/13/23 AA 274B | Lecture 10 4

2/13/23 AA 274B | Lecture 10 5

Robotic Grasping of Novel Objects using

2/13/23 AA 274B | Lecture 10 6

2/13/23 AA 274B | Lecture 10 7

2/13/23 AA 274B | Lecture 10 8

2/13/23 AA 274B | Lecture 10 9

2/13/23 AA 274B | Lecture 10 10

2/13/23 AA 274B | Lecture 10 11

Model Predicted Effect

Action Model-Predictive Control

When pushed with a wrench within

For quasi-static pushing: wrench on

Model-based Model Complexity

Model-based Model Complexity Data-Driven

More than a Million Ways to Be Pushed. A High-Fidelity

Relation between wrench cone, limit surface and unit twist

Neural Network only Hybrid Model

Data Efficiency Extrapolation

Training = Test Distribution

New Pushing Angles New Push Velocities New Object Shapes

Training = Test Distribution

2/13/23 AA 274B | Lecture 10 43

pose geo force friction

RGB force/torque robot states

Robot state Representation

Robot state Representation

Robot state Representation Task-relevant labels

Robot state Representation

Robot state Representation

Robot state Representation 0/1

Robot state Representation 0/1

Robot state Representation

Robot state Representation 0/1

Robot state Representation no, not

Robot state Representation

Robot state Representation

Freeze Learn 15k

Robot state Representation

Episode 0 Episode 100

Episode 300 Episode 300 Episode 300

73% success rate 71% success rate 92% success rate

80 Force Only: Can’t find box

Policy does not transfer

Policy does not transfer Representation transfers

Representation Representation Representation

𝑜!"# , 𝑜$%&'( , 𝑜&%)%* 𝑓(𝑜!"# , 𝑜$%&'( , 𝑜&%)%* ) 𝜋(𝑓(⋅)) = 𝑎

100k data points 20 epochs on GPU Deep RL

1. Self-supervision gives us rich learning objectives

2. Representation that captures concurrency and

3. Our experiments show that multimodal representation

2/13/23 AA 274B | Lecture 10 81

Sensory Data Actions Time

Interactive Perception - Leveraging Action in Perception and Perception in Action.

Look Rotate Touch

J. J. Gibson (1966) - The Senses