You are on page 1of 101

Principles of Robot Autonomy II

Learning-based Approaches to Manipulation & Interactive Perception


Jeannette Bohg
Learning Outcome for next four Lectures

Use Manipulation to Perceive better


Modeling and Evaluating Grasps

Apply Learning to Grasping and Manipulation

Modeling and Executing Manipulation

2/13/23 AA 274B | Lecture 10 2


Today’s itinerary
• Recap
• Learning-based Grasping
• Learning-based Approaches to
• Planar Pushing
• Contact-Rich Manipulation Tasks
• Interactive Perception
• Perception is not passive but active
• Not seeing, but looking

2/13/23 AA 274B | Lecture 10 3


For a Deeper Dive into Grasping and
Manipulation
• CS326 – Topics in Advanced Robotic Manipulation – Fall 2023

2/13/23 AA 274B | Lecture 10 4


Data-Driven Approaches to Grasping
Covered up till now

This lecture Transactions on Robotics 2014, ”Data-Driven Grasp Synthesis – A survey" by Bohg et al.

2/13/23 AA 274B | Lecture 10 5


Slides adopted from CS 326 Fall 2019 -- Tutorial presented by Victor Zhang

Robotic Grasping of Novel Objects using


Vision. Saxena et al. IJRR 2008.

2/13/23 AA 274B | Lecture 10 6


Slides adopted from CS 326 Fall 2019 -- Tutorial presented by Victor Zhang

2/13/23 AA 274B | Lecture 10 7


Slides adopted from CS 326 Fall 2019 -- Tutorial presented by Victor Zhang

2/13/23 AA 274B | Lecture 10 8


Slides adopted from CS 326 Fall 2019 -- Tutorial presented by Victor Zhang

2/13/23 AA 274B | Lecture 10 9


Using more sensing modalities and data to
learn features and grasp policies
• DexNet 1.0 – 4.0 – Berkeley – AutoLab
• Google Arm Farm

”Learning Hand-Eye Coordination for Robotic Grasping with Deep Learning and Large-Scale Data ”Learning Deep Policies for Robot Bin Picking by Simulating Robust Grasping
Collection" by Levine et al. IJRR 2017. Sequences" by Mahler and Goldberg. CORL 2017.
https://berkeleyautomation.github.io/dex-net

2/13/23 AA 274B | Lecture 10 10


Today’s itinerary
• Recap
• Learning-based Grasping
• Learning-based Approaches to
• Planar Pushing
• Contact-Rich Manipulation Tasks
• Interactive Perception
• Perception is not passive but active
• Not seeing, but looking

2/13/23 AA 274B | Lecture 10 11


Controlling through contact

Finger

Goal
Object
Controlling through contact

Better Models
Better Feedback
Predictive Model
Sensory Observations

Model Predicted Effect

Action Model-Predictive Control


Modeling Planar Pushing
Friction limit surface: describes
friction forces occurring when part
slides over support.

When pushed with a wrench within


the limit surface: no motion.

For quasi-static pushing: wrench on


the limit surface; object twist normal
to limit surface. Relation between wrench cone, limit surface and unit twist
sphere. Adopted from Chapter 37, Fig 37.10 in Springer
Handbook of Robotics.
Validating Planar Pushing Models
Predicting physical effect

Goal
Keep Predicting

Goal
Bias-Variance Tradeoff
Error

Min Error
Bias Variance

Model Complexity
Model-based
Bias-Variance Tradeoff
Error

Min Error
Bias Variance

Model-based Model Complexity


Data-Driven
Bias-Variance Tradeoff
Error

Learning Structure

Min Error
Bias Variance

Model-based Model Complexity Data-Driven


Hypothesis
Physics Models + Learning

= Generalization
Hybrid Model Extrapolation
Sensory Observations

End-to-End
Loss on Effect
Learned Physics-based
Model Parameters Model
Predicted Effect

Action
Testing Hypothesis on a Case Study

More than a Million Ways to Be Pushed. A High-Fidelity


Experimental Dataset of Planar Pushing. Yu et al. IROS 2016.
Analytic Model

Relation between wrench cone, limit surface and unit twist


sphere. Adopted from Chapter 37, Fig 37.10 in Springer
Handbook of Robotics.
Compared Architectures
Raw Sensory Observations

Neural Network only Hybrid Model

Sensory Learned
Sensory
Observations Model
Observations

Learned Predicted
Model Effect

Physics-
Predicted
based
Action Action Parameters Effect
Model

Training: End-to-End
Loss: Error between Predicted and Ground Truth Effect
Advantages

Data Efficiency Extrapolation


Testing Data Efficiency
Testing Data Efficiency
Testing Data Efficiency

Training = Test Distribution


Testing Generalization

New Pushing Angles New Push Velocities New Object Shapes


& Contact Points
Training Testing Training Testing
Training Testing

Interpolation Extrapolation

Alina Kloss et al, “Combining learned and analytical models for predicting action effects,” IJRR 2020.
Generalization to new push velocities
Training Testing

Extrapolation
Generalization to new push velocities
Training Testing

Extrapolation
Generalization to new push velocities
Training Testing

Extrapolation

Training = Test Distribution


Don’t throw away
structure
Learn to extract given state
representation from raw
data
A Concrete Suggestion
Sensory Observations

End-to-End
Loss on Effect
Learned Physics-based
Model Parameters Model
Predicted Effect

Action
Interpretability
Today’s itinerary
• Recap
• Learning-based Grasping
• Learning-based Approaches to
• Planar Pushing
• Contact-Rich Manipulation Tasks
• Interactive Perception
• Perception is not passive but active
• Not seeing, but looking

2/13/23 AA 274B | Lecture 10 43


Models will never be perfect
RGB-D
Tactile Audio

Force/Torque

Active Stereo

Joint Torque
Human multimodal sensor-motor coordination

Sensory
Inputs

Example of
Task-Relevant
pose geo force friction Information

47
Human multimodal sensor-motor coordination

Sensory
Inputs

Example of
Task-Relevant
pose geo force friction Information

π Action
48
Human multimodal sensor-motor coordination

Sensory
Inputs

Example of
Task-Relevant
pose geo force friction Information

π Action
49
Generalizable representations for multimodal inputs

50
Generalizable representations for multimodal inputs

𝜋(𝑓(𝑜! , 𝑜" , 𝑜# . . . 𝑜$ )) = 𝑎
multimodal sensory inputs

Learn representation …

pose geo force friction

learn policy
for new task instances
51
Experimental setup

Multimodal
sensory
inputs

RGB force/torque robot states


52
Experimental setup

Training Testing

Peg geometry

53
Learning generalizable representation
Inputs

RGB image

Encoder Decoder

Force data

Robot state Representation


Learning generalizable representation
Inputs

RGB image

Fusion
Fusion Decoder

Force data

Robot state Representation


Learning generalizable representation
Inputs

RGB image
pose force
Fusion
Fusion Decoder

Force data
geo friction

Robot state Representation Task-relevant labels


Collect labeled data with self-supervision
Optical Flow

Normalized Force
1.0 x
y
0.5
z

Force
0

-0.5

-1.0
Time

57
Dynamics prediction from self-supervision
Inputs Decoders

Action-conditional
robot optical flow
action

RGB image

Encoder
Force data

Robot state Representation


Dynamics prediction from self-supervision
Inputs Decoders

Action-conditional
robot optical flow
action

RGB image

0/1
Encoder contact in
Force data the next step?

Robot state Representation


Concurrency prediction from self-supervision
Inputs Decoders

Action-conditional
robot optical flow
action

RGB image

0/1
Encoder contact in
Force data the next step?

Robot state Representation 0/1


time-aligned?
Concurrency prediction from self-supervision
Inputs Decoders

Action-conditional
robot optical flow
action

RGB image

0/1
Encoder contact in
Force data the next step?

Robot state Representation 0/1


time-aligned?
Concurrency prediction from self-supervision
Inputs Decoders

Action-conditional
robot optical flow
action

RGB image

0/1
Encoder contact in
Force data the next step?

Robot state Representation


yes, time aligned
Concurrency prediction from self-supervision
Unpaired Inputs Decoders

Action-conditional
robot optical flow
action

RGB image

0/1
Encoder contact in
Force data the next step?

Robot state Representation 0/1


time-aligned?
Concurrency prediction from self-supervision
Unpaired Inputs Decoders

Action-conditional
robot optical flow
action

RGB image

0/1
Encoder contact in
Force data the next step?

Robot state Representation no, not


time aligned!
Learning sample efficient policies
Inputs

RGB image

Encoder
Force data

Robot state Representation


How to efficiently learn a policy?
Inputs

Freeze
500k parameters

RL Policy
RGB image

Encoder
Force data

Robot state Representation


How to efficiently learn a policy?
Inputs

Freeze Learn 15k


500k parameters parameters

RL Policy
RGB image

Encoder
Force data

Robot state Representation


We evaluate our representation with policy learning

Episode 0 Episode 100


0% success rate 21% success rate

round x1

68
We efficiently learn policies in 5 hours

Episode 300 Episode 300 Episode 300

73% success rate 71% success rate 92% success rate

69
Our multimodal policy is robust against sensor noise

1 2

Force Camera
Perturbation Occlusion

70
Force Perturbation 1.0
Normalized Force
x
y
0.5
z

Force
0

-0.5

-1.0
Time

x1 71
Camera Occlusion

Agent View

x1 72
How is each modality used?

100

80 Force Only: Can’t find box


Success Rate (%)

60
Image Only: Struggles with peg alignment
40
77 Force & Image: Can learn full task
20 49 completion
1
0
Force Image Force & Image

Simulation Results
(Randomized box location)

73
Does our representation generalize to new geometries?

7
4
Does our representation generalize to new geometries?
92% Success Rate

Tested on

Representation
Policy
7
5
Does our representation generalize to new policies?
92% Success Rate 62% Success Rate

Policy does not transfer

Tested on Tested on

Representation Representation
Policy Policy
7
6
Does our representation generalize?
92% Success Rate 62% Success Rate 92% Success Rate

Policy does not transfer Representation transfers


Tested on Tested on Tested on

Representation Representation Representation


Policy Policy Policy
7
7
Overview of method
Self-supervised data collection Representation learning Policy learning

𝑜!"# , 𝑜$%&'( , 𝑜&%)%* 𝑓(𝑜!"# , 𝑜$%&'( , 𝑜&%)%* ) 𝜋(𝑓(⋅)) = 𝑎

100k data points 20 epochs on GPU Deep RL


90 minutes 24 hours 5 hours

Lee et al. Making Sense of Vision and Touch: Self-Supervised Learning of Multimodal 78
Representations for Contact-Rich Tasks. ICRA’19. Best Paper Award.
Lessons Learned

1. Self-supervision gives us rich learning objectives

2. Representation that captures concurrency and


dynamics can generalize across task instances

3. Our experiments show that multimodal representation


leads to learning efficiency and policy robustness
79
State Representation - Physically Meaningful
or Learned?
Explicit Representation Learned Representation

Sensory Learned
Observations Model

Physics-
Predicted
based
Action Parameters Effect
Model
Today’s itinerary
• Recap
• Learning-based Grasping
• Learning-based Approaches to
• Planar Pushing
• Contact-Rich Manipulation Tasks
• Interactive Perception
• Perception is not passive but active
• Not seeing, but looking

2/13/23 AA 274B | Lecture 10 81


Interactive Perception

Sensory Data Actions Time

Control

Interactive Perception - Leveraging Action in Perception and Perception in Action.


Bohg, Hausman, Sanakaran, Brock, Kragic, Schaal and Sukhatme. TRO ’17.
Exploiting Multi-Modality
125

100
Success Rate

75

50

25

Look Rotate Touch

J. J. Gibson (1966) - The Senses


considered as a Perceptual System.
Concurrency of Motion and Sensing

Held and Hein (1963).


Movement-Produced
Stimulation in the
Development of
Visually-Guided
Behaviour
Accumulation over Time

Thanks to Octavia Camps at Northeastern University, Boston


Interactive Perception

Sensory Data Actions Time

Selfsupervised Learning of a
Multimodal Representation

Interactive Perception - Leveraging Action in Perception and Perception in Action.


Bohg, Hausman, Sanakaran, Brock, Kragic, Schaal and Sukhatme. TRO ’17.
Concurrency prediction from self-supervision
Inputs Decoders

Action-conditional
robot optical flow
action

RGB image

0/1
Encoder contact in
Force data the next step?

Robot state Representation 0/1


time-aligned?
Lee et al. Making Sense of Vision and Touch: Self-Supervised Learning of Multimodal
Representations for Contact-Rich Tasks. ICRA’19. Best Paper Award.
Interactive Perception

Sensory Data Actions Time

Exploiting Selfsupervised
RGB, Depth and Learning of aInstance Segmentation
Motion for
Multimodal Representation

Interactive Perception - Leveraging Action in Perception and Perception in Action.


Bohg, Hausman, Sanakaran, Brock, Kragic, Schaal and Sukhatme. TRO ’17.
c
Instance segmentation
from a robotics viewpoint.
Optical Flow - Scene Flow

A Database and Evaluation Methodology for Optical Flow. Baker et al. IJCV. 2011 A Primal-Dual Framework for Real-Time Dense RGB-D Scene Flow. Jaimez at al. ICRA, 2015.
Unknown number
of rigid objects

Motion-based Object Segmentation based on Dense RGB-D Scene Flow. Shao et al. Submitted to RAL + IROS. 2018. Pre-print on arXiv.
Pixel-wise Prediction

Encoder

Structure
Motion-based Object Segmentation based on Dense RGB-D Scene Flow. Shao et al. Submitted to RAL + IROS. 2018. Pre-print on arXiv.
Computing Scene Flow
Scene Flow = Displacement of one 3D point over time

?!

SE(3)

Motion-based Object Segmentation based on Dense RGB-D Scene Flow. Shao et al. Submitted to RAL + IROS. 2018. Pre-print on arXiv.
Motion-based Object Segmentation
Projection of centroid

Pixels that move rigidly belong to one object.

Trajectory feature

Clustering Clustering

Radius Enclosing Sphere

Motion-based Object Segmentation based on Dense RGB-D Scene Flow. Shao et al. Submitted to RAL + IROS. 2018. Pre-print on arXiv.
Data Set of Frame Pairs
Time

Motion-based Object Segmentation based on Dense RGB-D Scene Flow. Shao et al. Submitted to RAL + IROS. 2018. Pre-print on arXiv.
Qualitative Results - Synthetic

Motion-based Object Segmentation based on Dense RGB-D Scene Flow. Shao et al. Submitted to RAL + IROS. 2018. Pre-print on arXiv.
Qualitative Results - Real

Motion-based Object Segmentation based on Dense RGB-D Scene Flow. Shao et al. Submitted to RAL + IROS. 2018. Pre-print on arXiv.
Interactive Perception

Sensory Data Actions Time

Exploiting RGB, Depth and Motion for Instance Segmentation

Interactive Perception - Leveraging Action in Perception and Perception in Action.


Bohg, Hausman, Sanakaran, Brock, Kragic, Schaal and Sukhatme. TRO ’17.
Interactive Perception

Sensory Data Actions Time

Homework 3 - Problem 3 Learning Intuitive Physics

Interactive Perception - Leveraging Action in Perception and Perception in Action.


Bohg, Hausman, Sanakaran, Brock, Kragic, Schaal and Sukhatme. TRO ’17.
Conclusions
• Interaction generates a rich sensory signal that eases perception.
• Action-conditional, multi-modal representation helps contact-rich
manipulation and generalizes.
Suggested Reading
• ”Data-Driven Grasp Synthesis – A survey" by Bohg et al. TRO 2014
• ”Dex-Net 2.0: Deep Learning to Plan Robust Grasps with Synthetic Point Clouds and Analytic Grasp
Metrics" by Mahler et al.. RSS 2017. https://berkeleyautomation.github.io/dex-net
• ”Learning Hand-Eye Coordination for Robotic Grasping with Deep Learning and Large-Scale Data
Collection" by Levine et al. IJRR 2017.
• “Making Sense of Vision and Touch: Self-Supervised Learning of Multimodal Representations for Contact-
Rich Tasks.” Lee et al. ICRA’19. Best Paper Award.
• “Combining learned and analytical models for predicting action effects”, Kloss et al, IJRR 2020.
• “Interactive Perception - Leveraging Action in Perception and Perception in Action.”
Bohg, Hausman, Sankaran, Brock, Kragic, Schaal and Sukhatme. TRO ’17.

2/13/23 AA 274B | Lecture 10 102


Next time
Guest Lecture by Karol Hausman

2/13/23 AA 274B | Lecture 10 103

You might also like