You are on page 1of 170

Principles of Robot Autonomy II

Human-Robot Interaction
Today’s itinerary
• Game-Theoretic Views on Multi-Agent Interactions

• Partner Modeling: Active Info Gathering over Human’s Intent

• Shared Autonomy and Latent Actions

• Partner Modeling: Learning and Influencing Latent Intent

• Learning from physical feedback

• Learning from play


Today’s itinerary
• Game-Theoretic Views on Multi-Agent Interactions

• Partner Modeling: Active Info Gathering over Human’s Intent

• Shared Autonomy and Latent Actions

• Partner Modeling: Learning and Influencing Latent Intent

• Learning from physical feedback

• Learning from play


Learning from Humans

Existing research explores how robots adapt to humans


• Imitation learning
• Learning from demonstrations
Influencing Humans

Far less studies how robots influence humans


Nth order Theory of Mind
Nth order Theory of Mind
Nth order Theory of Mind

[Sadigh, Sastry, Seshia, Dragan, RSS 2016, IROS 2016, AURO 2018]
An autonomous car’s
actions will affect the actions of other
drivers.
Source: https://twitter.com/nitguptaa/
Interaction as a Dynamical System

direct control
over 𝑢ℛ

indirect
control over 𝑢ℋ
Interaction as a Dynamical System

∗ ∗
𝑢ℛ = argmax 𝑅ℛ (𝑥, 𝑢ℛ , 𝑢ℋ (𝑥, 𝑢ℛ ))
#ℛ

Find optimal actions for the


robot while accounting for

the human response 𝑢ℋ .

Model 𝑢ℋ as optimizing
the human reward
function 𝑅ℋ .

𝑢ℋ 𝑥, 𝑢ℛ ≈ argmax 𝑅ℋ (𝑥, 𝑢ℛ , 𝑢ℋ )
#ℋ

Sadigh et al. RSS 2016, AURO 2018


Learning Driver Models
Learn Human’s reward function based on Inverse
Reinforcement Learning:
exp(𝑅ℋ (𝑥, 𝑢ℛ , 𝑢ℋ ))
𝑃 𝑢ℋ 𝑥, 𝑤) =
∫ exp 𝑅ℋ 𝑥, 𝑢ℛ , 𝑢. ℋ 𝑑 𝑢. ℋ

𝑅ℋ 𝑥, 𝑢ℛ , 𝑢ℋ = 𝑤 ⏉ 𝜙(𝑥, 𝑢ℛ , 𝑢ℋ )

Features for thefor the


(a) Features Features for staying
(b) Feature for staying Features for avoiding
(c) Features for avoiding
boundariesofofthe
boundaries theroad.
road inside the lanes.
inside the lanes. other other
vehicles.
vehicles.

[Ziebart’ 09] [Levine’10]


Interaction as a Dynamical System
∗ ∗
𝑢ℛ = argmax 𝑅ℛ (𝑥, 𝑢ℛ , 𝑢ℋ (𝑥, 𝑢ℛ ))
%ℛ

Find optimal actions for the


robot while accounting for

the human response 𝑢ℋ .

Model 𝑢ℋ as optimizing
the human reward
function 𝑅ℋ .

𝑢ℋ 𝑥, 𝑢ℛ ≈ argmax 𝑅ℋ (𝑥, 𝑢ℛ , 𝑢& )
%ℋ
Approximations for Tractability
− Receding Horizon Control:

Plan for short time horizon, replan at every step.

− Model the problem as a Stackelberg game.


Give the human full access to 𝑢ℛ for the short time horizon.
Nth order Theory of Mind
Nth order Theory of Mind
Approximations for Tractability
− Receding Horizon Control:

Plan for short time horizon, replan at every step.

− Model the problem as a Stackelberg game.


Give the human full access to 𝑢ℛ for the short time horizon.


𝑢ℋ (𝑥, 𝑢ℛ ) = argmax 𝑅ℋ (𝑥, 𝑢ℛ , 𝑢ℋ )
%ℋ

− Assume deterministic human model.


Solution of Nested Optimization
∗ ∗
𝑢ℛ = argmax 𝑅ℛ (𝑥, 𝑢ℛ , 𝑢ℋ (𝑥, 𝑢ℛ ))
%ℛ *
' '
𝑅ℛ 𝑥, 𝑢ℛ , 𝑢ℋ = 6 𝑟ℛ 𝑥 ' , 𝑢ℛ , 𝑢ℋ
'()

Gradient-Based Method (Quasi-Newton):


𝑅ℛ (𝑥, 𝑢ℛ , 𝑢ℋ )

𝜕𝑅ℛ 𝜕𝑅ℛ 𝜕𝑢ℋ 𝜕𝑅ℛ
= +
𝜕𝑢ℛ 𝜕𝑢ℋ 𝜕𝑢ℛ 𝜕𝑢ℛ

𝑢ℋ 𝑥, 𝑢ℛ ≈ argmax 𝑅ℋ (𝑥, 𝑢ℛ , 𝑢ℋ ) *
%ℋ
' '
𝑅ℋ 𝑥, 𝑢ℛ , 𝑢ℋ = 6 𝑟ℋ 𝑥 ' , 𝑢ℛ , 𝑢ℋ
'()
Solution of Nested Optimization
Quasi-Newton method:

𝜕𝑅ℛ 𝜕𝑅ℛ 𝜕𝑢ℋ 𝜕𝑅ℛ
= ⋅ +
𝜕𝑢ℛ 𝜕𝑢ℋ 𝜕𝑢ℛ 𝜕𝑢ℛ

Given 𝑅ℋ is:, 𝜕𝑅ℋ ∗


• smooth, 𝑥, 𝑢ℛ , 𝑢ℋ 𝑥, 𝑢ℛ =0
𝜕𝑢ℋ
• its minimum is attained,
for an unconstrained optimization,
the partial !"ℋ at the optimum 𝑢 ∗
!$ℋ ℋ
evaluates to zero. ∗
𝜕 +𝑅ℋ 𝜕𝑢ℋ 𝜕 +𝑅ℋ 𝜕𝑢ℛ
+ ⋅ 𝜕𝑢 + 𝜕𝑢 𝜕𝑢 ⋅ 𝜕𝑢 = 0
𝜕𝑢ℋ ℛ ℋ ℛ ℛ
Implication: Efficiency

ℛobot
ℋ uman
Implication: Efficiency
Implication: Efficiency
Implication: Coordination
Implication: Coordination
Legible Motion

Using robot motion to coordinate


with the human better about the
robot’s goal
Human crossing First

an
m
lHu
ea
Id
y of Human Vehicle

e
ar
Aw
n-
tio
c

e
ra

a c l
st
te

b
In

m ic O Human crossing
y n a
D Second

x of Autonomous Vehicle
𝑝 𝑢ℋ 𝑥 ∝ exp(𝑅ℋ (𝑥, 𝑢ℋ ))
We can’t rely on a
single driver model.

We need to differentiate
between different drivers.
𝑝 𝑢ℋ 𝑥, 𝜃 ∝ exp(𝑅ℋ (𝑥, 𝑢ℋ , 𝜃))

𝑏',) 𝜃 ∝ 𝑏' 𝜃 ⋅ 𝑝(𝑢ℋ |𝑥' , 𝜃)


𝑝 𝑢ℋ 𝑥, 𝜃 ∝ exp(𝑅ℋ (𝑥, 𝑢ℋ , 𝜃))

𝑏',) 𝜃 ∝ 𝑏' 𝜃 ⋅ 𝑝(𝑢ℋ |𝑥' , 𝜃)

𝑢ℛ = argmax 𝑅ℛ
%ℛ
Drivers respond to
actions of other cars.

…We have an opportunity to


actively gather information.
𝑝 𝑢ℋ 𝑥, 𝜃, 𝑢ℛ ∝ exp(𝑅ℋ (𝑥, 𝑢ℋ , 𝜃, 𝑢ℛ ))

𝑏',) 𝜃 ∝ 𝑏' 𝜃 ⋅ 𝑝(𝑢ℋ |𝑥' , 𝜃, 𝑢ℛ )


Info Gathering

𝑅ℛ 𝑥, 𝑢ℋ , 𝜃, 𝑢ℛ = ℍ 𝑏' − ℍ 𝑏',)
+𝜆 ⋅ 𝑅-./0 (𝑥, 𝑢ℋ , 𝜃, 𝑢ℛ )

Goal
𝑝 𝑢ℋ 𝑥, 𝜃, 𝑢ℛ ∝ exp(𝑅ℋ (𝑥, 𝑢ℋ , 𝜃, 𝑢ℛ ))

𝑏',) 𝜃 ∝ 𝑏' 𝜃 ⋅ 𝑝(𝑢ℋ |𝑥' , 𝜃, 𝑢ℛ )


Info Gathering

𝑅ℛ 𝑥, 𝑢ℋ , 𝜃, 𝑢ℛ = ℍ 𝑏' − ℍ 𝑏',)
+𝜆 ⋅ 𝑅-./0 (𝑥, 𝑢ℋ , 𝜃, 𝑢ℛ )

Goal
𝑢ℛ = argmax 𝔼1 [𝑅ℛ ]
%ℛ
𝑝 𝑢ℋ 𝑥, 𝜃, 𝑢ℛ ∝ exp(𝑅ℋ (𝑥, 𝑢ℋ , 𝜃, 𝑢ℛ ))

𝑏',) 𝜃 ∝ 𝑏' 𝜃 ⋅ 𝑝(𝑢ℋ |𝑥' , 𝜃, 𝑢ℛ )


Info Gathering

𝑅ℛ 𝑥, 𝑢ℋ , 𝜃, 𝑢ℛ = ℍ 𝑏' − ℍ 𝑏',)
+𝜆 ⋅ 𝑅-./0 (𝑥, 𝑢ℋ , 𝜃, 𝑢ℛ )

Goal
𝑢ℛ = argmax 𝔼1 [𝑅ℛ ]
%ℛ
Nudging in for Active Info Gathering
Nudging in for Active Info Gathering
Distracted Human
Attentive Human
forward y-direction of human

Di
str
a
time
ct
ed
At

Hu
t

m
e nt

an
i
e v
H u m a n
Robot Active Info Gathering

Attentive

x of robot

d
Human

r
wa
or
gf
forward x-direction

hin
nc
pi
e
ke
Distracted
go Human
ba
ck

time time
time
(b) Scenario 2: (c) Scenario 3:
Human Responses

forward y-direction of human


m a
H u
te d
a c
i str
D

Attentive Human

time
Belief over Driving Style: Active vs Passive

Active

b(𝜃 = attentive)

Passive

time
Key Idea:

Robot’s actions affect human’s actions. We want to leverage these


effects for better safety and efficiency and better estimation.
Today’s itinerary
• Game-Theoretic Views on Multi-Agent Interactions

• Partner Modeling: Active Info Gathering over Human’s Intent

• Shared Autonomy and Latent Actions

• Partner Modeling: Learning and Influencing Latent Intent

• Learning from physical feedback

• Learning from play


𝑝 𝑢ℋ 𝑥, 𝜃, 𝑢ℛ ∝ exp(𝑅ℋ (𝑥, 𝑢ℋ , 𝜃, 𝑢ℛ ))

𝑏',) 𝜃 ∝ 𝑏' 𝜃 ⋅ 𝑝(𝑢ℋ |𝑥' , 𝜃, 𝑢ℛ )


Info Gathering

𝑅ℛ 𝑥, 𝑢ℋ , 𝜃, 𝑢ℛ = ℍ 𝑏' − ℍ 𝑏',)
+𝜆 ⋅ 𝑅-./0 (𝑥, 𝑢ℋ , 𝜃, 𝑢ℛ )

Goal
𝑢ℛ = argmax 𝔼1 [𝑅ℛ ]
%ℛ
Modeling Intent Inference using POMDPs

[Javdani et al.]
POMDP Formulation
MDPs have: POMDPs add:
Observations 𝑂
States 𝑆
Observation Function 𝑃(𝑜|𝑠)
Actions 𝐴
Transition Function 𝑃(𝑠 2 |𝑠, 𝑎)
Reward 𝑅(𝑠, 𝑎, 𝑠′)
Tiger Example

Actions 𝑎 = {0, : listen 1: open left, 2: open right}

Reward Function:
- Penalty for wrong opening: -100
- Reward for correct opening: +10
- Cost of listening: -1

Observations:
- To hear the tiger on the left
- To hear the tiger on the right
Tiger Example
Belief update based on observations:

𝑏) 𝑠H ∝ 𝑝 𝑜 𝑠H , 𝑎 6 𝑝 𝑠H 𝑠L , 𝑎 ⋅ 𝑏M(𝑠L )
I# ∈K

Immediate return Discounted future return


Value Iteration
𝑉 ∗ 𝑏 = max[6 𝑏 𝑠 ⋅ 𝑅 𝑠, 𝑎 + 𝛾 6 𝑃 𝑜 𝑏, 𝑎 ⋅ 𝑉 ∗ (𝑏./ ) ]
over Beliefs /∈N
I∈K .∈O

Hard to compute continuous space MDPs -> Approximation


Tiger Example
Immediate return Discounted future return
Value Iteration
𝑉 ∗ 𝑏 = max[6 𝑏 𝑠 ⋅ 𝑅 𝑠, 𝑎 + 𝛾 6 𝑃 𝑜 𝑏, 𝑎 ⋅ 𝑉 ∗ (𝑏./ ) ]
over Beliefs /∈N
I∈K .∈O

Hard to compute continuous space MDPs -> Approximation

Q-MDP
Approximation 𝑉 ∗ 𝑏 = 𝔼I 𝑉 ∗ 𝑠 = 6 𝑏 𝑠 ⋅ 𝑉 ∗ (𝑠)
I
Intent Inference

𝑋 Robot States
𝐴 Robot Actions
𝑇: 𝑋 ×𝐴 → 𝑋 Transition function

𝑢 ∈ 𝑈 Human continuous input


𝐷: 𝑈 → 𝐴 Mapping between human input and robot actions
User’s Policy is Learned from IRL

𝜋!"#$ 𝑥 = 𝑝(𝑢|𝑥, 𝑔) We learn a policy for each goal

𝑝 𝜉 𝑔 ∝ exp(−𝐶!"#$ 𝜉 )

𝑝 𝑔 𝜉 ∝ 𝑝 𝜉 𝑔 ⋅ 𝑝(𝑔) Bayes Rule

POMDP Observation Model


Hindsight Optimization (Q-MDP)
Estimate cost-to-go of the belief by assuming full observability will be obtained at the next
time step.

You never gather information, but can plan efficiently in deterministic subproblems.

𝑏 𝑠 =𝑏 𝑔 =𝑝 𝑔𝜉 Uncertainty is only over goals

𝑄 𝑏, 𝑎, 𝑢 = ( 𝑏 𝑔 ⋅ 𝑄# (𝑥, 𝑎, 𝑢)
Action-Value function of the POMDP # Cost-to-Go of Acting optimally and going towards goal 𝑔
Shared Autonomy with Hindsight
Optimization
Today’s itinerary
• Game-Theoretic Views on Multi-Agent Interactions

• Partner Modeling: Active Info Gathering over Human’s Intent

• Shared Autonomy and Latent Actions

• Partner Modeling: Learning and Influencing Latent Intent

• Learning from physical feedback

• Learning from play


• Assistive robotic arms are dexterous
• This dexterity makes it hard for users to control
the robot

• How can robots learn low-dimensional


representations that make controlling the robot
intuitive?
Our Vision

Offline, expert demonstrations of high-dimensional motions


Our Vision

Learn low-dimensional latent representations for online control


We make it easier to control high-dimensional
robots by embedding the robot’s actions into a
low-dimensional latent space.
Model
Structure
(cVAE)
Model
Structure
(cVAE)
User Study

• We trained on less than 7 minutes of kinesthetic demonstrations


• Demonstrations consisted of moving between shelves, pouring, stirring, and
reaching motions
• We compared our Latent Action to the current method for assistive robotic arms
(End-Effector)
End-Effector
End-Effector
Latent
Actions
Results
Today’s itinerary
• Game-Theoretic Views on Multi-Agent Interactions

• Partner Modeling: Active Info Gathering over Human’s Intent

• Shared Autonomy and Latent Actions

• Partner Modeling: Learning and Influencing Latent Intent

• Learning from physical feedback

• Learning from play


Nth order Theory of Mind

Most interactive tasks are


not the same as playing
chess!
… low-dimensional shared representation
that captures the interaction and can change over time.
Other agents are often non-stationary:
They update their behavior in response to the robot.
Ego Agent Other Agent
𝑎 ∈ ℝ!
Ego Agent Other Agent

𝒛𝟏
𝒛𝟐
𝒛𝟑
!
𝜏 = 𝑠" , 𝑎" , 𝑟" , … , 𝑠# , 𝑎# , 𝑟#
𝑧 !"# ~𝑓 $ ! !
𝑧 ,𝜏 )
Modeling Other Agent’s Behavior
Modeling Other Agent’s Behavior

ℰ1 0
𝑧
𝜏 023
I think it will
aim right
next

ℰ1 0
𝑧
𝜏 023
I think it will
aim right
next

ℰ1 0 𝒟4
𝑧
𝜏 023 𝜏̂ 0
I think it will
aim right
next

ℰ1 0 𝒟4
𝑧
𝜏 023 𝜏̂ 0

- 0
*
Learning objective: max $ $ log 𝑝(,) 𝑠.1/ , 𝑟.* 𝑠.* , 𝑎.* , 𝜏 *2/ )
(,)
*+, .+/
Representation Learning

𝜏 !"# , 𝜏 !
𝜏 $%& ℰ 𝑧$ 𝒟 𝜏̂ $

Experience
Buffer
Learning and Influencing Latent Intent

Maximize expected return


within an interaction Representation Learning

𝜏 !"# , 𝜏 !
% / 𝜏 $%& ℰ 𝑧$ 𝒟 𝜏̂ $

max ( 𝛾 " 𝔼&R ((|*, , S ) ( 𝑅(𝑠, 𝑧 " ) Experience


!
"#$ .#$ Buffer

to react to the other agent

[Xie, Losey, Tolsma, Finn, Sadigh, CoRL 2020]


Point Mass Navigation
Point Mass Navigation
Point Mass Navigation
Point Mass Navigation
Point Mass Navigation
Point Mass Navigation

SAC LILI
Ego Agent Other Agent

left
middle
right

Air Hockey Results


Ego Agent +1 Other Agent

Air Hockey Results


Ego Agent Other Agent

Air Hockey Results


Ego Agent +2 Other Agent

Air Hockey Results


Ego Agent Other Agent

Air Hockey Results


2x speed

SAC: initial policy


2x speed

SAC: 2 hours of training


2x speed

SAC: 4 hours of training


Air Hockey Results
2x speed

LILI: 4 hours of training


Air Hockey Results
Reacting to Other Agents

Maximize expected return


within an interaction Representation Learning

𝜏 !"# , 𝜏 !
% / 𝜏 $%& ℰ 𝑧$ 𝒟 𝜏̂ $

max ( 𝛾 " 𝔼&R ((|*, , S ) ( 𝑅(𝑠, 𝑧 " ) Experience


!
"#$ .#$ Buffer

to react to the other agent


Influencing Other Agents

Maximize expected return across


interactions Representation Learning

𝜏 !"# , 𝜏 !
% / 𝜏 $%& ℰ 𝑧$ 𝒟 𝜏̂ $

max ( 𝛾 " 𝔼&R ((|*, , S ) ( 𝑅(𝑠, 𝑧 " ) Experience


!
"#$ .#$ Buffer

to influence the other agent


Point Mass Navigation

SAC LILI LILI (with influence)


2x speed

LILI (with influence): 4 hours of training


Air Hockey Results
Air Hockey Results
Air Hockey Results

[Xie, Losey, Tolsma, Finn, Sadigh, CoRL 2020]


Playing with a
Human Expert

SAC: 45% success


Playing with a
Human Expert

LILI : 73% success


Key Takeaways

Human partners are often non-stationary –


which can be represented by low-dimensional latent strategies.

LILI anticipates the partner’s policies using latent strategies to react and
influence the other agent.
Learn from Different Sources of Data

Expert demonstrations
Suboptimal demonstrations, observations
Language instructions, narrations

Pairwise comparisons, rankings, ordinal data Physical Corrections

X ✓
Today’s itinerary
• Game-Theoretic Views on Multi-Agent Interactions

• Partner Modeling: Active Info Gathering over Human’s Intent

• Shared Autonomy and Latent Actions

• Partner Modeling: Learning and Influencing Latent Intent

• Learning from physical feedback

• Learning from play


Learning from Interactions
• During interaction, users change the robot’s behavior
• Compliance alone does not provide an intelligent response to these physical
changes

• How can robots learn from and intelligently respond to physical interactions?
Robots can learn by recognizing that
interactions are often intentional corrections.
Formalizing Physical Corrections
Formalizing Physical Corrections
Formalizing Physical Corrections
Formalizing Physical Corrections
Value Alignment

MAP Estimate
Value Alignment

Conditionally
Independent
Value Alignment

Conditionally
Independent

'
exp(𝜃 & Φ 𝜉$ − 𝜆 𝜉$ − 𝜉% ) '
𝑃 𝜉$ 𝜉% ; 𝜃 = ' ≈ exp(𝜃 & (Φ 𝜉$ − Φ 𝜉% ) − 𝜆 𝜉$ − 𝜉% )
∫ exp 𝜃&Φ 𝜉 − 𝜆 𝜉 − 𝜉% 𝑑𝜉

1 '
Assume a prior 𝑃 𝜃 = exp(− 𝜃 − 𝜃( )
2𝛽
Value Alignment

Correction
Value Alignment

Gradient Descent
Value Alignment

Learning Rule
Today’s itinerary
• Game-Theoretic Views on Multi-Agent Interactions

• Partner Modeling: Active Info Gathering over Human’s Intent

• Shared Autonomy and Latent Actions

• Partner Modeling: Learning and Influencing Latent Intent

• Learning from physical feedback

• Learning from play


Learn from Different Sources of Data

Expert demonstrations
Suboptimal demonstrations, play, observations
Language instructions, narrations

Pairwise comparisons, rankings, ordinal data Physical Corrections

X ✓
Learning from Play Data

Human Play: unstructured, undirected multi-task demonstrations

- Cheap to collect – address scalability

- No task specifications
- Reset-free
- Broad state-action-goal coverage – address generalizability
- Access to human priors on goals and behaviors
Play Data Covers State Space at Faster Rate

Imitation policies learned from play are more robust at test time.

[Lynch et. al. CoRL 2020]


How to Learn from Play Data?
Latent Motor Plans (Play-LMP):
Learn plans to represent random sequences from play
time

𝑠) 𝜏 𝑜&

[Lynch et. al. CoRL 2020]


Insight: Viewing Play as Object Interactions

Play consists of back-to-back environment interactions

Pre-Interaction Interaction Post-Interaction


time

Next Interaction

Insight: By segmenting play into object interactions, we can...


1. Sample more informative and accurate goals
2. Bias latent space to model interactions rather random sequences compared to prior work
PLATO: Predicting Latent Affordances Through Object-Centric Play
Pre-Interaction Interaction Post-Interaction
time

Next Interaction

Encode Into Sample object


Affordance goal state

𝑧 𝑜-

Learn Policy 𝜋 ⋅ 𝑠' , 𝒛, 𝑜- )


Learning Affordances from Interaction Period
Affordance: Property of an object that defines how it can be used

Pushing

𝜏 (H) 𝐸 𝑧 Bias Latent Space to


Model Affordances
Lifting
𝐿bc
𝑜)
Predict affordance using object
𝐸′ 𝑧′ start & goal states (prior)
𝑜-
Learning Policy for Affordances
Policy conditioned on current state (𝑠. ), goal (𝑜8 ), and affordance (𝒛)

Policy

𝑠' 𝐿defgh
𝜏 (H) 𝐸 𝑧 𝑧 π 𝑎W ' 𝑎'
Sample Affordance z 𝑜-

𝐿bc
𝑜)
𝐸′ 𝑧′
𝑜-
Learning Policy for Affordances
Test Time:
Policy conditioned on current state (𝑠. ), goal (𝑜8 ), and affordance (𝒛’)

Policy
𝑜)
𝑠'
𝐸′ 𝑧′ 𝑧′ π 𝑎W '
Sample Affordance z’ 𝑜-
𝑜-
Experiments: Block2D
Collected scripted & human play

Evaluated on a variety of primitives


(randomized block sizes, masses, and positions)

Push Pull Lift Tip Side-Rotate

Baselines: Play-LMP and Play-GCBC


Experiments: Block2D

PLATO outperforms baselines on all primitives


- Both for scripted & human data

Performs well on complex primitives


PLATO latent space separated tasks, despite no task labels

Experiments: Block2D Latent Space


Experiments: 3D Manipulation

Block3D-Platforms Mug3D-Platforms Playroom3D


Experiments: Block3D-Platforms

PLATO does substantially better than prior


methods on lifting

Also does as well or better on pushing.


Experiments: Mug3D-Platforms
PLATO does substantially better than prior
methods on lifting

Also does as well or better on pushing and


rotating.
Experiments: Playroom3D

PLATO improves on cabinet and drawer open /


close tasks.

Outperforms in block push / move tasks.

Improves on button pressing tasks.


Push Forward
Experiments: BlockReal

Franka Panda, trained on pushing tasks


in simulation (no real-world play data)

PLATO generalizes to novel real world


object dynamics.
Push Right
Key Takeaway

PLATO intelligently segments the play data based on object


interactions.

… PLATO learns a robust, generalizable, multi-task policy from


play data.

PLATO: Predicting Latent Affordances Through Object-Centric Play


Suneel Belkhale, Dorsa Sadigh.
CoRL 2022

You might also like