Lecture 17

Principles of Robot Autonomy II
Human-Robot Interaction
Today’s itinerary
• Game-Theoretic Views on Multi-Agent Interactions
• Partner Modeling: Active Info Gathering over Human’s Intent
• Shared Autonomy and Latent Actions
• Partner Modeling: Learning and Influencing Latent Intent
• Learning from physical feedback
• Learning from play

Today’s itinerary

Learning from Humans
Existing research explores how robots adapt to humans

• Imitation learning
• Learning from demonstrations
Influencing Humans
Far less studies how robots influence humans

Nth order Theory of Mind
[Sadigh, Sastry, Seshia, Dragan, RSS 2016, IROS 2016, AURO 2018]
An autonomous car’s
actions will affect the actions of other
drivers.
Source: https://twitter.com/nitguptaa/
Interaction as a Dynamical System
direct control
over 𝑢ℛ
indirect
control over 𝑢ℋ
∗ ∗
𝑢ℛ = argmax 𝑅ℛ (𝑥, 𝑢ℛ , 𝑢ℋ (𝑥, 𝑢ℛ ))
#ℛ
Find optimal actions for the

robot while accounting for
∗
the human response 𝑢ℋ .
∗
Model 𝑢ℋ as optimizing
the human reward
function 𝑅ℋ .
∗
𝑢ℋ 𝑥, 𝑢ℛ ≈ argmax 𝑅ℋ (𝑥, 𝑢ℛ , 𝑢ℋ )
#ℋ
Sadigh et al. RSS 2016, AURO 2018

Learning Driver Models
Learn Human’s reward function based on Inverse
Reinforcement Learning:
exp(𝑅ℋ (𝑥, 𝑢ℛ , 𝑢ℋ ))
𝑃 𝑢ℋ 𝑥, 𝑤) =
∫ exp 𝑅ℋ 𝑥, 𝑢ℛ , 𝑢. ℋ 𝑑 𝑢. ℋ
𝑅ℋ 𝑥, 𝑢ℛ , 𝑢ℋ = 𝑤 ⏉ 𝜙(𝑥, 𝑢ℛ , 𝑢ℋ )
Features for thefor the

(a) Features Features for staying
(b) Feature for staying Features for avoiding
(c) Features for avoiding
boundariesofofthe
boundaries theroad.
road inside the lanes.
inside the lanes. other other
vehicles.
vehicles.
[Ziebart’ 09] [Levine’10]

∗ ∗
%ℛ
Find optimal actions for the

robot while accounting for
∗
the human response 𝑢ℋ .
∗
Model 𝑢ℋ as optimizing
the human reward
function 𝑅ℋ .
∗
𝑢ℋ 𝑥, 𝑢ℛ ≈ argmax 𝑅ℋ (𝑥, 𝑢ℛ , 𝑢& )
%ℋ
Approximations for Tractability
− Receding Horizon Control:
Plan for short time horizon, replan at every step.
− Model the problem as a Stackelberg game.

Give the human full access to 𝑢ℛ for the short time horizon.
Approximations for Tractability
− Receding Horizon Control:
Plan for short time horizon, replan at every step.
− Model the problem as a Stackelberg game.

Give the human full access to 𝑢ℛ for the short time horizon.
∗
𝑢ℋ (𝑥, 𝑢ℛ ) = argmax 𝑅ℋ (𝑥, 𝑢ℛ , 𝑢ℋ )
%ℋ
− Assume deterministic human model.

Solution of Nested Optimization
∗ ∗
%ℛ *
' '
𝑅ℛ 𝑥, 𝑢ℛ , 𝑢ℋ = 6 𝑟ℛ 𝑥 ' , 𝑢ℛ , 𝑢ℋ
'()
Gradient-Based Method (Quasi-Newton):
∗
𝑅ℛ (𝑥, 𝑢ℛ , 𝑢ℋ )
∗
𝜕𝑅ℛ 𝜕𝑅ℛ 𝜕𝑢ℋ 𝜕𝑅ℛ
= +
𝜕𝑢ℛ 𝜕𝑢ℋ 𝜕𝑢ℛ 𝜕𝑢ℛ
∗
𝑢ℋ 𝑥, 𝑢ℛ ≈ argmax 𝑅ℋ (𝑥, 𝑢ℛ , 𝑢ℋ ) *
%ℋ
' '
𝑅ℋ 𝑥, 𝑢ℛ , 𝑢ℋ = 6 𝑟ℋ 𝑥 ' , 𝑢ℛ , 𝑢ℋ
'()
Solution of Nested Optimization
Quasi-Newton method:
∗
𝜕𝑅ℛ 𝜕𝑅ℛ 𝜕𝑢ℋ 𝜕𝑅ℛ
= ⋅ +
𝜕𝑢ℛ 𝜕𝑢ℋ 𝜕𝑢ℛ 𝜕𝑢ℛ
Given 𝑅ℋ is:, 𝜕𝑅ℋ ∗

• smooth, 𝑥, 𝑢ℛ , 𝑢ℋ 𝑥, 𝑢ℛ =0
𝜕𝑢ℋ
• its minimum is attained,
for an unconstrained optimization,
the partial !"ℋ at the optimum 𝑢 ∗
!$ℋ ℋ
evaluates to zero. ∗
𝜕 +𝑅ℋ 𝜕𝑢ℋ 𝜕 +𝑅ℋ 𝜕𝑢ℛ
+ ⋅ 𝜕𝑢 + 𝜕𝑢 𝜕𝑢 ⋅ 𝜕𝑢 = 0
𝜕𝑢ℋ ℛ ℋ ℛ ℛ
Implication: Efficiency
ℛobot
ℋ uman
Implication: Coordination
Implication: Coordination
Legible Motion
Using robot motion to coordinate

with the human better about the
robot’s goal
Human crossing First
an
m
lHu
ea
Id
y of Human Vehicle
e
ar
Aw
n-
tio
c
e
ra
a c l
st
te
b
In
m ic O Human crossing
y n a
D Second
x of Autonomous Vehicle
𝑝 𝑢ℋ 𝑥 ∝ exp(𝑅ℋ (𝑥, 𝑢ℋ ))
We can’t rely on a
single driver model.
We need to differentiate
between different drivers.
𝑝 𝑢ℋ 𝑥, 𝜃 ∝ exp(𝑅ℋ (𝑥, 𝑢ℋ , 𝜃))
𝑏',) 𝜃 ∝ 𝑏' 𝜃 ⋅ 𝑝(𝑢ℋ |𝑥' , 𝜃)

𝑝 𝑢ℋ 𝑥, 𝜃 ∝ exp(𝑅ℋ (𝑥, 𝑢ℋ , 𝜃))
𝑏',) 𝜃 ∝ 𝑏' 𝜃 ⋅ 𝑝(𝑢ℋ |𝑥' , 𝜃)
𝑢ℛ = argmax 𝑅ℛ
%ℛ
Drivers respond to
actions of other cars.
…We have an opportunity to

actively gather information.
𝑝 𝑢ℋ 𝑥, 𝜃, 𝑢ℛ ∝ exp(𝑅ℋ (𝑥, 𝑢ℋ , 𝜃, 𝑢ℛ ))
𝑏',) 𝜃 ∝ 𝑏' 𝜃 ⋅ 𝑝(𝑢ℋ |𝑥' , 𝜃, 𝑢ℛ )

Info Gathering
𝑅ℛ 𝑥, 𝑢ℋ , 𝜃, 𝑢ℛ = ℍ 𝑏' − ℍ 𝑏',)
+𝜆 ⋅ 𝑅-./0 (𝑥, 𝑢ℋ , 𝜃, 𝑢ℛ )
Goal
𝑏',) 𝜃 ∝ 𝑏' 𝜃 ⋅ 𝑝(𝑢ℋ |𝑥' , 𝜃, 𝑢ℛ )

Info Gathering
𝑅ℛ 𝑥, 𝑢ℋ , 𝜃, 𝑢ℛ = ℍ 𝑏' − ℍ 𝑏',)
+𝜆 ⋅ 𝑅-./0 (𝑥, 𝑢ℋ , 𝜃, 𝑢ℛ )
Goal
𝑢ℛ = argmax 𝔼1 [𝑅ℛ ]
%ℛ
𝑏',) 𝜃 ∝ 𝑏' 𝜃 ⋅ 𝑝(𝑢ℋ |𝑥' , 𝜃, 𝑢ℛ )

Info Gathering
𝑅ℛ 𝑥, 𝑢ℋ , 𝜃, 𝑢ℛ = ℍ 𝑏' − ℍ 𝑏',)
+𝜆 ⋅ 𝑅-./0 (𝑥, 𝑢ℋ , 𝜃, 𝑢ℛ )
Goal
%ℛ
Nudging in for Active Info Gathering
Nudging in for Active Info Gathering
Distracted Human
Attentive Human
forward y-direction of human
Di
str
a
time
ct
ed
At
Hu
t
m
e nt
an
i
e v
H u m a n
Robot Active Info Gathering
Attentive
x of robot
d
Human
r
wa
or
gf
forward x-direction
hin
nc
pi
e
ke
Distracted
go Human
ba
ck
time time
time
(b) Scenario 2: (c) Scenario 3:
Human Responses
forward y-direction of human

m a
H u
te d
a c
i str
D
Attentive Human
time
Belief over Driving Style: Active vs Passive
Active
b(𝜃 = attentive)
Passive
time
Key Idea:
Robot’s actions affect human’s actions. We want to leverage these

effects for better safety and efficiency and better estimation.
Today’s itinerary

𝑏',) 𝜃 ∝ 𝑏' 𝜃 ⋅ 𝑝(𝑢ℋ |𝑥' , 𝜃, 𝑢ℛ )

Info Gathering
𝑅ℛ 𝑥, 𝑢ℋ , 𝜃, 𝑢ℛ = ℍ 𝑏' − ℍ 𝑏',)
+𝜆 ⋅ 𝑅-./0 (𝑥, 𝑢ℋ , 𝜃, 𝑢ℛ )
Goal
%ℛ
Modeling Intent Inference using POMDPs
[Javdani et al.]
POMDP Formulation
MDPs have: POMDPs add:
Observations 𝑂
States 𝑆
Observation Function 𝑃(𝑜|𝑠)
Actions 𝐴
Transition Function 𝑃(𝑠 2 |𝑠, 𝑎)
Reward 𝑅(𝑠, 𝑎, 𝑠′)
Tiger Example
Actions 𝑎 = {0, : listen 1: open left, 2: open right}
Reward Function:
- Penalty for wrong opening: -100
- Reward for correct opening: +10
- Cost of listening: -1
Observations:
- To hear the tiger on the left
- To hear the tiger on the right
Tiger Example
Belief update based on observations:
𝑏) 𝑠H ∝ 𝑝 𝑜 𝑠H , 𝑎 6 𝑝 𝑠H 𝑠L , 𝑎 ⋅ 𝑏M(𝑠L )
I# ∈K
Immediate return Discounted future return

Value Iteration
𝑉 ∗ 𝑏 = max[6 𝑏 𝑠 ⋅ 𝑅 𝑠, 𝑎 + 𝛾 6 𝑃 𝑜 𝑏, 𝑎 ⋅ 𝑉 ∗ (𝑏./ ) ]
over Beliefs /∈N
I∈K .∈O
Hard to compute continuous space MDPs -> Approximation

Tiger Example
Immediate return Discounted future return
Value Iteration
𝑉 ∗ 𝑏 = max[6 𝑏 𝑠 ⋅ 𝑅 𝑠, 𝑎 + 𝛾 6 𝑃 𝑜 𝑏, 𝑎 ⋅ 𝑉 ∗ (𝑏./ ) ]
over Beliefs /∈N
I∈K .∈O
Hard to compute continuous space MDPs -> Approximation
Q-MDP
Approximation 𝑉 ∗ 𝑏 = 𝔼I 𝑉 ∗ 𝑠 = 6 𝑏 𝑠 ⋅ 𝑉 ∗ (𝑠)
I
Intent Inference
𝑋 Robot States
𝐴 Robot Actions
𝑇: 𝑋 ×𝐴 → 𝑋 Transition function
𝑢 ∈ 𝑈 Human continuous input

𝐷: 𝑈 → 𝐴 Mapping between human input and robot actions
User’s Policy is Learned from IRL
𝜋!"#$ 𝑥 = 𝑝(𝑢|𝑥, 𝑔) We learn a policy for each goal
𝑝 𝜉 𝑔 ∝ exp(−𝐶!"#$ 𝜉 )
𝑝 𝑔 𝜉 ∝ 𝑝 𝜉 𝑔 ⋅ 𝑝(𝑔) Bayes Rule
POMDP Observation Model

Hindsight Optimization (Q-MDP)
Estimate cost-to-go of the belief by assuming full observability will be obtained at the next
time step.
You never gather information, but can plan efficiently in deterministic subproblems.
𝑏 𝑠 =𝑏 𝑔 =𝑝 𝑔𝜉 Uncertainty is only over goals
𝑄 𝑏, 𝑎, 𝑢 = ( 𝑏 𝑔 ⋅ 𝑄# (𝑥, 𝑎, 𝑢)
Action-Value function of the POMDP # Cost-to-Go of Acting optimally and going towards goal 𝑔
Shared Autonomy with Hindsight
Optimization
Today’s itinerary

• Assistive robotic arms are dexterous
• This dexterity makes it hard for users to control
the robot
• How can robots learn low-dimensional

representations that make controlling the robot
intuitive?
Our Vision
Offline, expert demonstrations of high-dimensional motions

Our Vision
Learn low-dimensional latent representations for online control

We make it easier to control high-dimensional
robots by embedding the robot’s actions into a
low-dimensional latent space.
Model
Structure
(cVAE)
Model
Structure
(cVAE)
User Study
• We trained on less than 7 minutes of kinesthetic demonstrations

• Demonstrations consisted of moving between shelves, pouring, stirring, and
reaching motions
• We compared our Latent Action to the current method for assistive robotic arms
(End-Effector)
End-Effector
End-Effector
Latent
Actions
Results
Today’s itinerary

Most interactive tasks are

not the same as playing
chess!
… low-dimensional shared representation
that captures the interaction and can change over time.
Other agents are often non-stationary:
They update their behavior in response to the robot.
Ego Agent Other Agent
𝑎 ∈ ℝ!
𝒛𝟏
𝒛𝟐
𝒛𝟑
!
𝜏 = 𝑠" , 𝑎" , 𝑟" , … , 𝑠# , 𝑎# , 𝑟#
𝑧 !"# ~𝑓 $ ! !
𝑧 ,𝜏 )
Modeling Other Agent’s Behavior
Modeling Other Agent’s Behavior
ℰ1 0
𝑧
𝜏 023
I think it will
aim right
next
ℰ1 0
𝑧
𝜏 023
I think it will
aim right
next
ℰ1 0 𝒟4
𝑧
𝜏 023 𝜏̂ 0
I think it will
aim right
next
ℰ1 0 𝒟4
𝑧
𝜏 023 𝜏̂ 0
- 0
*
Learning objective: max $ $ log 𝑝(,) 𝑠.1/ , 𝑟.* 𝑠.* , 𝑎.* , 𝜏 *2/ )
(,)
*+, .+/
Representation Learning
𝜏 !"# , 𝜏 !
𝜏 $%& ℰ 𝑧$ 𝒟 𝜏̂ $
Experience
Buffer
Learning and Influencing Latent Intent
Maximize expected return

within an interaction Representation Learning
𝜏 !"# , 𝜏 !
% / 𝜏 $%& ℰ 𝑧$ 𝒟 𝜏̂ $
max ( 𝛾 " 𝔼&R ((|*, , S ) ( 𝑅(𝑠, 𝑧 " ) Experience

!
"#$ .#$ Buffer
to react to the other agent
[Xie, Losey, Tolsma, Finn, Sadigh, CoRL 2020]

Point Mass Navigation
SAC LILI
left
middle
right
Air Hockey Results

Ego Agent +1 Other Agent
Air Hockey Results

Air Hockey Results

Ego Agent +2 Other Agent
Air Hockey Results

Air Hockey Results

2x speed
SAC: initial policy

2x speed
SAC: 2 hours of training

2x speed
SAC: 4 hours of training

Air Hockey Results
2x speed
LILI: 4 hours of training

Air Hockey Results
Reacting to Other Agents
Maximize expected return

within an interaction Representation Learning
𝜏 !"# , 𝜏 !
% / 𝜏 $%& ℰ 𝑧$ 𝒟 𝜏̂ $

!
"#$ .#$ Buffer
to react to the other agent

Influencing Other Agents
Maximize expected return across

interactions Representation Learning
𝜏 !"# , 𝜏 !
% / 𝜏 $%& ℰ 𝑧$ 𝒟 𝜏̂ $

!
"#$ .#$ Buffer
to influence the other agent

SAC LILI LILI (with influence)

2x speed
LILI (with influence): 4 hours of training

Air Hockey Results
Air Hockey Results
Air Hockey Results
[Xie, Losey, Tolsma, Finn, Sadigh, CoRL 2020]

Playing with a
Human Expert
SAC: 45% success

Playing with a
Human Expert
LILI : 73% success

Key Takeaways
Human partners are often non-stationary –

which can be represented by low-dimensional latent strategies.
LILI anticipates the partner’s policies using latent strategies to react and
influence the other agent.
Learn from Different Sources of Data
Expert demonstrations
Suboptimal demonstrations, observations
Language instructions, narrations
Pairwise comparisons, rankings, ordinal data Physical Corrections
X ✓
Today’s itinerary

Learning from Interactions
• During interaction, users change the robot’s behavior
• Compliance alone does not provide an intelligent response to these physical
changes
• How can robots learn from and intelligently respond to physical interactions?
Robots can learn by recognizing that
interactions are often intentional corrections.
Formalizing Physical Corrections
Value Alignment
MAP Estimate
Value Alignment
Conditionally
Independent
Value Alignment
Conditionally
Independent
'
exp(𝜃 & Φ 𝜉$ − 𝜆 𝜉$ − 𝜉% ) '
𝑃 𝜉$ 𝜉% ; 𝜃 = ' ≈ exp(𝜃 & (Φ 𝜉$ − Φ 𝜉% ) − 𝜆 𝜉$ − 𝜉% )
∫ exp 𝜃&Φ 𝜉 − 𝜆 𝜉 − 𝜉% 𝑑𝜉
1 '
Assume a prior 𝑃 𝜃 = exp(− 𝜃 − 𝜃( )
2𝛽
Value Alignment
Correction
Value Alignment
Gradient Descent
Value Alignment
Learning Rule
Today’s itinerary

Learn from Different Sources of Data
Expert demonstrations
Suboptimal demonstrations, play, observations
Language instructions, narrations
Pairwise comparisons, rankings, ordinal data Physical Corrections
X ✓
Learning from Play Data
Human Play: unstructured, undirected multi-task demonstrations
- Cheap to collect – address scalability
- No task specifications
- Reset-free
- Broad state-action-goal coverage – address generalizability
- Access to human priors on goals and behaviors
Play Data Covers State Space at Faster Rate
Imitation policies learned from play are more robust at test time.
[Lynch et. al. CoRL 2020]

How to Learn from Play Data?
Latent Motor Plans (Play-LMP):
Learn plans to represent random sequences from play
time
𝑠) 𝜏 𝑜&
[Lynch et. al. CoRL 2020]

Insight: Viewing Play as Object Interactions
Play consists of back-to-back environment interactions
Pre-Interaction Interaction Post-Interaction

time
Next Interaction
Insight: By segmenting play into object interactions, we can...

1. Sample more informative and accurate goals
2. Bias latent space to model interactions rather random sequences compared to prior work
PLATO: Predicting Latent Affordances Through Object-Centric Play
Pre-Interaction Interaction Post-Interaction
time
Next Interaction
Encode Into Sample object

Affordance goal state
𝑧 𝑜-
Learn Policy 𝜋 ⋅ 𝑠' , 𝒛, 𝑜- )

Learning Affordances from Interaction Period
Affordance: Property of an object that defines how it can be used
Pushing
𝜏 (H) 𝐸 𝑧 Bias Latent Space to

Model Affordances
Lifting
𝐿bc
𝑜)
Predict affordance using object
𝐸′ 𝑧′ start & goal states (prior)
𝑜-
Learning Policy for Affordances
Policy conditioned on current state (𝑠. ), goal (𝑜8 ), and affordance (𝒛)
Policy
𝑠' 𝐿defgh
𝜏 (H) 𝐸 𝑧 𝑧 π 𝑎W ' 𝑎'
Sample Affordance z 𝑜-
𝐿bc
𝑜)
𝐸′ 𝑧′
𝑜-
Learning Policy for Affordances
Test Time:
Policy conditioned on current state (𝑠. ), goal (𝑜8 ), and affordance (𝒛’)
Policy
𝑜)
𝑠'
𝐸′ 𝑧′ 𝑧′ π 𝑎W '
Sample Affordance z’ 𝑜-
𝑜-
Experiments: Block2D
Collected scripted & human play
Evaluated on a variety of primitives

(randomized block sizes, masses, and positions)
Push Pull Lift Tip Side-Rotate
Baselines: Play-LMP and Play-GCBC

Experiments: Block2D
PLATO outperforms baselines on all primitives

- Both for scripted & human data
Performs well on complex primitives

PLATO latent space separated tasks, despite no task labels
Experiments: Block2D Latent Space

Experiments: 3D Manipulation
Block3D-Platforms Mug3D-Platforms Playroom3D

Experiments: Block3D-Platforms
PLATO does substantially better than prior

methods on lifting
Also does as well or better on pushing.

Experiments: Mug3D-Platforms
PLATO does substantially better than prior
methods on lifting
Also does as well or better on pushing and

rotating.
Experiments: Playroom3D
PLATO improves on cabinet and drawer open /

close tasks.
Outperforms in block push / move tasks.
Improves on button pressing tasks.

Push Forward
Experiments: BlockReal
Franka Panda, trained on pushing tasks

in simulation (no real-world play data)
PLATO generalizes to novel real world

object dynamics.
Push Right
Key Takeaway
PLATO intelligently segments the play data based on object

interactions.
… PLATO learns a robust, generalizable, multi-task policy from

play data.
PLATO: Predicting Latent Affordances Through Object-Centric Play

Suneel Belkhale, Dorsa Sadigh.
CoRL 2022

Lecture 17

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 17

Uploaded by

Copyright:

Available Formats

Principles of Robot Autonomy II

• Partner Modeling: Active Info Gathering over Human’s Intent

• Shared Autonomy and Latent Actions

• Partner Modeling: Learning and Influencing Latent Intent

• Learning from physical feedback

• Learning from play

• Partner Modeling: Active Info Gathering over Human’s Intent

• Shared Autonomy and Latent Actions

• Partner Modeling: Learning and Influencing Latent Intent

• Learning from physical feedback

• Learning from play

Existing research explores how robots adapt to humans

Far less studies how robots influence humans

Find optimal actions for the

Sadigh et al. RSS 2016, AURO 2018

Features for thefor the

[Ziebart’ 09] [Levine’10]

Find optimal actions for the

Plan for short time horizon, replan at every step.

− Model the problem as a Stackelberg game.

Plan for short time horizon, replan at every step.

− Model the problem as a Stackelberg game.

− Assume deterministic human model.

Gradient-Based Method (Quasi-Newton):

Given 𝑅ℋ is:, 𝜕𝑅ℋ ∗

Using robot motion to coordinate

𝑏',) 𝜃 ∝ 𝑏' 𝜃 ⋅ 𝑝(𝑢ℋ |𝑥' , 𝜃)

𝑏',) 𝜃 ∝ 𝑏' 𝜃 ⋅ 𝑝(𝑢ℋ |𝑥' , 𝜃)

…We have an opportunity to

𝑏',) 𝜃 ∝ 𝑏' 𝜃 ⋅ 𝑝(𝑢ℋ |𝑥' , 𝜃, 𝑢ℛ )

𝑏',) 𝜃 ∝ 𝑏' 𝜃 ⋅ 𝑝(𝑢ℋ |𝑥' , 𝜃, 𝑢ℛ )

𝑏',) 𝜃 ∝ 𝑏' 𝜃 ⋅ 𝑝(𝑢ℋ |𝑥' , 𝜃, 𝑢ℛ )

forward y-direction of human

Robot’s actions affect human’s actions. We want to leverage these

• Partner Modeling: Active Info Gathering over Human’s Intent

• Shared Autonomy and Latent Actions

• Partner Modeling: Learning and Influencing Latent Intent

• Learning from physical feedback

• Learning from play

𝑏',) 𝜃 ∝ 𝑏' 𝜃 ⋅ 𝑝(𝑢ℋ |𝑥' , 𝜃, 𝑢ℛ )

Actions 𝑎 = {0, : listen 1: open left, 2: open right}

Immediate return Discounted future return

Hard to compute continuous space MDPs -> Approximation

Hard to compute continuous space MDPs -> Approximation

𝑢 ∈ 𝑈 Human continuous input

𝜋!"#$ 𝑥 = 𝑝(𝑢|𝑥, 𝑔) We learn a policy for each goal

𝑝 𝑔 𝜉 ∝ 𝑝 𝜉 𝑔 ⋅ 𝑝(𝑔) Bayes Rule

POMDP Observation Model

𝑏 𝑠 =𝑏 𝑔 =𝑝 𝑔𝜉 Uncertainty is only over goals

• Partner Modeling: Active Info Gathering over Human’s Intent

• Shared Autonomy and Latent Actions

• Partner Modeling: Learning and Influencing Latent Intent

• Learning from physical feedback

• Learning from play

• How can robots learn low-dimensional

Offline, expert demonstrations of high-dimensional motions

Learn low-dimensional latent representations for online control

• We trained on less than 7 minutes of kinesthetic demonstrations

• Partner Modeling: Active Info Gathering over Human’s Intent

• Shared Autonomy and Latent Actions

• Partner Modeling: Learning and Influencing Latent Intent

• Learning from physical feedback

• Learning from play