Professional Documents
Culture Documents
Human-Robot Interaction
Today’s itinerary
• Game-Theoretic Views on Multi-Agent Interactions
[Sadigh, Sastry, Seshia, Dragan, RSS 2016, IROS 2016, AURO 2018]
An autonomous car’s
actions will affect the actions of other
drivers.
Source: https://twitter.com/nitguptaa/
Interaction as a Dynamical System
direct control
over 𝑢ℛ
indirect
control over 𝑢ℋ
Interaction as a Dynamical System
∗ ∗
𝑢ℛ = argmax 𝑅ℛ (𝑥, 𝑢ℛ , 𝑢ℋ (𝑥, 𝑢ℛ ))
#ℛ
𝑅ℋ 𝑥, 𝑢ℛ , 𝑢ℋ = 𝑤 ⏉ 𝜙(𝑥, 𝑢ℛ , 𝑢ℋ )
∗
𝑢ℋ (𝑥, 𝑢ℛ ) = argmax 𝑅ℋ (𝑥, 𝑢ℛ , 𝑢ℋ )
%ℋ
∗
𝑅ℛ (𝑥, 𝑢ℛ , 𝑢ℋ )
∗
𝜕𝑅ℛ 𝜕𝑅ℛ 𝜕𝑢ℋ 𝜕𝑅ℛ
= +
𝜕𝑢ℛ 𝜕𝑢ℋ 𝜕𝑢ℛ 𝜕𝑢ℛ
∗
𝑢ℋ 𝑥, 𝑢ℛ ≈ argmax 𝑅ℋ (𝑥, 𝑢ℛ , 𝑢ℋ ) *
%ℋ
' '
𝑅ℋ 𝑥, 𝑢ℛ , 𝑢ℋ = 6 𝑟ℋ 𝑥 ' , 𝑢ℛ , 𝑢ℋ
'()
Solution of Nested Optimization
Quasi-Newton method:
∗
𝜕𝑅ℛ 𝜕𝑅ℛ 𝜕𝑢ℋ 𝜕𝑅ℛ
= ⋅ +
𝜕𝑢ℛ 𝜕𝑢ℋ 𝜕𝑢ℛ 𝜕𝑢ℛ
ℛobot
ℋ uman
Implication: Efficiency
Implication: Efficiency
Implication: Coordination
Implication: Coordination
Legible Motion
an
m
lHu
ea
Id
y of Human Vehicle
e
ar
Aw
n-
tio
c
e
ra
a c l
st
te
b
In
m ic O Human crossing
y n a
D Second
x of Autonomous Vehicle
𝑝 𝑢ℋ 𝑥 ∝ exp(𝑅ℋ (𝑥, 𝑢ℋ ))
We can’t rely on a
single driver model.
We need to differentiate
between different drivers.
𝑝 𝑢ℋ 𝑥, 𝜃 ∝ exp(𝑅ℋ (𝑥, 𝑢ℋ , 𝜃))
𝑢ℛ = argmax 𝑅ℛ
%ℛ
Drivers respond to
actions of other cars.
𝑅ℛ 𝑥, 𝑢ℋ , 𝜃, 𝑢ℛ = ℍ 𝑏' − ℍ 𝑏',)
+𝜆 ⋅ 𝑅-./0 (𝑥, 𝑢ℋ , 𝜃, 𝑢ℛ )
Goal
𝑝 𝑢ℋ 𝑥, 𝜃, 𝑢ℛ ∝ exp(𝑅ℋ (𝑥, 𝑢ℋ , 𝜃, 𝑢ℛ ))
𝑅ℛ 𝑥, 𝑢ℋ , 𝜃, 𝑢ℛ = ℍ 𝑏' − ℍ 𝑏',)
+𝜆 ⋅ 𝑅-./0 (𝑥, 𝑢ℋ , 𝜃, 𝑢ℛ )
Goal
𝑢ℛ = argmax 𝔼1 [𝑅ℛ ]
%ℛ
𝑝 𝑢ℋ 𝑥, 𝜃, 𝑢ℛ ∝ exp(𝑅ℋ (𝑥, 𝑢ℋ , 𝜃, 𝑢ℛ ))
𝑅ℛ 𝑥, 𝑢ℋ , 𝜃, 𝑢ℛ = ℍ 𝑏' − ℍ 𝑏',)
+𝜆 ⋅ 𝑅-./0 (𝑥, 𝑢ℋ , 𝜃, 𝑢ℛ )
Goal
𝑢ℛ = argmax 𝔼1 [𝑅ℛ ]
%ℛ
Nudging in for Active Info Gathering
Nudging in for Active Info Gathering
Distracted Human
Attentive Human
forward y-direction of human
Di
str
a
time
ct
ed
At
Hu
t
m
e nt
an
i
e v
H u m a n
Robot Active Info Gathering
Attentive
x of robot
d
Human
r
wa
or
gf
forward x-direction
hin
nc
pi
e
ke
Distracted
go Human
ba
ck
time time
time
(b) Scenario 2: (c) Scenario 3:
Human Responses
Attentive Human
time
Belief over Driving Style: Active vs Passive
Active
b(𝜃 = attentive)
Passive
time
Key Idea:
𝑅ℛ 𝑥, 𝑢ℋ , 𝜃, 𝑢ℛ = ℍ 𝑏' − ℍ 𝑏',)
+𝜆 ⋅ 𝑅-./0 (𝑥, 𝑢ℋ , 𝜃, 𝑢ℛ )
Goal
𝑢ℛ = argmax 𝔼1 [𝑅ℛ ]
%ℛ
Modeling Intent Inference using POMDPs
[Javdani et al.]
POMDP Formulation
MDPs have: POMDPs add:
Observations 𝑂
States 𝑆
Observation Function 𝑃(𝑜|𝑠)
Actions 𝐴
Transition Function 𝑃(𝑠 2 |𝑠, 𝑎)
Reward 𝑅(𝑠, 𝑎, 𝑠′)
Tiger Example
Reward Function:
- Penalty for wrong opening: -100
- Reward for correct opening: +10
- Cost of listening: -1
Observations:
- To hear the tiger on the left
- To hear the tiger on the right
Tiger Example
Belief update based on observations:
𝑏) 𝑠H ∝ 𝑝 𝑜 𝑠H , 𝑎 6 𝑝 𝑠H 𝑠L , 𝑎 ⋅ 𝑏M(𝑠L )
I# ∈K
Q-MDP
Approximation 𝑉 ∗ 𝑏 = 𝔼I 𝑉 ∗ 𝑠 = 6 𝑏 𝑠 ⋅ 𝑉 ∗ (𝑠)
I
Intent Inference
𝑋 Robot States
𝐴 Robot Actions
𝑇: 𝑋 ×𝐴 → 𝑋 Transition function
𝑝 𝜉 𝑔 ∝ exp(−𝐶!"#$ 𝜉 )
You never gather information, but can plan efficiently in deterministic subproblems.
𝑄 𝑏, 𝑎, 𝑢 = ( 𝑏 𝑔 ⋅ 𝑄# (𝑥, 𝑎, 𝑢)
Action-Value function of the POMDP # Cost-to-Go of Acting optimally and going towards goal 𝑔
Shared Autonomy with Hindsight
Optimization
Today’s itinerary
• Game-Theoretic Views on Multi-Agent Interactions
𝒛𝟏
𝒛𝟐
𝒛𝟑
!
𝜏 = 𝑠" , 𝑎" , 𝑟" , … , 𝑠# , 𝑎# , 𝑟#
𝑧 !"# ~𝑓 $ ! !
𝑧 ,𝜏 )
Modeling Other Agent’s Behavior
Modeling Other Agent’s Behavior
ℰ1 0
𝑧
𝜏 023
I think it will
aim right
next
ℰ1 0
𝑧
𝜏 023
I think it will
aim right
next
ℰ1 0 𝒟4
𝑧
𝜏 023 𝜏̂ 0
I think it will
aim right
next
ℰ1 0 𝒟4
𝑧
𝜏 023 𝜏̂ 0
- 0
*
Learning objective: max $ $ log 𝑝(,) 𝑠.1/ , 𝑟.* 𝑠.* , 𝑎.* , 𝜏 *2/ )
(,)
*+, .+/
Representation Learning
𝜏 !"# , 𝜏 !
𝜏 $%& ℰ 𝑧$ 𝒟 𝜏̂ $
Experience
Buffer
Learning and Influencing Latent Intent
𝜏 !"# , 𝜏 !
% / 𝜏 $%& ℰ 𝑧$ 𝒟 𝜏̂ $
SAC LILI
Ego Agent Other Agent
left
middle
right
𝜏 !"# , 𝜏 !
% / 𝜏 $%& ℰ 𝑧$ 𝒟 𝜏̂ $
𝜏 !"# , 𝜏 !
% / 𝜏 $%& ℰ 𝑧$ 𝒟 𝜏̂ $
LILI anticipates the partner’s policies using latent strategies to react and
influence the other agent.
Learn from Different Sources of Data
Expert demonstrations
Suboptimal demonstrations, observations
Language instructions, narrations
X ✓
Today’s itinerary
• Game-Theoretic Views on Multi-Agent Interactions
• How can robots learn from and intelligently respond to physical interactions?
Robots can learn by recognizing that
interactions are often intentional corrections.
Formalizing Physical Corrections
Formalizing Physical Corrections
Formalizing Physical Corrections
Formalizing Physical Corrections
Value Alignment
MAP Estimate
Value Alignment
Conditionally
Independent
Value Alignment
Conditionally
Independent
'
exp(𝜃 & Φ 𝜉$ − 𝜆 𝜉$ − 𝜉% ) '
𝑃 𝜉$ 𝜉% ; 𝜃 = ' ≈ exp(𝜃 & (Φ 𝜉$ − Φ 𝜉% ) − 𝜆 𝜉$ − 𝜉% )
∫ exp 𝜃&Φ 𝜉 − 𝜆 𝜉 − 𝜉% 𝑑𝜉
1 '
Assume a prior 𝑃 𝜃 = exp(− 𝜃 − 𝜃( )
2𝛽
Value Alignment
Correction
Value Alignment
Gradient Descent
Value Alignment
Learning Rule
Today’s itinerary
• Game-Theoretic Views on Multi-Agent Interactions
Expert demonstrations
Suboptimal demonstrations, play, observations
Language instructions, narrations
X ✓
Learning from Play Data
- No task specifications
- Reset-free
- Broad state-action-goal coverage – address generalizability
- Access to human priors on goals and behaviors
Play Data Covers State Space at Faster Rate
Imitation policies learned from play are more robust at test time.
𝑠) 𝜏 𝑜&
Next Interaction
Next Interaction
𝑧 𝑜-
Pushing
Policy
𝑠' 𝐿defgh
𝜏 (H) 𝐸 𝑧 𝑧 π 𝑎W ' 𝑎'
Sample Affordance z 𝑜-
𝐿bc
𝑜)
𝐸′ 𝑧′
𝑜-
Learning Policy for Affordances
Test Time:
Policy conditioned on current state (𝑠. ), goal (𝑜8 ), and affordance (𝒛’)
Policy
𝑜)
𝑠'
𝐸′ 𝑧′ 𝑧′ π 𝑎W '
Sample Affordance z’ 𝑜-
𝑜-
Experiments: Block2D
Collected scripted & human play