Professional Documents
Culture Documents
demos
Expert demonstrations
Learn from Different Sources of Data
Expert demonstrations
Suboptimal demonstrations, play, observations
Language instructions, narrations
X ✓
Learn from Different Sources of Data
Expert demonstrations
Suboptimal demonstrations, observations
Language instructions, narrations
X ✓
Leverage comparisons
as useful observations about the
desired robot reward function.
Which grasping trajectory do you prefer?
𝜉! 𝜉"
𝜉! or 𝜉" ?
𝜉! 𝜉"
𝑅! or 𝑅" ?
𝑅 =𝒘⋅𝜙
𝜉! 𝜉"
𝑅! or 𝑅" ?
𝒘 ⋅ 𝜙! > 𝒘 ⋅ 𝜙"
or 𝒘 ⋅ 𝜙! < 𝒘 ⋅ 𝜙"
𝜉! 𝜉"
𝜑 = 𝜙! − 𝜙"
𝑅! or 𝑅" ?
𝒘⋅𝜑 >0
or 𝒘 ⋅ 𝜑 < 0
𝒘𝟏
𝒘𝟐
𝒘𝟑
𝒘𝟏
𝒘𝟐
𝒘𝟑
𝒘𝟏
𝒘𝟐
𝒘𝟑
0}
=
. φ 𝒘𝟏
: 𝒘
{𝒘
𝒘𝟐
a
𝒘𝟑 re s
d to
pon rplan
e
e
s cor hyp
e
ueri ating
q ar
sep
0}
=
. φ 𝒘𝟏
: 𝒘
{𝒘
X ✓
φ
𝒘𝟐
a
𝒘𝟑 re s
d to
pon rplan
e
e
s cor hyp
e
ueri ating
q ar
sep
can be noisy!
𝒘𝟏 𝒘𝟏
𝒘𝟐 𝒘𝟐
𝒘𝟑 𝒘𝟑
1
𝑓! 𝒘 = 𝑝 𝐼" 𝒘 =
1 + exp(−𝐼" 𝒘# 𝜑)
φ
𝒘𝟐
a
𝒘𝟑 e s p
d
rp
to
on lane
corr hype
es
u eri rating
q pa
se
0} 𝜑 determines how
=
. φ 𝒘𝟏
: 𝒘 informative the query is
{𝒘 (in expectation)
φ
𝒘𝟐
a
𝒘𝟑 es p
d
rp
to
on lane
corr hype
es
u eri ating
q ar
sep
0} 𝜑 determines how
=
. φ 𝒘𝟏
: 𝒘 informative the query is
{𝒘 (in expectation)
φ
𝒘𝟐
a
𝒘𝟑 es p
d
rp
to
on lane
corr hype
es
u eri ating
q ar
sep
Queries should be actively synthesized.
Actively synthesizing queries
update function 𝑓! 𝒘 = min(1, exp(𝐼" 𝒘# 𝜑))
Subject to 𝜑 ∈ 𝔽
𝔽 = {𝜑: 𝜑 = Φ 𝜉$ − Φ 𝜉% , 𝜉$ , 𝜉% ∈ Ξ}
Features: max altitude, final distance to the closest basket, max horizontal range, total
angular displacement [Bıyık, Sadigh. CoRL 2018]
𝑅 𝜉 =𝜃⋅𝜙 𝜉
𝑅 𝜉! > 𝑅(𝜉" )
𝑅 𝜉! = 𝜃 ⋅ 𝜙 𝜉!
Trajectory
Designing features is hard.
Features
[Sadigh’17]
[Basu’18]
[Levine’10] Feature generation?
𝑅 𝜉! > 𝑅(𝜉" )
[Biyik’18,’19]
[Cobo’11]
[Katz’19]
[Singliar’12] Deep learning?
[Palan’19]
[Choi’13]
[Wilde’20]
[Arenz’14]
Trajectory Features: Shot Speed, Shot Angle
𝑅 𝜉! > 𝑅(𝜉" )
[Biyik, Huynh, Kochenderfer, Sadigh. RSS20]
Training: An Optimal Query with GP Reward
𝜉* 𝜉+
𝑤 ⋅ 𝜙 𝑠, 𝑎 𝑟! 𝑠, 𝑎
Few-Shot Preference Learning for Human-in-the-Loop RL
Pre-training
Prior Tasks
Pre-training
Prior Tasks
…
𝑟!
…
Segments 𝜎! , 𝜎"
…
𝑟!
…
Segments 𝜎! , 𝜎"
New Task
…
𝑟#! 𝑠, 𝑎 𝑎
𝑟! 𝑟! !
…
𝜋! (𝑎|𝑠) Drawer Open
Segments 𝜎! , 𝜎"
𝜉" 𝜉#
Preferences:
Easier and more accurate to use – but gives one bit of information.
X ✓
Preferences:
Easier and more accurate to use – but gives one bit of information.
X ✓
Demonstrations:
Rich and informative – but noisy and inaccurate.
Preferences:
Easier and more accurate to use – but gives one bit of information.
X ✓
Demonstrations:
Rich and informative – but noisy and inaccurate.
Preferences:
Easier and more accurate to use – but gives one bit of information.
X ✓
Demonstrations:
Rich and informative – but noisy and inaccurate.
Preferences:
Easier and more accurate to use – but gives one bit of information.
X ✓
Demonstrations:
Rich and informative – but noisy and inaccurate.
Preferences:
Easier and more accurate to use – but gives one bit of information.
X ✓
Demonstrations:
Rich and informative – but noisy and inaccurate.
Preferences:
Easier and more accurate to use – but gives one bit of information.
X ✓
Demonstrations:
Rich and informative – but noisy and inaccurate.
Actively synthesizing queries
Inverse Reinforcement Learning Policy DemPref Policy
Nonlinear Rewards for Exoskeletons
ROIAL: Region of Interest Active Learning for Characterizing Exoskeleton Gait Preference Landscapes
K. Li, et al. ICRA’21.
Learn Human Preferences
Lewis, Mike, et al. "Deal or no deal? end-to-end learning for negotiation dialogues."
Negotiation Domain
Negotiation Domain
Negotiation Domain
What is a fair,
polite, human-like
negotiation?
Negotiation Domain
Construct
prompt (𝜌)
Prompt (𝜌)
Agreement!
Alice : 4 points
Example from user describing Bob : 5 points
objective (versatile behavior) ------------------------------------------------------------------- Construct
(𝜌" ) Is Alice a versatile negotiator? prompt (𝜌)
Yes, because she suggested different proposals.
Prompt (𝜌)
Agreement!
Alice : 4 points
Example from user describing Bob : 5 points
objective (versatile behavior) ------------------------------------------------------------------- Construct
(𝜌" ) Is Alice a versatile negotiator? prompt (𝜌)
Yes, because she suggested different proposals.
--------------------------------------------------------------------
Alice : propose: book=1 hat=1 ball=0
Bob : propose: book=0 hat=1 ball=0
Alice : propose: book=1 hat=1 ball=0
Agreement!
Episode outcome described as Alice : 5 points
string using parse 𝑓 (𝜌# ) Bob : 5 points
--------------------------------------------------------------------
Prompt (𝜌)
Agreement!
Alice : 4 points
Example from user describing Bob : 5 points
objective (versatile behavior) ------------------------------------------------------------------- Construct
(𝜌" ) Is Alice a versatile negotiator? prompt (𝜌)
Yes, because she suggested different proposals.
--------------------------------------------------------------------
Alice : propose: book=1 hat=1 ball=0
Bob : propose: book=0 hat=1 ball=0
Alice : propose: book=1 hat=1 ball=0
Agreement!
Episode outcome described as Alice : 5 points
string using parse 𝑓 (𝜌# ) Bob : 5 points
--------------------------------------------------------------------
Question (𝜌$ ) Is Alice a versatile negotiator?
Prompt (𝜌)
Agreement!
Alice : 4 points
Example from user describing Bob : 5 points
objective (versatile behavior) ------------------------------------------------------------------- Construct
(𝜌" ) Is Alice a versatile negotiator? prompt (𝜌)
Yes, because she suggested different proposals.
--------------------------------------------------------------------
Alice : propose: book=1 hat=1 ball=0
Bob : propose: book=0 hat=1 ball=0
Alice : propose: book=1 hat=1 ball=0
Agreement!
Episode outcome described as Alice : 5 points
string using parse 𝑓 (𝜌# ) Bob : 5 points
--------------------------------------------------------------------
Question (𝜌$ ) Is Alice a versatile negotiator?
Prompt (𝜌)
Agreement!
Alice : 4 points
Example from user describing Bob : 5 points
Construct “No”
objective (versatile behavior) -------------------------------------------------------------------
(𝜌" ) Is Alice a versatile negotiator? prompt (𝜌)
Yes, because she suggested different proposals.
--------------------------------------------------------------------
Alice : propose: book=1 hat=1 ball=0
Bob : propose: book=0 hat=1 ball=0
Alice : propose: book=1 hat=1 ball=0
Agreement!
Episode outcome described as Alice : 5 points
string using parse 𝑓 (𝜌# ) Bob : 5 points
--------------------------------------------------------------------
Question (𝜌$ ) Is Alice a versatile negotiator?
Prompt (𝜌)
Agreement!
Alice : 4 points
Example from user describing Bob : 5 points
Construct “No”
objective (versatile behavior) -------------------------------------------------------------------
(𝜌" ) Is Alice a versatile negotiator? prompt (𝜌)
Yes, because she suggested different proposals.
--------------------------------------------------------------------
Alice : propose: book=1 hat=1 ball=0
(3)
Bob : propose: book=0 hat=1 ball=0 Convert to int “0”
Alice : propose: book=1 hat=1 ball=0 using parse 𝑔
and use as
Agreement!
Episode outcome described as Alice : 5 points
reward signal
string using parse 𝑓 (𝜌# ) Bob : 5 points
--------------------------------------------------------------------
Question (𝜌$ ) Is Alice a versatile negotiator?
Prompt (𝜌)
Agreement!
Alice : 4 points
Example from user describing Bob : 5 points
Construct “No”
objective (versatile behavior) -------------------------------------------------------------------
(𝜌" ) Is Alice a versatile negotiator? prompt (𝜌)
Yes, because she suggested different proposals.
--------------------------------------------------------------------
Alice : propose: book=1 hat=1 ball=0
(3)
Bob : propose: book=0 hat=1 ball=0 Convert to int “0”
Alice : propose: book=1 hat=1 ball=0 using parse 𝑔
and use as
Agreement!
Episode outcome described as Alice : 5 points
reward signal
string using parse 𝑓 (𝜌# ) Bob : 5 points
-------------------------------------------------------------------- (4) Update agent (Alice)
Question (𝜌$ ) Is Alice a versatile negotiator? weights and run an
episode
Prompt (𝜌)
Agreement!
Alice : 4 points
Example from user describing Bob : 5 points
Construct “No”
objective (versatile behavior) -------------------------------------------------------------------
(𝜌" ) Is Alice a versatile negotiator? prompt (𝜌)
Yes, because she suggested different proposals.
--------------------------------------------------------------------
Alice : propose: book=1 hat=1 ball=0
(3)
Bob : propose: book=0 hat=1 ball=0 Convert to int “0”
Alice : propose: book=1 hat=1 ball=0 (5) using parse 𝑔
Summarize and use as
Agreement!
Episode outcome described as Alice : 5 points episode outcome reward signal
string using parse 𝑓 (𝜌# ) Bob : 5 points as string (𝜌! )
-------------------------------------------------------------------- using parser 𝑓 (4) Update agent (Alice)
Question (𝜌$ ) Is Alice a versatile negotiator? weights and run an
episode
DEALORNODEAL Negotiation Task
…
end
Bob Alice
Labeling Accuracy
Versatile Push-Over Competitive Stubborn
RL Agent Accuracy
Versatile Push-Over Competitive Stubborn