You are on page 1of 74

Principles of Robot Autonomy II

Learning from Diverse Sources of Data (2)


Today’s itinerary
• Recap of imitation learning, inverse RL, learning from suboptimal

demos

• Learning from preference queries

• Learning from physical feedback

• Learning from play

• Intro to human-robot interaction


Learn from Different Sources of Data

Expert demonstrations
Learn from Different Sources of Data

Expert demonstrations
Suboptimal demonstrations, play, observations
Language instructions, narrations

Pairwise comparisons, rankings, ordinal data Physical Corrections

X ✓
Learn from Different Sources of Data

Expert demonstrations
Suboptimal demonstrations, observations
Language instructions, narrations

Pairwise comparisons, rankings, ordinal data Physical Corrections

X ✓
Leverage comparisons
as useful observations about the
desired robot reward function.
Which grasping trajectory do you prefer?
𝜉! 𝜉"

𝜉! or 𝜉" ?
𝜉! 𝜉"

𝑅! or 𝑅" ?

𝑅 =𝒘⋅𝜙
𝜉! 𝜉"

𝑅! or 𝑅" ?
𝒘 ⋅ 𝜙! > 𝒘 ⋅ 𝜙"
or 𝒘 ⋅ 𝜙! < 𝒘 ⋅ 𝜙"
𝜉! 𝜉"

𝜑 = 𝜙! − 𝜙"
𝑅! or 𝑅" ?
𝒘⋅𝜑 >0
or 𝒘 ⋅ 𝜑 < 0
𝒘𝟏

𝒘𝟐

𝒘𝟑
𝒘𝟏

𝒘𝟐

𝒘𝟑
𝒘𝟏

𝒘𝟐

𝒘𝟑
0}
=
. φ 𝒘𝟏
: 𝒘
{𝒘

𝒘𝟐

a
𝒘𝟑 re s
d to
pon rplan
e
e
s cor hyp
e
ueri ating
q ar
sep
0}
=
. φ 𝒘𝟏
: 𝒘
{𝒘

X ✓
φ
𝒘𝟐

a
𝒘𝟑 re s
d to
pon rplan
e
e
s cor hyp
e
ueri ating
q ar
sep
can be noisy!

𝒘𝟏 𝒘𝟏

𝒘𝟐 𝒘𝟐

𝒘𝟑 𝒘𝟑

1
𝑓! 𝒘 = 𝑝 𝐼" 𝒘 =
1 + exp(−𝐼" 𝒘# 𝜑)

Restricted Boltzmann machines modeling human choice. T. Osogami, M. Otsuka. NeurIPS’14.


1
𝑓$% 𝒘 = 𝑝 𝐼" 𝒘 =
1 + exp(−𝐼" 𝒘# 𝜓)

𝑓$& 𝒘 = min(1, exp(𝐼" 𝒘# 𝜓))


0} 𝜑 determines how
=
. φ 𝒘𝟏
: 𝒘 informative the query is
{𝒘 (in expectation)

φ
𝒘𝟐

a
𝒘𝟑 e s p
d
rp
to
on lane
corr hype
es
u eri rating
q pa
se
0} 𝜑 determines how
=
. φ 𝒘𝟏
: 𝒘 informative the query is
{𝒘 (in expectation)

φ
𝒘𝟐

a
𝒘𝟑 es p
d
rp
to
on lane
corr hype
es
u eri ating
q ar
sep
0} 𝜑 determines how
=
. φ 𝒘𝟏
: 𝒘 informative the query is
{𝒘 (in expectation)

φ
𝒘𝟐

a
𝒘𝟑 es p
d
rp
to
on lane
corr hype
es
u eri ating
q ar
sep
Queries should be actively synthesized.
Actively synthesizing queries
update function 𝑓! 𝒘 = min(1, exp(𝐼" 𝒘# 𝜑))

minimum expected volume removed

max min{𝔼 1 − 𝑓& (𝑤) , 𝔼 1 − 𝑓'& (𝑤) }


&

Subject to 𝜑 ∈ 𝔽
𝔽 = {𝜑: 𝜑 = Φ 𝜉$ − Φ 𝜉% , 𝜉$ , 𝜉% ∈ Ξ}

Active Preference-Based Learning of Reward Functions. Sadigh, et al. RSS 2017


No prior preference Learns heading preferences Learns collision
avoidance preferences
No prior preference Preferring green basket over the red one.

Features: max altitude, final distance to the closest basket, max horizontal range, total
angular displacement [Bıyık, Sadigh. CoRL 2018]
𝑅 𝜉 =𝜃⋅𝜙 𝜉
𝑅 𝜉! > 𝑅(𝜉" )
𝑅 𝜉! = 𝜃 ⋅ 𝜙 𝜉!
Trajectory
Designing features is hard.
Features

[Sadigh’17]
[Basu’18]
[Levine’10] Feature generation?
𝑅 𝜉! > 𝑅(𝜉" )
[Biyik’18,’19]
[Cobo’11]
[Katz’19]
[Singliar’12] Deep learning?
[Palan’19]
[Choi’13]
[Wilde’20]
[Arenz’14]
Trajectory Features: Shot Speed, Shot Angle

𝑅 𝜉!! = 𝑓𝜃 ⋅𝜙𝜙 𝜉𝜉!!

𝑅 𝜉! > 𝑅(𝜉" )
[Biyik, Huynh, Kochenderfer, Sadigh. RSS20]
Training: An Optimal Query with GP Reward

𝜉* 𝜉+

GP Reward enables the exploration of different


trajectories (not just the boundaries).
Learned
Online: Final Policies
policy based on learned reward

Linear Reward GP Reward


Linear Models: Neural Models:
- Inexpressive - Thousands of Queries
+ Feedback efficient + Highly expressive

𝑤 ⋅ 𝜙 𝑠, 𝑎 𝑟! 𝑠, 𝑎
Few-Shot Preference Learning for Human-in-the-Loop RL

Pre-training
Prior Tasks

Door Open Window Open Button Press

Few-Shot Preference Learning for Human-in-the-Loop RL


Joey Hejna, Dorsa Sadigh. CoRL’22.
Few-Shot Preference Learning for Human-in-the-Loop RL

Pre-training
Prior Tasks

Door Open Window Open Button Press


𝑟!

Segments 𝜎! , 𝜎"

Few-Shot Preference Learning for Human-in-the-Loop RL


Joey Hejna, Dorsa Sadigh. CoRL’22.
Few-Shot Preference Learning for Human-in-the-Loop RL

Pre-training Online Adaptation


Prior Tasks

Door Open Window Open Button Press


𝑟!

Segments 𝜎! , 𝜎"

Few-Shot Preference Learning for Human-in-the-Loop RL


Joey Hejna, Dorsa Sadigh. CoRL’22.
Few-Shot Preference Learning for Human-in-the-Loop RL

Pre-training Online Adaptation


Prior Tasks

Door Open Window Open Button Press

New Task

𝑟#! 𝑠, 𝑎 𝑎
𝑟! 𝑟! !

𝜋! (𝑎|𝑠) Drawer Open
Segments 𝜎! , 𝜎"

Few-Shot Preference Learning for Human-in-the-Loop RL


Joey Hejna, Dorsa Sadigh. CoRL’22.
Few-Shot Preference Learning for Human-in-the-Loop RL

Pre-training Online Adaptation


Prior Tasks
Segments 𝜎! , 𝜎"

(𝑠, 𝑎, 𝑠′) Replay Buffer


Door Open Window Open Button Press
(𝑠, 𝑎, 𝑠′)
New Task

𝑟#! 𝑠, 𝑎 𝑎
𝑟! 𝑟! !

𝜋! (𝑎|𝑠) Drawer Open
Segments 𝜎! , 𝜎"

Few-Shot Preference Learning for Human-in-the-Loop RL


Joey Hejna, Dorsa Sadigh. CoRL’22.
Few-Shot Preference Learning for Human-in-the-Loop RL

Pre-training Online Adaptation


Prior Tasks
… Segments 𝜎! , 𝜎"

(𝑠, 𝑎, 𝑠′) Replay Buffer


Door Open Window Open Button Press … (𝑠, 𝑎, 𝑠′)
New Task

𝑟#! 𝑠, 𝑎 𝑎
𝑟! 𝑟! !

𝜋! (𝑎|𝑠) Drawer Open
Segments 𝜎! , 𝜎"

Few-Shot Preference Learning for Human-in-the-Loop RL


Joey Hejna, Dorsa Sadigh. CoRL’22.
41
Actively synthesizing queries

• Going beyond linear reward functions and using Gaussian Processes?


• Queries that are easy to answer for the human? (CoRL’19)
max 𝐼 response; 𝜔
= max 𝐻 response − 𝐻(response|𝜔)
Robot’s Human’s
uncertainty uncertainty

𝜉" 𝜉#
Preferences:
Easier and more accurate to use – but gives one bit of information.

X ✓
Preferences:
Easier and more accurate to use – but gives one bit of information.

X ✓

Demonstrations:
Rich and informative – but noisy and inaccurate.
Preferences:
Easier and more accurate to use – but gives one bit of information.

X ✓

Demonstrations:
Rich and informative – but noisy and inaccurate.
Preferences:
Easier and more accurate to use – but gives one bit of information.

X ✓

Demonstrations:
Rich and informative – but noisy and inaccurate.
Preferences:
Easier and more accurate to use – but gives one bit of information.

X ✓

Demonstrations:
Rich and informative – but noisy and inaccurate.
Preferences:
Easier and more accurate to use – but gives one bit of information.

X ✓

Demonstrations:
Rich and informative – but noisy and inaccurate.
Preferences:
Easier and more accurate to use – but gives one bit of information.

X ✓

Demonstrations:
Rich and informative – but noisy and inaccurate.
Actively synthesizing queries
Inverse Reinforcement Learning Policy DemPref Policy
Nonlinear Rewards for Exoskeletons

ROIAL: Region of Interest Active Learning for Characterizing Exoskeleton Gait Preference Landscapes
K. Li, et al. ICRA’21.
Learn Human Preferences

Ask informative pairwise comparisons


Negotiation Domain
Negotiation Domain
Negotiation Domain

Lewis, Mike, et al. "Deal or no deal? end-to-end learning for negotiation dialogues."
Negotiation Domain
Negotiation Domain
Negotiation Domain

What is a fair,
polite, human-like
negotiation?
Negotiation Domain

Targeted Data Acquisition for Evolving Negotiation Agents


Kwon, Karamcheti, Cuéllar, Sadigh
ICML 2021
Learn Human Preferences

Ask informative pairwise comparisons


Learn Human Preferences

Query LLMs to capture preferences

We use LLMs as a proxy reward function


to train RL agents from user inputs
Prompt (𝜌)

Alice and Bob are negotiating how to split a set of (1)


Task description (𝜌! ) books, hats, and balls.
Feed prompt
(𝜌) LLM

Construct
prompt (𝜌)
Prompt (𝜌)

Alice and Bob are negotiating how to split a set of (1)


Task description (𝜌! ) books, hats, and balls.
Feed prompt
------------------------------------------------------------------
Alice : propose: book=1 hat=1 ball=0
(𝜌) LLM
Bob : propose: book=0 hat=1 ball=0
Alice : propose: book=1 hat=0 ball=1

Agreement!
Alice : 4 points
Example from user describing Bob : 5 points
objective (versatile behavior) ------------------------------------------------------------------- Construct
(𝜌" ) Is Alice a versatile negotiator? prompt (𝜌)
Yes, because she suggested different proposals.
Prompt (𝜌)

Alice and Bob are negotiating how to split a set of (1)


Task description (𝜌! ) books, hats, and balls.
Feed prompt
------------------------------------------------------------------
Alice : propose: book=1 hat=1 ball=0
(𝜌) LLM
Bob : propose: book=0 hat=1 ball=0
Alice : propose: book=1 hat=0 ball=1

Agreement!
Alice : 4 points
Example from user describing Bob : 5 points
objective (versatile behavior) ------------------------------------------------------------------- Construct
(𝜌" ) Is Alice a versatile negotiator? prompt (𝜌)
Yes, because she suggested different proposals.

--------------------------------------------------------------------
Alice : propose: book=1 hat=1 ball=0
Bob : propose: book=0 hat=1 ball=0
Alice : propose: book=1 hat=1 ball=0

Agreement!
Episode outcome described as Alice : 5 points
string using parse 𝑓 (𝜌# ) Bob : 5 points
--------------------------------------------------------------------
Prompt (𝜌)

Alice and Bob are negotiating how to split a set of (1)


Task description (𝜌! ) books, hats, and balls.
Feed prompt
------------------------------------------------------------------
Alice : propose: book=1 hat=1 ball=0
(𝜌) LLM
Bob : propose: book=0 hat=1 ball=0
Alice : propose: book=1 hat=0 ball=1

Agreement!
Alice : 4 points
Example from user describing Bob : 5 points
objective (versatile behavior) ------------------------------------------------------------------- Construct
(𝜌" ) Is Alice a versatile negotiator? prompt (𝜌)
Yes, because she suggested different proposals.

--------------------------------------------------------------------
Alice : propose: book=1 hat=1 ball=0
Bob : propose: book=0 hat=1 ball=0
Alice : propose: book=1 hat=1 ball=0

Agreement!
Episode outcome described as Alice : 5 points
string using parse 𝑓 (𝜌# ) Bob : 5 points
--------------------------------------------------------------------
Question (𝜌$ ) Is Alice a versatile negotiator?
Prompt (𝜌)

Alice and Bob are negotiating how to split a set of (1)


Task description (𝜌! ) books, hats, and balls.
Feed prompt
------------------------------------------------------------------
Alice : propose: book=1 hat=1 ball=0
(𝜌) LLM
Bob : propose: book=0 hat=1 ball=0
Alice : propose: book=1 hat=0 ball=1

Agreement!
Alice : 4 points
Example from user describing Bob : 5 points
objective (versatile behavior) ------------------------------------------------------------------- Construct
(𝜌" ) Is Alice a versatile negotiator? prompt (𝜌)
Yes, because she suggested different proposals.

--------------------------------------------------------------------
Alice : propose: book=1 hat=1 ball=0
Bob : propose: book=0 hat=1 ball=0
Alice : propose: book=1 hat=1 ball=0

Agreement!
Episode outcome described as Alice : 5 points
string using parse 𝑓 (𝜌# ) Bob : 5 points
--------------------------------------------------------------------
Question (𝜌$ ) Is Alice a versatile negotiator?
Prompt (𝜌)

Alice and Bob are negotiating how to split a set of (1)


Task description (𝜌! ) books, hats, and balls.
Feed prompt (2)
------------------------------------------------------------------
Alice : propose: book=1 hat=1 ball=0
(𝜌) LLM LLM provides
textual output
Bob : propose: book=0 hat=1 ball=0
Alice : propose: book=1 hat=0 ball=1

Agreement!
Alice : 4 points
Example from user describing Bob : 5 points
Construct “No”
objective (versatile behavior) -------------------------------------------------------------------
(𝜌" ) Is Alice a versatile negotiator? prompt (𝜌)
Yes, because she suggested different proposals.

--------------------------------------------------------------------
Alice : propose: book=1 hat=1 ball=0
Bob : propose: book=0 hat=1 ball=0
Alice : propose: book=1 hat=1 ball=0

Agreement!
Episode outcome described as Alice : 5 points
string using parse 𝑓 (𝜌# ) Bob : 5 points
--------------------------------------------------------------------
Question (𝜌$ ) Is Alice a versatile negotiator?
Prompt (𝜌)

Alice and Bob are negotiating how to split a set of (1)


Task description (𝜌! ) books, hats, and balls.
Feed prompt (2)
------------------------------------------------------------------
Alice : propose: book=1 hat=1 ball=0
(𝜌) LLM LLM provides
textual output
Bob : propose: book=0 hat=1 ball=0
Alice : propose: book=1 hat=0 ball=1

Agreement!
Alice : 4 points
Example from user describing Bob : 5 points
Construct “No”
objective (versatile behavior) -------------------------------------------------------------------
(𝜌" ) Is Alice a versatile negotiator? prompt (𝜌)
Yes, because she suggested different proposals.

--------------------------------------------------------------------
Alice : propose: book=1 hat=1 ball=0
(3)
Bob : propose: book=0 hat=1 ball=0 Convert to int “0”
Alice : propose: book=1 hat=1 ball=0 using parse 𝑔
and use as
Agreement!
Episode outcome described as Alice : 5 points
reward signal
string using parse 𝑓 (𝜌# ) Bob : 5 points
--------------------------------------------------------------------
Question (𝜌$ ) Is Alice a versatile negotiator?
Prompt (𝜌)

Alice and Bob are negotiating how to split a set of (1)


Task description (𝜌! ) books, hats, and balls.
Feed prompt (2)
------------------------------------------------------------------
Alice : propose: book=1 hat=1 ball=0
(𝜌) LLM LLM provides
textual output
Bob : propose: book=0 hat=1 ball=0
Alice : propose: book=1 hat=0 ball=1

Agreement!
Alice : 4 points
Example from user describing Bob : 5 points
Construct “No”
objective (versatile behavior) -------------------------------------------------------------------
(𝜌" ) Is Alice a versatile negotiator? prompt (𝜌)
Yes, because she suggested different proposals.

--------------------------------------------------------------------
Alice : propose: book=1 hat=1 ball=0
(3)
Bob : propose: book=0 hat=1 ball=0 Convert to int “0”
Alice : propose: book=1 hat=1 ball=0 using parse 𝑔
and use as
Agreement!
Episode outcome described as Alice : 5 points
reward signal
string using parse 𝑓 (𝜌# ) Bob : 5 points
-------------------------------------------------------------------- (4) Update agent (Alice)
Question (𝜌$ ) Is Alice a versatile negotiator? weights and run an
episode
Prompt (𝜌)

Alice and Bob are negotiating how to split a set of (1)


Task description (𝜌! ) books, hats, and balls.
Feed prompt (2)
------------------------------------------------------------------
Alice : propose: book=1 hat=1 ball=0
(𝜌) LLM LLM provides
textual output
Bob : propose: book=0 hat=1 ball=0
Alice : propose: book=1 hat=0 ball=1

Agreement!
Alice : 4 points
Example from user describing Bob : 5 points
Construct “No”
objective (versatile behavior) -------------------------------------------------------------------
(𝜌" ) Is Alice a versatile negotiator? prompt (𝜌)
Yes, because she suggested different proposals.

--------------------------------------------------------------------
Alice : propose: book=1 hat=1 ball=0
(3)
Bob : propose: book=0 hat=1 ball=0 Convert to int “0”
Alice : propose: book=1 hat=1 ball=0 (5) using parse 𝑔
Summarize and use as
Agreement!
Episode outcome described as Alice : 5 points episode outcome reward signal
string using parse 𝑓 (𝜌# ) Bob : 5 points as string (𝜌! )
-------------------------------------------------------------------- using parser 𝑓 (4) Update agent (Alice)
Question (𝜌$ ) Is Alice a versatile negotiator? weights and run an
episode
DEALORNODEAL Negotiation Task

propose(0 books, 2 hats, 1 ball)


end
Bob Alice
Labeling Accuracy
Versatile Push-Over Competitive Stubborn

SL Ours SL Ours SL Ours SL Ours

RL Agent Accuracy
Versatile Push-Over Competitive Stubborn

SL Ours True SL Ours True SL Ours True SL Ours True


Reward Reward Reward Reward

We outperform SL by avg. of 46%

We underperform True Reward by avg. of 4%

We can use an LLM as a proxy reward to train objective-aligned agents


Key Takeaways

We can learn representations (reward functions) by


1) pretraining and actively querying for informative human feedback
2) leveraging the knowledge of large language models.

You might also like