Lecture 15

Principles of Robot Autonomy II
Learning from Diverse Sources of Data (2)

Today’s itinerary
• Recap of imitation learning, inverse RL, learning from suboptimal
demos
• Learning from preference queries
• Learning from physical feedback
• Learning from play
• Intro to human-robot interaction

Learn from Different Sources of Data
Expert demonstrations
Suboptimal demonstrations, play, observations
Language instructions, narrations
Pairwise comparisons, rankings, ordinal data Physical Corrections
X ✓
Suboptimal demonstrations, observations
Language instructions, narrations
Pairwise comparisons, rankings, ordinal data Physical Corrections
X ✓
Leverage comparisons
as useful observations about the
desired robot reward function.
Which grasping trajectory do you prefer?
𝜉! 𝜉"
𝜉! or 𝜉" ?
𝜉! 𝜉"
𝑅! or 𝑅" ?
𝑅 =𝒘⋅𝜙
𝜉! 𝜉"
𝑅! or 𝑅" ?
𝒘 ⋅ 𝜙! > 𝒘 ⋅ 𝜙"
or 𝒘 ⋅ 𝜙! < 𝒘 ⋅ 𝜙"
𝜉! 𝜉"
𝜑 = 𝜙! − 𝜙"
𝑅! or 𝑅" ?
𝒘⋅𝜑 >0
or 𝒘 ⋅ 𝜑 < 0
𝒘𝟏
𝒘𝟐
𝒘𝟑
𝒘𝟏
𝒘𝟐
𝒘𝟑
𝒘𝟏
𝒘𝟐
𝒘𝟑
0}
=
. φ 𝒘𝟏
: 𝒘
{𝒘
𝒘𝟐
a
𝒘𝟑 re s
d to
pon rplan
e
e
s cor hyp
e
ueri ating
q ar
sep
0}
=
. φ 𝒘𝟏
: 𝒘
{𝒘
X ✓
φ
𝒘𝟐
a
𝒘𝟑 re s
d to
pon rplan
e
e
s cor hyp
e
ueri ating
q ar
sep
can be noisy!
𝒘𝟏 𝒘𝟏
𝒘𝟐 𝒘𝟐
𝒘𝟑 𝒘𝟑
1
𝑓! 𝒘 = 𝑝 𝐼" 𝒘 =
1 + exp(−𝐼" 𝒘# 𝜑)
Restricted Boltzmann machines modeling human choice. T. Osogami, M. Otsuka. NeurIPS’14.

1
𝑓$% 𝒘 = 𝑝 𝐼" 𝒘 =
1 + exp(−𝐼" 𝒘# 𝜓)
𝑓$& 𝒘 = min(1, exp(𝐼" 𝒘# 𝜓))

0} 𝜑 determines how
=
. φ 𝒘𝟏
: 𝒘 informative the query is
{𝒘 (in expectation)
φ
𝒘𝟐
a
𝒘𝟑 e s p
d
rp
to
on lane
corr hype
es
u eri rating
q pa
se
=
. φ 𝒘𝟏
φ
𝒘𝟐
a
𝒘𝟑 es p
d
rp
to
on lane
corr hype
es
u eri ating
q ar
sep
=
. φ 𝒘𝟏
φ
𝒘𝟐
a
𝒘𝟑 es p
d
rp
to
on lane
corr hype
es
u eri ating
q ar
sep
Queries should be actively synthesized.
Actively synthesizing queries
update function 𝑓! 𝒘 = min(1, exp(𝐼" 𝒘# 𝜑))
minimum expected volume removed
max min{𝔼 1 − 𝑓& (𝑤) , 𝔼 1 − 𝑓'& (𝑤) }

&
Subject to 𝜑 ∈ 𝔽
𝔽 = {𝜑: 𝜑 = Φ 𝜉$ − Φ 𝜉% , 𝜉$ , 𝜉% ∈ Ξ}
Active Preference-Based Learning of Reward Functions. Sadigh, et al. RSS 2017

No prior preference Learns heading preferences Learns collision
avoidance preferences
No prior preference Preferring green basket over the red one.
Features: max altitude, final distance to the closest basket, max horizontal range, total
angular displacement [Bıyık, Sadigh. CoRL 2018]
𝑅 𝜉 =𝜃⋅𝜙 𝜉
𝑅 𝜉! > 𝑅(𝜉" )
𝑅 𝜉! = 𝜃 ⋅ 𝜙 𝜉!
Trajectory
Designing features is hard.
Features
[Sadigh’17]
[Basu’18]
[Levine’10] Feature generation?
𝑅 𝜉! > 𝑅(𝜉" )
[Biyik’18,’19]
[Cobo’11]
[Katz’19]
[Singliar’12] Deep learning?
[Palan’19]
[Choi’13]
[Wilde’20]
[Arenz’14]
Trajectory Features: Shot Speed, Shot Angle
𝑅 𝜉!! = 𝑓𝜃 ⋅𝜙𝜙 𝜉𝜉!!
𝑅 𝜉! > 𝑅(𝜉" )
[Biyik, Huynh, Kochenderfer, Sadigh. RSS20]
Training: An Optimal Query with GP Reward
𝜉* 𝜉+
GP Reward enables the exploration of different

trajectories (not just the boundaries).
Learned
Online: Final Policies
policy based on learned reward
Linear Reward GP Reward

Linear Models: Neural Models:
- Inexpressive - Thousands of Queries
+ Feedback efficient + Highly expressive
𝑤 ⋅ 𝜙 𝑠, 𝑎 𝑟! 𝑠, 𝑎
Few-Shot Preference Learning for Human-in-the-Loop RL
Pre-training
Prior Tasks
Door Open Window Open Button Press

Joey Hejna, Dorsa Sadigh. CoRL’22.
Pre-training
Prior Tasks
…
𝑟!
…
Segments 𝜎! , 𝜎"

Pre-training Online Adaptation

Prior Tasks
…
𝑟!
…


Prior Tasks
New Task
…
𝑟#! 𝑠, 𝑎 𝑎
𝑟! 𝑟! !
…
𝜋! (𝑎|𝑠) Drawer Open


Prior Tasks
(𝑠, 𝑎, 𝑠′) Replay Buffer

(𝑠, 𝑎, 𝑠′)
New Task
…
𝑟! 𝑟! !
…


Prior Tasks
… Segments 𝜎! , 𝜎"
(𝑠, 𝑎, 𝑠′) Replay Buffer

Door Open Window Open Button Press … (𝑠, 𝑎, 𝑠′)
New Task
…
𝑟! 𝑟! !
…

41
• Going beyond linear reward functions and using Gaussian Processes?

• Queries that are easy to answer for the human? (CoRL’19)
max 𝐼 response; 𝜔
= max 𝐻 response − 𝐻(response|𝜔)
Robot’s Human’s
uncertainty uncertainty
𝜉" 𝜉#
Preferences:
Easier and more accurate to use – but gives one bit of information.
X ✓
Preferences:
X ✓
Demonstrations:
Rich and informative – but noisy and inaccurate.
Preferences:
X ✓
Demonstrations:
Preferences:
X ✓
Demonstrations:
Preferences:
X ✓
Demonstrations:
Preferences:
X ✓
Demonstrations:
Preferences:
X ✓
Demonstrations:
Inverse Reinforcement Learning Policy DemPref Policy
Nonlinear Rewards for Exoskeletons
ROIAL: Region of Interest Active Learning for Characterizing Exoskeleton Gait Preference Landscapes
K. Li, et al. ICRA’21.
Learn Human Preferences
Ask informative pairwise comparisons

Negotiation Domain
Negotiation Domain
Negotiation Domain
Lewis, Mike, et al. "Deal or no deal? end-to-end learning for negotiation dialogues."
Negotiation Domain
Negotiation Domain
Negotiation Domain
What is a fair,
polite, human-like
negotiation?
Negotiation Domain
Targeted Data Acquisition for Evolving Negotiation Agents

Kwon, Karamcheti, Cuéllar, Sadigh
ICML 2021
Ask informative pairwise comparisons

Query LLMs to capture preferences
We use LLMs as a proxy reward function

to train RL agents from user inputs
Prompt (𝜌)
Alice and Bob are negotiating how to split a set of (1)

Task description (𝜌! ) books, hats, and balls.
Feed prompt
(𝜌) LLM
Construct
prompt (𝜌)
Prompt (𝜌)

Feed prompt
------------------------------------------------------------------
Alice : propose: book=1 hat=1 ball=0
(𝜌) LLM
Bob : propose: book=0 hat=1 ball=0
Agreement!
Alice : 4 points
Example from user describing Bob : 5 points
objective (versatile behavior) ------------------------------------------------------------------- Construct
(𝜌" ) Is Alice a versatile negotiator? prompt (𝜌)
Yes, because she suggested different proposals.
Prompt (𝜌)

Feed prompt
------------------------------------------------------------------
(𝜌) LLM
Agreement!
Alice : 4 points
--------------------------------------------------------------------
Agreement!
Episode outcome described as Alice : 5 points
string using parse 𝑓 (𝜌# ) Bob : 5 points
--------------------------------------------------------------------
Prompt (𝜌)

Feed prompt
------------------------------------------------------------------
(𝜌) LLM
Agreement!
Alice : 4 points
--------------------------------------------------------------------
Agreement!
--------------------------------------------------------------------
Question (𝜌$ ) Is Alice a versatile negotiator?
Prompt (𝜌)

Feed prompt
------------------------------------------------------------------
(𝜌) LLM
Agreement!
Alice : 4 points
--------------------------------------------------------------------
Agreement!
--------------------------------------------------------------------
Prompt (𝜌)

Feed prompt (2)
------------------------------------------------------------------
(𝜌) LLM LLM provides
textual output
Agreement!
Alice : 4 points
Construct “No”
objective (versatile behavior) -------------------------------------------------------------------
--------------------------------------------------------------------
Agreement!
--------------------------------------------------------------------
Prompt (𝜌)

Feed prompt (2)
------------------------------------------------------------------
textual output
Agreement!
Alice : 4 points
Construct “No”
--------------------------------------------------------------------
(3)
Bob : propose: book=0 hat=1 ball=0 Convert to int “0”
Alice : propose: book=1 hat=1 ball=0 using parse 𝑔
and use as
Agreement!
reward signal
--------------------------------------------------------------------
Prompt (𝜌)

Feed prompt (2)
------------------------------------------------------------------
textual output
Agreement!
Alice : 4 points
Construct “No”
--------------------------------------------------------------------
(3)
Alice : propose: book=1 hat=1 ball=0 using parse 𝑔
and use as
Agreement!
reward signal
-------------------------------------------------------------------- (4) Update agent (Alice)
Question (𝜌$ ) Is Alice a versatile negotiator? weights and run an
episode
Prompt (𝜌)

Feed prompt (2)
------------------------------------------------------------------
textual output
Agreement!
Alice : 4 points
Construct “No”
--------------------------------------------------------------------
(3)
Alice : propose: book=1 hat=1 ball=0 (5) using parse 𝑔
Summarize and use as
Agreement!
Episode outcome described as Alice : 5 points episode outcome reward signal
string using parse 𝑓 (𝜌# ) Bob : 5 points as string (𝜌! )
-------------------------------------------------------------------- using parser 𝑓 (4) Update agent (Alice)
Question (𝜌$ ) Is Alice a versatile negotiator? weights and run an
episode
DEALORNODEAL Negotiation Task
propose(0 books, 2 hats, 1 ball)
…
end
Bob Alice
Labeling Accuracy
Versatile Push-Over Competitive Stubborn
SL Ours SL Ours SL Ours SL Ours
RL Agent Accuracy
Versatile Push-Over Competitive Stubborn
SL Ours True SL Ours True SL Ours True SL Ours True

Reward Reward Reward Reward
We outperform SL by avg. of 46%
We underperform True Reward by avg. of 4%
We can use an LLM as a proxy reward to train objective-aligned agents

Key Takeaways
We can learn representations (reward functions) by

1) pretraining and actively querying for informative human feedback
2) leveraging the knowledge of large language models.

Lecture 15

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 15

Uploaded by

Copyright:

Available Formats

Principles of Robot Autonomy II

Learning from Diverse Sources of Data (2)

• Learning from preference queries

• Learning from physical feedback

• Learning from play

• Intro to human-robot interaction

Pairwise comparisons, rankings, ordinal data Physical Corrections

Pairwise comparisons, rankings, ordinal data Physical Corrections

Restricted Boltzmann machines modeling human choice. T. Osogami, M. Otsuka. NeurIPS’14.

𝑓$& 𝒘 = min(1, exp(𝐼" 𝒘# 𝜓))

minimum expected volume removed

max min{𝔼 1 − 𝑓& (𝑤) , 𝔼 1 − 𝑓'& (𝑤) }

Active Preference-Based Learning of Reward Functions. Sadigh, et al. RSS 2017

𝑅 𝜉!! = 𝑓𝜃 ⋅𝜙𝜙 𝜉𝜉!!

GP Reward enables the exploration of different

Linear Reward GP Reward

Door Open Window Open Button Press

Few-Shot Preference Learning for Human-in-the-Loop RL

Door Open Window Open Button Press

Few-Shot Preference Learning for Human-in-the-Loop RL

Pre-training Online Adaptation

Door Open Window Open Button Press

Few-Shot Preference Learning for Human-in-the-Loop RL

Pre-training Online Adaptation

Door Open Window Open Button Press

Few-Shot Preference Learning for Human-in-the-Loop RL

Pre-training Online Adaptation

(𝑠, 𝑎, 𝑠′) Replay Buffer

Few-Shot Preference Learning for Human-in-the-Loop RL

Pre-training Online Adaptation

(𝑠, 𝑎, 𝑠′) Replay Buffer

Few-Shot Preference Learning for Human-in-the-Loop RL

• Going beyond linear reward functions and using Gaussian Processes?

Ask informative pairwise comparisons

Targeted Data Acquisition for Evolving Negotiation Agents

Ask informative pairwise comparisons

Query LLMs to capture preferences

We use LLMs as a proxy reward function

Alice and Bob are negotiating how to split a set of (1)

Alice and Bob are negotiating how to split a set of (1)

Alice and Bob are negotiating how to split a set of (1)

Alice and Bob are negotiating how to split a set of (1)

Alice and Bob are negotiating how to split a set of (1)

Alice and Bob are negotiating how to split a set of (1)

Alice and Bob are negotiating how to split a set of (1)

Alice and Bob are negotiating how to split a set of (1)

Alice and Bob are negotiating how to split a set of (1)

propose(0 books, 2 hats, 1 ball)

SL Ours SL Ours SL Ours SL Ours

SL Ours True SL Ours True SL Ours True SL Ours True

We outperform SL by avg. of 46%

We underperform True Reward by avg. of 4%

We can use an LLM as a proxy reward to train objective-aligned agents

We can learn representations (reward functions) by

You might also like