Professional Documents
Culture Documents
Tor Lattimore
DeepMind, London
Program
• There are exercises. If you have time, you will benefit by attempting
them. I will update slides with solutions at the end
Action
agent environment
Reward/Observation
• An MDP is a tuple M = (S , A , P , R )
• S is a finite or countable set of states
• A is a finite set of actions
• P is a probability kernel from S × A to S
• R is a probability kernel from S × A to [0, 1]
Notation
Finite horizon Learner starts in an initial state. Interacts with the MDP
for H rounds and is reset to the initial state
Discounted Learner interacts with the MDP without resets. Rewards are
geometrically discounted.
Average reward Learner interacts with the MDP without resets. We care
about some kind of average reward
S1 S2 S3 SH SH+1
s1 sH+1
Finite horizon MDPs
• A stationary deterministic policy is a function π : S → A
• Policy and MDP induce a probability measure on
state/action/reward sequences
• The probability that π and the MDP produce interaction sequence
s1 , a1 , r1 , . . . , sH , aH , rH is
Y
H
1π(sh )=ah R (rh |sh , ah )P (sh+1 |sh , ah )
h=1
X
" H #
v (s1 ) = Eπ
π
rh
h=1
X
" H #
v (s) = Eπ
π
ru sh = s for all s ∈ Sh
u=h
q-values
Value of policy π when starting in state s, taking action a and following π
subsequently
X
qπ (s, a) = r(s, a) + P (s 0 |s, a)vπ (s 0 )
s 0 ∈S
Bellman operator
?
Optimal q-value is q? (s, a) = qπ (s, a)
Optimal policies
The optimal policy maximises vπ over all states
v? (s) = max vπ (s)
π
Proposition 1 There exists a stationary policy π? such that
?
vπ (s) = v? (s) for all s ∈ S
?
Optimal q-value is q? (s, a) = qπ (s, a)
Bellman optimality operator
X
(T q)(s, a) = r(s, a) + P (s 0 |s, a) max
0
q(s 0 , a 0 )
a ∈A
s 0 ∈S
X
(T v)(s) = max r(s, a) + P (s 0 |s, a)v(s 0 )
a∈A
s 0 ∈S
d
<
girls , at = rlsia / +
{ pls
'
/ Sia / Max
'
Éh+i( Sia '
)
a
=
(1-94+1) ( Sta )
Three Learning Models
Online RL the learner interacts with the environment as if it were in the
real world
Local planning the learner can ‘query’ the environment at any
state/action pair it has seen before
Generative model the learner can ‘query’ the environment at any
state/action pair
Generative model
3 2
I
4 ,
5 Online
10
2 3 4 5
,
7
•
i
g
9 7
"
. 8 9 10 11
Local Access
.
13
14 If
6 It "
2
.
IS 16
.
17
.
7 10
4
3 -
8 9
5
I 1
Learning with a Generative Model
sk k k k k k k
1 , a1 , r1 , s2 , a2 , . . . , rH , sH
• ak k
t is the action played by the learner in state st
• sk k k
t is sampled from P (·|st−1 , at−1 )
• rk k k
t is sampled from R (st , at )
• Unknown MDP (S , A , P , r)
• Access to a generative model
1 X
n
Pˆ(s 0 |s, a) = 1(s,a,s 0 )=(st ,at ,st0 )
m
t=1
1 X
n
r̂(s, a) = 1(s,a)=(st ,at ) rt
m
t=1
• Output the optimal policy π for the empirical MDP (S , A , Pˆ, r̂)
Necessary aside
Concentration
1 X
m
r !
log(1/δ)
P Xi − µ > cnst B 6δ
m m
i=1
? ? ? ?
vπ (s1 ) − vπ (s1 ) = vπ (s1 ) − v̂π (s1 ) + v̂π (s1 ) − v̂π (s1 ) + v̂π (s1 ) − vπ (s1 )
error (A) (B) (C)
X
" H #
vπ (s1 ) − v̂π (s1 ) = Eπ (r − r̂)(sh , ah ) + hP (sh , ah ) − Pˆ(sh , ah ), v̂π i
h=1
Corollary 3 If
Then
X
hPˆ(s, a) − P (s, a), v̂π i = (Pˆ(s 0 |s, a) − P (s 0 |s, a))v̂π (s 0 )
s0
1 XX
m
= (1si =s 0 − P (s 0 |s, a))v̂π (s 0 )
m 0 i=1 s
∆i
r
whp log(1/δ)
6 cnst H
m
(∆i )i=1 are independent and |∆i | 6 H and E[∆i ] = 0
m
Dependence on |S |
Theorem 4 If
X X
" H # " H #
Vπ rh = Eπ 2
σπ (sh , ah )
h=1 h=1
Naive bounds
X
H X
H
" # " #
Vπ rh 6 H2 Eπ σ2π (sh , ah ) 6 H3
h=1 h=1
Dependence on H
Repeating the previous analysis and assuming known rewards (again...)
X
" H #
v (s1 ) − v̂ (s1 ) = Eπ
π π π
hP (sh , ah ) − Pˆ(sh , ah ), v̂ i
h=1
XH
" r #
σ2π (sh , ah ) log(|S ||A |/δ)
. Eπ
m
h=1
v
X
u " H #
uH
6 t Eπ σ2π (sh , ah ) log(|S ||A |/δ)
m
h=1
v
X
u " H #
uH
= t Vπ rh log(|S ||A |/δ)
m
h=1
s
H3 |S ||A |
6 log(|S ||A |/δ)
#queries
Final result
• Online model
• Learner starts at initial state and interacts with the MDP for an
episode
• Cannot explore arbitrary states as usual
• Our learner will choose stationary polices π1 , . . . , πn over n episodes
• Assume the reward function is known in advance for simplicity
Regret
X
n
Regn = v? (s1 ) − vπt (s1 )
t=1
Exploration/exploitation dilemma
Learner wants to play πt with vπt close to v? but needs to gain
information as well
Optimism
Standard tool for acting in the face of uncertainty
since Lai [1987] and Auer et al. [2002]
Intuition
Act as if the world is as rewarding as plausibly possible
Mathematically
where C t−1 is a confidence set containing the true MDP with high
probability constructed using data from the first t − 1 episodes
Confidence set
s
|Ss | log(n|S ||A |)
Ct = Q : kQ (s, a) − Pˆt (s, a)k1 6 cnst ∀s, a
1 + Nt (s, a)
where
• Nt (s, a) is the number of times the algorithm played action a in
state s in the first t episodes
• Ss is the number of states in the layer after state s
X
" n #
E[Regn ] = E v? (s1 ) − vπt (s1 )
t=1
X
" n #
.E vQπtt (s1 ) − vπt (s1 ) (Optimism)
t=1
XX
" n H #
=E hQ t t
t (sh , ah ) − P (sth , ath ), vQπtt i (Lemma 1)
t=1 h=1
XX
" n H #
=E kQ t t
t (sh , ah ) − P (sth , ath )k1 kvQπtt k∞ (Hölder’s inequality)
t=1 h=1
XX
" n H s #
|Sh+1 | log(1/δ)
6 cnst HE (Def. of conf.)
1 + Nt−1 (sth , ath )
t=1 h=1
Analysis (cont)
XX
" n H s #
|Sh+1 | log(1/δ)
E[Regn ] 6 cnst HE
1 + Nt−1 (sth , ath )
t=1 h=1
X X XX
" H s
n
#
|Sh+1 | log(n|A ||S |/δ)
= cnst HE 1sth =s,ath =a
1 + Nt−1 (s, a)
h=1 s∈Sh a∈A t=1
XH X X Nn−1 X (s,a) r
|S h+1 | log(n|A ||S |/δ)
= cnst HE
1+u
h=1 s∈Sh a∈A u=0
X X Xp
" H #
6 cnst HE Nn−1 (s, a)|Sh+1 | log(n|S ||A |/δ)
h=1 s∈Sh a∈A
• Average regret
r
1 |S ||A | log(n|A ||S |)
E[Regn ] 6 cnst H 6ε
n n
H2 |S ||A | log(n|A ||S |)
⇐⇒ n >
ε2
• H queries per episode
• Sample complexity with a generative model:
X
n
!
P 1 (v∗ (s1 ) − vπt (s1 ) > ε) > S(ε, δ) 6δ
t=1
Regret(π)2
πt = arg min
π Inf. gain(π)
①
•••¥ 0
.
Informative Action
y
A note on conservative algorithms
Structure needs to be
- restrictive enough that learning is possible
- flexible enough that the true environment is
(approximately) in the class
Linear function approximation
Slides available at
https://tor-lattimore.com/downloads/RLTheory.pdf
Function approximation
3 2
I
4 ,
5 Online
10
2 3 4 5
,
7
•
i
g
9 7
"
. 8 9 10 11
Local Access
.
13
14 If
6 It "
2
.
IS 16
.
17
.
7 10
4
3 -
8 9
5
I 1
Linear function approximation
yt = hat , θ? i + ηt
θ? ∈ Rd is unknown (ηt )n
t=1 is noise and yt bounded in [−H, H]
Least squares
yt = hat , θ? i + ηt
θ? ∈ Rd is unknown (ηt )n
t=1 is noise and yt bounded in [−H, H]
• Estimate θ? with least squares
X
n X
n
θ̂ = arg min (hat , θi − yt )2 = G−1 at y t
θ t=1 t=1
Pn >
with G = t=1 at at the design matrix
Least squares
yt = hat , θ? i + ηt
θ? ∈ Rd is unknown (ηt )n
t=1 is noise and yt bounded in [−H, H]
• Estimate θ? with least squares
X
n X
n
θ̂ = arg min (hat , θi − yt )2 = G−1 at y t
θ t=1 t=1
Pn >
with G = t=1 at at the design matrix
p
• Conc: kθ̂ − θ? kG , kG1/2 (θ̂ − θ? )k . H log(1/δ) + d , β
132}
µ
← {⊖ : 11-0 -
E- ii. ≤
" "
"
" "" Intuition
'
'
i "G "
'
i. Confidence ellipsoid is
-
§
.
•
.
qn, Small
.
,
in directions where
have
many actions been
Experimental design
The support of the Kiefer-Wolfowitz distribution may have size d(d + 1)/2
kak2G(π)−1 6 2d
Assumption 1 For all π there exists a θ such that qπ (s, a) = hθ, φ(s, a)i
5h Shtl
Tlhtl "
{
h"
I Estimate
q by
2 th =
Tnt , Oh ,
_
"" "
Tht , near opt
II. 151 =
argmax ¢ / s.at here
a
0h
1 H
Policy evaluation (rollouts)
• Given policy π and state-action pair (s, a) sample a rollout starting in
state s ∈ Sh and taking action a and subsequently taking actions
using π
P
• Collect cumulative rewards rh , . . . , rH and q = Hu=h ru
PH
• Then E[q] = E[ u=h ru ] = q (s, a)
π
P
• |q| = | Hu=h ru | 6 H
1 H
Policy evaluation (extrapolation)
• Perform m rollouts
• Start from state (s, a) in proportion to optimal design ρ
• Collect the data:
(s1 , a1 , q1 ), . . . , (sm , am , qm )
1X
m
θ̂ = arg min (hθ, φ(st , at )i − qt )2 q̂π (s, a) = hθ̂, φ(s, a)i
θ∈Rd 2
t=1
Policy evaluation (extrapolation)
• Perform m rollouts
• Start from state (s, a) in proportion to optimal design ρ
• Collect the data:
(s1 , a1 , q1 ), . . . , (sm , am , qm )
1X
m
θ̂ = arg min (hθ, φ(st , at )i − qt )2 q̂π (s, a) = hθ̂, φ(s, a)i
θ∈Rd 2
t=1
cnst d2 H3 log(1/δ)
n>
ε2
queries to the generative model we have an estimator q̂π of qπ such
that
cnst d2 H5 log(H/δ)
n=
ε2
queries to find kq̂πh+1 − qπh+1 k∞ 6 ε/H
πh+1 (s) if s ∈
/ Sh
- Update policy πh (s) =
arg maxa∈A q̂πh+1 (s, a) otherwise
2ε(H + 1 − h)
vπh (s) > v∗ (s) − for all s ∈ ∪u>h Su
H
Analysis
For s ∈ Sh
πh (s) = arg max q̂h+1 (s, a)
a∈A
Hence
qπh+1 (s, πh (s))
Analysis
For s ∈ Sh
πh (s) = arg max q̂h+1 (s, a)
a∈A
Hence
ε
qπh+1 (s, πh (s)) > q̂πh+1 (s, πh (s)) − (concentration)
H
Analysis
For s ∈ Sh
πh (s) = arg max q̂h+1 (s, a)
a∈A
Hence
ε
qπh+1 (s, πh (s)) > q̂πh+1 (s, πh (s)) − (concentration)
H
πh+1 ε
= max q̂ (s, a) − (def of πh )
a H
Analysis
For s ∈ Sh
πh (s) = arg max q̂h+1 (s, a)
a∈A
Hence
ε
qπh+1 (s, πh (s)) > q̂πh+1 (s, πh (s)) − (concentration)
H
πh+1 ε
= max q̂ (s, a) − (def of πh )
a H
πh+1 2ε
> max q (s, a) − (concentration)
a H
Analysis
For s ∈ Sh
πh (s) = arg max q̂h+1 (s, a)
a∈A
Hence
ε
qπh+1 (s, πh (s)) > q̂πh+1 (s, πh (s)) − (concentration)
H
ε
= max q̂πh+1 (s, a) − (def of πh )
a H
2ε
> max qπh+1 (s, a) − (concentration)
a H
X 2ε
= max r(s, a) + P (s 0 |s, a)vπh+1 (s 0 ) −
a
0
H
s ∈Sh+1
Analysis
For s ∈ Sh
πh (s) = arg max q̂h+1 (s, a)
a∈A
Hence
ε
qπh+1 (s, πh (s)) > q̂πh+1 (s, πh (s)) − (concentration)
H
ε
= max q̂πh+1 (s, a) − (def of πh )
a H
2ε
> max qπh+1 (s, a) − (concentration)
a H
X 2ε
= max r(s, a) + P (s 0 |s, a)vπh+1 (s 0 ) −
a
0
H
s ∈Sh+1
X 2ε 2ε(H − h)
> max r(s, a) + P (s 0 |s, a)v∗ (s 0 ) − −
a H H
s 0 ∈Sh+1
2ε(H + 1 − h)
= v∗ (s) −
H
Analysis
For s ∈ Sh
πh (s) = arg max q̂h+1 (s, a)
a∈A
Hence
ε
qπh+1 (s, πh (s)) > q̂πh+1 (s, πh (s)) − (concentration)
H
ε
= max q̂πh+1 (s, a) − (def of πh )
a H
2ε
> max qπh+1 (s, a) − (concentration)
a H
X 2ε
= max r(s, a) + P (s 0 |s, a)vπh+1 (s 0 ) −
a
0
H
s ∈Sh+1
X 2ε 2ε(H − h)
> max r(s, a) + P (s 0 |s, a)v∗ (s 0 ) − −
a H H
s 0 ∈Sh+1
2ε(H + 1 − h)
= v∗ (s) −
H
Misspecification
a
/ # [ 1lb .li ]
,b~U1sᵈ
=
"
_b
a
=
? # ( bi ]
Ella .BY =dElbi ]
[ bi ]
weak
=
#
[ eaibs ]
'
Hence ☒
-
_
d-
* [ Kalb > 1) ≈d
Johnson-Lindenstraus Lemma
Lemma 11 There exists a set A ⊂ Rd of size k such that
1. kak = 1 for all a ∈ A
2. |ha, bi| 6 8 log(k)/(d − 1) , γ for all a, b ∈ A with a 6= b
p
Johnson-Lindenstraus Lemma
Lemma 11 There exists a set A ⊂ Rd of size k such that
1. kak = 1 for all a ∈ A
2. |ha, bi| 6 8 log(k)/(d − 1) , γ for all a, b ∈ A with a 6= b
p
Alternative
Assumption 3 There is a given function φ : S → Rd
such that for π there exists a θ such that
vπ (s) = hφ(s), θi
Linear MDPs
(S , A , P , r) with feature map φ : S × A → Rd is linear if
• There exists a θ such that r(s, a) = hφ(s, a), θi
• There exists a signed measure µ : S → Rd such that
Linear MDPs
(S , A , P , r) with feature map φ : S × A → Rd is linear if
• There exists a θ such that r(s, a) = hφ(s, a), θi
• There exists a signed measure µ : S → Rd such that
Learning µ is hopeless
Why are linear MDPs learnable?
D = (su , au , ru , su0 )m
u=1
Let v : S → [0, H]
Want to estimate from data
X X
r(s, a) + P (s 0 |s, a)f(s 0 ) = hφ(s, a), θi + φ(s, a)> µ(s 0 )v(s 0 )
s 0 ∈S s 0 ∈S
D = (su , au , ru , su0 )m
u=1
X X
r(s, a) + P (s 0 |s, a)v(s 0 ) = φ(s, a)> θ + φ(s, a)> µ(s 0 )v(s 0 )
s 0 ∈S s 0 ∈S
, hφ(s, a), wv i
X
m
ŵv = arg min (hφ(su , au ), wi − ru − v(su0 ))2
w∈Rd u=1
X
m
= G−1 φ(su , au )[ru + v(su0 )]
u=1
X
m
>
q̃h (s, a) = φ(s, a) G−1
φ(su , au ) [ru + ṽh+1 (su0 )]
u=1
+ βkφ(s, a)kG−1 ṽh+1 (s) = maxa q̃h+1 (s, a)
bonus
q
V = s 7→ maxhφ(s, a), wi + φ(s, a)> Wφ(s, a) : w ∈ Rd , W ∈ Rd×d
a
X
m
q̃h (s, a) = φ(s, a)> G−1 φ(su , au ) ru + ṽh+1 (su0 ) + βkφ(s, a)kG−1
u=1
= φ(s, a)> ŵṽh+1 + βkφ(s, a)kG−1
> φ(s, a)> wṽh+1
X
= r(s, a) + P (s 0 |s, a)ṽh+1 (s 0 )
s 0 ∈S
X
> r(s, a) + P (s 0 |s, a)v? (s 0 ) = q? (s, a)
s 0 ∈S
Bellman operator on q-values
• Let q : S × A → R
• Abbreviate v(s) = maxa∈A q(s, a)
• Define T : RS ×A → RS ×A by
X
(T q)(s, a) = r(s, a) + P (s 0 |s, a)v(s 0 )
s 0 ∈S
X
" H #
v(s1 ) − v (s1 ) = Eπ
π
(q − T q)(sh , ah )
h=1
Proof
X
" n #
Regn = E v? (s1 ) − vπt (s1 )
t=1
X
" n #
.E t πt
ṽ (s1 ) − v (s1 ) (Optimism)
t=1
XX
" n H #
=E (q̃t − T q̃t )(sth , ath ) (Prop 3)
t=1 h=1
Bellman error
Remember
X
m
q̃t (s, a) = φ(s, a)> G−1
t−1 φ(su , au )[ru + ṽt (su0 )] + βkφ(s, a)kG−1
t−1
u=1
X
T q̃t (s, a) = r(s, a) + P (s 0 |s, a)ṽt (s 0 )
s 0 ∈S
Concentration
XX
" n H #
Regn 6 E t
(q̃ − T q̃ t
)(sth , ath )
t=1 h=1
XX
" n H #
6 2βE kφ(sth , ath )kG−1
t−1
t=1 h=1
v
XX
u " n H #
u
6 2β nHE
t kφ(st , at )k2 h h G−1
t−1
t=1 h=1
Back to the regret
XX
" n H #
Regn 6 E t
(q̃ − T q̃ t
)(sth , ath )
t=1 h=1
XX
" n H #
6 2βE kφ(sth , ath )kG−1
t−1
t=1 h=1
v
XX
u " n H #
u
6 2β nHE
t kφ(st , at )k2 h h G−1
t−1
t=1 h=1
X
n X
H
kφ(sth , ath )k2G−1 . dH log(n)
t−1
t=1 h=1
Back to the regret
XX
" n H #
Regn 6 E t
(q̃ − T q̃ t
)(sth , ath )
t=1 h=1
XX
" n H #
6 2βE kφ(sth , ath )kG−1
t−1
t=1 h=1
v
XX
u " n H #
u q
6 2β nHE
t kφ(st , at )k2 h h G−1
. d3 H4 n log(n)
t−1
t=1 h=1
X
n X
H
kφ(sth , ath )k2G−1 . dH log(n)
t−1
t=1 h=1
Misspecification
Slides available at
https://tor-lattimore.com/downloads/RLTheory.pdf
Nonlinear function approximation
Remember bandits
Remember bandits
X
" n #
Regn = max
?
E f(a? ) − f(at )
a ∈A
t=1
Eluder dimension (intuition)
• We are going to play optimistically
• Somehow construct confidence set Ft based on data collected
• Optimistic value function is ft = arg maxf∈Ft−1 maxa∈A f(a)
• Play at = arg maxa∈A ft (a)
Eluder dimension (intuition)
• We are going to play optimistically
• Somehow construct confidence set Ft based on data collected
• Optimistic value function is ft = arg maxf∈Ft−1 maxa∈A f(a)
• Play at = arg maxa∈A ft (a)
• Regret
X
" n #
Regn = E f(a? ) − f(at ) (regret def)
t=1
X
" n #
.E (ft (at ) − f(at )) (optimism principle)
t=1
Eluder dimension (intuition)
• We are going to play optimistically
• Somehow construct confidence set Ft based on data collected
• Optimistic value function is ft = arg maxf∈Ft−1 maxa∈A f(a)
• Play at = arg maxa∈A ft (a)
• Regret
X
" n #
Regn = E f(a? ) − f(at ) (regret def)
t=1
X
" n #
.E (ft (at ) − f(at )) (optimism principle)
t=1
Eluder dimension
measures how often ft (at ) − f(at ) can be large
Confidence bounds for LSE
X
t
f̂t = arg min (rs − f(as ))2
f∈F s=1
X
t
f̂t = arg min (rs − f(as ))2
f∈F s=1
X
t X
t
0> (rs − ft (as ))2 − (rs − f? (as ))2
s=1 s=1
Xt X
t
2
= (f? (as ) + ηs − ft (as )) − η2s
s=1 s=1
Xt X
t
= (f? (as ) − ft (as ))2 + 2 ηs (f? (as ) − ft (as ))
s=1 s=1
want this small noise
CLT things
Last slide
X
t X
t
(f? (as ) − ft (as ))2 6 2 ηs (f? (as ) − ft (as ))
s=1 s=1
sum of zero-mean random variables
X
t X
t
2 Vs−1 [ηs (f? (as ) − f(as ))] = 2 (f? (as ) − f(as ))2
s=1 s=1
Martingale CLT
v
X
t uX
u t
whp
2 ηs (f? (as ) − f(as )) . t (f? (as ) − ft (as ))2 log(1/δ)
s=1 s=1
CLT things
Last slide
X
t X
t
(f? (as ) − ft (as ))2 6 2 ηs (f? (as ) − ft (as ))
s=1 s=1
sum of zero-mean random variables
X
t X
t
2 Vs−1 [ηs (f? (as ) − f(as ))] = 2 (f? (as ) − f(as ))2
s=1 s=1
Last slide
X
t X
t
(f? (as ) − ft (as ))2 6 2 ηs (f? (as ) − ft (as ))
s=1 s=1
sum of zero-mean random variables
X
t X
t
2 Vs−1 [ηs (f? (as ) − f(as ))] = 2 (f? (as ) − f(as ))2
s=1 s=1
X
t X
t
(f? (as ) − ft (as ))2 6 2 ηs (f? (as ) − ft (as ))
s=1 s=1
sum of zero-mean random variables
v
uX
u t
whp
. t (f? (as ) − ft (as ))2 log(|F |/δ)
s=1
Confidence bound
X
t X
t
(f? (as ) − ft (as ))2 6 2 ηs (f? (as ) − ft (as ))
s=1 s=1
sum of zero-mean random variables
v
uX
u t
whp
. t (f? (as ) − ft (as ))2 log(|F |/δ)
s=1
Rearranging
X
t whp
kf? − ft k2t , (f? (as ) − ft (as ))2 . log(|F |/δ)
s=1
Eluder dimension
X
n
∀f, g ∈ F with (f(at ) − g(at ))2 6 ε2 , f(a) − g(a) 6 ε
t=1
B. I 7 fig c- Ft -
, 5. t flat / -9.19+1 > [
as
I 413,2
i
9laÑ≤
'
=
> 3- B S.t. E ( flat - E
-
AEB
Bm④f 3 Add at to B
in
ind elements
M =
[ 4132 /[ 27 buckets 4- at is e- .
of
added
B. Hence at most E- dim items
Eluder dimension
Ft = Ft−1 ∩ {f ∈ F : kf − ft k2t 6 β2 }
X
n q
wt−1 (at ) 6 nε + cnst nβ2 dimE(F , ε)
t=1
Eluder dimension for bandits
rt = f(at ) + ηt
X
" n #
• Regret is Regn = max E f(a) − f(at )
a∈A
t=1
• Given a confidence set Ft after round t, the algorithm plays
p
Theorem 16 For EluderUCB: Regn . nβ2 dimE(F , 1/n)
Proof Let gt−1 = arg maxg∈Ft−1 maxa∈A g(a)
X
" n #
Regn = E f(a? ) − f(at )
t=1
Xn
" !#
.E gt−1 (at ) − f(at ) (Optimism)
t=1
X
" n #
6E wt−1 (at )
t=1
. n dimE(F , 1/n) log(|F |n)
p
(Corollary 15)
Bounds on the Eluder dimension
Proposition 4 dimE(F , ε) 6 |A | for all ε > 0 and all F
Proof Suppose that a ∈ {a1 , . . . , an }. We claim that a is
ε-dependent on {a1 , . . . , an }
Bounds on the Eluder dimension
Proposition 4 dimE(F , ε) 6 |A | for all ε > 0 and all F
Proof Suppose that a ∈ {a1 , . . . , an }. We claim that a is
ε-dependent on {a1 , . . . , an }
Def. of independence Given an ε > 0 and sequence a1 , . . . , an in A ,
we say that a ∈ A is ε-dependent with respect to (at )n
t=1 if
X
n
∀f, g ∈ F with (f(at ) − g(at ))2 6 ε2 , f(a) − g(a) 6 ε
t=1
Bounds on the Eluder dimension
Proposition 4 dimE(F , ε) 6 |A | for all ε > 0 and all F
Proof Suppose that a ∈ {a1 , . . . , an }. We claim that a is
ε-dependent on {a1 , . . . , an }
Def. of independence Given an ε > 0 and sequence a1 , . . . , an in A ,
we say that a ∈ A is ε-dependent with respect to (at )n
t=1 if
X
n
∀f, g ∈ F with (f(at ) − g(at ))2 6 ε2 , f(a) − g(a) 6 ε
t=1
a ∈ {a1 , . . . , an } so
X
" n #
(f(at ) − g(at ))2 6 ε2 =⇒ [f(a) − g(a) 6 ε]
t=1
Proposition 5 When A ⊂ Rd and F = {f : f(a) = ha, θi, θ ∈ Rd }, then
dimE(F , ε) = O(d log(1/ε)) for all ε > 0
Proof
X
t−1
hft − gt , at i > ε hft − gt , as i2 6 ε2
s=1
Pt
Hence, with Gt = ε2 I + s=1 as a>
s ,
2
ε (1 − kat k2G−1 ) 6 hft − gt , at i2 (1 − kat k2G−1 )
t t
X
dim Abbasi-Yadkori et al. [2011]
dim
Summing: dim ε2 6 3ε2 kat k2G−1 6 3dε2 log 1 +
t dε2
t=1
Consequences
|F | = ∞ in both cases???
Covering
1 llf -911s I
If ≤ , Regn / f) ≈ Regn / 9)
'
V-fEF 3-
g. C- G Ilf -911s ≤ t
flail ,
•
p
• .
•
•
•
. Points in F
• p
•
•
'
•
_
Points in G land F)
a p
or
: : "
is size
•
number of G
4 covering
• •
• ^
.
d
flail
. •
Consequences
• Online RL setting
• Nonlinear function approximation
• Subsumes many existing frameworks on function approximation
• Statistically efficient
• Not computationally efficient
Assumptions
Assumption 4 (Realisability) q? ∈ F
• Let f : S × A → R
• Abbreviate f(s) = maxa∈A f(s, a)
• Define T : RS ×A → RS ×A by
X
(T f)(s, a) = r(s, a) + P (s 0 |s, a)f(s 0 )
s 0 ∈S
X
" H #
f(s1 ) − v (s1 ) = Eπ
π
(f − T f)(sh , ah )
h=1
GOLF Algorithm Intuition
Need a way to eliminate functions f in the confidence set for which the
Bellman error is large
If T f − f is large
X 2 X
f(s, a) − r − f(s 0 ) ((T f)(s, a) − r − f(s 0 ))2
s,a,r,s 0 ∈D s,a,r,s 0 ∈D ∈F
If T f = f
X whp X 2
(f(s, a) −r − f(s 0 ))2 . g(s, a) − r − f(s 0 ) + β2
s,a,r,s 0 ∈D (T f)(s,a) s,a,r,s 0 ∈D
GOLF Algorithm
1. Set D = ∅ and C 0 = F
2. for episode t = 1, 2 . . .
3. Choose policy πt = πft where
P
where L D (g, f) = s,a,r,s 0 ∈D (g(s, a) − r − f(s 0 ))2
Concentration analysis
Given D ⊂ S × A × [0, 1] × S collected by some policy and
|F |
CD = f ∈ F : L D (f) 6 inf L D (g, f) + β2 β2 = cnst log
g∈F δ
P
where L D (g, f) = s,a,r,s 0 ∈D [g(s, a) − r − f(s 0 )]2 and L D (f) = L D (f, f)
|F |
CD = f ∈ F : L D (f) 6 inf L D (g, f) + β2 β2 = cnst log
g∈F δ
P
where L D (g, f) = s,a,r,s 0 ∈D [g(s, a) − r − f(s 0 )]2 and L D (f) = L D (f, f)
X
m
L D (f) − L D (g, f) = [f(si , ai ) − ri − f(si0 )]2 − [g(si , ai ) − ri − f(si0 )]2
i=1 Xi
X
m
L D (f) − L D (g, f) = [f(si , ai ) − ri − f(si0 )]2 − [g(si , ai ) − ri − f(si0 )]2
i=1 Xi
f ∈ Ct
L D (f) − L D (T f, f)
X X
s
2 1 1
> ((f − T f)(s, a)) − ((f − T f)(s, a))2 log − log
δ δ
s,a,r,s 0 ∈D s,a,r,s 0 ∈D
L Dt (f) − L Dt (T f, f) > β2 if
X
t X
H
" #
Eπu ((f − T f)(su u 2
h , ah )) > cnst β2
u=1 h=1
Regret analysis
1. By optimism
X
n
Regn = v? (s1 ) − vπt (s1 )
t=1
whp X
n
6 ft (s1 , πft (s1 )) − vπt (s1 ) (Optimism)
t=1
X
n X
H
= Eπt [(ft − T ft )(sh , ah )] (Prop. 3)
t=1 h=1
2. Since ft ∈ Ft−1
X
t−1 X
H
" #
Eπu t
((f − T f t
)(su u 2
h , ah )) 6 cnst β2
u=1 h=1
Bellman Eluder dimension
X
n X
H
Regn 6 Eπt [(ft − T ft )(sh , ah )]
t=1 h=1
For all t
X
t−1 X
H
" #
Eπu ((ft − T ft )(su u 2
h , ah )) 6 cnst β2
u=1 h=1
Bellman Eluder dimension
X
n X
H
Regn 6 Eπt [(ft − T ft )(sh , ah )]
t=1 h=1
For all t
X
t−1 X
H
" #
Eπu ((ft − T ft )(su u 2
h , ah )) 6 cnst β2
u=1 h=1
p
Theorem 17 Regn = Õ(H dimBE(F , ε)nβ2 )
• Linear MDPs
• Eluder dimension [Osband and Van Roy, 2014]
• Bellman rank [Jiang et al., 2017]
• Witness rank [Sun et al., 2019]
• Bilinear rank [Du et al., 2021]
Negative results
• Last lecture we showed that you can learn with a generative model
when for all π there exists a θ such that qπ (s, a) = hφ(s, a), θi
• Linear features ϕ : S → Rd
poly((dH/ε)|A | )
Other (not covered) topics