You are on page 1of 167

REINFORCEMENT LEARNING

UNIT - I

Basics of probability and linear algebra, Definition of a stochastic multi-


armed bandit, Definition ofregret, Achieving sublinear regret, UCB
algorithm, KL-UCB, Thompson Sampling.

UNIT - II

Markov Decision Problem, policy, and value function, Reward models


(infinite discounted, total, finitehorizon, and average), Episodic &
continuing tasks, Bellman's optimality operator, and Value iteration&
policy iteration

UNIT - III

The Reinforcement Learning problem, prediction and control problems,


Model-based algorithm, MonteCarlo methods for prediction, and Online
implementation of Monte Carlo policy evaluation

UNIT - IV

Bootstrapping; TD(0) algorithm; Convergence of Monte Carlo and batch


TD(0) algorithms; Model-freecontrol: Q-learning, Sarsa, Expected Sarsa.

UNIT - V

n-step returns; TD(λ) algorithm; Need for generalization in practice;


Linear function approximation and geometric view; Linear TD(λ). Tile
coding; Control with function approximation; Policy search; Policy
gradient methods; Experience replay; Fitted Q Iteration; Case studies.

prepared by- P Srinivas Rao (ns lectures youtube channel)


prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
KLuCB Alqoith)
KL- kul)b ack-Lef blex
Ihe kL-vcB agonithn ís desigued to
selve multi- o3med bandit poro blens, whee
an ogent faces set of actons(ans)
and nee ds t ehoose (select) actiens
cueY trme to maxi f3e ue cumulatrve
he
Teward. KL-UCe aims t baance
erpleration of uwcertaín(less freguenty
or selected octíons hat give
less eaoyd octions urtth explortaton
of a t hat Seem poomis fivg (actiong

that ne occUTMhg selected


no ef fines that qíve mee ewayd)
lbased n the avatlable fnfmatron.
uses the kL
k
The alqoilhm
druergen ce to mEasuNe he nce
trinty
beteen the empmical (obseved) disti
bution of arde and the unkve un

prepared by- P Srinivas Rao (ns lectures youtube channel)


aie ewrd distibut o fo each action
KL/tucactual
obseved rewad (s-9)
ckon a
actn of actioy a

A then adiusts the upper congidence


bounde for act ons based on this measure
kuloack-Leibler (kL) diveqence (df
eoence) fs a Mathematca meaue hat
caledates the dtfference bchwcen twt
robability dstibutrong.
Civen two prubalty distibutrog
P cnd a over the same/sample
over he ExpeiweNe
he kL diverqence tom P to a ísts
defned as foll ous
1.

for each xpeSaniple space x


X=,XL --Aa outcomes
Fur contuous distibutnc
P) dx

prepared by- P Srinivas Rao (ns lectures youtube channel)


8.

Pseudo code of kL- UCB alqonfthw


Intalrze fr each o m(acion) a a,a2,34

tazl (ttme step fr am a)


Sample_rewand (a) [Irtral estrwmate of
ha meay

pulled or
Nad (No: of timer am a s seleted(ueuta)
action a (s
SRepea fuy each time step t
Foy each a)
cal culate theKL díveqence
KL
Na

cumet emprícal (observed)


between the ve7sfon
Arstr butio) and its updated
(after ta steps
upper cenfdence bound
-calclate
Ua sclvekL] kL, log)
Na

choose the orm A wth the highert


Uppor confideuce bound
prepared by- P Srinivas Rao (ns lectures youtube channel)
Ap= rgmexa
- obseve the eard R after selectiry
(pertíng oxY enee whngl acton At
Update estmates
mea)
taztatl ’previous rewad
ney
ypdato Value of(n-)
selections
ead
-a (Na-) xa t Rt
KtYeward
th
eceived
aflalue Na ín aiseletr
Na
Selectíons
balances expleation and
explottatíon bq by ad(ustrng the uppey
onfdene bounds based on KL dívergence.
the fun cton s olvek
-he choice. of
depends on the speúfc
is fmpetant ond propertes of the
RL prollem qnd the
Teward distibution
more
advanced than UCB
fn handling uncetatn Teward distib

t takes higher computatondl


trons , but
costs due to soluig kL divergences.
prepared by- P Srinivas Rao (ns lectures youtube channel)
|9.
THOMPSoN SAMPL1NG ('may)
Same actron can' ve diferent ewods
due to changivg
in diffeEt seledions
sele tions (executions) natureof enumeut
Ex Reweyd value Aisti budions
Samptes
of (2) selectel
actron , Pa)= 15l6,4, lo, ,&, 12}/
action g P(ae) 2o15, &, LS, 2 ,25j.
Tn Tiompson samplrng, heagen
maintains prv bability drstribution
(usually a Bayesian. posteior distibution)
fts
for each amactou), Tepresenting
beliefs aloout the tue eward disti
bukíon of that acton. At each time step,
samples (selets s ome ewayd
the agent
Ayaus
Sounple values) a eward Estimate f
a

cah actow distibutro and then


selets he,ac, uith the highert
valuh
Sampledumof estimate
Tho ipson sampling s a he uviste
for choosirq actrons that adesses the
itatíon dilemna (n the
txploration-exploitation
prepared by- P Srinivas Rao (ns lectures youtube channel)
waltf- an medbandrt problem. Tt consisto
of choosíuq the acton thot maxímíaes
he expected eward w t Tandony a

drawn belief (Tandady thawn sampl


Consídey contexts (states
set of
of agent) x ,a set ef Sactrong A, nd
aim of hepayex fs
acttons undes Ae vaíous
to play
hay marímí3e 4he
centets, such as to
Cumuloatve Veoard.
Tn each und,the playey obtarn
a conbext (state) e ,plays CAn actron
a renayd E R.
acA and rece•ves
depends
follousing a a distibution tat
centeut and the fssued action.
te
’The plenents of Thompson Seamplíng ane
os follous
funcron P/,a, )
d) A likelhood 9fr the
2) A set of paYameters
distibutí oy oof
prepared by- P Srinivas Rao (ns lectures youtube channel)
20.

A príc distibatio) P(O) on these


parameters
Pat
45 cbsevatfons tnplets D-f,)
SA postesí o distibution

ts the Iikelrho od function


whene Po) playing
in
Thcmpson Sawiplnq consísts
aeA acoydng to the pmba
the action
t maxíNÍ 2¬S the expected
bilrty tht
Teunnd.
Tewasd. Acton chosen wtth probability
JTEGlx,9= maEQlax,ej P/D) de.
expected
TRe rule s fonplemented by samplinq
In eah Tound (selecton or tne step)
parameters g* ae Sampled from the
pesteníor P(D) and CA acron a is

chosen that

prepared by- P Srinivas Rao (ns lectures youtube channel)


That Ls, he epe cted TewaTd iven
the sonpled ponc ctes(e). te actran
, and the cunmet contet x.
Thís means
that the playergent)
instanttates theiy beless andomly
Cach und aceNding to the postesiY
dtstibution and then acts optnay
acceTding to them.
7Pis obabilíty TepYes ents what is
believed bete new eidence
ení yin ally
is futduced occuíng
event A
PCA) pmbabilrty oflpnor baor lity)
pbability of event B occuríng
PCB)=
(evtdence cY marqina Iikelhsod
And postenoY pbabi lrty takes this
fnfomaton uto account
hew of evet A
probalbility
P(NE) = The occuTTnq pvided the evideneB.
P(B/A) P(A)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)
prepared by- P Srinivas Rao (ns lectures youtube channel)

You might also like