Reinforcement Learning: EEE 485/585 Statistical Learning and Data Analytics

Reinforcement
Learning
Chapter 14
Reinforcement Learning
EEE 485/585 Statistical Learning and Data Analytics Reinforcement
Learning
Markov Decision
Process
Value Iteration
Q Learning
Cem Tekin
Bilkent University
Cannot be distributed outside this class without the permission of the

instructor. 14.1
Reinforcement
Reinforcement learning (RL) Learning
How should an agent interact with its environment in order to

maximize its cumulative reward?
Example: Robot in a gridworld
Reinforcement
Learning
Markov Decision
Process
Value Iteration
Q Learning
S0 : initial state (position)

In each round t
Take action At in state St (move 1 step left, right, up or down)
Observe the next state St+1
Collect reward Rt+1 (-100 if bomb hit, 1 if power found, 100 if end
reached, -1 otherwise)
Figure by Akshay Lambda from

https://medium.com/free-code-camp/an-introduction-to-q-learning-reinforcement-learning-14ac0b4493cc 14.2
Reinforcement
Reinforcement learning (RL) Learning
Goal
Given discount rate 0 ≤ γ ≤ 1 select actions to maximize
∞
X
(total return) G1 = R1 + γR2 + γ 2 R3 + . . . = γ k Rk +1
k =0
Reinforcement
Learning
Discount rate represents how much the agent cares about
Markov Decision
immediate rewards vs. future rewards Process
Value Iteration
Policy π (method to select actions) Q Learning
History Ht = {S0 , A0 , R1 , . . . , St−1 , At−1 , Rt , St , At }

everything that happened by the end of round t
Policy π maps past information to distributions over actions
At sampled from π(·|Ht−1 , Rt , St )
π is deterministic if it puts all probability to a single action
What is a good model the environment?
Markov property
Pr(Rt+1 = r , St+1 = s0 |Ht ) = Pr(Rt+1 = r , St+1 = s0 |St , At )

14.3
Reinforcement
General RL model Learning
Reinforcement
Learning
Markov Decision
Process
Value Iteration
Q Learning
Some real-world applications

Autonomous driving
Personalized medicine
Web advertising
News, video, movie recommendation
Figure 3.1 from “Reinforcement Learning: An Introduction" by Sutton and Barto 14.4
Reinforcement
Markov Decision Process (MDP) Learning
A mathematical framework for modeling the interaction

between agent and environment under Markov assumption
Reinforcement
Finite set of states: S Learning
Markov Decision
Finite set of actions: A Process
State transition probabilities: Value Iteration
Q Learning
p(s0 |s, a) := Pr(St+1 = s0 |St = s, At = a)
Expected reward:
r (s, a, s0 ) = E[Rt+1 |St = s, At = a, St+1 = s0 ]
14.5
Reinforcement
Recycling robot example Learning
Reinforcement
Learning
Markov Decision
Process
Value Iteration
Q Learning
S = {high, low}
A(high) = {search, wait}
A(low) = {search, wait, recharge}
rsearch > rwait [expected num. of cans collected by the
robot]
State transitions are random
Figure 3.3 from “Reinforcement Learning: An Introduction" by Sutton and Barto 14.6
Reinforcement
Recycling robot example Learning
Reinforcement
Learning
Markov Decision
Process
Value Iteration
Q Learning
Table 3.1 from “Reinforcement Learning: An Introduction" by Sutton and Barto 14.7
Reinforcement
Markov policies and the value function Learning
General policy
At sampled from π(·|Ht−1 , Rt , St )
Stationary Markov policy
At sampled from π(·|St )
Total return after time t
Reinforcement
∞
X Learning
2 k
Gt = Rt+1 + γRt+2 + γ Rt+3 + . . . = γ Rt+k +1 Markov Decision
Process
k =0
Value Iteration
State-value function for π Q Learning
∞
" #
X
k
vπ (s) = Eπ [Gt |St = s] = Eπ γ Rt+k +1 |St = s
k =0
Action-value (Q) function for π
qπ (s, a) = Eπ [Gt |St = s, At = a]

"∞ #
X
k
= Eπ γ Rt+k +1 |St = s, At = a
k =0
14.8
Reinforcement
Optimal policy Learning
π ∗ is optimal iff vπ∗ (s) ≥ vπ (s) for all s ∈ S and π

Theorem [Puterman, 1994]
For infinite horizon discounted MDP there exists a deterministic
stationary Markov policy that is optimal.
Optimal state-value function v∗ (s) = maxπ vπ (s) Reinforcement
Learning
Optimal action-value (Q) function q∗ (s, a) = maxπ qπ (s, a) Markov Decision
Process
Bellman optimality equations Value Iteration
Q Learning
v∗ (s) = max E[Rt+1 + γv∗ (St+1 )|St = s, At = a]

a∈A(s) | {z }
q∗ (s,a)
q∗ (s, a) = E[Rt+1 + γ max q∗ (St+1 , a0 ) |St = s, At = a]

a0
| {z }
v∗ (St+1 )
Optimal policy
π ∗ (s) = arg max q∗ (s, a) for all states s

a
14.9
Reinforcement
Computing the optimal policy (when state transition Learning
probabilities are known)

Value Iteration
(1) Start with an initial guess of the value functions v0 (s),
s ∈ S (e.g., set to zero)
(2) Compute the new value functions (at iteration k + 1) by Reinforcement
Learning
updating the value functions found at iteration k :
Markov Decision
Process
vk +1 (s) = max E[Rt+1 + γvk (St+1 )|St = s, At = a] Value Iteration

a
X Q Learning
= max p(s0 |s, a)[r (s, a, s0 ) + γvk (s0 )]

a
s0
(3) Repeat the above procedure until convergence, i.e.,

||vk ∗ − vk ∗ −1 || ≤
(4) The final policy is
( )
X
0 0 0
π(s) = arg max p(s |s, a) [r (s, a, s ) + γvk ∗ (s )]
a
s0
14.10
Reinforcement
Computing the optimal policy (when state transition Learning
probabilities are known)
Value Iteration (with Q function)

(1) Start with an initial guess of the Q functions q0 (s, a), s ∈ S, Reinforcement
a ∈ A (e.g., set to zero) Learning
Markov Decision
(2) Compute the new Q functions (at iteration k + 1) by Process
updating the Q functions found at iteration k : Value Iteration
Q Learning
X
qk +1 (s, a) = p(s0 |s, a) r (s, a, s0 ) + γ max
0
q k (s 0
, a 0
)
a
s0
(3) Repeat the above procedure until convergence

(4) We have vk ∗ (s) = maxa qk ∗ (s, a)
(5) π(s) = arg maxa qk ∗ (s, a)
14.11
Reinforcement
Robot grid-world example for value iteration Learning
https://youtu.be/gThGerajccM
Goal location: high reward Reinforcement
Learning
Freespace: small penalty Markov Decision
Process
Obstacles: very large penalty Value Iteration
Types of robots: Q Learning
Deterministic: Always moves in the direction of the

dictated action
Stochastic: Can also move in other directions with a
positive probability
14.12
Reinforcement
Learning the optimal policy (when state transition Learning
probabilities are unknown)

Estimate q ∗ (s, a) in a data-driven manner. Recall that
q∗ (s, a) = E[Rt+1 + γ max q∗ (St+1 , a0 ) |St = s, At = a]
a0
| {z }
v∗ (St+1 )
Reinforcement
Q learning Learning
Markov Decision
Keep a table of Q value estimates: Q(s, a) for s ∈ S, a ∈ A Process
Value Iteration
→ At → (St+1 , Rt+1 )
In round t: St |{z}
Q Learning
How?
Form sample estimate:
Q̂(St , At ) = Rt+1 + γ maxa0 Q(St+1 , a0 )
Update Q-value of (St , At )
Q(St , At ) ← (1 − α)Q(St , At ) + α
|{z} Q̂(St , At )
learning rate
Convergence If all (s, a) pairs are selected infinitely many times
Q(s, a) → q∗ (s, a) with probability 1

14.13
Reinforcement
How to choose At given St ? Learning
Option 1: Greedy
At = arg max Q(St , a)

a
Always exploits. No exploration. Might stuck in suboptimal Reinforcement

Learning
Option 2: -greedy Markov Decision
Process
Toss a coin Ct with Pr(Ct = H) = Value Iteration
If Ct = H, then sample At uniformly randomly from action Q Learning
set (explore)
If Ct = T , then At = arg maxa Q(St , a) (exploit)
Option 3: Boltzmann exploration
eQ(St ,a)
At ∼ Pr(·|St ) such that Pr(At = a|St ) = P Q(St ,a0 )
a0 e
Explores implicitly
14.14
Reinforcement
Deep Q Network Learning to Play Atari Game Learning
Reinforcement
Learning
Markov Decision
Process
https://youtu.be/cjpEIotvwFY Value Iteration
Q Learning
14.15

Reinforcement Learning: EEE 485/585 Statistical Learning and Data Analytics

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Reinforcement Learning: EEE 485/585 Statistical Learning and Data Analytics

Uploaded by

Copyright:

Available Formats

Reinforcement

Cannot be distributed outside this class without the permission of the

How should an agent interact with its environment in order to

S0 : initial state (position)

Figure by Akshay Lambda from

History Ht = {S0 , A0 , R1 , . . . , St−1 , At−1 , Rt , St , At }

Pr(Rt+1 = r , St+1 = s0 |Ht ) = Pr(Rt+1 = r , St+1 = s0 |St , At )

Some real-world applications

A mathematical framework for modeling the interaction

State transition probabilities: Value Iteration

p(s0 |s, a) := Pr(St+1 = s0 |St = s, At = a)

r (s, a, s0 ) = E[Rt+1 |St = s, At = a, St+1 = s0 ]

State-value function for π Q Learning

Action-value (Q) function for π

qπ (s, a) = Eπ [Gt |St = s, At = a]

π ∗ is optimal iff vπ∗ (s) ≥ vπ (s) for all s ∈ S and π

Bellman optimality equations Value Iteration

v∗ (s) = max E[Rt+1 + γv∗ (St+1 )|St = s, At = a]

q∗ (s, a) = E[Rt+1 + γ max q∗ (St+1 , a0 ) |St = s, At = a]

π ∗ (s) = arg max q∗ (s, a) for all states s

probabilities are known)

vk +1 (s) = max E[Rt+1 + γvk (St+1 )|St = s, At = a] Value Iteration

= max p(s0 |s, a)[r (s, a, s0 ) + γvk (s0 )]

(3) Repeat the above procedure until convergence, i.e.,

probabilities are known)

Value Iteration (with Q function)

updating the Q functions found at iteration k : Value Iteration

(3) Repeat the above procedure until convergence

Types of robots: Q Learning

Deterministic: Always moves in the direction of the

probabilities are unknown)

Convergence If all (s, a) pairs are selected infinitely many times

Q(s, a) → q∗ (s, a) with probability 1

At = arg max Q(St , a)

Always exploits. No exploration. Might stuck in suboptimal Reinforcement

If Ct = H, then sample At uniformly randomly from action Q Learning

You might also like