Professional Documents
Culture Documents
Reinforcement Learning: EEE 485/585 Statistical Learning and Data Analytics
Reinforcement Learning: EEE 485/585 Statistical Learning and Data Analytics
Learning
Chapter 14
Reinforcement Learning
EEE 485/585 Statistical Learning and Data Analytics Reinforcement
Learning
Markov Decision
Process
Value Iteration
Q Learning
Cem Tekin
Bilkent University
Reinforcement
Learning
Markov Decision
Process
Value Iteration
Q Learning
Goal
Given discount rate 0 ≤ γ ≤ 1 select actions to maximize
∞
X
(total return) G1 = R1 + γR2 + γ 2 R3 + . . . = γ k Rk +1
k =0
Reinforcement
Learning
Discount rate represents how much the agent cares about
Markov Decision
immediate rewards vs. future rewards Process
Value Iteration
Policy π (method to select actions) Q Learning
Reinforcement
Learning
Markov Decision
Process
Value Iteration
Q Learning
Figure 3.1 from “Reinforcement Learning: An Introduction" by Sutton and Barto 14.4
Reinforcement
Markov Decision Process (MDP) Learning
Markov Decision
Finite set of actions: A Process
Q Learning
Expected reward:
14.5
Reinforcement
Recycling robot example Learning
Reinforcement
Learning
Markov Decision
Process
Value Iteration
Q Learning
S = {high, low}
A(high) = {search, wait}
A(low) = {search, wait, recharge}
rsearch > rwait [expected num. of cans collected by the
robot]
State transitions are random
Figure 3.3 from “Reinforcement Learning: An Introduction" by Sutton and Barto 14.6
Reinforcement
Recycling robot example Learning
Reinforcement
Learning
Markov Decision
Process
Value Iteration
Q Learning
Table 3.1 from “Reinforcement Learning: An Introduction" by Sutton and Barto 14.7
Reinforcement
Markov policies and the value function Learning
General policy
At sampled from π(·|Ht−1 , Rt , St )
Stationary Markov policy
At sampled from π(·|St )
Total return after time t
Reinforcement
∞
X Learning
2 k
Gt = Rt+1 + γRt+2 + γ Rt+3 + . . . = γ Rt+k +1 Markov Decision
Process
k =0
Value Iteration
∞
" #
X
k
vπ (s) = Eπ [Gt |St = s] = Eπ γ Rt+k +1 |St = s
k =0
Q Learning
Optimal policy
14.10
Reinforcement
Computing the optimal policy (when state transition Learning
Markov Decision
(2) Compute the new Q functions (at iteration k + 1) by Process
Q Learning
X
qk +1 (s, a) = p(s0 |s, a) r (s, a, s0 ) + γ max
0
q k (s 0
, a 0
)
a
s0
14.11
Reinforcement
Robot grid-world example for value iteration Learning
https://youtu.be/gThGerajccM
Goal location: high reward Reinforcement
Learning
Freespace: small penalty Markov Decision
Process
Obstacles: very large penalty Value Iteration
14.12
Reinforcement
Learning the optimal policy (when state transition Learning
Markov Decision
Keep a table of Q value estimates: Q(s, a) for s ∈ S, a ∈ A Process
Value Iteration
→ At → (St+1 , Rt+1 )
In round t: St |{z}
Q Learning
How?
Form sample estimate:
Q̂(St , At ) = Rt+1 + γ maxa0 Q(St+1 , a0 )
Update Q-value of (St , At )
Q(St , At ) ← (1 − α)Q(St , At ) + α
|{z} Q̂(St , At )
learning rate
Option 1: Greedy
set (explore)
If Ct = T , then At = arg maxa Q(St , a) (exploit)
Option 3: Boltzmann exploration
eQ(St ,a)
At ∼ Pr(·|St ) such that Pr(At = a|St ) = P Q(St ,a0 )
a0 e
Explores implicitly
14.14
Reinforcement
Deep Q Network Learning to Play Atari Game Learning
Reinforcement
Learning
Markov Decision
Process
https://youtu.be/cjpEIotvwFY Value Iteration
Q Learning
14.15