Professional Documents
Culture Documents
net/publication/329029610
CITATIONS READS
0 794
1 author:
Mauricio Arango
Oracle Corporation
25 PUBLICATIONS 198 CITATIONS
SEE PROFILE
All content following this page was uploaded by Mauricio Arango on 18 November 2018.
Agent
state reward action
𝑠' 𝑟' 𝑎'
𝑟'*+
𝑠'*+ Environment
• Types of RL algorithms:
– Value-based - learn optimal action-value function 𝑸∗ (𝒔, 𝒂)
• Derive policy from 𝑄∗ (𝑠, 𝑎) – Q-Learning
– Policy-based – search directly for the optimal policy 𝝅∗
𝑸 𝒔, 𝒂 ← 𝑸 𝒔, 𝒂 + 𝜶 ∆(𝑠, 𝑎, 𝑟, 𝑠 E )
• Agent interacts with environment and obtains samples:
< 𝒄𝒖𝒓𝒓𝒆𝒏𝒕 𝒔𝒕𝒂𝒕𝒆, 𝒂𝒄𝒕𝒊𝒐𝒏, 𝒓𝒆𝒘𝒂𝒓𝒅, 𝒏𝒆𝒙𝒕 𝒔𝒕𝒂𝒕𝒆 > − < 𝒔, 𝒂, 𝒓, 𝒔E >
s Function
a Approximator
Parameter 𝑄 𝑠, 𝑎, 𝑤
r vector: w
𝑠E
• Use neural network with stochastic gradient descent and back propagation, where:
error (s, a, r, s’, w) = 𝑟 + 𝛾 max
D
𝑄(𝑠 E ,𝑎E , 𝑤) − 𝑄 𝑠, 𝑎, 𝑤
_
E E 4
Mean Squared Error (MSE) = 𝛦[(𝑟 + 𝛾 𝑚𝑎𝑥
D
𝑄(𝑠 ,𝑎 , 𝑤) − 𝑄 𝑠, 𝑎, 𝑤 ) ]
_
𝒘 ← 𝒘 + 𝜶 𝒓 + 𝜸 𝒎𝒂𝒙
D
𝑸 𝒔E ,𝒂E , 𝒘 − 𝑸 𝒔, 𝒂, 𝒘 𝜵𝒘 𝑸 𝒔, 𝒂, 𝒘
𝒂
Gradient obtained
with back propagation
7 ∆𝒘
Apply 1
𝑸 𝒔, 𝒂, 𝒘
s Neural
𝜀 − greedy a
network 2 policy (*)
a ×
𝜵𝒘 𝑸 𝒔, 𝒂, 𝒘 6 Initialize network weights
4 Repeat for each episode:
r + − Initialize s
5 Repeat for each step of episode:
Choose a from s using 𝜖 − 𝑔𝑟𝑒𝑒𝑑𝑦 policy − 1
E E
𝒎𝒂𝒙 𝑸 𝒔 ,𝒂 , 𝒘 Take action a, observe r, s’
Neural D
𝒂
s’ Neural
Neural The network used Obtain current Q and gradient – 2
network
network
network 3 in stage 1 and in Calculate max next state Q value - 3
a’ stage 2 is the same Calculate target – 4
Calculate error – 5
Calculate weights delta – 6
Update weights – retrain network – 7
(*) 𝜺 − greedy policy: s ⟵ 𝑠′ Perform
With probability 𝜀, select random a gradient
otherwise select argmax 𝑄 𝑠, 𝑎, 𝑤 Until s is terminal descent
_
r + −
For episode = 1, M do
For t = 1, T do
y 5 error
8 4 Select action a using an 𝜀 − greedy policy- 1
Execute action and observe next state 𝑠 E
every C updates and reward 𝑟
Store transition (s, a, r, s’) in D
𝒘{ ← 𝒘 Sample random mini-batch of transitions
(s, a, r, s’) from D
Neural
s’ network
3 Obtain 𝑄(s, a, w) and gradient 𝛻• 𝑄 𝑠, 𝑎, 𝑤 – 2
Calculate 𝑚𝑎𝑥 𝑄| 𝑠 E ,𝑎 E , 𝑤 { – 3
} 𝒔E ,𝒂E , 𝒘{
_D
𝑤{ 𝒎𝒂𝒙
D
𝑸 Calculate mini-batch targets, y – 4
𝒂 Calculate error – 5
Calculate weights 𝑤 delta – 6
Update 𝑤 weights for network Q – retrain network – 7
(*) 𝜺 − greedy policy: Every C updates reset 𝑄| = Q – 8
End For
With probability 𝜀, select random a End For
otherwise select argmax 𝑄 𝑠, 𝑎, 𝑤 Perform
_ gradient descent
• Double DQN
E E
– Addresses DQN overestimation problem caused by max 𝑄 𝑠 ,𝑎 operation in the
_D
updates
– Use two separate networks, one used to determine the maximizing action the other
second to estimate Q-values; alternate on each step