Professional Documents
Culture Documents
It would be much easier if robots and agents could just teach themselves.
broad class of optimal control methods based on estimating value functions from
experience, simulation, or search
Supervised learning:
- teacher (ie desired output) giving
detailed feedback of the best move
in each situation
- infeasible for complex games, e.g.
which chess move is best from a
given board position?
Reinforcement learning:
- no teacher; only feedback is
whether you won or lost
- computer could simulate many
games to explore space
3
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Reinforcement Learning 1
8
Copyright 2002, 2004, Andrew W. Moore Markov Systems: Slide 15
Value teration for solving Markov
Systems
Compute J
1
(S
i
) for each j
Compute J
2
(S
i
) for each j
:
Compute J
k
(S
i
) for each j
As k!" J
k
(S
i
)!J*(S
i
) . Why?
When to stop? When
Max J
k+1
(S
i
) J
k
(S
i
) < #
i
This is faster than matrix inversion (N
3
style)
if the transition matrix is sparse
Copyright 2002, 2004, Andrew W. Moore Markov Systems: Slide 16
A Markov Decision Process
! = 0.9
Poor &
Unknown
+0
Rich &
Unknown
+10
Rich &
Famous
+10
Poor &
Famous
+0
You run a
startup
company.
n every
state you
must
choose
between
Saving
money or
Advertising.
S
A
A
S
A
A
S
S
1
1
1
1/2
1/2
1/2
1/2
1/2
1/2
1/2
1/2
1/2
1/2
Recall Markov Decision Processes
Value iteration:
Optimal policy:
4
states: {s
1
, . . . , s
N
}
initial state: S
0
actions: {a
1
, . . . , a
M
}
rewards: R(s) = {r
1
, . . . , r
N
}
transitions: T(s, a, s
) = P(s
t+1
= s
|a
t
= a, s
t
= s)
discount:
V
i+1
(s) = R(s) + max
a
T(s, a, s
)V
i
(s
)
V
(s) = lim
i
V
i
(s)
R(s) +
T(s, a, s
)V
(s
) = P(s
t+1
|a
t
, s
t
)
actions
The reward function R(s) and the transition model T(s,a,s) are also known.
How does the agent maximize reward? Value (or policy) iteration.
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Reinforcement Learning 1
The optimal policy
6
3 +1
2 "1
1
1 2 3 4
0.8
0.1 0.1
T(s, a, s
) = P(s
t+1
|a
t
, s
t
)
actions
R(s) = -0.04
With weak negative reward, best policy is to avoid -1 when possible and seek +1
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Reinforcement Learning 1
Another optimal policy
7
3 +1
2 "1
1
1 2 3 4
0.8
0.1 0.1
T(s, a, s
) = P(s
t+1
|a
t
, s
t
)
actions
R(s) = < 1.6285
With strong negative reward, best policy is to always seek end states
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Reinforcement Learning 1
Reinforcement learning: learning from experience
8
3 ? ? ? ?
2 ? ? ? ?
1 ? ? ? ?
1 2 3 4
actions
T(s, a, s
) = ?
?
? ?
?
R(s) = ?
T(s, a, s
)
# transitions s s
for action a
# times a selected at state s
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Reinforcement Learning 1
Model estimation
10
With estimates of T(s,a,s) and R(s), we can just treat it like an MDP.
T(s, a, s
)V
i
(s
)
V
(s) = lim
i
V
i
(s)
R(s) +
T(s, a, s
)V
(s
1 2 3 N
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Reinforcement Learning 1
More efcient updating
T(s, a, s
)V
i
(s
)
s
i
(a) s
j
T(s
i
, a, s
j
)
V
i+1
(s
i
) =
R(s
i
) + max
a
T(s
i
, a, s
)V
i
(s
)
one backup CE
learning
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Reinforcement Learning 1
T(s
i
, a, s
)V
i
(s
)
= |V
prev
(s)
V (s)|
T(s
p
, a
p
, s)
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Reinforcement Learning 1
Example: prioritized sweeping
For a big change in V(s), V(si) is likely to change too, so update it.
T(s
i
, a, s
)V
i
(s
)
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Reinforcement Learning 1
How do we choose the actions?
R(s) +
T(s, a, s
)V
(s
R = 10 R = -5 R = 50
0.9
a a a
b b b
0.1
0.9
0.1
1.0
0.9 0.1
1.0
0.9
0.1
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Reinforcement Learning 1
Exploration strategy
How do you maximize your payoff when you dont know R(si) or P(si) ?
If you did know the expected reward = R(si)P(si) then it would be trivial:
just pull the lever with highest expected reward
Possible strategies:
- #-greedy (exploit, but once in a while do something random)
choose machine with current best expected reward with probability 1-#
choose another machine (action) randomly with probability # / (N-1)
- Boltzmann exploration (choose probabilistically relative to expected rewards)
choose machine k with probability ~ exp(-P(sk)/T)
decrease T (temperature) over time
- high T: nearly equiprobable (exploration)
- low T: increasingly biased toward best action (exploitation)
17
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Reinforcement Learning 1
Some real intelligence
Consider the summation in the Bellman equations: The expected value of each
state s equals its own reward plus the (discounted) reward of its successor states:
For state s: we receive reward R(s) and take action a = !(s) to transition from s
to s.
For any successor state s, we can adjust V(s) so it better agrees with the
constraint equations, ie closer to the expected future rewards.
Temporal difference learning
19
V
i+1
(s) =
R(s) + max
a
T(s, a, s
)V
i
(s
)
current
value
expected future
reward
New estimate is
weighted sum of
V (s) (1 )
V (s) + (R(s) +
V (s
))
=
V (s) + (R(s) +
V (s
)
V (s))
current
value
discrepancy between current
value and new estimate after
transitioning to s.
The TD (or temporal
difference) equation
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Reinforcement Learning 1
The temporal difference equation
Too small: V(s) will converge slowly, biased toward current estimate of V
Too large: V(s) will change quickly, biased toward new estimate
Decaying learning rate strategy: start large when were not sure, then reduce as
we become more sure of our estimates.
20
V (s)
V (s) + (R(s) +
V (s
)
V (s))
$
iterations
Michael S. Lewicki ! Carnegie Mellon Articial Intelligence: Reinforcement Learning 1
Summary
relationship to MDPs: MDP but world, reward, and transitions are unknown
TD learning