Professional Documents
Culture Documents
Unit 4: Reinforcement Learning
Unit 4: Reinforcement Learning
REINFORCEMENT LEARNING
Reinforcement Learning
Model-Based Learning
Supervised Learning:
Example Class
Reinforcement Learning:
…
Situation Reward Situation Reward
Unit 4 3
Examples
Playing chess:
Reward comes at end of game
Ping-pong:
Reward on each point scored
Animals:
Hunger and pain - negative reward
food intake – positive reward
Unit 4 4
Framework: Agent in State Space
Remark: no
Example: XYZ-World terminal
e e states
1 2 3 R=+5
n s s
4 5 R=+3 w 6 R=-9
sw ne
n s s
x/0.3 x/0.7
7 8 R=+4 9 R=-6
Problem: What actions nw s
should an agent choose
to maximize its rewards? Unit 4 10
5
Bellman TD P
XYZ-World: Discussion Problem 12
e e (3.3, 0.5)
1 2 3 R=+5
n s s
I tried hard but: any
better
4 5 R=+3 w 6 R=-9
explanations?
sw ne
n s s
(3.2, -0.5) x/0.7
x/0.3
7 8 R=+4 9 R=-6
nw s
(0.6, -0.2)
Explanation of discrepancies TD for P/Bellman:
• Most significant discrepancies in states 3 and 8; minor in state 10
10
• P chooses worst successor of 8; should apply operator x instead
• P should apply w in state 6, but only does it only in 2/3 of the cases;
which affects the utility of state 3
• The low utility value of state 8 in TD seems to lower the utility value
of state 10 only a minor discrepancy
Unit 4 6
P: 1-2-3-6-5-8-6-9-10-8-6-5-7-4-1-2-5-7-4-1.
XYZ-World: Discussion Problem 12 Bellman Update g=0.2
e e
10.145 20.72 30.58 R=+5
n s s
40.03 53.63 R=+3 w 6-8.27 R=-9
sw ne
n s s
x/0.3 x/0.7
70.001 83.17 R=+4 9-5.98 R=-6
nw s
Discussion on using Bellman Update for Problem 12:
• No convergence for g=1.0; utility values seem to run away!
100.63
• State 3 has utility 0.58 although it gives a reward of +5 due to the
immediate penalty that follows; we were able to detect that.
• Did anybody run the algorithm for other g e.g. 0.4 or 0.6 values; if
yes, did it converge to the same values?
• Speed of convergence seems to depend on the value of g.
Unit 4 7
TD TD inverse R
XYZ-World: Discussion Problem 12
e e (0.57, -0.65)
1 2 3 R=+5
n s s
(2.98, -2.99)
4 5 R=+3 w 6 R=-9
sw ne
n s s
x/0.3 (-0.50, 0.47) x/0.7
7 8 R=+4 9 R=-6
Other observations: nw s
• The Bellman update did not converge for g=1
(-0.18, -0.12)
• The Bellman update converged very fast for g=0.2
• Did anybody try other values for g (e.g. 0.6)? 10
• The Bellman update suggest a utility value for 3.6 for state 5; what
does this tell us about the optimal policy? E.g. is 1-2-5-7-4-1
optimal?
• TD reversed utility values quite neatly when reward were inversed;
x become –x+u with u[-0.08,0.08].
• Unit 4 P: 1-2-3-6-5-8-6-9-10-8-6-5-7-4-1-2-5-7-4-1.
8
XYZ-World --- Other Considerations
• R(s) might be known in advance or has to be learnt.
• R(s) might be probabilistic or not
• R(s) might change over time --- agent has to adapt.
• Results of actions might be known in advance or
have to be learnt; results of actions can be fixed, or
may change over time.
Unit 4 9
To be used in Assignment2:
Unit 4 11
Passive Learning
Unit 4 12
Expected Utility
Unit 4 13
Direct Utility Estimation
(1,1) U = 0.72
(2,1) U = 0.68
…
Learn to map states to utilities.
Bellman Equation
U(1,3) = 0.84
(1,3) (2,3)
U(2,3) = 0.92
Unit 4 17
Updating Estimations Based on Observations:
New_Estimation = Old_Estimation*(1-) + Observed_Value*
New_Estimation= Old_Estimation + Observed_Difference*
Example: Measure the utility of a state s with current value
being 2 and observed values are 3 and 3 and the learning rate
is 0.2:
Unit 4 18
Markov Decision Processes - Recitation
• M = (S,A,T,R)
– S: States
– A: Actions
– T: Transition Probability
– R: Reward
• Goal of the MDP?
– Finding the Policy
• Policy is the mapping from state to action
– Two well known objectives
• Average Reward
• Discounted Reward
Unit 4 19
Reinforcement Learning
• No knowledge of environment
– Can only act in the world and observe states and reward
• Many factors make RL difficult:
– Actions have non-deterministic effects
• Which are initially unknown
– Rewards / punishments are infrequent
• Often at the end of long sequences of actions
• How do we determine what action(s) were really responsible for
reward or punishment?
(credit assignment)
– World is large and complex
• Nevertheless learner must decide what actions to take
– We will assume the world behaves as an MDP
Unit 4 20
Delayed Reward Problem
Unit 4 22
Two Key Aspect in RL
• How we update the value function or policy?
– How do we form training data
– Sequence of (s,a,r)….
• How we explore?
– Exploit or Exploration
Unit 4 23
Category of Reinforcement Learning
• Model-based RL
– Constructs domain transition model, MDP
• E3 – Kearns and Singh
• Model-free RL
– Only concerns policy
• Q-Learning - Watkins
• Active Learning (Off-Policy Learning)
– Q-Learning
• Passive Learning (On-Policy learning)
– Sarsa - Sutton
Unit 4 24
Passive RL
”If you study hard, you will be blessed”
“what is the value of studying hard?”
“You will learn from RL”
Unit 4
25
Policy Evaluation
• Remember the formula?
V ( s ) R( s ) β T ( s, ( s ), s ' ) V ( s ' )
s'
• Can we do that with RL?
– What is missing?
– What needs to be done?
Unit 4 27
Objective: Value Function
Unit 4 28
Passive RL
• Given policy ,
– estimate V(s)
• Not given
– transition matrix, nor
– reward function!
• Simply follow the policy for many
epochs
• Epochs: training sequences
(1,1)(1,2)(1,3)(1,2)(1,3)(2,3)(3,3) (3,4) +1
(1,1)(1,2)(1,3)(2,3)(3,3)(3,2)(3,3)(3,4) +1
(1,1)(2,1)(3,1)(3,2)(4,2) -1
29
Approach 1: Direct Estimation
• Direct estimation (model free)
– Estimate V(s) as average total reward of epochs
containing s (calculating from s to end of epoch)
• Reward to go of a state s
the sum of the (discounted) rewards from
that state until a terminal state is reached
• Key: use observed reward to go of the state
as the direct evidence of the actual expected
utility of that state
• Averaging the reward to go samples will
converge to true value at state
Unit 4 30
Passive RL
• Given policy ,
– estimate V(s)
• Not given
– transition matrix, nor
– reward function!
• Simply follow the policy for many
epochs
• Epochs: training sequences
(1,1)(1,2)(1,3)(1,2)(1,3)(2,3)(3,3) (3,4) +1
0.57 0.64 0.72 0.81 0.9
(1,1)(1,2)(1,3)(2,3)(3,3)(3,2)(3,3)(3,4) +1
(1,1)(2,1)(3,1)(3,2)(4,2) -1
31
Direct Estimation
• Converge very slowly to correct utilities values (requires a
lot of sequences)
V ( s ) R ( s ) T ( s, a, s ' )V ( s ' )
s'
Unit 4 32
Approach 2: Adaptive Dynamic Programming (ADP)
V ( s ) R ( s ) T ( s, a, s ' )V ( s ' )
s'
learned
• How can we estimate transition model T(s,a,s’)?
– Simply the fraction of times we see s’ after taking a in state s.
– NOTE: Can bound error with Chernoff bounds if we want
(will see Chernoff bound later in course)
Unit 4 33
ADP learning curves
(4,3)
(3,3)
(2,3)
(1,1)
(3,1)
(4,1)
(4,2)
Unit 4 34
Approach 3: Temporal Difference Learning (TD)
s'
Unit 4 35
Aside: Online Mean Estimation
Xˆ n
1
n 1
xn 1 Xˆ n
average of n+1 samples sample n+1
learning rate
• Given a new sample x(n+1), the new mean is the old
estimate (for n samples) plus the weighted difference
between the new sample and old estimate
Unit 4 36
Approach 3: Temporal Difference Learning (TD)
s'
Unit 4 37
Approach 3: Temporal Difference Learning (TD)
V ( s ) R ( s ) T ( s, a, s ' )V ( s ' )
s'
Unit 4 38
The TD learning curve
40
Active RL
”If you study hard, you will be blessed”
“what is the value of studying hard?”
“You will learn from RL”
“Well, I will not study hard always, I will try some music playing career and see if I
can be an American Idol” - Sanjaya
Unit 4
41
Active Reinforcement Learning
Unit 4 42
Naïve Approach
1. Act Randomly for a (long) time
– Or systematically explore all possible actions
2. Learn
– Transition function
– Reward function
3. Use value iteration, policy iteration, …
4. Follow resulting policy thereafter.
Will this work? Yes (if we do step 1 long enough)
Any problems? We will act randomly for a long time
before exploiting what we know.
Unit 4 43
Revision of Naïve Approach
1. Start with initial utility/value function and initial model
2. Take greedy action according to value function|
(this requires using our estimated model to do “lookahead”)
3. Update estimated model
4. Perform step of ADP ;; update value function
5. Goto 2
This is just ADP but we follow the greedy policy suggested by current
value estimate
Unit 4 44
Exploration versus Exploitation
• Two reasons to take an action in RL
– Exploitation: To try to get reward. We exploit our
current knowledge to get a payoff.
– Exploration: Get more information about the world.
How do we know if there is not a pot of gold around
the corner.
• To explore we typically need to take actions that
do not seem best according to our current model.
• Managing the trade-off between exploration and
exploitation is a critical issue in RL
• Basic intuition behind most approaches:
– Explore more when knowledge is weak
– Exploit more as we gain knowledge
Unit 4 45
ADP-based RL
1. Start with initial utility/value function
2. Take action according to an explore/exploit policy
(explores more early on and gradually becomes greedy)
3. Update estimated model
4. Perform step of ADP
5. Goto 2
This is just ADP but we follow the explore/exploit policy
Unit 4 46
Explore/Exploit Policies
Unit 4 47
Explore/Exploit Policies
exp Q( s, a ) / T
Pr( a | s )
exp Q( s, a' ) / T
a ' A
– T is the temperature. Large T means that each action has about the
same probability. Small T leads to more greedy behavior.
– Typically start with large T and decrease with time
Unit 4 48
Alternative Approach:
Exploration Functions
Unit 4 49
Exploration Functions
• Recall that value iteration iteratively performs the following update at all
states:
V ( s ) R ( s ) max T ( s, a, s ' )V ( s ' )
– We want the update to make a s ' that lead to unexplored regions
actions
look good
• Implemented by exploration function f(u,n):
– assigning a higher utility estimate to relatively unexplored action
state pairs
– change the updating rule of value function to
V ( s) R ( s ) max f T ( s, a, s ' )V ( s' ), N (a, s)
a s'
V ( s ) R ( s ) max f T ( s, a, s ' )V ( s ' ), N (a, s )
a s'
• Properties of f(u,n)?
– If n > Ne u i.e. normal utility
– Else, R+ i.e. max possible value of a state
• The agent will behave initially as if there were wonderful rewards
scattered all over around– optimistic .
• But after actions are tried enough times we will perform standard
“non-optimistic” value iteration
• Note that all of these exploration approaches assume that
exploration will not lead to unrecoverable disasters (falling off a
cliff).
Unit 4 51
TD-based Active RL
1. Start with initial utility/value function
2. Take action according to an explore/exploit policy
(should converge to greedy policy, i.e. GLIE)
3. Update estimated model
4. Perform TD update
V ( s ) V ( s ) ( R ( s ) V ( s ' ) V ( s ))
V(s) is new estimate of optimal value function at state s.
5. Goto 2
Just like TD for passive RL, but we follow explore/exploit policy
V ( s ) V ( s ) ( R ( s ) V ( s ' ) V ( s ))
V(s) is new estimate of optimal value function at state s.
5. Goto 2
Unit 4 53
Q-Learning: Model-Free RL
• Instead of learning the optimal value function V, directly learn the
optimal Q function.
– Recall Q(s,a) is expected value of taking action a in state s and then
following the optimal policy thereafter
• The optimal Q-function satisfies V ( s ) max Q ( s, a ' )
which gives: a'
Q ( s, a ) R ( s ) T ( s, a, s ' )V ( s ' )
s'
Unit 4 55
Q-Learning
1. Start with initial Q-function (e.g. all zeros)
2. Take action according to an explore/exploit policy
(should converge to greedy policy, i.e. GLIE)
3. Perform TD update
Unit 4 56
Direct Policy RL
• Why?
• Again, ironically, policy gradient based approaches
were successful in many real applications
• Actually we do this.
– From 10 Commandments
– “You Shall Not Murder”
– “Do not have any other gods before me”
• How we design an policy with parameters?
– Multinomial distribution of actions in a state
– Binary classification for each action in a state
Unit 4 57
•
Policy Gradient Algorithm
J(w)=E [ c ] (failure prob.,makespan, …)
t
w t=0..1 t
• minimise J by
– computing gradient
• Game Playing
Unit 4 59
Summary
• Goal is to learn utility values of states and
an optimal mapping from states to actions.
• Direct Utility Estimation ignores
dependencies among states we must
follow Bellman Equations.
• Temporal difference updates values to
match those of successor states.
• Active reinforcement learning learns the
optimal mapping from states to actions.
Unit 4 60