You are on page 1of 60

Unit 4

REINFORCEMENT LEARNING
Reinforcement Learning

 Elements of Reinforcement Learning

 Model-Based Learning

 Temporal Difference Learning


 Generalization

 Design and Analysis of Machine Learning


Experiments
Unit 4 2
Introduction

Supervised Learning:

Example Class

Reinforcement Learning:

Situation Reward Situation Reward

Unit 4 3
Examples

Playing chess:
Reward comes at end of game

Ping-pong:
Reward on each point scored

Animals:
Hunger and pain - negative reward
food intake – positive reward
Unit 4 4
Framework: Agent in State Space
Remark: no
Example: XYZ-World terminal
e e states
1 2 3 R=+5
n s s
4 5 R=+3 w 6 R=-9
sw ne
n s s
x/0.3 x/0.7
7 8 R=+4 9 R=-6
Problem: What actions nw s
should an agent choose
to maximize its rewards? Unit 4 10
5
Bellman TD P
XYZ-World: Discussion Problem 12
e e (3.3, 0.5)
1 2 3 R=+5
n s s
I tried hard but: any
better
4 5 R=+3 w 6 R=-9
explanations?
sw ne
n s s
(3.2, -0.5) x/0.7
x/0.3
7 8 R=+4 9 R=-6
nw s
(0.6, -0.2)
Explanation of discrepancies TD for P/Bellman:
• Most significant discrepancies in states 3 and 8; minor in state 10
10
• P chooses worst successor of 8; should apply operator x instead
• P should apply w in state 6, but only does it only in 2/3 of the cases;
which affects the utility of state 3
• The low utility value of state 8 in TD seems to lower the utility value
of state 10  only a minor discrepancy
Unit 4 6
P: 1-2-3-6-5-8-6-9-10-8-6-5-7-4-1-2-5-7-4-1.
XYZ-World: Discussion Problem 12 Bellman Update g=0.2
e e
10.145 20.72 30.58 R=+5
n s s
40.03 53.63 R=+3 w 6-8.27 R=-9
sw ne
n s s
x/0.3 x/0.7
70.001 83.17 R=+4 9-5.98 R=-6
nw s
Discussion on using Bellman Update for Problem 12:
• No convergence for g=1.0; utility values seem to run away!
100.63
• State 3 has utility 0.58 although it gives a reward of +5 due to the
immediate penalty that follows; we were able to detect that.
• Did anybody run the algorithm for other g e.g. 0.4 or 0.6 values; if
yes, did it converge to the same values?
• Speed of convergence seems to depend on the value of g.
Unit 4 7
TD TD inverse R
XYZ-World: Discussion Problem 12
e e (0.57, -0.65)
1 2 3 R=+5
n s s
(2.98, -2.99)
4 5 R=+3 w 6 R=-9
sw ne
n s s
x/0.3 (-0.50, 0.47) x/0.7
7 8 R=+4 9 R=-6
Other observations: nw s
• The Bellman update did not converge for g=1
(-0.18, -0.12)
• The Bellman update converged very fast for g=0.2
• Did anybody try other values for g (e.g. 0.6)? 10
• The Bellman update suggest a utility value for 3.6 for state 5; what
does this tell us about the optimal policy? E.g. is 1-2-5-7-4-1
optimal?
• TD reversed utility values quite neatly when reward were inversed;
x become –x+u with u[-0.08,0.08].
• Unit 4 P: 1-2-3-6-5-8-6-9-10-8-6-5-7-4-1-2-5-7-4-1.
8
XYZ-World --- Other Considerations
• R(s) might be known in advance or has to be learnt.
• R(s) might be probabilistic or not
• R(s) might change over time --- agent has to adapt.
• Results of actions might be known in advance or
have to be learnt; results of actions can be fixed, or
may change over time.

Unit 4 9
To be used in Assignment2:

Example: The ABC-World


Remark: no
terminal
e e states
1 2 3 R=+5
n n sw s
w w 6 R=-9
4 R=-1 5 R=-4
ne ne
n s s
x/0.1 x/0.9
7 8 R=-3 9 R=+8
Problem: What actions nw s
should an agent choose
to maximize its rewards? Unit 4 10R=+9
10
Basic Notations
• T(s,a,s’) denotes the probability of reaching s’ when using
action a in state s; it describes the transition model
• A policy p specifies what action to take for every possible
state sS
• R(s) denotes the reward an agent receives in state s
• Utility-based agents learn an utility function of states uses it
to select actions to maximize the expected outcome utility.
• Q-learning, on the other hand, learns the expected utility of
taking a particular action a in a particular state s (Q-value of
the pair (s,a)
• Finally, reflex agents learn a policy that maps directly from
states to actions

Unit 4 11
Passive Learning

• We assume the policy Π is fixed.

• In state s we always execute action Π(s)

• Rewards are given.

Unit 4 12
Expected Utility

UΠ (s) = E [ Σt=0 γ R(st) | Π, S0 = s ]

Expected sum of rewards when the


policy is followed.

Unit 4 13
Direct Utility Estimation

Convert the problem to a supervised


learning problem:

(1,1)  U = 0.72
(2,1)  U = 0.68

Learn to map states to utilities.

Problem: utilities are not independent of each other!


Unit 4 14
Incorrect formula replaced on March 10, 2006

Bellman Equation

Utility values obey the following equations:


Assume γ =1, for this lecture!

U (s) = R(s) + γ*maxaΣs’ T(s,a,s’)*U (s’)

Can be solved using dynamic programming.


Assumes knowledge of transition model T
and reward R; the result is policy independent!
Unit 4 15
Example

U(1,3) = 0.84
(1,3) (2,3)
U(2,3) = 0.92

We hope to see that:


U(1,3) = -0.04 + U(2,3)

The value is 0.88. Current value


is a bit low and we must increase it.
Unit 4 16
Bellman Update (Section 17.2 of textbook)

If we apply the Bellman update indefinitely


often, we obtain the utility values that are the
solution for the Bellman equation!!
Bellman Update:
Ui+1(s) = R(s) + γ maxa(Σs’(T(s,a,s’)*Ui(s’)))

Some Equations for the XYZ World:


Ui+1(1) = 0+ γ*Ui(2)
Ui+1(5) = 3+ γ *max(Ui(7),Ui(8))
Ui+1(8) = 4+ γ *max(Ui(6),0.3*Ui(7) + 0.7*Ui(9) )

Unit 4 17
Updating Estimations Based on Observations:
New_Estimation = Old_Estimation*(1-) + Observed_Value*
New_Estimation= Old_Estimation + Observed_Difference*
Example: Measure the utility of a state s with current value
being 2 and observed values are 3 and 3 and the learning rate
is 0.2:

Initial Utility Value:2


Utility Value after observing 3: 2x0.8 + 3x0.2=2.2
Utility Value after observing 3,3: 2.2x0.8 +3x0.2= 2.36

Unit 4 18
Markov Decision Processes - Recitation
• M = (S,A,T,R)
– S: States
– A: Actions
– T: Transition Probability
– R: Reward
• Goal of the MDP?
– Finding the Policy
• Policy is the mapping from state to action
– Two well known objectives
• Average Reward
• Discounted Reward

Unit 4 19
Reinforcement Learning
• No knowledge of environment
– Can only act in the world and observe states and reward
• Many factors make RL difficult:
– Actions have non-deterministic effects
• Which are initially unknown
– Rewards / punishments are infrequent
• Often at the end of long sequences of actions
• How do we determine what action(s) were really responsible for
reward or punishment?
(credit assignment)
– World is large and complex
• Nevertheless learner must decide what actions to take
– We will assume the world behaves as an MDP

Unit 4 20
Delayed Reward Problem

• Delayed Reward Make it hard to learn


• The choice in State S was important but, it seems
the action in S’ lead to the big reward in S’’
• How you deal with this problem?
S’ S’’
Big Reward
S

• How about the performance comparison of VI and


PI in this example?
Unit 4 21
Reinforcement Learning
• Something is unknown
• Learning and Planning at the same time
• Ultimate learning and planning paradigm
• Scalability is a big issue, Very Challenging!
• Ironically, RL was most successful in Real Application even
more than STRIPS Planning!

– Zhang, W., Dietterich, T. G., (1995). A Reinforcement Learning


Approach to Job-shop Scheduling
– G. Tesauro (1994). "TD-Gammon, A Self-Teaching Backgammon
Program Achieves Master-level Play" in Neural Computation
– Reinforcement Learning for Vulnerability Assessment in Peer-to-
Peer Networks, IAAI 2008
• Policy Gradient Update

Unit 4 22
Two Key Aspect in RL
• How we update the value function or policy?
– How do we form training data
– Sequence of (s,a,r)….

• How we explore?
– Exploit or Exploration

Unit 4 23
Category of Reinforcement Learning
• Model-based RL
– Constructs domain transition model, MDP
• E3 – Kearns and Singh
• Model-free RL
– Only concerns policy
• Q-Learning - Watkins
• Active Learning (Off-Policy Learning)
– Q-Learning
• Passive Learning (On-Policy learning)
– Sarsa - Sutton
Unit 4 24
Passive RL
”If you study hard, you will be blessed”
“what is the value of studying hard?”
“You will learn from RL”

Unit 4

25
Policy Evaluation
• Remember the formula?
V ( s )  R( s )  β  T ( s,  ( s ), s ' ) V ( s ' )
s'
• Can we do that with RL?
– What is missing?
– What needs to be done?

• What do we do after policy evaluation?


– Policy Update
Unit 4 26
Example: Passive RL

• Suppose given policy


• Want to determine how good it is

Unit 4 27
Objective: Value Function

Unit 4 28
Passive RL
• Given policy ,
– estimate V(s)
• Not given
– transition matrix, nor
– reward function!
• Simply follow the policy for many
epochs
• Epochs: training sequences
(1,1)(1,2)(1,3)(1,2)(1,3)(2,3)(3,3) (3,4) +1
(1,1)(1,2)(1,3)(2,3)(3,3)(3,2)(3,3)(3,4) +1
(1,1)(2,1)(3,1)(3,2)(4,2) -1

29
Approach 1: Direct Estimation
• Direct estimation (model free)
– Estimate V(s) as average total reward of epochs
containing s (calculating from s to end of epoch)
• Reward to go of a state s
the sum of the (discounted) rewards from
that state until a terminal state is reached
• Key: use observed reward to go of the state
as the direct evidence of the actual expected
utility of that state
• Averaging the reward to go samples will
converge to true value at state
Unit 4 30
Passive RL
• Given policy ,
– estimate V(s)
• Not given
– transition matrix, nor
– reward function!
• Simply follow the policy for many
epochs
• Epochs: training sequences
(1,1)(1,2)(1,3)(1,2)(1,3)(2,3)(3,3) (3,4) +1
0.57 0.64 0.72 0.81 0.9
(1,1)(1,2)(1,3)(2,3)(3,3)(3,2)(3,3)(3,4) +1
(1,1)(2,1)(3,1)(3,2)(4,2) -1
31
Direct Estimation
• Converge very slowly to correct utilities values (requires a
lot of sequences)

• Doesn’t exploit Bellman constraints on policy values

V  ( s )  R ( s )    T ( s, a, s ' )V  ( s ' )
s'

How can we incorporate constraints?

Unit 4 32
Approach 2: Adaptive Dynamic Programming (ADP)

• ADP is a model based approach


– Follow the policy for awhile
– Estimate transition model based on observations
– Learn reward function
– Use estimated model to compute utility of policy

V  ( s )  R ( s )    T ( s, a, s ' )V  ( s ' )
s'

learned
• How can we estimate transition model T(s,a,s’)?
– Simply the fraction of times we see s’ after taking a in state s.
– NOTE: Can bound error with Chernoff bounds if we want
(will see Chernoff bound later in course)

Unit 4 33
ADP learning curves
(4,3)

(3,3)

(2,3)

(1,1)

(3,1)

(4,1)

(4,2)

Unit 4 34
Approach 3: Temporal Difference Learning (TD)

• Can we avoid the computational expense of full DP policy


evaluation?
• Temporal Difference Learning
– Do local updates of utility/value function on a per-action basis
– Don’t try to estimate entire transition function!
– For each transition from s to s’, we perform the following update:
   
V ( s )  V ( s )   ( R( s )  V ( s ' )  V ( s ))
learning rate discount factor
• Intutively moves us closer to satisfying Bellman constraint
V ( s )  R ( s )    T ( s, a, s ' )V ( s ' )
 

s'

Unit 4 35
Aside: Online Mean Estimation

• Suppose that we want to incrementally compute the


mean of a sequence of numbers
– E.g. to estimate the expected value of a random variable from a
sequence of samples.
1 n 1 1 n 1  1 n 
Xˆ n 1  
n  1 i 1
xi   xi 
n i 1
 xn 1   xi 
n 1 n i 1 

 Xˆ n 
1
n 1

xn 1  Xˆ n 
average of n+1 samples sample n+1
learning rate
• Given a new sample x(n+1), the new mean is the old
estimate (for n samples) plus the weighted difference
between the new sample and old estimate

Unit 4 36
Approach 3: Temporal Difference Learning (TD)

• TD update for transition from s to s’:


   
V ( s )  V ( s )   ( R ( s )  V ( s ' )  V ( s ))
learning rate (noisy) sample of utility
based on next state

• So the update is maintaining a “mean” of the


(noisy) utility samples
• If the learning rate decreases with the number
of samples (e.g. 1/n) then the utility estimates
will converge to true values!
V ( s )  R ( s )    T ( s, a, s ' )V ( s ' )
 

s'
Unit 4 37
Approach 3: Temporal Difference Learning (TD)

• TD update for transition from s to s’:


   
V ( s )  V ( s )   ( R ( s )  V ( s ' )  V ( s ))
learning rate (noisy) sample of utility
based on next state

• When V satisfies Bellman constraints then


expected update is 0.

V  ( s )  R ( s )    T ( s, a, s ' )V  ( s ' )
s'

Unit 4 38
The TD learning curve

• Tradeoff: requires more training experience (epochs) than


ADP but much less computation per epoch
• Choice depends on relative cost of experience vs. computation
39
Unit 4
Comparisons
• Direct Estimation (model free)
– Simple to implement
– Each update is fast
– Does not exploit Bellman constraints
– Converges slowly
• Adaptive Dynamic Programming (model based)
– Harder to implement
– Each update is a full policy evaluation (expensive)
– Fully exploits Bellman constraints
– Fast convergence (in terms of updates)
• Temporal Difference Learning (model free)
– Update speed and implementation similar to direct estimation
– Partially exploits Bellman constraints---adjusts state to ‘agree’ with observed
successor
• Not all possible successors
– Convergence in between direct estimation and ADP

40
Active RL
”If you study hard, you will be blessed”
“what is the value of studying hard?”
“You will learn from RL”

“Well, I will not study hard always, I will try some music playing career and see if I
can be an American Idol” - Sanjaya

Unit 4

41
Active Reinforcement Learning

• So far, we’ve assumed agent has policy


– We just try to learn how good it is
• Now, suppose agent must learn a good policy
(ideally optimal)
– While acting in uncertain world

Unit 4 42
Naïve Approach
1. Act Randomly for a (long) time
– Or systematically explore all possible actions
2. Learn
– Transition function
– Reward function
3. Use value iteration, policy iteration, …
4. Follow resulting policy thereafter.
Will this work? Yes (if we do step 1 long enough)
Any problems? We will act randomly for a long time
before exploiting what we know.
Unit 4 43
Revision of Naïve Approach
1. Start with initial utility/value function and initial model
2. Take greedy action according to value function|
(this requires using our estimated model to do “lookahead”)
3. Update estimated model
4. Perform step of ADP ;; update value function
5. Goto 2
This is just ADP but we follow the greedy policy suggested by current
value estimate

Will this work? No. Gets stuck in local minima.


What can be done?

Unit 4 44
Exploration versus Exploitation
• Two reasons to take an action in RL
– Exploitation: To try to get reward. We exploit our
current knowledge to get a payoff.
– Exploration: Get more information about the world.
How do we know if there is not a pot of gold around
the corner.
• To explore we typically need to take actions that
do not seem best according to our current model.
• Managing the trade-off between exploration and
exploitation is a critical issue in RL
• Basic intuition behind most approaches:
– Explore more when knowledge is weak
– Exploit more as we gain knowledge
Unit 4 45
ADP-based RL
1. Start with initial utility/value function
2. Take action according to an explore/exploit policy
(explores more early on and gradually becomes greedy)
3. Update estimated model
4. Perform step of ADP
5. Goto 2
This is just ADP but we follow the explore/exploit policy

Depends on the explore/exploit policy.


Will this work?
Any ideas?

Unit 4 46
Explore/Exploit Policies

• Greedy action is action maximizing estimated Q-value


Q ( s, a )  R ( s )    T ( s, a, s ' )V ( s ' )
s'
– where V is current value function estimate, and R, T are current estimates of
model
– Q(s,a) is the expected value of taking action a in state s and then getting the
estimated value V(s’) of the next state s’
• Want an exploration policy that is greedy in the limit of infinite
exploration (GLIE)
– Guarantees convergence
• Solution 1
– On time step t select random action with probability p(t) and greedy action
with probability 1-p(t)
– p(t) = 1/t will lead to convergence, but is slow

Unit 4 47
Explore/Exploit Policies

• Greedy action is action maximizing estimated Q-value


Q ( s, a )  R ( s )    T ( s, a, s ' )V ( s ' )
s'
– where V is current value function estimate, and R, T are current
estimates of model
• Solution 2: Boltzmann Exploration
– Select action a with probability,

exp Q( s, a ) / T 
Pr( a | s ) 
 exp Q( s, a' ) / T 
a ' A

– T is the temperature. Large T means that each action has about the
same probability. Small T leads to more greedy behavior.
– Typically start with large T and decrease with time
Unit 4 48
Alternative Approach:
Exploration Functions

1. Start with initial utility/value function


2. Take greedy action
3. Update estimated model
4. Perform step of optimistic ADP
(uses exploration function in value iteration)
(inflates value of actions leading to unexplored regions)
5. Goto 2

What do we mean by exploration function in VI?

Unit 4 49
Exploration Functions
• Recall that value iteration iteratively performs the following update at all
states:
V ( s )  R ( s )   max  T ( s, a, s ' )V ( s ' )
– We want the update to make a s ' that lead to unexplored regions
actions
look good
• Implemented by exploration function f(u,n):
– assigning a higher utility estimate to relatively unexplored action
state pairs
– change the updating rule of value function to

 
V ( s)  R ( s )   max f   T ( s, a, s ' )V ( s' ), N (a, s) 
 

a  s' 

– V+ denote the optimistic estimate of the utility


– N(s,a) = number of times that action a was taken from state s
What properties
Unit 4 should f(u,n) have? 50
Exploration Functions

 
V ( s )  R ( s )   max f   T ( s, a, s ' )V ( s ' ), N (a, s ) 
 

a  s' 
• Properties of f(u,n)?
– If n > Ne u i.e. normal utility
– Else, R+ i.e. max possible value of a state
• The agent will behave initially as if there were wonderful rewards
scattered all over around– optimistic .
• But after actions are tried enough times we will perform standard
“non-optimistic” value iteration
• Note that all of these exploration approaches assume that
exploration will not lead to unrecoverable disasters (falling off a
cliff).

Unit 4 51
TD-based Active RL
1. Start with initial utility/value function
2. Take action according to an explore/exploit policy
(should converge to greedy policy, i.e. GLIE)
3. Update estimated model
4. Perform TD update

V ( s )  V ( s )   ( R ( s )  V ( s ' )  V ( s ))
V(s) is new estimate of optimal value function at state s.
5. Goto 2
Just like TD for passive RL, but we follow explore/exploit policy

Given the usual assumptions about learning rate and GLIE,


TD will converge to an optimal value function!
Unit 4 52
TD-based Active RL
1. Start with initial utility/value function
2. Take action according to an explore/exploit policy
(should converge to greedy policy, i.e. GLIE)
3. Update estimated model
4. Perform TD update

V ( s )  V ( s )   ( R ( s )  V ( s ' )  V ( s ))
V(s) is new estimate of optimal value function at state s.
5. Goto 2

Requires an estimated model. Why?


To compute Q(s,a) for greedy policy execution
Can we construct a model-free variant?

Unit 4 53
Q-Learning: Model-Free RL
• Instead of learning the optimal value function V, directly learn the
optimal Q function.
– Recall Q(s,a) is expected value of taking action a in state s and then
following the optimal policy thereafter
• The optimal Q-function satisfies V ( s )  max Q ( s, a ' )
which gives: a'

Q ( s, a )  R ( s )    T ( s, a, s ' )V ( s ' )
s'

 R ( s )    T ( s, a, s ' ) max Q ( s, a ' )


• Given the Q function we can act
s'
optimally by select
a ' action
greedily according to Q(s,a)

How can we learn the Q-function directly?


Unit 4 54
Q-Learning: Model-Free RL
Bellman constraints on optimal Q-function:
Q ( s, a )  R ( s )    T ( s, a, s ' ) max Q ( s, a ' )
a'
s'

• We can perform updates after each action just


like in TD.
– After taking action a in state s and reaching state s’
do:
(note that we directly observe reward R(s))
Q( s, a)  Q( s, a )   ( R( s )   max Q( s ' , a' )  Q( s, a))
a'

(noisy) sample of Q-value


based on next state

Unit 4 55
Q-Learning
1. Start with initial Q-function (e.g. all zeros)
2. Take action according to an explore/exploit policy
(should converge to greedy policy, i.e. GLIE)
3. Perform TD update

Q( s, a )  Q ( s, a)   ( R ( s )   max Q( s ' , a ' )  Q( s, a ))


a'
Q(s,a) is current estimate of optimal Q-function.
4. Goto 2

h Does not require model since we learn Q directly!


h Uses explicit |S|x|A| table to represent Q
h Explore/exploit policy directly uses Q-values
5 E.g. use Boltzmann exploration.

Unit 4 56
Direct Policy RL
• Why?
• Again, ironically, policy gradient based approaches
were successful in many real applications
• Actually we do this.
– From 10 Commandments
– “You Shall Not Murder”
– “Do not have any other gods before me”
• How we design an policy with parameters?
– Multinomial distribution of actions in a state
– Binary classification for each action in a state

Unit 4 57

Policy Gradient Algorithm
J(w)=E [  c ] (failure prob.,makespan, …)
t
w t=0..1 t

• minimise J by
– computing gradient

– stepping the parameters away wt+1 = wt − rJ(w)


• until convergence

• Gradient Estimate [Sutton et.al.’99, Baxter & Bartlett’01]


• Monte Carlo estimate from trace s1, a1, c1, …, sT, aT, cT
– et+1 = et + rw log Pr(at+1|st,wt)
– wt+1 = wt - tctet+1

• Successfully used in many applications!


Unit 4 58
Applications

• Game Playing

• Checker playing program by Arthur Samuel


(IBM)

• Update rules: change weights by difference


between current states and backed-up value
generating full look-ahead tree

Unit 4 59
Summary
• Goal is to learn utility values of states and
an optimal mapping from states to actions.
• Direct Utility Estimation ignores
dependencies among states  we must
follow Bellman Equations.
• Temporal difference updates values to
match those of successor states.
• Active reinforcement learning learns the
optimal mapping from states to actions.
Unit 4 60

You might also like