Professional Documents
Culture Documents
ORF 544 Week 10 Approximate Dynamic Prograrmming and Q Learning
ORF 544 Week 10 Approximate Dynamic Prograrmming and Q Learning
Warren Powell
Princeton University
http://www.castlelab.princeton.edu
T
X ( St ) arg max xt C ( St , xt ) E max E C ( St ' , X t ' ( St ' )) | St 1 | S t , xt
*
t
t 't 1
X t* ( St ) arg max xt C ( St , xt ) E Vt 1 ( St 1 ) | St , xt
Etwind
Dt
Pt grid
Rtbattery
Etwind E wind
Eˆ wind
1 t t 1
Rtbattery
1 Rt
battery
Axt xt Controllable inputs
V ( St ) min xt X C ( St , xt ) E V ( St 1 ( St , xt ,Wt 1 ) | St
End step 4;
End Step 3;
Find Vt * ( Rt ) max xt Q( Rt , xt )
*
Store X t ( Rt ) arg max xt Q( Rt , xt ). (This is our policy)
End Step 2;
End Step 1;
© 2019 Warren B. Powell
Curse of dimensionality
Dynamic programming in multiple dimensions
Step 0: Initialize VT 1 ( ST 1 ) 0 for all states.
Step 1: Step backward t T , T 1, T 2,...
Step 2: Loop over St = Rt , Dt , pt , Et (four loops)
Step 3: Loop over all decisions xt (all dimensions)
Step 4: Take the expectation over each random dimension Dˆ t , pˆ t , Eˆ t
Compute Q( St , xt ) C ( St , xt )
100 100 100
V S S , x ,W
w1 0 w2 0 w3 0
t 1
M
t t t 1 ( w1 , w2 , w3 ) PW ( w1 , w2 , w3 )
End step 4;
End Step 3;
Find Vt * ( St ) max xt Q( St , xt )
*
Store X t ( St ) arg max xt Q( St , xt ). (This is our policy)
End Step 2;
End Step 1;
© 2019 Warren B. Powell
Curse of dimensionality
Notes:
» There are potentially three “curses of dimensionality”
when using backward dynamic programming:
• The state variable – We need to enumerate all states. If the
state variable is a vector with more than two dimensions, the
state space gets very big, very quickly.
• The random information – We have to sum over all possible
realizations of the random variable. If this has more than two
dimensions, this gets very big, very quickly.
• The decisions – Again, decisions may be vectors (our energy
example has five dimensions). Same problem as the other
two.
» Some problems fit this framework, but not very.
However, when we can use this framework, we obtain
something quite rare: an optimal policy.
© 2019 Warren B. Powell
Curse of dimensionality
Strategies for approximating value functions:
» Backward dynamic programming
• Exact using lookup tables
• Backward approximate dynamic programming:
– Linear regression
– Low rank approximations
» Forward approximate dynamic programming
• Approximation architectures
– Lookup tables
» Correlated beliefs
» Hierarchical
– Linear models
– Convex/concave
• Updating schemes
– Pure forward pass TD(0)
– Double pass TD(1)
© 2019 Warren B. Powell
Backward ADP-Chapter 16
V S sˆ , x ,W
w1 0 w2 0 w3 0
t 1
M
t t t 1 ( w1 , w2 , w3 ) P W
( w1 , w2 , w3 )
End step 4;
End Step 3;
Find vˆt ( sˆt ) max xt Q( sˆt , xt )
End Step 2;
Use sampled vˆt ( sˆt ) ' s to find an approximate Vt ( s).
End Step 1;
© 2019 Warren B. Powell
Backward ADP
Backward ADP with the post-decision state
» Computing the imbedded expectation can be a pain.
» Instead of sampling over (pre-decision) states, sample
post-decision states .
» Then draw a sample , and simulate our way to the next
pre-decision state
» From , compute the sampled value from
Werbos, P. J. (1974). Beyond regression: new tools for prediction and analysis in the
behavioral sciences. Ph.D. dissertation Harvard University.
© 2019 Warren B. Powell
Histories of ADP/reinforcement learning
History of Q-learning
» Began ~1980 with Andy Barto (supervisor) and Rich Sutton
(student) studying behavior of mice in mazes.
» Heuristically developed basic feedback
mechanism:
qˆ n (s n , a n ) C (s n , a n ) maxa ' Q n 1(s ', a ')
Q n (s n , a n ) (1 n 1 )Q n 1(s n , a n ) n 1qˆ n (s n , a n )
» Late 1980’s the link to Markov
decision processes was made, and the
community adopted the basic notation
of Markov decision processes.
» Bizarrely, the RL community adopted popular test problems from
the controls community, which are primarily deterministic:
• Inverted pendulum problem.
• Hill-climbing
• Truck backer-upper
• Robotics applications.
© 2019 Warren B. Powell
Histories of ADP/reinforcement learning
Reinforcement Learning
TD-learning
“Reinforcement learning”
Receive reward = 1
where s ' S M s n , a n ,W n 1
» Exploration
• Pick a state at random
» Hybrid
• Use trajectory following with randomization, e.g.
t= 0
N- 1 T
1
» For =
p ,n
N
å å
n= 1 t = 0
C ( St , X p ( St ),Wt + 1 (w n )) where St + 1 = S M ( St , X p ( St ), Wt + 1 (w n )))
» Off-policy learning:
• Sample actions according to a learning policy (called “behavior
policy” in the RL literature). This is the policy used for learning the
implementation policy (called the “target policy” in the RL literature).
• Needs to be combined with a state sampling policy.
Hint: Optimal Q = 0!
Algorithms
Relationship to TD-learning
» Approximate value iteration uses TD(0) updates.
» Approximate policy iteration uses TD(1) updates.
Slide 56
Approximate value iteration
Step 1: Start with a pre-decision state Stn
Step 2: Solve the deterministic optimization using
Deterministic
an approximate value function:
optimization
m
vˆ min x C ( S , x ) f f ( S ( S , x ))
m n 1 M m
to obtain .
f
Linear model for post-decision state
Step 3: Update the value function approximation Recursive
Vt n1 ( Stx,1n ) (1 n 1 )Vt n11 ( Stx,1n ) n 1vˆtn statistics
Step 4: Obtain Monte Carlo sample of and
compute the next pre-decision state: Simulation
Slide 57
Approximate policy iteration
Step 1: Start with a pre-decision state Stn
Step 2: Inner loop: Do for m=1,…,M:
Step 2a: Solve the deterministic optimization using
an approximate value function:
vˆ m min x C ( S m , x ) V n 1 ( S M , x ( S m , x ))
to obtain .
Step 2b: Update the value function approximation
V n 1,m ( S x ,m ) (1 m 1 )V n 1,m 1 ( S x ,m ) m 1vˆ m
Step 2c: Obtain Monte Carlo sample of and
compute the next pre-decision state:
S m 1 S M ( S m , x m , W ( m ))
Step 3: Update V n ( S ) using V n 1,M ( S ) and return to step 1.
Slide 58
Approximate policy iteration
Step 1: Start with a pre-decision state Stn
Step 2: Inner loop: Do for m=1,…,M:
Step 2a: Solve the deterministic optimization using
an approximate value function:
vˆ min x C ( S , x ) f f ( S ( S , x ))
m m n 1 M m
to obtain f
$350
$150
$450
$300
TX ˆ
St ( ,Dt )
t
© 2018 W.B. Powell
Approximate dynamic programming
We use initial value function approximations…
V 0 ( MN ) 0
V 0 ( NY ) 0
0
V (CO) 0 $350
0
$150
V (CA) 0
$450
$300
TX ˆ
St ( ,Dt )
t
© 2018 W.B. Powell
Approximate dynamic programming
1
… and make our first choice: x
V 0 ( MN ) 0
V 0 ( NY ) 0
0
V (CO) 0 $350
0
$150
V (CA) 0
$450
$300
NY
x
S (
t )
t 1
© 2018 W.B. Powell
Approximate dynamic programming
Update the value of being in Texas.
V 0 ( MN ) 0
V 0 ( NY ) 0
0
V (CO) 0 $350
0
$150
V (CA) 0
$450
$300
V 1 (TX ) 450
NY
x
S (
t )
t 1
© 2018 W.B. Powell
Approximate dynamic programming
Now move to the next state, sample new demands and make a new
decision
V 0 ( MN ) 0
$180
0 V 0 ( NY ) 0
V (CO) 0
$400
V 0 (CA) 0 $600
$125
V 1 (TX ) 450
NY ˆ
St 1 ( ,Dt 1 )
t 1
© 2018 W.B. Powell
Approximate dynamic programming
Update value of being in NY
V 0 ( MN ) 0
$180
0 V 0 ( NY ) 600
V (CO) 0
$400
V 0 (CA) 0 $600
$125
V 1 (TX ) 450
x CA
S t 1 ( )
t 2
© 2018 W.B. Powell
Approximate dynamic programming
Move to California.
V 0 ( MN ) 0
$400
0 V 0 ( NY ) 600
V (CO) 0
$200
V 0 (CA) 0 $150
$350
V 1 (TX ) 450
CA ˆ
St 2 ( ,Dt 2 )
t 2
© 2018 W.B. Powell
Approximate dynamic programming
Make decision to return to TX and update value of being in CA
V 0 ( MN ) 0
$400
0 V 0 ( NY ) 500
V (CO) 0
$200
V 0 (CA) 800 $150
$350
V 1 (TX ) 450
CA ˆ
St 2 ( ,Dt 2 )
t 2
© 2018 W.B. Powell
Approximate dynamic programming
An updated value of being in TX
V 0 ( MN ) 0
V 0 ( NY ) 600
0
V (CO) 0 $385
0
$800
V (CA) 800
$125
$275
V 1 (TX ) 450
TX ˆ
St 3 ( ,Dt 3 )
t 3
© 2018 W.B. Powell
Approximate dynamic programming
Updating the value function:
Old value:
V 1 (TX ) $450
New estimate:
vˆ 2 (TX ) $800
V 0 ( MN ) 0
V 0 ( NY ) 600
0
V (CO) 0 $385
0
$800
V (CA) 800
$125
$275
V 1 (TX ) 485
TX ˆ
St 3 ( ,Dt 3 )
t 3
© 2018 W.B. Powell
Approximate dynamic programming
Hierarchical learning
ad
decision d
Resource attribute:
a "State" that the trucker is currently in
ad '
decision d’
v(ad' ' ) ?
v(ad' ) ?
a1
a1 a
a 2
2
an
an
a1
a
a1 2
a
2
an
an
Vt ( St ) max x Ct ( St , xt ) vta Rtax
aA
Location
Fleet
v ( Location ) (1 )v ( Location ) vˆ(
n n 1
Domicile )
DOThrs
DaysFromHome
Location
Fleet
Location Location
n n 1
v (
Fleet ) (1 )v ( Fleet ) vˆ( Domicile )
DOThrs
DaysFromHome
Location
Location Location Fleet
n n 1
v ( Fleet ) (1 )v ( Fleet ) vˆ( Domicile )
Domicile Domicile DOThrs
DaysFromHome
where
1
wa( g ) Var va( g ) (g) 2
a
George, A., W.B. Powell and S. Kulkarni, “Value Function Approximation Using Hierarchical
Aggregation for Multiattribute Resource Management,” Journal of Machine Learning Research,
Vol. 9, pp. 2079-2111 (2008). © 2019 Warren B. Powell
Hierarchical learning
Hierarchical aggregation
Original function
v(a)
Aggregated function
Approximating a
a1 Bias nonlinear function
using two-levels of
aggregation.
a = attribute selected
© 2019 Warren B. Powell
Hierarchical learning
Hierarchical aggregation
f(x)
V (a2' )
V (a1' )
C (a, d1 ) C ( a, d 2 )
vNE
NE region
vPA
?vNE
PA
TX
Aggregation level
0.3 Average weight on most disaggregate level
1
0.25
3
Weights
0.2
2
Weights
0.15
4
0.1
5
0.05
Average weight on most aggregate levels
0
6
0 200 400 600 800 1000 1200
7
Iterations
Iteration
Hierarchical learning
Notes:
» In the early iterations, we do not have enough data to
provide estimates at the detail level.
» So, we put more weight on the most aggregate
estimates.
» As the algorithm progresses and we gain more
information, we can put more weight on the more
disaggregate estimates.
» But the weights depend on how much data we have in
different regions.
» This type of adaptive learning, from coarse-grained to
fine-grained, is common across all learning problems in
stochastic optimization.
© 2019 Warren B. Powell
Hierarchical learning
Hierarchical aggregation
1900000
1850000
1800000
1750000
Objective function
1700000
1650000
1600000
Aggregate
1550000
Aggregate approximation
Disaggregate shows faster initial
1500000
convergence; disaggregate
1450000
shows better asymptotic
performance.
1400000
0 100 200 300 400 500 600 700 800 900 1000
Iterations
© 2019 Warren B. Powell
Hierarchical learning
Hierarchical aggregation
1900000
1850000
Weighted Combination
1800000
1750000
Objective function
1700000
1650000
1600000
Aggregate
1550000
But adaptive weighting
Disaggregate outperforms both. This
1500000
hints at a strategy for
1450000
adaptive learning.
1400000
0 100 200 300 400 500 600 700 800 900 1000
Iterations
© 2019 Warren B. Powell
The exploration-exploitation problem
V (a2' )
V (a1' )
C (a, d1 ) C ( a, d 2 )
• Exploitation
• Exploration
Pure exploitation
a11
a
wa0V 0 a11 w1aV 1 11 wa2V 2 a12
a12 a
13
a1
a
2
an
© 2008
2019 Warren B. Powell Slide 106
© 2008
2019 Warren B. Powell Slide 107
Optimizing fleets
From one truck to many trucks
» If there is one truck and N “states” (locations), then our
dynamic program (post-decision state) has N states.
» But what if there is more than one truck? Then we
have to capture the state of the fleet.
Number
Attribute
of State space
space
resources
1 1 1
1 100 100
1 1000 1,000
5 10 2,002
5 100 91,962,520
5 1000 8,416,958,750,200
50 10 12,565,671,261
50 100 13,419,107,273,154,600,000,000,000,000,000,000,000,000
50 1000 109,740,941,767,311,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000
S t
x
St 1 S M ,W ( Stx , Wt 1 )
Wt 1 ( Rˆt 1 , Dˆ t 1 )
© 2019 Warren B. Powell Slide 121
Optimizing fleets
The next pre-decision state
St 1
a3 a M ( a3 , d3 )
» Add this value to the
assignment arc.
a4
a M ( a3 , d 4 )
a5
a M (a3 , d 5 )
1800000
1700000
Objective function
1600000
1500000
1400000
1300000
1200000
0 100 200 300 400 500 600 700 800 900 1000
© 2018 Warren
IterationsB. Powell
Approximate dynamic programming
n x n 1 x n
V ( S ) (1 n 1 )V
t 1 t 1 t 1 t 1
ˆ
( S ) n 1vt
The stepsize
“Learning rate”
“Smoothing factor”
¥
c
V= ån= 0
g nc =
1- g
vˆ n+ 1 = c + g v n
v n+ 1 = (1- a n )v n + a n vˆ n+ 1
2 Noise
n 1
1
n 1 2 n 2
Bias
1.2
1
Observed values
0.8
0.6
0.4
BAKF stepsize rule
0.2
1.6
1.4
Observed values
1.2
0.8
0.6
0.4
BAKF stepsize rule
0.2
0
1 21 41 61 81
Simulation
600 Historical minimum
Utilization
400
1200
200
1000
0
US_SOLO US_IC US_TEAM
800
Capacity category
Historical maximum
Utilization
600 Simulation
850
Vanilla simulator
800
750
LOH (m iles)
700
Using approximate dynamic programming
Acceptable region
650
600
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49
Iterations
1900000
1890000 s1
1880000 s2
s3
1870000
s4
1860000
s5
1850000
s6
1840000
s7
1830000
s8
1820000
s9
1810000 s10
1800000 avg
580 590 600 610 620 630 640 650
pred
# of drivers w ith attribute a
1900000
1890000 s1
1880000 s2
s3
1870000
s4
1860000
s5
1850000
s6
1840000
va s7
1830000
s8
1820000
s9
1810000 s10
1800000 avg
580 590 600 610 620 630 640 650
pred
# of drivers w ith attribute a
3500
2000
1500
1000
500
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
-500
This shows that the value functions provide reasonable approximations of the
© 2019 Warren B. Powell
Case study: truckload trucking