Welcome to Scribd. Sign in or start your free trial to enjoy unlimited e-books, audiobooks & documents.Find out more
Download
Standard view
Full view
of .
Look up keyword
Like this
0Activity
0 of .
Results for:
No results containing your search query
P. 1
Bt Rarbarllwcm 09

Bt Rarbarllwcm 09

Ratings: (0)|Views: 2|Likes:
Published by crennydane

More info:

Published by: crennydane on Apr 26, 2013
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

04/26/2013

pdf

text

original

 
REGAL: A Regularization based Algorithm for ReinforcementLearning in Weakly Communicating MDPs
Peter L. Bartlett
Computer Science Division, andDepartment of StatisticsUniversity of California at BerkeleyBerkeley, CA 94720
Ambuj Tewari
Toyota Technological Institute at Chicago6045 S. Kenwood Ave.Chicago, IL 60615
Abstract
We provide an algorithm that achieves theoptimal regret rate in an unknown weaklycommunicating Markov Decision Process(MDP). The algorithm proceeds in episodeswhere, in each episode, it picks a policy us-ing regularization based on the
span 
of theoptimal bias vector. For an MDP with
states and
A
actions whose optimal bias vec-tor has span bounded by
, we show a regretbound of ˜
O
(
H
√ 
AT 
). We also relate thespan to various diameter-like quantities asso-ciated with the MDP, demonstrating how ourresults improve on previous regret bounds.
1 INTRODUCTION
In reinforcement learning, an agent interacts with anenvironment while trying to maximize the total rewardit accumulates. Markov Decision Processes (MDPs)are the most commonly used model for the environ-ment. To every MDP is associated a state space
andan action space
A
. Suppose there are
states and
A
actions. The parameters of the MDP then consist of 
·
A
state transition distributions
s,a
and
·
A
re-wards
r
(
s,a
). When the agent takes action
a
in state
s
, it receives reward
r
(
s,a
) and the probability thatit moves to state
s
is
s,a
(
s
). The agent does notknow the parameters
s,a
and
r
(
s,a
) of the MDP inadvance but has to learn them by directly interactingwith the environment. In doing so, it faces the
explo-ration vs. exploitation 
trade-off that Kearns and Singh[2002] succinctly describe as,“... should the agent exploit its cumulativeexperience so far, by executing the actionthat currently seems best, or should it exe-cute a different action, with the hope of gain-ing information or experience that could leadto higher future payoffs? Too little explo-ration can prevent the agent from ever con-verging to the optimal behavior, while toomuch exploration can prevent the agent fromgaining near-optimal payoff in a timely fash-ion.”Suppose the agent uses an algorithm
G
to choose itsactions based on the history of its interactions with theMDP starting from some initial state
s
1
. Denotingthe (random) reward obtained at time
t
by
r
t
, thealgorithm’s expected reward until time
is
R
G
(
s
1
,
) =
E
t
=1
r
t
.
Suppose
λ
is the optimal per-step reward. An impor-tant quantity used to measure how well
G
is handlingthe exploration vs. exploitation trade-off is the
regret 
,
G
(
s
1
,
) =
λ
R
G
(
s
1
,
)
.
If
G
(
s
1
,
) is
o
(
) then
G
is indeed learning some-thing useful about the MDP since its expected averagereward then converges to the optimal value
λ
(whichcan only be computed with the knowledge of the MDPparameters) in the limit
.However, asymptotic results are of limited value andtherefore it is desirable to have
finite time 
bounds onthe regret. To obtain such results, one has to workwith a restricted class of MDPs. In the theory of MDPs, four fundamental subclasses have been stud-ied. Unless otherwise specified, by a
policy 
, we meana deterministic stationary policy, i.e. simply a map
π
:
S →A
.
Ergodic
Every policy induces a single recurrent class,i.e. it is possible to reach any state from any otherstate.
Unichain
Every policy induces a single recurrentclass plus a possibly empty set of transient states.
 
Communicating
For every
s
1
,s
2
, there is
some 
policy that takes one from
s
1
to
s
2
.
Weakly Communicating
The state space
decom-poses into two sets: in the first, each state is reach-able from every other state in the set under somepolicy; in the second, all states are transient underall policies.Unless we modify our criterion, it is clear that regretguarantees cannot be given for general MDPs since theagent might be placed in a state from which reachingthe optimally rewarding states is impossible. So, weconsider the most general subclass:
weakly communi-cating 
MDPs. For such MDPs, the optimal gain
λ
is state independent and is obtained by solving thefollowing
optimality equations 
,
h
+
λ
e
= max
a
∈A
r
(
s,a
) +
s,a
h
.
(1)Here,
h
is known as the
bias 
vector. Note that if 
h
is a bias vector then so is
h
+
c
e
where
e
is the all 1’svector. If we want to make the dependence of 
h
(
s
)and
λ
on the underlying MDP
explicit, we willwrite them as
h
(
s
;
) and
λ
(
) respectively.In this paper, we give an algorithm
Regal
, thatachieves˜
O
sp(
h
(
))
√ 
AT 
regret with probability at least 1
δ 
when started inany state of a weakly communicating MDP
. Heresp(
h
) is the
span 
of 
h
defined as,sp(
h
) := max
s
∈S 
h
(
s
)
min
s
∈S 
h
(
s
)
.
The˜
O
(
·
) notation hides factors that are logarithmicin
S,A,T 
and 1
/δ 
.
Regal
is based on the idea of 
regularization 
that hasbeen successfully applied to obtain low regret algo-rithms in other settings involving decision making un-der uncertainty. The main idea is simple to describe.Using its experience so far, the agent can build a set
M
such that with high probability, the true MDP liesin
M
. If the agent assumes that it is in the best of all possible worlds, it would choose an MDP
M
to maximize
λ
(
) and follow the optimal policy for
. Instead,
Regal
picks
to trade-off high gainfor low span, by maximizing the following regularizedobjective,
λ
(
)
sp(
h
(
))
.
1.1 RELATED WORK
The problem of simultaneous estimation and control of MDPs has been studied in the control theory [Kumarand Varaiya, 1986], operations research [Burnetas andKatehakis, 1997] and machine learning communities.In the machine learning literature, finite time boundsfor undiscounted MDPs were pioneered by Kearns andSingh [2002]. Their seminal work spawned a longthread of research [Brafman and Tennenholtz, 2002,Kakade, 2003, Strehl and Littman, 2005, Strehl et al.,2006, Auer and Ortner, 2007, Tewari and Bartlett,2008, Auer et al., 2009a]. These efforts improved the
and
A
dependence of their bounds, investigated lowerbounds and explored other ways to study the explo-ration vs. exploitation trade-off.The results of Auer and Ortner [2007] and Tewari andBartlett [2008] applied to unichain and ergodic MDPsrespectively. Recently, Auer et al. [2009a] gave an algo-rithm for communicating MDPs whose expected regretafter
steps is, with high probability,˜
O
(
D
(
)
√ 
AT 
)
.
Here,
D
(
) is the
diameter 
of the MDP
defined as
D
(
) := max
s
1
=
s
2
min
π
πs
1
s
2
,
where
πs
1
s
2
is the expected number of steps it takespolicy
π
to get to
s
2
from
s
1
. By definition, any com-municating MDP has finite diameter. Let
O
denotethe set of average-reward optimal policies. Previouswork had considered various notions of “diameter”,such as,
D
worst
(
) := max
π
max
s
1
=
s
2
πs
1
s
2
D
opt
(
) := min
π
∈O
max
s
1
=
s
2
πs
1
s
2
.
Clearly
D
D
opt
D
worst
, and so, everything elsebeing the same, an upper bound on regret in terms of 
D
is stronger.We propose another diameter that we call the
one-way diameter 
,
D
ow
(
) := max
s
min
π
πs
1
¯
s
,
(2)where ¯
s
= argmax
s
h
(
s
). We study the relationshipbetween sp(
h
),
D
ow
and
D
in Section 4 below andprove that sp(
h
)
D
ow
D
whenever the rewardsare in [0
,
1]. So, we not only extend the result of Aueret al. [2009a] to weakly communicating MDPs but alsoreplace
D
by a smaller quantity, sp(
h
).
2 PRELIMINARIES
We make the simplifying assumption that the rewards
r
(
s,a
)
[0
,
1] are known and only the transition prob-abilities
s,a
are unknown. This is not a restrictive as-sumption as the rewards can also be estimated at the
 
cost of increasing the regret bound by some constantfactor. The heart of the problem lies in not knowingthe transition probabilities.Recall the definition of the dynamic programming op-erator
,(
)(
s
) := max
a
∈A
r
(
s,a
) +
s,a
.
Let
n
(
s
) denote the maximum (over
all 
policies, sta-tionary or non-stationary) possible expected rewardobtainable in
n
steps starting from
s
. This can becomputed via the simple recursion,
0
(
s
) = 0
n
+1
=
n
.
Note that the optimality equations (1) can be writtensuccinctly in terms of the operator
,
h
+
λ
e
=
h
.
(3)
3 REGAL
In this section, we present
Regal
, an algorithm (seeAlgorithm 1) inspired by the
Ucrl2
algorithm of Aueret al. [2009a]. Before we describe the algorithm, let usset up some notation. Let
(
s,a,s
;
t
) be the numberof times the state-action-state triple (
s,a,s
) has beenvisited up to time
t
. Let
(
s,a
;
t
) be defined similarly.Then, an estimate of the state transition probabilityat time
t
isˆ
ts,a
(
s
) =
(
s,a,s
;
t
)max
{
(
s,a
;
t
)
,
1
}
,
(4)Using these estimates, we can define a set
M
(
t
) of MDPs such that the true MDP lies in
M
k
with highprobability. The set
M
(
t
) consists of all those MDPswhose transition probabilities satisfy,
s,a
ˆ
ts,a
1
 
12
log(2
At/δ 
)max
{
(
s,a,t
)
,
1
}
.
(5)Like
Ucrl2
,
Regal
works in episodes. Let
t
k
denotethe start of episode
k
. We will abbreviate
(
s,a,s
;
t
k
)and
(
s,a
;
t
k
) to
k
(
s,a,s
) and
k
(
s,a
) respectively.Also, let
v
k
(
s,a
) =
k
+1
(
s,a
)
k
(
s,a
) be the num-ber of times (
s,a
) is visited during episode
k
.For choosing a policy to follow in episode
k
,
Regal
first finds an MDP
k
∈ M
k
:=
M
(
t
k
) that maxi-mizes a “regularized” average optimal reward,
λ
(
)
k
sp(
h
(
))
.
(6)Here
k
is a regularization parameter that is to be setappropriately at the start of episode
k
.
Regal
then
Algorithm 1
REGularization based Regret Minimiz-ing ALgorithm (
Regal
)
for
episodes
k
= 1
,
2
,...,
do
t
k
current time
M
k
is the set of MDPs whose transition functionsatisfies (5) with
t
=
t
k
Choose
k
M
k
to maximize the following cri-terion over
M
k
,
λ
(
)
k
sp(
h
(
))
.π
k
average reward optimal policy for
k
Follow
π
k
until some
s,a
pair gets visited
k
(
s,a
)times
end for
follows the average reward optimal policy
π
k
for
k
until some state-action pair is visited
k
(
s,a
) times.When this happens,
Regal
enters the next episode.
Theorem 1.
Suppose Algorithm 1 is run for 
steps in a weakly communicating MDP 
starting from some state 
s
1
. Then, with probability at least 
1
δ 
,there exist choices for the regularization parameters 
k
such that,
∆(
s
1
,
) =
O
sp(
h
(
))
 
AT 
log(
AT/δ 
)
.
The proof of this theorem is given in Section 6. Un-fortunately, as the proof reveals, the choice of the pa-rameter
k
in the theorem requires knowledge of thecounts
v
k
(
s,a
)
before 
episode
k
begins. Therefore, wecannot run
Regal
with these choices of 
k
. However,if an
upper bound 
on the span sp(
h
(
)) of the trueMDP
is known, solving the constrained version of the optimization problem (6) gives Algorithm 2 thatenjoys the following guarantee.
Theorem 2.
Suppose Algorithm 2 is run for 
steps in a weakly communicating MDP 
starting in some state 
s
1
. Let the input parameter 
be such that 
sp(
h
(
))
. Then, with probability at least 
1
δ 
,
∆(
s
1
,
) =
O
H
 
AT 
log(
AT/δ 
)
.
Finally, we also present a modification of 
Regal
thatdoes not require knowledge of an upper bound onsp(
h
)). The downside is that the regret guaranteewe can prove becomes˜
O
(sp(
h
)
√ 
 
3
AT 
). The modi-fication,
Regal.D
, that is given as Algorithm 3 usesthe so called
doubling trick 
to guess the length of anepisode. If the guess turns out to be incorrect, theguess is doubled.
Theorem 3.
Suppose Algorithm 3 is run for 
steps in a weakly communicating MDP 
starting from 

You're Reading a Free Preview

Download
scribd
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->