# Welcome back

## Find a book, put up your feet, stay awhile

Sign in with Facebook

Sorry, we are unable to log you in via Facebook at this time. Please try again later.

or

Welcome to Scribd. Sign in or start your free trial to enjoy unlimited e-books, audiobooks & documents.Find out more

Download

Standard view

Full view

of .

Look up keyword

Like this

Share on social networks

0Activity

×

0 of .

Results for: No results containing your search query

P. 1

Bt Rarbarllwcm 09Ratings: (0)|Views: 2|Likes: 0

Published by crennydane

See more

See less

https://www.scribd.com/doc/138022508/Bt-Rarbarllwcm-09

04/26/2013

text

original

REGAL: A Regularization based Algorithm for ReinforcementLearning in Weakly Communicating MDPs

Peter L. Bartlett

Computer Science Division, andDepartment of StatisticsUniversity of California at BerkeleyBerkeley, CA 94720

Ambuj Tewari

Toyota Technological Institute at Chicago6045 S. Kenwood Ave.Chicago, IL 60615

Abstract

We provide an algorithm that achieves theoptimal regret rate in an unknown weaklycommunicating Markov Decision Process(MDP). The algorithm proceeds in episodeswhere, in each episode, it picks a policy us-ing regularization based on the

span

of theoptimal bias vector. For an MDP with

S

states and

A

actions whose optimal bias vec-tor has span bounded by

H

, we show a regretbound of ˜

O

(

HS

√

AT

). We also relate thespan to various diameter-like quantities asso-ciated with the MDP, demonstrating how ourresults improve on previous regret bounds.

1 INTRODUCTION

In reinforcement learning, an agent interacts with anenvironment while trying to maximize the total rewardit accumulates. Markov Decision Processes (MDPs)are the most commonly used model for the environ-ment. To every MDP is associated a state space

S

andan action space

A

. Suppose there are

S

states and

A

actions. The parameters of the MDP then consist of

S

·

A

state transition distributions

P

s,a

and

S

·

A

re-wards

r

(

s,a

). When the agent takes action

a

in state

s

, it receives reward

r

(

s,a

) and the probability thatit moves to state

s

is

P

s,a

(

s

). The agent does notknow the parameters

P

s,a

and

r

(

s,a

) of the MDP inadvance but has to learn them by directly interactingwith the environment. In doing so, it faces the

explo-ration vs. exploitation

trade-oﬀ that Kearns and Singh[2002] succinctly describe as,“... should the agent exploit its cumulativeexperience so far, by executing the actionthat currently seems best, or should it exe-cute a diﬀerent action, with the hope of gain-ing information or experience that could leadto higher future payoﬀs? Too little explo-ration can prevent the agent from ever con-verging to the optimal behavior, while toomuch exploration can prevent the agent fromgaining near-optimal payoﬀ in a timely fash-ion.”Suppose the agent uses an algorithm

G

to choose itsactions based on the history of its interactions with theMDP starting from some initial state

s

1

. Denotingthe (random) reward obtained at time

t

by

r

t

, thealgorithm’s expected reward until time

T

is

R

G

(

s

1

,T

) =

E

T

t

=1

r

t

.

Suppose

λ

is the optimal per-step reward. An impor-tant quantity used to measure how well

G

is handlingthe exploration vs. exploitation trade-oﬀ is the

regret

,∆

G

(

s

1

,T

) =

λ

T

−

R

G

(

s

1

,T

)

.

If ∆

G

(

s

1

,T

) is

o

(

T

) then

G

is indeed learning some-thing useful about the MDP since its expected averagereward then converges to the optimal value

λ

(whichcan only be computed with the knowledge of the MDPparameters) in the limit

T

→∞

.However, asymptotic results are of limited value andtherefore it is desirable to have

ﬁnite time

bounds onthe regret. To obtain such results, one has to workwith a restricted class of MDPs. In the theory of MDPs, four fundamental subclasses have been stud-ied. Unless otherwise speciﬁed, by a

policy

, we meana deterministic stationary policy, i.e. simply a map

π

:

S →A

.

Ergodic

Every policy induces a single recurrent class,i.e. it is possible to reach any state from any otherstate.

Unichain

Every policy induces a single recurrentclass plus a possibly empty set of transient states.

Communicating

For every

s

1

,s

2

∈S

, there is

some

policy that takes one from

s

1

to

s

2

.

Weakly Communicating

The state space

S

decom-poses into two sets: in the ﬁrst, each state is reach-able from every other state in the set under somepolicy; in the second, all states are transient underall policies.Unless we modify our criterion, it is clear that regretguarantees cannot be given for general MDPs since theagent might be placed in a state from which reachingthe optimally rewarding states is impossible. So, weconsider the most general subclass:

weakly communi-cating

MDPs. For such MDPs, the optimal gain

λ

is state independent and is obtained by solving thefollowing

optimality equations

,

h

+

λ

e

= max

a

∈A

r

(

s,a

) +

P

s,a

h

.

(1)Here,

h

is known as the

bias

vector. Note that if

h

is a bias vector then so is

h

+

c

e

where

e

is the all 1’svector. If we want to make the dependence of

h

(

s

)and

λ

on the underlying MDP

M

explicit, we willwrite them as

h

(

s

;

M

) and

λ

(

M

) respectively.In this paper, we give an algorithm

Regal

, thatachieves˜

O

sp(

h

(

M

))

S

√

AT

regret with probability at least 1

−

δ

when started inany state of a weakly communicating MDP

M

. Heresp(

h

) is the

span

of

h

deﬁned as,sp(

h

) := max

s

∈S

h

(

s

)

−

min

s

∈S

h

(

s

)

.

The˜

O

(

·

) notation hides factors that are logarithmicin

S,A,T

and 1

/δ

.

Regal

is based on the idea of

regularization

that hasbeen successfully applied to obtain low regret algo-rithms in other settings involving decision making un-der uncertainty. The main idea is simple to describe.Using its experience so far, the agent can build a set

M

such that with high probability, the true MDP liesin

M

. If the agent assumes that it is in the best of all possible worlds, it would choose an MDP

M

∈M

to maximize

λ

(

M

) and follow the optimal policy for

M

. Instead,

Regal

picks

M

to trade-oﬀ high gainfor low span, by maximizing the following regularizedobjective,

λ

(

M

)

−

C

sp(

h

(

M

))

.

1.1 RELATED WORK

The problem of simultaneous estimation and control of MDPs has been studied in the control theory [Kumarand Varaiya, 1986], operations research [Burnetas andKatehakis, 1997] and machine learning communities.In the machine learning literature, ﬁnite time boundsfor undiscounted MDPs were pioneered by Kearns andSingh [2002]. Their seminal work spawned a longthread of research [Brafman and Tennenholtz, 2002,Kakade, 2003, Strehl and Littman, 2005, Strehl et al.,2006, Auer and Ortner, 2007, Tewari and Bartlett,2008, Auer et al., 2009a]. These eﬀorts improved the

S

and

A

dependence of their bounds, investigated lowerbounds and explored other ways to study the explo-ration vs. exploitation trade-oﬀ.The results of Auer and Ortner [2007] and Tewari andBartlett [2008] applied to unichain and ergodic MDPsrespectively. Recently, Auer et al. [2009a] gave an algo-rithm for communicating MDPs whose expected regretafter

T

steps is, with high probability,˜

O

(

D

(

M

)

S

√

AT

)

.

Here,

D

(

M

) is the

diameter

of the MDP

M

deﬁned as

D

(

M

) := max

s

1

=

s

2

min

π

T

πs

1

→

s

2

,

where

T

πs

1

→

s

2

is the expected number of steps it takespolicy

π

to get to

s

2

from

s

1

. By deﬁnition, any com-municating MDP has ﬁnite diameter. Let

O

denotethe set of average-reward optimal policies. Previouswork had considered various notions of “diameter”,such as,

D

worst

(

M

) := max

π

max

s

1

=

s

2

T

πs

1

→

s

2

D

opt

(

M

) := min

π

∈O

max

s

1

=

s

2

T

πs

1

→

s

2

.

Clearly

D

≤

D

opt

≤

D

worst

, and so, everything elsebeing the same, an upper bound on regret in terms of

D

is stronger.We propose another diameter that we call the

one-way diameter

,

D

ow

(

M

) := max

s

min

π

T

πs

1

→

¯

s

,

(2)where ¯

s

= argmax

s

h

(

s

). We study the relationshipbetween sp(

h

),

D

ow

and

D

in Section 4 below andprove that sp(

h

)

≤

D

ow

≤

D

whenever the rewardsare in [0

,

1]. So, we not only extend the result of Aueret al. [2009a] to weakly communicating MDPs but alsoreplace

D

by a smaller quantity, sp(

h

).

2 PRELIMINARIES

We make the simplifying assumption that the rewards

r

(

s,a

)

∈

[0

,

1] are known and only the transition prob-abilities

P

s,a

are unknown. This is not a restrictive as-sumption as the rewards can also be estimated at the

cost of increasing the regret bound by some constantfactor. The heart of the problem lies in not knowingthe transition probabilities.Recall the deﬁnition of the dynamic programming op-erator

T

,(

T

V

)(

s

) := max

a

∈A

r

(

s,a

) +

P

s,a

V

.

Let

V

n

(

s

) denote the maximum (over

all

policies, sta-tionary or non-stationary) possible expected rewardobtainable in

n

steps starting from

s

. This can becomputed via the simple recursion,

V

0

(

s

) = 0

V

n

+1

=

T

V

n

.

Note that the optimality equations (1) can be writtensuccinctly in terms of the operator

T

,

h

+

λ

e

=

T

h

.

(3)

3 REGAL

In this section, we present

Regal

, an algorithm (seeAlgorithm 1) inspired by the

Ucrl2

algorithm of Aueret al. [2009a]. Before we describe the algorithm, let usset up some notation. Let

N

(

s,a,s

;

t

) be the numberof times the state-action-state triple (

s,a,s

) has beenvisited up to time

t

. Let

N

(

s,a

;

t

) be deﬁned similarly.Then, an estimate of the state transition probabilityat time

t

isˆ

P

ts,a

(

s

) =

N

(

s,a,s

;

t

)max

{

N

(

s,a

;

t

)

,

1

}

,

(4)Using these estimates, we can deﬁne a set

M

(

t

) of MDPs such that the true MDP lies in

M

k

with highprobability. The set

M

(

t

) consists of all those MDPswhose transition probabilities satisfy,

P

s,a

−

ˆ

P

ts,a

1

≤

12

S

log(2

At/δ

)max

{

N

(

s,a,t

)

,

1

}

.

(5)Like

Ucrl2

,

Regal

works in episodes. Let

t

k

denotethe start of episode

k

. We will abbreviate

N

(

s,a,s

;

t

k

)and

N

(

s,a

;

t

k

) to

N

k

(

s,a,s

) and

N

k

(

s,a

) respectively.Also, let

v

k

(

s,a

) =

N

k

+1

(

s,a

)

−

N

k

(

s,a

) be the num-ber of times (

s,a

) is visited during episode

k

.For choosing a policy to follow in episode

k

,

Regal

ﬁrst ﬁnds an MDP

M

k

∈ M

k

:=

M

(

t

k

) that maxi-mizes a “regularized” average optimal reward,

λ

(

M

)

−

C

k

sp(

h

(

M

))

.

(6)Here

C

k

is a regularization parameter that is to be setappropriately at the start of episode

k

.

Regal

then

Algorithm 1

REGularization based Regret Minimiz-ing ALgorithm (

Regal

)

for

episodes

k

= 1

,

2

,...,

do

t

k

←

current time

M

k

is the set of MDPs whose transition functionsatisﬁes (5) with

t

=

t

k

Choose

M

k

∈M

k

to maximize the following cri-terion over

M

k

,

λ

(

M

)

−

C

k

sp(

h

(

M

))

.π

k

←

average reward optimal policy for

M

k

Follow

π

k

until some

s,a

pair gets visited

N

k

(

s,a

)times

end for

follows the average reward optimal policy

π

k

for

M

k

until some state-action pair is visited

N

k

(

s,a

) times.When this happens,

Regal

enters the next episode.

Theorem 1.

Suppose Algorithm 1 is run for

T

steps in a weakly communicating MDP

M

starting from some state

s

1

. Then, with probability at least

1

−

δ

,there exist choices for the regularization parameters

C

k

such that,

∆(

s

1

,T

) =

O

sp(

h

(

M

))

S

AT

log(

AT/δ

)

.

The proof of this theorem is given in Section 6. Un-fortunately, as the proof reveals, the choice of the pa-rameter

C

k

in the theorem requires knowledge of thecounts

v

k

(

s,a

)

before

episode

k

begins. Therefore, wecannot run

Regal

with these choices of

C

k

. However,if an

upper bound

H

on the span sp(

h

(

M

)) of the trueMDP

M

is known, solving the constrained version of the optimization problem (6) gives Algorithm 2 thatenjoys the following guarantee.

Theorem 2.

Suppose Algorithm 2 is run for

T

steps in a weakly communicating MDP

M

starting in some state

s

1

. Let the input parameter

H

be such that

H

≥

sp(

h

(

M

))

. Then, with probability at least

1

−

δ

,

∆(

s

1

,T

) =

O

HS

AT

log(

AT/δ

)

.

Finally, we also present a modiﬁcation of

Regal

thatdoes not require knowledge of an upper bound onsp(

h

∗

)). The downside is that the regret guaranteewe can prove becomes˜

O

(sp(

h

)

√

S

3

AT

). The modi-ﬁcation,

Regal.D

, that is given as Algorithm 3 usesthe so called

doubling trick

to guess the length of anepisode. If the guess turns out to be incorrect, theguess is doubled.

Theorem 3.

Suppose Algorithm 3 is run for

T

steps in a weakly communicating MDP

M

starting from

- Read and print without ads
- Download to keep your version
- Edit, email or read offline

Sign in with Facebook

Sorry, we are unable to log you in via Facebook at this time. Please try again later.

or

Password Reset Email Sent

Join with Facebook

Sorry, we are unable to log you in via Facebook at this time. Please try again later.

or

By joining, you agree to our

read free for two weeks

Unlimited access to more than

one million books

one million books

Personalized recommendations

based on books you love

based on books you love

Syncing across all your devices

Join with Facebook

or Join with emailSorry, we are unable to log you in via Facebook at this time. Please try again later.

Already a member? Sign in.

By joining, you agree to our

to download

Unlimited access to more than

one million books

one million books

Personalized recommendations

based on books you love

based on books you love

Syncing across all your devices

Continue with Facebook

Sign inJoin with emailSorry, we are unable to log you in via Facebook at this time. Please try again later.

By joining, you agree to our

Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

CANCEL

OK

You've been reading!

NO, THANKS

OK

scribd