You are on page 1of 9

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/221344969

Relating reinforcement learning performance to classification performance

Conference Paper · January 2005


DOI: 10.1145/1102351.1102411 · Source: DBLP

CITATIONS READS
47 189

2 authors, including:

Bianca Zadrozny
IBM
82 PUBLICATIONS   4,017 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Clinical decision support systems View project

GeoStats.jl: High-performance Geostatistics in Julia View project

All content following this page was uploaded by Bianca Zadrozny on 23 March 2015.

The user has requested enhancement of the downloaded file.


Relating Reinforcement Learning Performance to
Classification Performance

John Langford jl@hunch.net


TTI-Chicago, 1427 E 60th Street, Chicago, IL 60637 USA
Bianca Zadrozny zadrozny@us.ibm.com
IBM T.J. Watson, 1101 Kitchawan Road, Yorktown Heights, NY 10598 USA
Abstract weak statements such as “the performance of an iter-
We prove a quantitative connection between ation does not degrade much” for approximate policy
the expected sum of rewards of a policy and iteration (Bertsekas & Tsitsiklis, 1996). The practical
binary classification performance on created symptom of these weak performance guarantees is that
subproblems. This connection holds without the application of reinforcement learning to real-world
any unobservable assumptions (no assump- problems typically requires the attention of experts,
tion of independence, small mixing time, fully many samples of experience, a great deal of computa-
observable states, or even hidden states) and tional time, or a combination of all three for success.
the resulting statement is independent of the In order to eliminate this reliance on unrealistic func-
number of states or actions. The statement tion approximation, reinforcement learning algorithms
is critically dependent on the size of the re- must be explicitly designed to be robust against pre-
wards and prediction performance of the cre- diction errors. The method used here relates reinforce-
ated classifiers. ment learning performance to classification perfor-
We also provide some general guidelines for mance through a reduction from reinforcement learn-
obtaining good classification performance on ing to classification. Note that we are not claiming
the created subproblems. In particular, we that “reinforcement learning is classification” because
discuss possible methods for generating train- knowing what to predict and predicting it well does
ing examples for a classifier learning algo- not solve the temporal credit assignment problem—
rithm. how are past actions in an interactive environment re-
lated to future reward?
A Simple Relationship. The goal of reinforcement
1. Introduction learning is to find a policy
Motivation. Reinforcement learning on real-world π : (O × A × R)∗ × O → A
problems requires coping with large (possibly infi-
nite) state, observation, and action spaces. However, which maps a history of (observation,action,reward)
many reinforcement learning algorithms (such as Q- sequences and a new observation to an action so as to
Learning (Watkins, 1989) and other algorithms (Sut- maximize the expected sum of rewards in an unknown
ton & Barto, 1998)) are only tractable for small envi- environment. The goal of (binary) classifier learning
ronments with relatively few states and actions. This is to produce a classifier
disparity is commonly handled by adding a function c : X → {0, 1}
approximation step to the learning algorithms devel-
oped for small problems. which maps features to a binary label so as to min-
imize the error rate with respect to some unknown
This approach is problematic. On the theoretical side, distribution.
strong and unrealistic assumptions about function ap-
proximation accuracy (such as assuming a small esti- A simple way to connect reinforcement learning to
mation error for the entire space) are required to make classifier learning is to notice that a policy can be rep-
resented by a binary classifier which, given observa-
Appearing in Proceedings of the 22 nd International Confer- tions as input, predicts the correct action. Since for
ence on Machine Learning, Bonn, Germany, 2005. Copy- any 2-action environment there is a sequence of cor-
right 2005 by the author(s)/owner(s).
rect actions that maximizes the rewards, choosing the
Relating Reinforcement Learning Performance to Classification Performance

correct actions would be equivalent to choosing the whenever a certain regression problem is solved well.
correct labels and, consequently, minimizing the error This technique is complementary to the analysis pre-
rate of the classifier. sented here.
However, some important issues are left open: The algorithms of Lagoudakis and Parr (2003) and
Fern et al. (2004) rely upon approximate policy it-
1. What if the number of actions is greater than 2? eration in order to construct a policy formed from a
2. What if the classifier is sometimes wrong? classifier. Thus, they inherit the non-robust guaran-
Shouldn’t later errors alter the correctness of ear- tees of the approximate policy iteration framework.
lier decisions? Some earlier work (see theorem 3.1 of Khardon (1999))
3. If a classifier learning algorithm optimizes classifi- can be understood as relating policy learning to clas-
cation performance to some small error rate, does sification.
that imply a good policy? What if the environ-
ment is fundamentally noisy? 2. A Reduction of Reinforcement
Learning to Classification
Results. In section 2, we address each of these issues
and quantify the relationship between classification er- In this section we connect the prediction ability of
ror rate and the expected sum of rewards performance a classifier learning algorithm on certain problems to
of an induced policy in noisy environments with in- the performance of a reinforcement learning algorithm.
finite states, multiple actions, and varying stochastic This form of analysis1 , known as a reduction, has some
rewards. The analysis holds without any of the com- subtle implications. In particular, assuming that the
mon assumptions used in past reinforcement learning classifier learning algorithm succeeds implies that:
work such as independence, Markovian state transi-
tions and observable states. This is possible because 1. We do not need to consider exploration explicitly.
we condition the performance of a policy on the per-
formance of a binary classifier. 2. We do not need to generate training sets for clas-
sifier learning.
Note, however, that the existence of a relationship be-
tween policy performance and classifier error rate does 3. We do not need to make assumptions (such
not provide any guidance on how to obtain a classifier as smoothness, Markovian dynamics, observable
with a small error rate. In section 3 we provide some states and fast mixing) about the environment
general guidelines for obtaining such a classifier. In which are typically introduced in order to create
particular, we discuss possible methods for generating predictability.
representative examples from which to train a classifier
learning algorithm.
What is the analysis good for if it hides crucial issues
1.1. Past Work in reinforcement learning? It gives us insight into what
are the critical prediction problems necessary for solv-
The RLGen analysis (Langford & Zadrozny, 2003) and ing reinforcement learning and the relative difficulty
algorithm relates the error rate of a classifier on cer- of these problems. This may lead to algorithms which
tain examples created with a generative model to the attempt to more directly answer these prediction prob-
performance of a policy. The work presented here is lems, yielding better performance.
significantly more general in scope. In particular, it
does not necessarily assume a generative model. 2.1. Definitions
The Policy Search by Dynamic Programming (Bagnell We define the reinforcement learning problem in a very
et al., 2004) analysis is similar to the one presented general manner that can capture most standard defini-
here, so we make pointwise comparisons throughout tions as special or equivalent cases. (See the Appendix
the paper. A very rough summary is that we require for a discussion). The reason for doing this is that the
fewer assumptions, achieve tighter bounds without a analysis becomes both more general and simpler.
dependence on the distance between two measures, 1
and the analysis motivates different design choices. One way to think about this form of analysis is “20
questions with faults”. I promise to answer any questions
The Conservative Policy Iteration algorithm (Kakade about the location of a hidden object but lie some number
of times. Your goal (as the designer) is to come up with a
& Langford, 2002) modifies approximate policy iter- good set of questions such that if I lie rarely, you still have
ation so that policy updates result in improvement good performance.
Relating Reinforcement Learning Performance to Classification Performance

Inspired by predictive state representations (Singh 2. Given that we use the first predictor, predict the
et al., 2004) we avoid the use of “state” in the defi- action which optimizes the expected sum of re-
nition of reinforcement learning. The following defini- wards assuming that π ∗ is followed for the next
tion is for a decision process (DP). T − 2 steps.
Definition 2.1 (Reinforcement Learning Problem) A 3. Given that we use the first and second predictors,
reinforcement learning problem D is defined as a con- predict the action which optimizes the expected
ditional probability table D(o0 , r|(o, a, r)∗ , o, a) on a set sum of rewards assuming π ∗ is followed for the
of observations O and rewards r ∈ [0, ∞) given any next T − 3 steps. And so forth.
(possibly empty) history (o, a, r)∗ of past observations,
actions (from action set A), and rewards. Discussion. These problems are not “predict accord-
ing to the measure over states induced by π ∗ at time
For simplicity, we avoid constraining a policy to be sta-
step T ” as in Bagnell et al. (2004). Instead, we take
tionary and define the goal of reinforcement learning
into account the policies that were constructed in pre-
as optimizing the T -step sum of rewards.
vious steps. We chose to construct the problems in
Definition 2.2 (Reinforcement Learning Goal) this manner because this allows stronger performance
Given some horizon T , find a policy π : (o, a)∗ → a, guarantees. Similarly, the prediction problems are or-
optimizing the expected sum of rewards: dered from first-to-last rather than last-to-first as in
" T # Bagnell et al. (2004). With the last-to-first approach a
wrong action choice made in the first step renders later
X
η(D, π) = E(o,a,r)T ∼D,π rt
t=1 prediction performance irrelevant. With the first-to-
where rt is the t-th observed reward. Here, the expecta- last approach, a wrong action choice made in the last
tion is over the process which generates a history using step given the first steps merely degrades performance.
D and choosing actions from π. This observation allows us to eliminate a dependence
on the “variation distance” between the measures in-
2.2. The Reduction duced by π ∗ and π. This dependence is inherent to
the algorithm and setting analyzed. From the reduc-
Preparation. Given the above definition of a rein- tion’s viewpoint, the dependence is difficult to accept
forcement learning problem, there exists some optimal because it can be large even for π of equivalent perfor-
policy π ∗ which maximizes the expected sum of re- mance to π ∗ . (In experiments not reported here the
wards for any valid history. We use π ∗ in defining a ordering also appears to be important.)
sequence of classification problems in the reduction.
Note that this way of constructing the prediction prob-
The reduction initially assumes that we can solve mul- lems is not optimal in general. (It is, however, the
ticlass reward-sensitive classification problems2 . In best for which we know how to prove a performance
reward-sensitive classification, the rewards for predict- guarantee.) The first classifier is learned assuming a
ing a label correctly or incorrectly for an example perfect policy for the later steps. But because the later
are not necessarily fixed at 1 and 0 respectively (as policy may be imperfect, the first classifier should be
in usual accuracy-maximizing classification) but can adjusted to take later mistakes into accounts. This can
be any positive real number. We rely on other work be done safely (without harming performance) using
(Langford & Beygelzimer, 2005) which states that any either an iteration-over-T classifier update mechanism
cost-sensitive (or equivalently, reward-sensitive) clas- (as in Bagnell et al. (2004)) or a mixture update mech-
sification problem can be broken down into multiple anism (as in Langford and Zadrozny (2003)) similar to
binary classification problems. the conservative policy iteration algorithm (Kakade &
Algorithm. The reduction consists of sequentially Langford, 2002).
building reward-sensitive classification problems for
each time step, always assuming that the optimal ac- 2.3. Analysis
tions will be executed in the remaining steps. More Definitions. The reduction algorithm gives us a se-
concretely: quence of reward-sensitive classifiers c1 , ..., cT , one for
each time step. We can consider them together as one
1. Given the empty history, predict the action which policy π defined by:
optimizes the expected sum of rewards assuming
that π ∗ is followed for the next T − 1 steps. π((o, a, r)t−1 , ot ) = ct ((o, a, r)t−1 , ot )
2
These problems can also be thought of as “do regres- where ot is the tth observation and (o, a, r)t−1 is
sion and pick the best”. the first t − 1 observed (observation, action, reward)
Relating Reinforcement Learning Performance to Classification Performance

triples. Each classifier has a reward rate: This lemma shows that the regret of the composite
" T # policy created by the reduction can be decomposed
X into the regrets of the component classifiers. This has
lt (D, π, ct ) = E(a,o,r)T ∼D,(π,ct ,π∗ ) r t0
the following implications:
t0 =t

where the composite policy (π, ct , π ∗ ) uses π for t − 1 1. Small regret relative to the optimal policy can be
steps then ct for one step and π ∗ for the remaining achieved by solving only these T reward-sensitive
T − t steps. classification problems. This is in contrast to
methods like Q-learning(Watkins, 1989) which at-
The regret of a policy is defined as the difference be- tempt to predict the value function of states and
tween the expected reward of the policy and the ex- actions at many locations. Similarly, the algo-
pected reward of the optimal policy: rithm E 3 (Kearns & Singh, 1998) can be thought
ρ(D, π) = η(D, π ∗ ) − η(D, π) of as attempting to learn a model sufficiently pow-
erful to approximately predict the value of all poli-
Each classifier has a regret: cies.

ρt (D, π, ct ) = lt (D, π, πt∗ ) − lt (D, π, ct ) 2. In the presence of errors, the optimal reward-
sensitive classification problems that need to be
Lemma 2.3 (Relative RL Regret) For every rein- solved are partially associated with the optimal
forcement learning problem D and every policy π = policy and partially associated with earlier sub-
c1 , ..., cT , we have: optimal choices.
T
X Note again that this result does not tell us how to solve
ρ(D, π) = ρt (D, π, ct )
these prediction problems because it does not tell us
t=1
how to construct training sets for reward-sensitive clas-
This is the core technical result. The most similar sifier learning at each step. The next section discusses
previous result was the PSDP analysis (Bagnell et al., some methods for solving these prediction problems.
2004). The PSDP analysis makes an explicit assump- To Binary Classification. This reduction can be
tion that the distribution over states of a near optimal easily combined with the SECOC (Sensitive Error Cor-
policy is known and characterizes the performance in recting Output Code) reduction (Langford & Beygelz-
terms of the distance between the distribution over imer, 2005) which reduces (multi-class) cost-sensitive
states of the learned policy and the near optimal pol- classification to binary (accuracy-maximizing) classifi-
icy. The analysis here is more basic. In particular, cation. The relevant details of this reduction are that
the same assumption could be added here yielding a it maps maps cost-sensitive examples to binary clas-
first-to-last PSDP-like analysis and explicit algorithm. sification examples and (hence) measures D on cost-
Proof. sensitive examples to measures SECOC(D) over bi-
ρ(D, π) = η(D, π ∗ ) − η(D, π) nary classification examples. For any binary classifier
hP
T
i distribution D 0 , the error rate of a binary classifier c
= E(a,o,r)T ∼D,π∗ r t is defined as:
hPt=1 i
T
−E(a,o,r)T ∼D,π t=1 rt e(D0 , c) = Pr (c(x) 6= y)
(x,y)∼D 0
PT hP i
T
= E T
t=1 (a,o,r) ∼D,(π,πt ,π ) ∗ ∗
t =1
0 r t 0
hP
T
i And the binary classification regret is defined as:
−E(a,o,r)T ∼D,(π,ct ,π∗ ) r
t0 =1 t
0

PT hP
T
i ρb (D0 , c) = e(D 0 , c) − min e(D0 , c0 )
= t=1 E (a,o,r) T ∼D,(π,π ∗ ,π ∗ )
t t =t
0 r t 0 0 c
hP i
−E(a,o,r)T ∼D,(π,ct ,π∗ )
T
r The above lemma and the SECOC reduction give the
t0 =t t
0
PT following result.
= [lt (D, π, πt∗ ) − lt (D, π, ct )]
Pt=1
T
= t=1 ρt (D, π, ct )
Theorem 2.4 (Reinforcement Learning To Binary
Classification) For every reinforcement learning prob-
The first, second, fifth and sixth equalities are defini- lem D and policy π defined by binary classifiers ct ob-
tions. The third equality
PT follows from the “telescoping tained through the SECOC reduction, we have:
property” of sums( i=1 ai−1 − ai = a0 − aT ). The T s
Pt−1 X X
fourth equality follows because t0 =1 rt has an iden- ρ(D, π) ≤ 4 ρb (SECOC(Dt ), ct ) ρt (Dt , π, ca )
tical expectation for both processes. t=1 a
Relating Reinforcement Learning Performance to Classification Performance

where ca is a classifier that always takes action a and 3. Obtaining Good Prediction
Dt is the distribution over sequences, (o, a, r)t−1 , ot . Performance
Proof. The proof is a conjunction of an earlier theo- Good classifier prediction performance is usually
rem (Langford & Beygelzimer, 2005) that reduces cost- achieved through a combination of several factors in-
sensitive classification to binary classification and the cluding expert knowledge, the use of learning algo-
Relative RL regret lemma 2.3. rithms that are known to perform well for similar
tasks and, perhaps most importantly, the availability
Cost-sensitive classification is defined by a distribution of training examples drawn from the same distribution
D over examples (x, l1 , ..., lk ) where li ∈ [0, ∞) is the as the examples about which the classifier is expected
cost of predicting class i (without loss of generality, to make predictions. In this section, we explain each
we assume that there exists an i with li = 0). The of these factors and discuss how they affect the pre-
cost-sensitive loss is defined as: diction performance needed for solving reinforcement
  learning problems.
lcs (D, h) = E(x,l1 ,...,lk )∼D lh(x)

which is the expected cost of the choices made. Simi- 3.1. Expert knowledge
larly the cost-sensitive regret is defined as: In many practical machine learning situations, a hu-
man expert knows (at least implicitly) how to make
ρcs (D, h) = lcs (D, h) − min lcs (D, h0 ). correct predictions and learning is used to transfer this
0h
knowledge to a machine. This is true, for example, in
which is the expected cost of the choices made minus text classification or face recognition. In these situa-
the expected cost of the optimal choices. tions, it is often possible to facilitate learning by em-
The SECOC reduction naturally maps D to a new dis- bedding the expert’s knowledge into the feature rep-
tribution SECOC(D) over binary examples (x0 , 0) or resentation. In other words, we can represent the ex-
(x0 , 1) and has the following guarantee for a multiclass amples using features that are known by the expert to
classifier hc formed out of the binary classifier c: be predictive for the particular problem in hand.
s Another way of using expert’s knowledge to improve
predictive performance is to use a prior (in a Bayesian
X
ρcs (D, hc ) ≤ 4 ρb (SECOC(D), c)E(x,l1 ,...,lk )∼D li .
i setting) that favors solutions which seem more reason-
able to the expert. This makes the task of learning
easier since it reduces the space of possible solutions
For our reinforcement learning reduction, the number and, consequently, the amount of data that is needed
of classes is equal to the number of available actions for achieving a certain performance level.
(k = |A|) and the loss la is the difference in reward
between acting according to the optimal action (then
3.2. Past performance in other problems
following π ∗ ) and acting according to a (then following
π ∗ ). The expected value of this difference is the regret Traditionally, in applied machine learning research,
ρt (Dt , π, ca ). we test classifier learning algorithms on many differ-
ent problems and prefer algorithms that exhibit better
Applying this analysis to each time step t and com-
performance across a wide variety of problems. This
bining the results with lemma 2.3 gives the theorem.
methodology implies that learning algorithms that are
considered successful in practice have over time implic-
This theorem shows that the regret of the policy (on itly incorporated information about the characteristics
the left-hand side) is bounded by the regret of the of typical prediction problems.
binary classifiers ρb (ct , SECOC(Dt )) and a particular
One significant outstanding question in the context
sum. The sum is over actions (implying that a problem
of reductions to classification is whether or not the
with many actions is inherently more difficult) and the
prediction problems created through reductions share
term is the regret of following action a and then acting
the characteristics of typical classification problems.
according to the optimal policy rather than acting ac-
If this is the case, classifier learners that have been
cording to the optimal policy at timestep t. This last
developed using natural problems as benchmarks will
term is a time-dependent form of advantage (Baird,
tend to also be successful for problems created through
1993) which implies that achieving small policy regret
reductions. Experimental results for some simple re-
is harder when there is a large difference in the value
ductions (Zadrozny et al., 2003; Beygelzimer et al.,
of actions.
Relating Reinforcement Learning Performance to Classification Performance

2004; Langford & Zadrozny, 2005) suggest that this is 3. Resetting. Many reinforcement learning algo-
true, but more work needs to be done. rithms are applied in a setting where a “reset to
the start” action is available. This might natu-
3.3. Training examples rally apply, for example, to a vacuuming robot
that always starts from the same location and
The main source of information for a classifier learn-
is easy to implement in simulators. The homing
ing algorithm is a set of labeled training examples that
analysis applies here as well, and can be tightened
inform the learner what are the correct labels for a set
by the removal of the need for homing.
of observations. Most classifier learning algorithms are
designed with the assumption that the training exam- 4. Generative Model. A generative model an-
ples are independently drawn from the same distribu- swers questions of the form “Given the state s
tion as the examples for which the classifier is expected and action a, what is a draw for the next reward,
to make predictions. This assumption cannot always observation and state?” This form of information
be fulfilled for reinforcement learning problems. For is typically only available in the form of a simu-
example, suppose that we have an agent in the world lator for the actual reinforcement learning prob-
who travels to a conference in Bonn, Germany. If this lem. With this simulator, it is possible to be-
is the first trip abroad for the agent, it is difficult to have according to a near optimal policy using a
think of any past experiences as being drawn from the sparse sampling tree (Kearns et al., 1999) repeat-
same distribution as the agent is being tested on. edly. However, the computational complexity of
Nevertheless, there are several mechanisms available this process is extreme, so solving the sequence of
for providing examples from the (approximate) test prediction problems discussed here may compile
distribution that rely on some piece of extra informa- the information about what is the correct policy.
tion (such as a human expert or a generative model). Somewhat surprisingly, it is possible to prove
that small error rate classification performance
1. RL as Information Transfer. There are a with respect to samples extracted from a (expo-
number of applications of reinforcement learning nentially less expensive) trajectory tree (Kearns
where a robot can be trained by a human (for ex- et al., 2000) implies a near optimal policy (Lang-
ample assembly line manufacture, game playing, ford & Zadrozny, 2003). This approach works
etc.). If we think of the human as an oracle to in stochastic environments where it is neverthe-
the optimal policy, then a small set of questions less possible to predict a near optimal action and
can be directly asked and answered. This requires fails (via an inherently large error rate) in other
an amount of work proportional to O(mT 2 |A|) stochastic environments.
where m is the number of examples per time step,
|A| is the number of actions and T is the number 4. Discussion
of time steps. The complexity is proportional to
O(T 2 ) because the human must guide the policy A common complaint about learning theory in general
for O(T ) timesteps for each of the T prediction is that it provides analysis for worst case scenarios.
problems. This can lead to unreasonably weak statements based
on unreasonably strong assumptions. However, in the
2. Homing. A “homing” policy is a policy which al- context of reinforcement learning, it is very difficult to
lows an inexact reset to an initial state (or initial justify the assumptions behind an average case analy-
state distribution). This mechanism has been re- sis because we cannot easily assert that reinforcement
cently analyzed for reinforcement learning (Evan- learning problems are drawn from some fixed known
Dar et al., 2005) with the result that it is possible probability distribution.
to learn to act near optimally by analyzing the
history of a process of intermittent homing, ex- There is an alternative to the worst case/average case
ploration and exploitation. debate. We can do analysis subject to a new assump-
tion. The assumption of prediction ability is a natu-
The internal representation of this algorithm
ral choice because we know simultaneously that in the
(which, intuitively, consists of probability tables
worst case we cannot predict, yet for the cases that we
over observations given belief states and actions)
care about we can often make successful predictions.
can be used to choose near-optimal actions given
past history. Consequently, the method here The result presented here can be thought about either
could be used to find a more compact collection- narrowly or broadly. The narrow view is that we have
of-classifiers representation for the policy. made a quantitative connection between the ability of
Relating Reinforcement Learning Performance to Classification Performance

a classifier to perform well on certain problems and Kakade, S., & Langford, J. (2002). Approximately
the performance of a policy. This statement provides optimal approximate reinforcement learning. Pro-
intuition relating the minimum3 difficulty of solving ceedings of the Nineteenth International Conference
reinforcement learning to the difficulty of solving clas- on Machine Learning (ICML-2002) (pp. 267–274).
sification.
Kearns, M., Ng, A., & Mansour, Y. (1999). A sparse
The broader view is that we have created a new form sampling algorithm for near-optimal planning in
of analysis, even in the context of machine learning large markov decision processes. Proceedings of the
reductions. Other machine learning reductions relate Sixteenth International Joint Conference on Artifi-
the performance of one task to the performance of an- cial Intelligence (IJCAI-1999) (pp. 1324–1331).
other by a mapping from one task to another inherent
in the learning process (as discussed in Beygelzimer Kearns, M., Ng, A., & Mansour, Y. (2000). Approxi-
et al. (2004)). The essential difference here is that we mate planning in large pomdps via reusable trajec-
can analyze and discuss the quantitative relationship tories. Advances in Neural Information Processing
between learning problems without explicit reference Systems 12 (NIPS*1999).
to a training set. This change is undesirable because
the analysis does not lead directly to a reinforcement Kearns, M., & Singh, S. (1998). Near optimal rein-
learning algorithm. The change may also be a neces- forcement learning in polynomial time. Proceedings
sity in reinforcement learning because of the sometimes of the Fifteenth International Conference on Ma-
unavoidable need to predict without prior representa- chine Learning (ICML-1998) (pp. 260–268).
tive experience as discussed in section 3.3.
Khardon, R. (1999). Learning to take actions. Machine
References Learning, 35, 57–90.

Bagnell, D. (2004). Personal communication. Lagoudakis, M., & Parr, R. (2003). Reinforcement
learning as classification: Leveraging modern clas-
Bagnell, D., Kakade, S., Ng, A., & Schneider, J. sifiers. Proceedings of the Twentieth International
(2004). Policy search by dynamic programming. Ad- Conference on Machine Learning (ICML-2003) (pp.
vances in Neural Information Processing Systems 16 424–431).
(NIPS*2003).
Langford, J., & Beygelzimer, A. (2005). Sensitive
Baird, L. C. (1993). Advantage updating (Technical error correcting output codes. Proceedings of the
Report). Wright Laboratory. Eighteenth Annual Conference on Learning Theory
(COLT-2005). To appear.
Bertsekas, D. P., & Tsitsiklis, J. N. (1996). Neurody-
namic programming. Athena Scientific. Langford, J., & Zadrozny, B. (2003). Reducing T-step
reinforcement learning to classification. Preprint at
Beygelzimer, A., Dani, V., Hayes, T., Lang- http://hunch.net/~jl/projects/reductions/
ford, J., & Zadrozny, B. (2004). Reduc- RL_to_class/colt_submission.ps%.
tions between classification tasks. Preprint at
http://hunch.net/~jl/projects/reductions/ Langford, J., & Zadrozny, B. (2005). Estimating class
mc_to_c/mc_submission.ps. membership probabilities using classifier learners.
Proceedings of the Tenth International Workshop on
Evan-Dar, E., Kakade, S., & Mansour, Y. (2005). AI and Statistics (AISTATS-2005) (pp. 198–205).
Reinforcement learning in POMDPs without re-
sets. Preprint at http://www.cis.upenn.edu/ Singh, S., James, M. R., & Rudary, M. R. (2004). Pre-
~skakade/papers/rl/learn_pomdp.ps. dictive state representations: A new theory for mod-
eling dynamical systems. Proceedings of the Twenti-
Fern, A., Yoon, S., & Givan, R. (2004). Approximate eth Annual Conference on Uncertainty in Artificial
policy iteration with a policy language bias. Ad- Intelligence (UAI-2004) (pp. 512–519).
vances in Neural Information Processing Systems 16
(NIPS*2003). Sutton, R. S., & Barto, A. G. (1998). Reinforcement
3 learning: An introduction. MIT Press.
It is only a minimum of course because we did not
fully address how to gain the information necessary to learn
these predictions. In other words we did not solve the Watkins, C. (1989). Learning from delayed rewards.
temporal credit assignment problem. Doctoral dissertation, Cambridge University.
Relating Reinforcement Learning Performance to Classification Performance

Zadrozny, B., Langford, J., & Abe, N. (2003). Cost- world. The advantage of DP is simplicity—there is no
sensitive classification by cost-proportionate exam- need to mention or use states.
ple weighting. Proceedings of the International Con-
For reinforcement learning goals, one common alter-
ference on Data Mining (ICDM-2003) (pp. 435–
native is to consider the γ discounted sum of rewards.
442).
Definition A.3 (Reinforcement Learning Goal) For
Appendix some γ ∈ (0, 1), find a policy π : (o, a)∗ → a optimizing
the expected sum of rewards:
A. Relationships between different ∞
X
reinforcement learning problem and ηγ (D, π) = E (o,a,r)∗ ∼π,D γ t rt
goal definitions t=1

The definition of reinforcement learning on a decision where rt is the t-th observed reward. Here, the expecta-
process (DP) used in this paper is mathematically ele- tion is over the process which generates a history using
gant (both general and simple). It is important, how- D and choosing actions from π.
ever, to understand how this definition relates to the
definitions that have been used in most previous re- One reason why this definition is convenient is that
inforcement learning work. There are two standard there exists a stationary (rather than nonstationary)
definitions: the MDP and the POMDP definitions. policy that is optimal with respect to this goal for any
MDP.
Definition A.1 (MDP Reinforcement Learning) The A basic statement can be made (Bagnell, 2004) re-
reinforcement learning problem is defined by a set of lating this goal to the T step goal used in this pa-
states S, a set of actions A, an initial state distribution
per whenever rewards are bounded. In particular,
p(s), and two conditional probability tables: p(s0 |s, a),
giving the probability of state s0 when action a is ex- for every reinforcement learning problem D, every γ,
ecuted in state s, and p(r|s0 ) giving the reward distri- and every tolerance  > 0, there exists a hard hori-
bution for each state s0 . zon T and a modified reinforcement learning problem
D0 (D) such that for all policies π we have |ηγ (D, π) −
An MDP is a DP with the Markov assumption η(D0 (D), π)| ≤ . (Here, we add the specification of
(the next state and reward distributions only de- the reinforcement learning problem to the performance
pend on the current state and action). Thus, an measure.)
MDP can be encoded by a DP through the mappings: 1
To do this, we first define4 T = 1−γ ln 1−γ
 . Then, we
s → o, p(o0 |a, o), p(r|o0 ) → D(o0 , r|(o, a, r)∗ , o, a) = 0
let D (D) be the reinforcement learning problem which
D(o0 , r|(o, a)). The last mapping is always feasible
for any valid history of D with probability γ uses D to
because D can express an arbitrary conditional dis-
draw the next observation and reward and with proba-
tribution between r and o0 , as is necessary. Note that
bility 1−γ draws the a special “end world” observation
because a history can be empty, the initial state dis-
and 0 reward. Any history with an “end world” always
tribution is p(o) = D(o0 , r).
results in an “end world” observation and 0 reward.
Another commonly used definition is the POMDP def- The proof that this works is simple—the accumulated
inition. “end world” probability decreases the expected value
of rewards at timestep t by γ t and truncating the sum
Definition A.2 (POMDP Reinforcement Learning) at T changes the expectation by at most .
The reinforcement learning problem is defined by a set
of states S, a set of actions A, a set of observations O, A similar transformation from T -step reward to γ-
and two conditional probability tables: p(o, r|s), giving discounted reward can be defined by choosing γ very
the joint probability of observation o and reward r for near to 1, then altering the reinforcement learning
each state s, and p(s0 |s, a), giving the probability of problem so all rewards after T timesteps are 0.
state s0 when action a is executed in state s.
Together these transformation imply that there is lit-
An unconstrained POMDP and a DP are equally gen- tle mathematical difference between the expressiveness
eral. In particular, for any sequence of (observa- of these two performance goals. We prefer the T -step
tion, action, reward) triples, there exists a DP and formulation because of the simplicity attained by elim-
a POMDP which predict the sequence with probabil- inating γ and avoiding infinite sums.
ity 1. This property means that POMDPs and DPs 4
This is for the case where r ∈ [0, 1]. The size of the
are “fully general” and assume no structure about the reward interval rescales  in general.

View publication stats

You might also like