You are on page 1of 36
Machine Learning, 22, 59-94 (1996) © 1906 Kluwer Acadomie Publishers, Boston. Manufactured in The Netherlands Feature-Based Methods for Large Scale Dynamic Programming JOHN N. TSUTSIKLIS AND BENIAMIN YAN ROY me@mitedu, permed Laboraiory for ftormation an Becton Ssstems ‘Massachasen hsttate of Technology, Cambridge, MA 02139 Abstract, We develop a methodological framework and presont a few diferent ways in hich dynamic Tn paricula, we develop algorithms that omplay two types of feature sed compact representations thi rprsentations that inolve festure extraction and 9 rslatvaly simple approximation secteur. We prove the convergence ofthese algorithm ar provide bounds oa the approximation extor As an example, ome of ‘example ilastrating the difcuies of integrating compact representations with dynamic programming, which exenpies the shoneomings of cet simple approaches. Keywords: Compact representation, cuse of dimensionality, dynamic programming, feturs,fanstion ap Proximation, neuro-dynamic programming, reinforcement tearing. 1. Introduction Problems of sequential decision making under uncertainty (Stochastic control) have been studied extensively in the operations research and control theory literature for a long time. using the methodology of dynamic programming (Bertsekas. 1995). The “planni problemas” studied by the artificial intelligence community are of a related nature although, until recently, this was mostly done in a deterministic setting leading to search or shortest path prahloms in aranhs (Karf. 1987). In either context. realistic problems have usually proved to be very difficult mostly due co the large size of the underlying state space or of the graph to be searched. In artifical intelligence, this issue is usually addressed by using heurisic position evaluation fictions: chess plaving programs are & prime example (Kort, 1987). Such functions provide a rough evaluation of the quality of given state (or board configuration in the context of chess) and are used in order to rank allervative Tn the context of dynamic: pr 1g and stochastic control, the most important object is the cost-r0-go fianction, which evaluates the expected future cest to be incurred, ‘22 4 Function of tho surrant state. Similarly with tho artificial intelligence wont to-go functions are used to assess the consequences of any given action at any particular state. Dynamic programming provides a variety of methods for computing cost-to-go functions. Due to the eunw of dimensionality, however, the practival applications of dynamic programming are somewhat limited; they involve certain problems in which 60 LN. ISUISIKLIS AND B, VAN ROY the cost-to-go function has a simple analytical form (e.g., controlling a linear system subject to 2 quadratic cost) or to problems with a manageably small state space. 1 most OF the steelustc conto! problents dar alse tn pracuice (CoMUE UF Mest systems, queueing and scheduling, logistics, etc.) the state space is huge. For example, every possible configuration of a quoueing system is different state, and the number of states Increases exponenually Wilh the number Of quCUES tnvoIVed. For this reason, 11S, essentially impossible to compute (or even store) the value of the cost-1o-go function at every possible stale. The most sensible way of dealing with this difficulty is to generate 4 compact parametric representation (compact representation, for brevity), such as an artificial neural network, that approximates the eost-to-go faction and can guide future actions, much the same as the position evaluators are used in chess. Since a compact represenuation wih 1eluvely Small Number OF parameters may approximate COst-1o- go function, we are required to compute only a few parameter values rather than as many values as there are states. ‘There are two important preconditions for the development of an effective approxima. tion, First, we need to choose a compact representation that ean closely approximate the desired cost-t0-g0 function. In this respect, the choice of a suitable compact repre- sentation requires Some practical experience or theoretical analysis that provides some rough information on the shape of the function to be approximated. Second, we necd cffcctive algorithms for tuning the parameters of the compact represeatation. These two objectives are olten conilicing. Having a compact representation that can approximate a rich set of functions usually means that there is a large number of parameters t0 be tuned andor thatthe dependence on the parameters is nonkinear, and in either case, 1 is an inerease 1m the computational complexity involve Itis important to note that methods of selecting suitable parameters for standard fune- ‘ion approximation are inadequate for approximation of cost-og0 Cunetions. In funetion approximation, we ate given teaning, data pairs {(21.1)..-.. Cex. ar}} and rust con struct a function y ~ f(x) that “explains” these data pairs. In dynamic programming, we are interested in approximating a cost-to-go function y — V (se) mapping states t0 optimal expected future costs An ideal set of training data would consist of pairs {ler yi),--(2m. vie) }. where cach x, is a state and each y, is a sample of the tuture cost incurred starting at state; when the system 1s optimally contolled. However, since we do not know how to control the system at the outset (in fact, our objective is 10 figure out how to contol the system), we have no way of obtaining such data pairs. An alternative way of making the Same point ss 10 note that the desirability of a particular state depends on how the system is controlled, so obscrving a poorly controlled system docs not help us tell how desieable a state will be when the system is well contralted To approximate a cost-t0-g0 function, we need variations of the algorithms of dynamic programming that work with compet representations. ‘The concept of approximating cost-to-go functions with compact represemtations is nov new Rellman and Dreyie (4989) explored the nse of pwlynamials as campact representations for accelerating dynamic progrannoing. Whitt (1978) and Rect (1977) analyzed approaches of reducing state space sizes, which lead to compact representa fing Sohsinitzar ane Se Aewann (1088) dowslispart co heiquas far anpeesienating FRATURF-BASED ETHODS: ol cos-to-go functions using linear combinations of fixed sets of basis functions. More recently, reinforcement learning researchers have developed a number of approaches, in cluding temporal 190) cd Qos, 1992), which have been used for dynamic prograxnming with many types of compact representation, especially artificial neucal networks. Aside from the work of White (1988) and Reetz (1977), the techniques that have been developed largely rely on heuristics. In particulat, these is a lack of formal prools guaranteeing sound results. As one might expect from this, the mothous have generated a mixture of success stories and failures. Nevertheless, the success stories ~ most notably the world-class backgammon player of Tesauro (1992) inspite great expectations in the Potential of compact representations and dynamic programming, erence bears (Sut ‘anal Day ait ‘The main aim of this paper is to provide a methodological foundation and a rigorous ascessinent of a few different wavs that dynamic programming and compact represen tations can be combined (© form the basis of « rational approach to diffieult stoshootic control problems. Although heuristics have (© be involved at some point, especially in the selection of a particular compact representation, it is desirable to retain as much payreit related objective is @ provide results dhat can help us assess the efficacy of alternative compact representations, fof tho non houristic arpact: of the dynamiv programming methodology. A Cost-fo-go lunctions are generally nonlinear, but offen demonstrate regularities similar to those found in the problems tackled by traditional fuaction approximation, There are several types of compact representations that one can use (approximate a cost function. (a) Arluictal neural networks (e.g., multi layer perceptrons) present one possibility. This approach has led 10 some successes, such as ‘Tesauro's backgammon player which was mentioned carlier Unfortunately, itis very have 10 quantify or analyze Ihe performance of ncural-network-based techmgues. (bj A second form of compact representation is bused on the use of feature extraction to map the set of states onto a much smaller set of feature vectors, By sloring a value of the cost-to-go function for each possible feature vector, the number ot values that need to be computed and stored can be drastically reduced and, if menningful features are chosen, there «6 4 chance of obtaining a good approximation of the ue cost-o-go funetion. (€) A third approach is to choose a paranncine lorm that maps the feature space 10 cost 40 go values and then try lo compute suitable values for the parameters, If the chosen parametsic representation is simple and structured, this approach may he amenable to mathematical analysis. One such approach, employing linear approximations, will be studied here. Jn this paper, we focus on dynamic programming methods that employ the Jatler two types af pompart representatinns, +6. the fe provide a general framework within which one can reason about such methods. We also suggest variants of the value iteraun algorithm of dynamic programming thal can be Lnsoa in conjunction with the inne we propase, We prose convergence recut for our algorithms and then proceed (» derive bounds on the difference between optimal performance and the performance obtained using our methoxls. As an example, one of | yonerats a strategy for Toteis, the ansao game. rachacad compact repeerantations, We tha tachniquee pescentad ie cod 62 UN. TSIPSIKLIS AND #8. VAN ROY ‘This paper is organized as follows, In Section 2, we introduce the Markov decision problem (MDP), which provides a mathematical setting for stochastic control problems, and we siso summarize te value eration algoritm and Mts properties. tn Section 3, ‘we propose a conceptual framework according to which different approximation method: ologies can be studied. To illustrate some of the difficulties involved with employing ‘compact representations for dynamic programming, in Section 4, we describe a “natural approach for dynamic programming with compact representations and then present a counter-example demonstrating the shortcomings of such an approach. In Section 5, we propose a variant of the value iteration algorithm that employs a Yook-up table in feature space rather than in state space, We also discuss a theorem that ensues its convergence and provides bounds on the accuracy of resulting approximations. Section 6 discusses an application of the aigorntim trom Section > {0 tne game ot tetris. m Section 1, we present our second approximation methodology. which employs feature extraction and linear approximations. Again, we provide a convergence theorem as well as bounds on the performance it denivers. his general methodoloey encompasses many types ot com- pact representations. and in Scotions 8 and 9 we provide two subclasses: imterpolative representations and localized basis function architectures. Two technical results that are ‘central to our convergence theorems are presented in the Appendices 4 and B. in partic ular, Appendix A proves a theorem involving transformations that preserve contraction propertics of an operator, and Appendix B reviews a result on stochastic approximation algorithms mivotving maximum norm contractions. Appenaices Cand L provide proots ‘of the convergence theorems of Sections 5 and 7, respectively: Markov Decision Problems In this section, we introduce Markov decision problems, which provide a model for sequential decision making problems under uncertainty (Bertsekas, 1995) We consider infinite horizon, discounted Markov devision problems defined on a finite stale space S. Throughout the paper, we let «i denote the cardinality of Sand, for simplicity, assume that $= {1}. For every state i € S. there é a finite set Oi) fof possible control actions and a set of nonnegative scalars p,(u}, u & Ua), 9 © 3, such that 7,25 Piy(u) = 1 for all «© 17(3) The scalar p(x) 18 interpreted as the probability of a transition to state j, given that the current state 1s ¥ and the control 1 is applied. Furthermore, for every state 1 and control w, there is a random. variable ¢, Which represents the one-stage cost if action w i applied at state 3, We assume that the variance of ey is finite for every 2 = $ and x ¢ L(y) In this paper, we treat only Marko decision problems for which transition probabilities (1c) and expected immediate costs Feu) are known. However, the ideas presented generalize to the context of algorithms such as Q-learning, which assume no knowledge of transition probabilities and costs A Stationary policy is « function 7 defined on $' such that xf) € (i) for all a € Given a stationary policy, we obtain a discrete-time Markov chain #°(f} with transition probabilities MONE ray a (0) =) pln PRATURE-DASED METHODS 63 Let J € [0,1} be a discount factor. For any stationary policy and initial state i, the cost-to-po V," is defined by vr BLS) = where e(t) = ¢ge(¢).,{e7(4}- In much of the dynamic programming literature, the map- Ping from states to cost-to-go values is referred to as the cost-to-go function, However, Sines the state spaces we consider in this paper are finite, we choose to think of the mapping in terms of a cost-to-go vector whose components are the cost-to-go values of various states. Hence. given the cost-to-2o veetor V* of policy x, the cost-to-go value af paliay: a at ctata 4 fe the ath component nf VF. ‘The optimal cost to go vestor V* iv defined by |s*(0) = VN, ies Its well known tha equation the optimal cost-to-go vector V* is the unique solution to Bellman's aw min (Bleu) +2) poly), ne siemply state that Ue the minimum, over all actions «that can be taken, of the immediate expected cost Ec) plus the suitably discounted expected cost-to-go V;* from the next state ), assuming that an optimal policy ‘The Matkov decision problem is to find a policy 7 such that nial ust ey slaeins Hanee a state & 49 ee 1 IL be followed in dw future uw ‘This is usually done by comput "und then choosing 7" as a function which satisfies 5 TO ~ ate ge, (How! FAY mola). Vig Ty cans svt Seana: W* Lu cae wba at a ‘ reasonable control policy sry satislying Jon Vow 7, we agin generne WW ) — ag sie (Bese TIE laws), vee a Intuitively. this policy considers actual immediate costs and uses Vt judge fatuce consequences of control actions. Such a poliey is sometimes culled & greedy poliey with respect to te cost-to-go vector Vand as V approaches V", the performance of a greedy policy 1 approaches that of an optimal policy =" There are soveral algorithms for computing V* but we only discuss the value iteration algorithm which forms the basis of the algorithms ( be considered later on. We start with some notation. We define f, - 8". > Re by 64 IN TSETSIKEIS AND VAN ROY iV) = (Ele +3 p,(0%) ies Wes. 2 We then define the dynanic programming aperator T sR — by hw) (Tw), TS In terms of this notation, Bellman’s equation simply asserts that V* = 7{W*) and V* is the unique fixed point of T. ‘The value iteration algorithm is described by Vie HI =TE)), were Vu) 18 an aroitrary vector me useu to untae ne argortInM., imturtvery, even V (2) is an estimate (though wot necessarily @ good one) of the true cost-to-go function =, which gets replaced by the hopefully better estimate (V(t). Let | se he the maximum norm denned Tor every VeetOr F = Ley...) GH" BY Joell = max, 2,|. It is well known (Bertsckas, 1995) and easy to check that Tis a ccomteaction with respect to the maximum norm, that is, or all VV" = R® HEY) - 2 ew SSI — V'be: Fin thie reason, the cequence V(E) pradluced hy the wale inert to V*, ar the rate of a geometric progression, Unfortunately, this algorithm requires that We maintain and update a vector V- of dimension 1 and this 1s essentially unpossible whoa nis extremely large For notational convenience, itis useful 1» deline for euch policy = the operator ‘LF RR TEU) = Pach + AD ps OV for cach 1 ¢ $. The operator 7% is defined by 1) TW) HY, Iris well known that 718 also contraction of the maximum norm ane that Vi is unigue Fixed point (Bertsekas, 1995). Note that, for any vector Ve we have TVjy=T™(v), since the cose mizing contro] action in Equation (2) is given by the greedy policy. 3. Compact Representations and Features As aeationed ia the Joss Gayot with the number of vanables involved. Because af this, itis offen impractical f@ compute tand store every component of s cost-to-go vector We set out avercome this timitation by wi sabucticay, the ies: of slaty aps ty pisally suuspaut wprruatatinn jevainialy Sant ERATURE-HASED MEI 65 velop a formal framework for reasoning about compact representations and features a groundwork for subsequent sections, where we will discuss ways of using compact reprcnttions fn lynanns progam, THe» Tray seapouts sia 4 tat in (Schweitzer and Seidman, 1985). A compact representation an be thought of as a scheme for recording a high- sisiiiol GSERrAY YEO TG RT unin loswer-aensional parame Yecur Wc R™ Gm -0. Suh a feature focuses on whether a quoue is emply oF NOL wen a Markow decision problem, one may wish to use several features iy... ach ine bei fancion oni sie ae Me J 0 MK SGU Ek, = LM Tey to each state + © S, wo associate the feature vector F(0) ~ (Jifa).....Jac(t)). Such a feature veetor is meant to represent the most salen properties of @ given state. Note ah the rebut ser a all panic fenue vecirs Ie Cartesian prague OF HE Sets {Qq andits cardinality increases exponentially with the number of features Ina feature-based compact representation, each component Vf the mapping Vis a Fumeion oF ve conespoaling Feare yews (4) sk ie paneer year (ut HL explicit function of the state value i). Hence, for some function 9 (TT! , Qe} x R OW) - aPC.) ° Tf nach featur takes an road vans, wi have Oy 6 Wf all bin natural to define the function g over all possible reat feature values, g BK x" — R, even though g will only ever be computed over a finite domain, Figure | illustrates the simenire of a foabirecbeced canpart raprecnntation In most problems of interest, V* is a highly complicated function of &. A eepresentation like the one in Equation (3) attempts to break the complexity ot V* into less complicated mvappings 9 and Thorn ie different choices lead to drastically different structures. Asa general prineiple, the feature extraction function is usually hand cratted and relies on whatever human experience sr itelligans is available. The fimetion 9 reproconts the choc for approximation and the vector WV are the free parameters (or weights) of the chosen architecture, When @ compact representation is used for static function approximation the values for the parsine could range from Linewr regression to backpropagation in neural networks. In this pa- per, however, we will develop parameter selection techniques for dynamic programming Arathor thon iy a tradeoff baternen the comploxity of and 1° and of an architecture wood re TW are checen using some optimi mo altornative architectures notion appronimation). Let us firat disesss 66 JN TSIPSIKLIS AND B. VAN OY Parameter Vector Feature Vector, state >| F a + Cost-To-Go Feature Approximation Extestion avaeciure Figure 1. Rlock srecture of a feature tase compact representation Look-Up Tables One possible compact representation can he obtained by employing a look-un table in feature space, that is, by assigning one value to each point in the feature space. In this, cease, the parameter vector HY contains one component for each possible feature vector. ‘The fimetion o acts a8 a hashing fumetion. selecting the component of IW” corresnonding toa given feature vector, In one extreme case, each feature vector corresponds to a single state, there are as many parameters as states, and V’ becomes the identity function, On the other hand effective feature extraction may associate many states with each feature vector so that the optimal costto-go values of states associated (0 any particular Feature vector are close, In this scenario, the feature space may be much smaller that the state mace reducing the number of reqnited parameters. Note. however. that the mumber oi possible feature vectors increases exponentially with the number of features. Por this reason, look-up tables are only practical when there are very few features. Using a lonkamn table in feanure space is eauivalent ny partitioning the stare space ‘and then using a common value for the cost-to-g0 from all the states in any given partition, In this context, the set of states which map to a particular feature vector forms one partition. Ry identifying one such partition nee nossihle feature vector the feature extraction mapping F defines a partitioning of the state space. The function g assigns each Component of the parameter vector to a partition, For conceptual purposes, we chnose fo view this typi of representation in terms of state agerevat feature-based fook-up tables. As we will sce in our formulation for Tetris, however, the feature-based look-up table interpretation 1s often more natural in applications, We naw develop a mathematical deccription of stale aggregation Suppoee that the state space $ — {l,....n} has been partitioned into m disjoint subsets Si... Sin, where m is the same as the dimension of the parameter vector W’. The compact representations wwe enncider take on the Fallawing Form: wn rather than tw} =, for any FEATURE-nASHD METHODS oT There are no inherent limitations to the representational capability of suet an archi tecture. Whatever limitations this approach may have are actually connected with the availability of wooful feauwes. Ty anplity define, forall. fp pnt, Wt us fin svt © 2 O aund Let un S;=filjes VE < Ute Using this particular partition, the function V* can be approximated with an accuracy of ¢, The catch is of course that since V* is unknown, we are unable to forin the sets 5; A different way of making the same polnt 1s 10 note that the most useful feature oF a state is its optinnal cost-to-go but, unfortunately, this is what we are trying 1 compute in the first place Linear Architectures With a look-up table, we need to store one parameter for every possible value of the feature vector F(:), and, as already noted, the number of possible values increases exponentially with the number of features, As more features are deemed. important, look-up tables must be abandoned at some point and a different kind of parametric representation is now called for. For instance, a sepresentation of the following form can be used: K hor) - Soma a is representation approximates a cost-to-go function using a linear combination of features, This simplicity makes it amenable to rigorous analysis, and we will develop an algontnm tor dynamie programming with such a representation. Note tat ine numer of parameters only grows linearly with the number of features. Hence, unlike the case of look-up tables, the number af features need not be small. However. itis important 10 choose features thal facilitate the linear approximation, ‘Many popular function approximation architectures fallin the class captured by Equa sion (4), Among these are radial basis functions, wavelet networks, polynomials, and smote generally all approximation methods that involve a fixed set of basis functions. In this paper, we will discuss two types of these compact representations that are compatible With our algorithm ~a method based on linear interpolation and localized basis functions, Nonlinear Architectures The apchiterties, x dose wi hy 9, cond he > nonlinear mapping cach as Pf ward neural network (multi-layer perceptron) with parameters WV. The feature extraction mapping F could be either entirely absent or it could be included to facilitate the job of the noural network. Both of thote options were used in the hackganman player of Tesauro and, as expected, the inclusion of features led to iinproved performance. Un- fortunately, 3s was mentioned in the introduction, there is not much that can be said analytically in this content 68 3. TS 4. Least-Squares Value Iteration: A Counter-Example Cuaven Set OL & samples {(01.j fo(das Vaden (ks Hee} OL AN optimal cest-to-go ctor V* we could approximate th vestor with a compact representation Vy choosing parameters 1 to minimize an errr fanction such as ¥ (Fa - ie, by finding the “least-squares ft” Such an approximation conforms to the spit of vedivona] tinction approximation Rowever as iseussed an the iniov'ton, we as not have access to such samples of the optimal costogo vector. To approximate an optimal cos-to-go vector, we must adapt dynamic programming algonthms such & the Value iteratinn algorithm So that they manipulate parameters oF compact representations For mstancs, we could start with a parameter vector 5¥(0) corresponding to an initial coxtto-go vector V(W'(0)), and then generate a sequence {W (ft ~ 1.2. ..} of param: ever vectors such that (WG 419) approximates 7 approximates a traditional value iteration, ‘The hope is that, by approximating individual Value iterations in such @ way, the sequence of approximations converges to an accurate anprocimatinn of the optimal cost-t0-g0 vee wo. Hence pach iteration 1 whieh is what vale it It may soem as though any reasonable approximation scheme could be used tw generate fcach anproximae value iteraian Far instance she “lpast-cquaros fie? is an abvinns candidate, This involves selecting W(E | 1} by setting wut ly aremy wy — 17 we) o Howasen, in this costion wi will Mlantify eubtlotics that make the ch parameter selection crucial. Furthermore, an approximation method that is compatible with one Lype of compact representation may generate poor results when a differeat compact reprovontation ie employe. We will now develop counter-example that illustrates the shortcomings of such a combination of value iteration and least-squares approximation, ‘his analysis. is par Uioularly interosting, sinew the algorithm ia elovely rslated to Q learning (Watkins and Dayan, 1992) and temporal-difference learning (TD(A)) (Sulton, 1988), with A set to 0. The counter-example discussed demonstrates the short-comings of some (hut nct all) variant of Q loaming and temporal difference learning that are esaployeal in practice Berisekas (1994) described a counter-example to methods like the one dofined by Equation (5). His counter-example involves a Markov decision problem and a compact repreventaticm that could generate w close apprisnatin (iw terms wf Lint Stans) Of the optimal cost-to-go vector, but fails to do so when slgorithass like the one we have descrited are used. In particular, the parameter vector does converge to some HF, bat, unetaaty, Us pauaMMter vuaion Reman a pon eotinate Of Ake of criterion for FEATURE-BASED METHODS 69 Bigwe 2 A counecesample optimal cost-to-go vector (in terms of Buclidean distance), that is, COW) Ue amin ITU) wwe Heo grip a) where | |j2 denotes the Fuclidean norm. With our upcoming counter-example, we showy that muuch worse behavien is pussilds orcas wha pace representa ‘a perfect approximation of the uptimal cost-to-go function (Le, ming [VOW } Va 0), the algorithm may diverge Consider the simply Mails de u problein Gopioted in Pigure 2. The swe space consists of two states. rj and ar, and at state a transition is always made to 22, whieh isan absorbing state. There are no control decisions involved. All wransitions incur 6 ost. Hence, the optimal sent vg funtion assigns 0 U0 b Suppose a feature J 4s defined over the state space so that f(1) Land f(r) — and a compact representation of the form Vile} wf, be fre is employed, wlere 1 is scalar. When we set u! a 0. we representation of the optimal cost-to-go vector is possible Let us investigate the behavior of the least-squares value iteration algorithm with the Markov decision problem and compact represemation we have described The paramensr w evolves as follows Voor) ~ Po soa pert wee Bee gu Y° (sow) — B00 wien) recig {Cue ~ 20(0))* 4 [Bee ~ Be ay) land we obtain fi and ae()) / 0, the sequence diverges. Counter-examples involving |Murgov deLIsion’ propleais thal allow several control aCtONS aL caLtL stale CaN also. Re 70 JN. TSUTSIKLIS AND B, VAN ROY produced. In that case, the least-squares approach to value iteration can generate poor control strategies even when the optimal costto-xo veetor can be represented “The ShoriconningS OF SialgnTorward procedures SUCH as leane-squares value HeraKOR characterize the challenges involved with combining compact representations and dy- namic programming. The remainder of this paper is dedicated to the developmeat of approaches iat guarantee more gracerul Denavior. 5 Value Iteration with Look-tIp Tables [Asa starting point, let us consider what is perhaps the simplest possible type of compact reproventation. hie ic the fextore-based look-up lable representation deveribed in S 3. In this section, we discuss & variant of the value iteration algorithm that has sound convergence properties when used in conjunction with such representations. We provide a convergence theorem, which we formally prove in Appendiy © Wa alea point out relationships between the presented algorithm and previous work in the fields of dynamic programming. and reinforcement learning, Sil. Algorithmic Model ‘As mentioned earlier the use of a look-up table in feature space is equivalent to state szaregaton. We chose ths ltr viewpoint in our anaysis. We consider «partion of the state space $= (2...) into subsets Sy Say. Sn in patcuan = $0 SoU Sie and SVS; OFZ, Let V RM 5h the fstion wich maps parameter vector W to a cost-to-go veetor V, be defined by: Kor) Wy, wees Let A’ be the set of nonnegative integers. We employ a discrete variable f, taking on values in A which i ase lo index successive updates of the parameter vector W Tet WF} be the parameter vector at time . Let I? he an infinite subset of A’ indicating the set of Limes at which an update of the jth component of the parameter vector is performed Kar each sot 8,9 = 1 an let 286) he a prohahility distribution aver the set 5, In particular, for every 1 € $, p)(a} is the probability that a random sample from $; 1s equal to 4, Naturally, we have p'(2) > Oand Dyce, PG} = 1 ‘sch ie tt Kb on deni representative of the set $}. sampled according to the probability distribution p'(). We assume that each such sample is generated independently from everything else that takes, Place in the course oF the algorithm? ‘The value iteration algorithm applied at state Xj(¢) would update the value V5 which is represented by IV,, by setting it equal to Tx, )(V’}. Given the compact tep- . and givon the ourrent paramotor vector It), we avtually need 10 set W, 10 Ty,( (VF {E))). However. in order to reduce the sensitivity of the algorithm Uo the randoraness caused by the random sampling, HW, is updated in that direction nith o small stepaine. We therefore snd ap with the followin random, 1 weotar whee jth componsnt update formula: Hos 1 ry, oT WE +N =O ast) Wi) FAO, VINO), te Wire Wy, rete ® Here, a, (f) is a stepsize parameter botween 0 and I. In order to bring Equations (7) and (into a common format, it ie convenient to accume that a, (t) ix defined for overy ji and f, but that 0, (£) = 0 for # ¢ P7 Ina simpler version of this algorithm, we could define a single probability distribution () over the ontize state space S such that for each subsct §, we have Soy 4, Bx (i) > 0. ‘Then, defining (#) as a state sampled according to the p( ), updates of the form Wit =O alWiO taihTn(ViW, — ifalthe sO Wyte = Wi), i xl ¢ 5} do) can be used. ihe simplicity of this version ~ primarly the 1act that sampies are taken from only one distribution rather than many ~ makes it attractive for implementation ‘This version has a potential shortcoming, though. It does not involve any adaptive exploration oF the Teature space; tat 15 tne enoice oF the subset 5, TO He sampred does not depend on past observations. This rules out the possibility of adapting the distribution to concentrate on a region of the feature space that appears increasingly significant as Approximation oF the cost—10-go runction ensues. KegaTaiess, this simple verston 1s the ‘one chosen for application to the Tetris playing problem which is reported in Section 5. We view all of the variables introduced so far, namely, ay(f), Xj(t), and W(t), as ranaom variaoies denned on a common pronaniity space. The reason ror a, (r} eing a random variable is that the decision whether W, will be updated at time £ (and, hence. whether a(t} will be ze10 oF not) may depend on past observations. Let F(z) be the set OF all random variabies that nave Deen realized up 10 ang anctuGINE tHE POLE AL WT the stepsize a(t) is fixed but just before X5(t} is generated, 5.2, Convergence Theorem Ratna storing aus camuneganse thovwem, wor must intends the fallow sumption concerning the stepsize sequence: Assumption La) For ail J, the stepsige sequence satisfies wpl ay zy 2 JN. TSITSIKLIS AND B VAN ROY THKOREM I Let Assumption I hold. (a) With probability 1, he sequence W(t) converges to W*, the unique vector whose cumpouenis solve the fotiowing system of equations: We Vranwiw'y, vy. 3) Define V* as the optimal cost-to-go vector and ¢ © R™ by ey mug VP — Vi] Vee (1, ym). Revall that woe, denotes «1 peredy poliey with wepwrt ta cactso-gn vecior VW"), vey(i} = ang ani (Blesal +8 nasoVs0V")) The following hold (b) ivy ov © errs yey (di there exists an example tie whick the bounds in (b) and Ce) hath hold with equality A proof of Theorem) 1 is provided in Appendix C. We prove the theorem by showing that the algorithm comesponds te a stochastic approximation involving a maximum norm Cconutaction, und tien appeat £0 a meorem concerMINg. ASyneMroNGUS stochastic approx mation duc 1 Tsiviklis (1994) (sce also Uaakols, Jordan, and Singh, 1994)), which is discussed in Appendix B, and a theorem concerning multi-representation contractions presemtea na proven in Appensls A. 5.2. Tho Quality af Approximations ‘Theorem | establishes that the quality of approximations is determined by the quality of the ol ie we Faatuean. FE tho snus aout to Fnotin the form (ME). then the computed parameter values deliver near optimal performance This is a desirable property ‘Pho disteouing wnpaot of Thacram 1 i the wide margin allowed bound. As ene discount factor approaches unity, ube <1 term explodes. Since discount n practice, this is a severe weakness. However, se hound in roa world applications may be a dlesited properties for a method! to be sound, As we have seen in Section 4, even this #8 nol guaranteed by factors elose ts one are most commion rare event. These weak hounds are to be viewed as the minima FEATURE-BASED METHODS 7B 5.4. Role of the Sampling Distributions “The worst-case bounds provided by Theorem | are satlsMed fOr any set OF state—sampuing. distributions. ‘The distribution of probability among states within a particular partition may be arbitrary. Samnpling only a single state per partition constitutes a special case ‘whic Satis ane requirement, For tniy special case, a decaying stepsize is unnecessary. If a constant stepsize of one is used in such a setting, the algorithm becomes an asyn- cchronous version of the standard value iteration algorithin applied 10 a reduced Marko doccision problem that has one slate per partition of the original state space, the com vergence of such an algorithm is weil known (Bertsekas, 1982; Bertsekas and Tsitsiklis, 1989). Such a state space reduction is analogous fo that brought about by state space di eretzzation, whieh» commonly applied to problems with continuous state spaces. Win, (1978) considered this method of discretization and derived the bounds of ‘Theorem 1 for the case where a single state is sampled in each pattition. Our result can be viewed aS a generanzanion oF Writs, allowing tne USe OL arbitrary Sampling distriDUtLONS. ‘When the state agaregation is perfect in that the true optimal cost states in any particular partition ase equal, the choice of sampling function is insigniticant Thus 1s because, independent ot the distububion, the error bound as zero when there 1s no fluctuation of optimal cost-to-go values within any partition. In contrast, when V* Auctuates within partitions, the error achieved by a feature-based approximation can depend on the samphng distribution, ‘Though the derived bound lamuts the error achieved using any set af state distributions, the choice of distributions may play an import role in attaining errors significantly lower than this worst ease bound, It often appears desirable to distribute the probability among many representative states ip each parton 1 ouly a few slates are sampled, the error can be magnified if these states do not happen to be representative of the whole partition, On the other hand, if many states are chosen, and their cost-to-20 vales are 1n some sease averaged, a cost-lo-go value representative ‘of the entire partition may be generated. It is possible to develop heuristics t0 aid in choosing suitable disirimutions, fut the relationship hetween sampling distribunions and approximation error 1 aot yet clearly understood or quanuted. 5.5. Related Work As was mentioned earlier, Theorem | can be viewed as an extension te the work of Whitt (1978), However, our philosophy is much different. Whitt was concerned with dliscretizing & viatinuous state space. Our congera here is «@ exploit human intuition, voncerning useful features and heutistic state sampling distributions (© drastically reduce: the dimensionality of a dynamic programming problem Several other researchers have considered ways of aggregating states co tactile dy fay Rovtonkas ant Castahon CORA levelaped on aulaptive aygnngarin scheme for use with the poliey iteration algorithin, Rather than relying on feature ex traction, this approach automatically and adaptively aggregates states during the course ‘of an algorithm based on probability transition matrices under greedy policies. 74 JX TSITSIKLIS AND 8, VAN ROY ‘The algorithm we have presented in this section is closely related to Quearning and temporal-difference learning (TD()) in the caso where 2 is set to 0. In fact, Theorem 1 can casily be wateinhal au thal i applins ty TOYO) v4 Gehan whic used it unuusion With feature-based look-up tables. Since the convergence and efficacy of TD(Q) and Q- learning in this setting have not been theoretically established in the past, our theorem, tho wm Tight on these as In considering what happens when applying the Q-learning algorithm to partially ob- servable Markov decision problems, Jaakola, Singh and Jordan (1995) prove a conver lar 1 part (a) vf Thew where the state aggregation is inherent because of incomplete state information ~ i. a policy must choose the same action within a group of states because there is no way @ controller ait distinguish Uecwext dies he gioup — aml ts UX geared vomatus accelerating dynamic programming in general 1. Thc analysis involves a sce 6. Example: Playing Tetris ‘Ac an onample, sve weod tho algorithm from the previous sostion to gonorate « otrategy for the game of Tetris. In this section we discuss the process of formulating Tettis as a Markov decision problem, choosing features, and finally, generating and assessing a game ctrategy. Tho objective of Wl can deliver reasonable performance for a rather complicated problem. Our objective was, not 10 construct the best possible ‘Tetris player, and for this reason, no effort was made to conetract and use caphi sxereibe wat to verify: that feature based value iteration 6.1. Problem Formulation We formulated the wame of Tetris as a Markov decision problem, much in the same spirit as the Tetris playing programs of Tipnman Knkolich and Singer (1993). Pach stare of the Markov decision problem is recorded using a two-hundred-dimensional binary vector (the wall veetor) which represents the configuration of the current wall of bricks and a seven-dimensinnal binary vecror which identifies the current falling piese 9 The sereen is (wenty squares high and ten squares wide, and each square is associated with 4 component of the wall vector. The component corresponding 10 a particular square is assigned 1 if the square is occupies! hy a hriek and Q otherwise All components af tho seven-dimensional vector are assigned 0 except for the one associated with the piece which is currently falling (ihere are seven types of pieces). Ar any tinw. the sor af pnesihle devicione inlucleg the Inratione and ariantatic ‘which we can place the falling piece on the current wall of bricks. ‘The subsequent state is determnined by the resulting wall configuration and the next random piece that appears Sime the reculting w Lanhgutation is deterministic and there are seven pacrible pive: there are seven potential Subsequent states for any action, each of which occurs with’ equal probability. An exception is when the wall is higher than sixteen rows. In this, circumstance, the gaune onde. and the state in absorbing. FEATURE-BASED METHODS 5 Each time an entire sow of squares is filed with bricks, the row vanishes, and the portion of the wall previously supported falls by one row. The goal in our version of “Towie is to manimizo the oxpocted number of rows eliminated during tho cource of = game, Though we generally formulate Markov decision problems in terms of minimizing costs, we can think of Tetris as a problem of maximizing rewards, witere rewards are negative sorts, Tho reveard of o transition ithe immediate number of rows eliminated ‘To ensure that the optimal cost-to-go from each state is finite, we chose a discount Factor of 9 = 0.0990. In the ast majority of states, there is no scoring opportunity. In other words, given a random wall configuration and pivee, chances are that no decision will lead to an mediate reward, When 1 homan being plays Tots, its erucial that she makes decisions fee aati ipotivn uf lungs awards, Daxaase of Us sinupls policies shat play “Tees such as those that make random decisions or even those that make greedy decisions (1e., decisions that maximize immediate rewards with no concern for the future) rarely score ony points iv ty couae of 6 game, Decisions that delives iasonable performance reflect a degree of “foresight.” 6.2. Some Simple Features ‘binee eacn compination oF walk conliguravion and current piece constitute a separate state, the state space of Tetris is huge, As a result, classical dynamic programming algorithms are inapplicable. Veature-based value iteration, on the other hand, can be used. In order To demonstrate mus, we cnose some simple features ana applied the algorunm. ‘The two features employed in our experiments were the height of the current wall and the number of holes (empty squares with bricks both above and below) in the wall, Let us denote he set of possible heights by #7 — {U,..., 20}, and the set of possible numbers of holes by £ = {0,.,200}. We can then think of the feature extraction process as the application of a function Fs SH x L [Note that the chosen features do not take into account the shape of the current falling piece, This may initially seem odd, since the decision of where to plice a piece relies fon knowledge of its shape. However, the cost-to- go function actually only needs 10 tenable the assessment of alternative decisions. This would entail assigning a value to each possible placement of the current pivee on the current wall. The cost-to-go function thus needs only 10 evaluate the desirability of each resulting wall configuration. Hence, features that capture salient characteristics of a wall configuration are sufficient, 63. A Heuristic Evaluation Function As 4 baseline Tetris-plaving program. we produced a simple Teris plaver that bases Stale assessments on the two features. The player consists of a quadratic lunetion 9 HL)» R which incorporates some heuristics developed by the authors. Then, although the composition of feature extraction and the evle hased system's evaluation function, 16 99 F, is not necessarily an estimate of the optimal cost-to-go veetor, the expert player follows a greedy policy based on the composite function, ‘he average score of tus ‘eteis player on & hundrod games was 31 (rows eliminated) ‘This may seem low since areade versions of Tetris drastically inflate scores. ‘To gain perspective, though, we should take into account the fact that an experienced human Tetns player would take about three minutes to eliminate thirty rows, 6A Value Hteration with wx Peature-Bused Livk-Up Table We synthesized two Tetris playing programs by applying the feature-based value iteration alguuiduu These ow players UUTeIeU Im UEIL each Feed on alerent slake-sampling distributions ‘The first Tetris player used the states visited by the heuristic player as sample states for value nerations. Aner convergence, Ne average Score OL this player on a hundred games was 32. The fact that this player does not do much better than the heuristic player Is not surprising given the simplicity of the features. on which both players base position Is EAaMIpIe reassures Us, Mevertnetess, (nat fealUre-DaseM value Keranion ‘converges to a reasonable solution: We may consider the way in whieh thi ie seteu ow ay evalua c first player was constructed unrealistic, since xing HeUASE player FOF stae sampling. Ihe second Lets player eliminates this requirement hy uses an ad hive slate sampling algorithm. tn sampling a state, the sampliny algortha begins by samplinz « maximum height for the wall of tnighs four # union disuibuGwN. Hen, For each square NelONe THIS NeIght a Nok Is place in the square with probability 4. Fach unsupported row of bricks is then allowed ‘o fall until every row is suppoviod. The player based on this sampling function gave an average wore 01 11 (equlvaleal 9 a noman game lasting about one and a all minute). ‘The experiments performed with Tetes provide sone assurance that feature-based value iteration produces reasonable contro! policies, In some sense, Fes is a wrstecase sce tacio Ry Ane evauatON oF auemtee Control atgorams, since Humans excel al Tetns The goal of algorthms that approximate dynamic programming is to able control policies for kage scale stochastic control problems that we have no other reasonanie Way OF agSeesSing, Suc proplems wold adt be natural fo usa and any reasonable poliey generated by feature-based value iteration would be valuable. Further more, the features chosen tor dis stay were veey erude: perhaps withthe intecdactim fF more sopmisticated features, feature hassa value KatOM WOU! Excl in Leth AS parting note an aldtional lesson ean be drawn from the fl that wo staleyies generated by feature-based value iteration wore of such disparte quality. This is tha tho sampling aistriution prays an important vole crate Reason: Te Value Hesation with Linear Ascticecanres We have discus dda ten use Gat (ihe use of leature-hased look up tables with value iteration, and fowl y accelerate Uyttatile programming. Aowever, employing FEATURE-BASED METHODS n look-up table with one entry per teature vector is viable only when the number of feature veotors is reasonably small. Unfortunately, the number of possible feature vectors grows, fexponennally.witt th re space, Wien the number or features is fairly large, alternative compact representations, requiring fewer parameters, must be used. In this section, we explore one possibility which involves a linear approximation areninecture. tore formally, we consider compact representations oF the 1orm ‘aimension oF ine 1 Kw) Sage WW, vec as) sihere Wc BM ip the parameter seutor, EG) (ids fn 2) © B98 fe the Fats vector associated with state 1, and the superscript T denotes transpose. This type of ‘compact representation is very attractive since the number of parameters is equal to the umber of dimonzigns of, rathor then the mumber of slomonts iny the Feature 2pacs. We will describe a variant of the value iteration algorithas that, under certain assump. tions on the Feature mapping, is compatible with compact representations of this form, and we will provide a convergence result and bounds wn the quality of appronimations Formal proofs are presented in Appendix D. 7.1. Algorithmic Model gorithm, AL the outset, A representative states 11,.., iq are chosen, where Kis the dimension of the parameter vector. Each iteration generates an improved parameter vec~ tor W{¢ 1) fromm @ parameter veotor 17{é) by evaluating Ty(W(1(2))) et stato day ran and then computing W(t 1} so that Vi(W(E + 1}) — LAV (W(B)) for te fo. ctw} In other words, the new cost-to-go estimate is constructed by fitting the compact repre sentation to TW), whorw V is the provious wont eo go estimate, by Fini Cepresentation at ty,...14. If suitable features and representative states are chosen, ViW(t}) may converge to a reasonable approximation of the optimal cost-to-go vee~ Rect ah nto the stundned valus itoration al the compact or V4. Such an algorithm hao been considered in the Tterature (Bellman (195 (1977), Morin (1979)}. Of these references, only (Rects (1977), establishes conver and error hounds. However, Reci/'s analysis is very diferent from what we will present and is limited to problems with one dimensional state space. If we apply an algorithm of this type to the counter-example of Section 4, with IC — 1 and #) — 11, we obtain w(t + 1) ~ 2w(t), and if > 4, the algorithm diverges. ‘Thus, ds adguvitiaa Of ais typ is uly Busan a Comveage Bal a subaans uf Um. Lon representations desenbed by Equation (14), ‘To ctaracterize this subclass, we introduce the following assumption which sestricts the types oF features that may be employed: Assumption 2 Let i... tye € 8 be the pre-selected states used by the algorithn. (a) The veetors Flix)... Plas) are linearty independent (4 Where eaines es vudae 1 © |I.A) ane th fr uny suse €¢ oF dere ease 096) 8 IN TSISIKLIS AND B, VAN ROY Gai) ER with nl F-Zero In order to understand the meaning of this condition, its useful to think about the fea- ture space defined by {F(i})/ ¢ S} and its convex hull. In the special case where 2 = and under the additional restrictions Sof, 84(1) = 1 for all 1, and (2) > 0, the feature space is contained in the (KK — 1}-dimensional simplex with vertices F(i,),..., Flix) Allowing 6° to be strictly greater than 4 introduces some slack and allows the fea- ture space to extend a bit beyond that simplex. Finally, if we only have the condition Ete l64(01 1s the feature space is contained in the convex hull of the vectors £5 Flin) £3 Flin)... £5 Fliv' ‘The significance of the geometric interpretation lies in the fact that the extrema of a linear function within a convex polyhedron must be located at the comers. Formally, Assumption 2 ensures that ‘The upcoming convergence proof capitalizes on this property. To formally define our algorithm, we need to define a few preliminary notions. Fist, the reproventation deseribod by Equation (14) ean be reweitten a vw) = Mw, «sy where Me RY" is a matrix with the ith row equal to P(A)". Let F@ ROK be a matrix with the kth row being F(i,)". Since the rows of L ate linearly independent, there exists a unigue matrix inverse Lc RK. We define Mt © RS as follows. For &C {L,..,A}, the axth column 18 the same as the th column of L~¥; all other entries are zero. Assuming, without loss of generality, that #1 = I,...ig — A, we have war -urtn[] tn wetne Fe EYE Gy ae icatty atin and O aepresenns dhe 16H Hence, M' is a left inverse of M. Our algorithm proceeds as follows. We start by selecting a set of K states, 44, fie. ent 1 paramoter vectors are generated using the following update rule: ing vows oF newer vecan HO). The, det as Mi uD o Mf, successive ital pa, wean TW) 6 9 7.2. Computational Considerations ‘We wilt prove shorty har tne operation 7” apptiea curing eacn iteration of our algorithm is a contraction in the parameter space. Thus, the difference between an intermediate parameter vector W(t} and the limit W* decays exponentially with the time index t fence, in practice, tne number oF iterations requitea should be teasonabe.* ‘The reason for using a compact representation isto alleviate the computational time and space requirements of dynamic programming, which traditionally employs an exhaustive Jook-up table, storing one value per state. Even when the parameter vector 1s small and the approximate value iteration algorithm requires few iterations, the algorithm would be impractical if the computation of '’ required time or memory proportional to the umber of states. Let us determine the conditions under which J can be computed 1n time polynomial in the number of parameters A rather than the number of states ‘The operator 7° is detined by TW) ~ MPTQUW). Since MT only has K° nonzero columns, only K components of (MW) must be computed: we only need to compute Ti(MW) for i — iy,o.te. Bach iteration of our algorithm thus takes time O(K717.) where tis the time taken to compute 1,(ACW) for aagiven state ¢. For any state i, 7,(MW) takes on the form TAMW) = yn, (Bled + Sv. FU) ‘The amount of time reauired to compute Sco prsl 2)" P(A) is O(N). where N. is the maximum number of possible successor states under any control action (ie. states j stich that pi3(u} > 0). By considering all possible actioas w < U(i) in order to perform the reauired minimization. T:.MW? can be commuted in time OLN..NBC1 where Nu is ‘maximum number of control actions allowed at any state. Tre computation of 7 thus takes time O(Ny NK"), Nove that for many conteal problems of practical interest she number nf contend allowed at a stato and the number of possible successor states grow exponentially with the number of state variables. For problems in which the number of possible successor tates grows exponentially, methods invalving Monte-Catky simulations may he coupled with our algorithm to reduce the computational complexity to a manageable level. We do not discuss such methods in this paper since we choose to concentrate on the Issue ff compact representations Rar problems in which the number af contral aetinne grows exponentially, on the other hand, there is no satisfactory solution, except to limit choices tw a small subset of allowed actions (perhaps by disregarding actions that seem “bad” a prior) Tn enmmary, rv algarithms ie evitohio For penbleenc sith lage etate pacer and can be modified to handle cases where an action taken at a state can potentially lead to any of large number of successor states, but the algorithm is not geared to solve probleme who 1p extremely large number of control actions: ie allowed 80 LX CSFPSIRLIS AND B, VAN ROY 7.3. Convergence Theorem |-et us now proceea with our convergence result Lor value iteration with Imneat architectures.” SOREN 2 Let Assumption 2 hold. (os) Thhons einen weston TE such char WV () wrneenges 60 BE (8) Tb « contraction, with contraction coefficient fi, with respect to.a norm || | on RE defined by I AAW Le. Let V* be she aptimal cost io-¢o vector, and define « by letting mt jy Fv wth! Wh where V" isthe optimal cose-to-go vector. Recall that my ry.y denotes 1 greedy policy with respect ta costaogn vector VW"), ce Few = are min (Bion) 3S polwW) The following hold: w aia [wr - 00) Le le, 0 id Ju teoes — bes RT (6) there exists an example Jor which the bounds of (c) and (d) hold with equaliry “This resull is unalogous to Theorem I. The algorithm 1s guaranteed to converge and, when the compact fepresemtation can pertectly represent the opuntal cost-to-go vector the algorithm converges to it, Furthermore, the accuracy of approximations. generated by the algorithm decays gracefully as the propriety of the compact representation dima: shes. The proof of this Theorem involves at stralghtforward application of Theorem 3 concerning multi-representation contractions, which is presented in Appendix D. Theorem 2 provides some assurance of reasonable behavior when feature-based linear architectures are used for dynamic programming, However, the theorem requires that the chosen representation satisties Assumption 2. which seems very restrictive. In the next two sections, we discuss two types of compact cepresemtationS thal salisIy Assumption 2 and may be of practical use a ample. Later pulative Represes ‘One possible compact representation can be produced by specifying values of KC states fav te state spave, and aking seiINed averages OF USE A villus wud villues OF FEATURE BASED METHODS 81 other states. ‘This approach is most natural when the state space is a grid of points in a Fuelidean space. Then, if cost-to-go values at states sparsely distributed in the grid acs aL ier pon be generated Vie Imerpotstin UNE IM Ine ‘case where the states occupy # Euclidean space, internotation-based representations 1 be used in settings where there seems to be a stall number of “prototypical” states that apuare Key features. The, if Cuse4O-RO value 3 jute OF AES SUMS, COSE-O-LO values at other states can be generated as weighted averages of cost-t0-po values a the “prototypical” states. Pur a mote fornia preventation of inwerpolation-based representations, lot S— {1.10} be the states in the original state space and let iy, i © S be the states for which values are specified. The Ath component of the parameter vector WY’ x" dened by 1’ — Vos oY. Ane hope is that V(W™) is close ta fixed point of T if W* isa fixed point of T". Ideally if the compact representation can exactly representa fixed point Ve RY of T thats, if there exits a We" such thal PCW) = ¥"— hen 1 should be a hed point of Furthermore, if the compact representation cannot exactly, but can elosly, represent the fixed point VC R* of T then W’* should be close to V". Clearly, choosing & mapping V tor wmeh V(H/) may closely approximate fixed points of T requires some intuition: ‘about where fixed points should generally be found in R™. A. Formal Framework fie sacuncin we will rane Renetaiaes «9 anbithaly MELEE spared, W9 pH ed vector spaces. We are given the mappings 0 "+ RF RM RY and VT RRM We employ a vector norm on 3” and a sect mon Uv yf J] AWE Ihave He 15 9 the ou Het Une 4 particular expression can be determined by the dimension of the argument. Define mapping TTR 5H" by oY. We make two sets of assumptions. The fries he mapping Assumption 4 The mapping ‘I is a contraction with contraction coefficient 8 < (0,1) mith roopost to | =| Henney for all YW? 6 IT) TW) 0. Choose Ways € i" such that VY" V(Wpe)l] < ‘Then, [Wage = P'Wap ll = IVEY OW ape)) — PLU We < WO Woned POU (Wane I [P(Woge) — VI + PV Wap) — eae ~ (he She. Now we ean place [WS Wage SW PW ope TE Ways) ~ Wea FW Wopel +L ~ Be abound on 1" — Wope|l and it Follows that ite, Te |W? Wopell < 88. JN, TSUTSIKLIS AND B. VAN ROY Next, a bound can be put on [iV* ~ V(W"h yee Wey eet — PEN ey) — UE < + EWege WI spits 5G ba a1 — 3} Since 6 can be arbitrarily smatl, the proof is complete: Qo Appendix B Asynchronous Stochastic Approximation Consider an algorithm that performs noisy updates of a vector Vc JR", for the purpose of solving a system of equations of the form 1(V) = V. Here £18 a mapping trom into itself. Let Ty,....‘liy 38° + # be the corresponding component mappings: that is, WV) = (Tv Ta{V)) for all Ve RY Let AV be the set of nonnegative mtegers, fet V (2) be ine value oF tne vector Vat ume 1 and let V.(#) denote its th component. Let I be an infinite subset of N’ indicating the set of times at which an update of V, is performed. We assume that Wien, ret! ca ava Wee MEH ¥o.CN(TVE) ~ WOM), Le ww) Hore, as(t) is a stepsize parameter between 0 and 1, and 14 (6) is a noise tenn, Ta order tw bring Fyuations (B.1) and (B.2) into a unified form, it convenient 10 assume that a(8) and (0) are defined for svory + and , but that on(#) — 0 toe td F Let F(t) be the set ofall random variables that have boen realized up to and including the point at which the stepsizes «4{2) for the fh iteration are selected, but just before Une uise term p(t) is generated. Az in Seetion 2, lla denotes the moxie norm ‘The following assumption concerns the statistics of the noise, Anoumption 6 (a) For every i and t, we hove Rln(#| Fey] = 0 (b) There exist deterministic) constants A and 1 such that Fivtt) FO) AS BIVEDIE,. Vie We then have the following result (Ibitsiklis, 1994) (related results are obtained in (Uaakots, JorgaN, and Sim, 1994), FEATURE-BASED METHODS 89) ‘THEOREM 4 Let Assumption 6 and Assumption I of Section 5 on the stepsizes c(t) old and suppose that the mapping T is a contraction with respect to the maximum norm. Then, VE) converges w thee unique fixed pois V* of P, wish protratiy 1 Appendix © Proof of Theorem (a) To prove this result, we witl bring the aggregated State value iteration alyoritha: into a form to which Theorems 3 and 4 (Irom the Appendices A and B) can be applied and wwe will then verify that the assumtions of these theorems are satisfied ‘Lot us begin by defining a function 7”: RR", which in some sense is a aoise- free version of our update procedure on W(t). fe particular, the jth component ‘1 of T" is defined by TW) -£ Yrancwy) cp ‘The update equations (7) and (8) van be then rewritten as Wy TH) = AAO) Fe(OCGOV) + 40) zy where the random variable n,(t) is defined by a(t) Ts VON) ~ BLL, ol POON wont of Given that euch X)(¢) is a candom sample from 5 who distribution is indsps FU} we obtain Elng(t) | F(t)] = Blny(e)] = 0. Our proof consists of two parts. First, we use Theorem 3 to establish that T” isa ‘maximum norm contraction, Once this is done, the desired convergenux result follows from Theorem 4. Let us verity the assumptions required by Theorem 3, First, let us define « function oh : Ww) = SPV This function is @ pseudo-inverse of” sinee, for any Wc 2 VFO — 3 POV) = Wy Ss, We can express 1 as T” T2¥, 0 bring it imo the form of Theorem 3. to this context, the vector norm we have in mind for both” and "is the maximum norm, Wt 90 JN TSUTSH IS AND #, VAN ROY Assumption 4 is sa sfed since T’is a contraction mapping. We will now show that 7, Vand 1, satisty Assumption S. Assumption S(a) is satisfied Since V+ isa pseudo-inverse wt a =F sine WV) - VOW = max = max Ws — Wi) sel = IW -W' loo ‘Assumption 5(c) is satisfied because AO TEP aw mace 5 ES = VY V's Hence, Theorem 3 applies, and £" must be a maximum norm contraction with contraction coetficient 5 Since T' is a maximum norm contraction, Theorem 4 now applics as long as its assumptions hold. We have already shown that Eby) | FW = 9, So Asgumipslow OG AS siHISMl&L, AS For AgsuuNpKiON O(L) GH URE variance OF 1p (0), conditional variance of our noise term satisfies HBO Ht] — B[(MarcPOMOM LP) ] < a( man (Vqw(en)) © Smax(Eleig])? + 8IWOI, Hence. ‘Theorem 4 applies and our proof is complete (b) If the maximum fluctuation ot V* within a particular partition is e, then the mini mum error that can be attained using a single constant approximate the cost-10-¥0 of every Slate within the partition is This implies that minw |W) V* |. the min imum error that can be attained by an aggregated state representation, is “L=. Hence, by substituting ¢ with HL, and recall Tncorem 3.01 (c} Now that we have a hound om the maximum norm between the optinial cost-t0-g0 estimate and the true optimal cost-to-go vector, we can place 4 bound on the maximum norm petween tne cost oF a polley win respect 1 WFH¥"") ane te opunnat pute as follows, nyg that we have 6! = J, the result follows from Pere ge = Lee ROMS LeO7 OR) FEATURE-BASED METHODS a Figure C1. Aavesanpl for wih the bowl Hols with equality Since F(V) 2" (V) for all VR”, we have Ivers — S Jreeen 7TH Ys + ITY) — Vv W dlls 4 FIV WT) ~ Vw Tool + B):V* — VOW") loo gv I follows that Wr Vee s PION ~ Ve aI w plete: (2) Consider the four-state Markov decision problem shown in Figure C.1. The stawes ave ay, 42, 13, and 4, und we form two panitions, the first consisting of 2 and 2 and the second containing the remaining two slates, All uansition probabilities are one No control decisions are made at stales 2), 13, 0¢ 24, State 1} is a zero-cost absorbing stale. Th stale 2 a Wansition to sate 2 38 inevitable, and, likewise, when in slate 4, a tramition (o stale av always occurs. In stale 2 two actions are allowed move and stay, The transition cost for each state-action pair is deterministic, and the are labels i Figure C.1 represent the values. Let ¢ be an arbiteary positive constant and, let b, the era he defined ash = 222-8 with & © 24. Clearly. the optimal costto-go values al x4, 22, 4. and 1 ae 0, 6, 0, ~€, respectively, and || = « Now, let us define sampling distsibutions, within each partition, that will be used wath the algorithm. In tha first partition, we always camplo i, and in the xeswad partition, we poet of staying in oe 92 SIKLIS AND 8. VAN ROY always sample x4, Consequeaily, the algorithm will converge to partition values wt and w},, satisfying the following equations: whe = + Buty wig = c+ Buh, It is not hard (o see that the unique solution is wer oS ‘The bound of part (b) is therefore satislied with equality, Consider the greedy policy with respect to w. For 8 > 0, the stay action is chosen al state ts, and the total discounted cost incurred starting at state a is 245%. When 6 =0, both actions, stay and move, are legitimate choiees. Tf stay is chosen, the bound of part (@) holds with equatity a Appendix D Proof of Thearem 2 (a) By defining V"(V) — MiV, we have 1? — V" 272 V7, which fits into the framework (of multi-representation contractions. Qur proof consists of a straightforward appl tion of Theorem 3 from Appendix A (on multi-representation contractions). We must show that the echnical assumptions of Theorem 3 are satisfied. Toe complete the multi- representation contraction framework. we must define a norm in one space nf enst-ta-g0 vectors and @ norm in the parameter space. In this context, as @ metric for parameter vectors, let us define a norm | || by y= Sawa Wey = Shay ince MF has full volun sav, || || ls dn stamland properties of a vedtut cost-to-go vectors, we employ the maximutn norm in A as our metric We know that 7” is a maximum norm contraction, so Assumption 4 is satisfied. As susiptio 5a) is satisGed since, (atl WLR Ue (Wwy) = MiMw wu Assumption S(b) follows from our definition of | - || and the fact that 8° = [9.1): iw wey = Spe — aw’. Wa) - Cae y, FEATURE-NASED METHODS 93 Showing that Assumption 2 implies Assumption 5(c) is the heart of this proof, ‘To de this, we must show that, for arbitrary costto-go vectors V and V', WV YY vv) OD Define D M(VI(V) VI(W")). Then, for arbitrary i © 5 we have Di — SIFPOKVY) WY Under Assumption ? there exist positive constants #4 (i), ...x(2) © 3, with DELIA) <1 such thar P(e) — FYE Ae) P iq). It follows that, for such Ais ORG) = BF Se pti yt sy Di < Dawrionviny vie ii S max FAY) Mv)| S Wal ay i < VWs Inequality (1.1) follows. Hence, Theorem 3 applies Theorem 2 art (a) can be proven using Ihe same argument as 1n Ihe proot of Theorem Ie), Vor part (@), We can use the same exainple as that used 10 prove part (d) of Theorem 1 o implying parts (a), (b), and (©), of Notes 1 To tose Fig wilh Q tearing or tiaporabtterence Iamning: the counterexample applic: 40 cases where remporu-ilference or Q-eaming updates are performed at sates that ae sampled wniozly feo the ene stae spac Clon limes. Rowever temporal difference nes assume tht se ste enerated fy followsng x smdomly produced corp traectory. In oan example. ths would corespun {o sting a stue +, moving to site 2, at then dengan finite umber of seems rm ste ty to Hse IE his inechanisn 6 used. ur example is No longer divenen,imagroomont wih Yess of Dayan (592). We tike the point of view dar each of these samples is independently yenerted from Thee probly distribution WF che samples were genecas! by a simtion experiment at Monie-Casl snulaion vadct some fxed policy woependense ‘This would complicate somewhat the coaverpeace 3. The way in which ste is eco ivonmseque SO we hae wade eon to emme Me tes of vector components equied 4. To really ensue x wasonable order of wow fr the mums of requir, we Would Rave He sharateiae a probaiy dstabuton fo he difference etc he wna preter sector Wf) al the oad 97 ay well lw close to tho goa! W" the params ectue 9V{2) snus be in endef The eo ove to Bald 94 JX TSETSIKEIS ASD 8 VAN ROY References aks, BR, & Stephanopoulos G.(199%).:°Wave-Net A Multwesoltion, Hierarchical Neural Network with Localized Leuming.” AICHE Journal, vol 38, n0. tp. 57-8 fBuro, A.C, Brike, S.J, & Singh, 8 F, (1995). "Real the Loaming and Control Using Asynchronous ‘Dynamic Fopramming, AROMCLIfetsgeboe, vol. (2, pp. 81-19% Dallinan,K,f.& Desi, SE. (1959) “Functional Approximstice and Dynami Programing.” Math Tables uxt Ober Aids Comp, Vo. (3. po. 247-251 Lerckas, D.P, (1985}Dsnaimic Perxranming and Optingd Controh, Aken Setentic, ellmnt, MA, ers, DP (1863)-°A CounwerEcanple to Temporn) Tferences Leaming,” Neural Computation, 96). 7, pp. 270-279, rsskas DV. & Castlon, D. A. U9K0). “Adaptive Agarcgition for Infinite Horizae Dynamic Progen Svone "TEER Traneacins om Aunt: Contol Vol Mt No.6 09.389 598, Rersckas. DP a Tus. JN, 989)" Parallel and Ducributed Computation: Numerical Methods Prentice fall, Englewood Chit NI Dayan, PD, (1582) “The Convergence of TDIA) for General" Machine Leasing, vol. pp. 341-362 CS.95 103, Camesie Melon Univesity Jhabola, T. Jordan M- 1, & Singh, 5. P, (1994) Oa the Coavesgence of Swshasbc ltertive Dynamic Pro ‘rnin Algohns.” Newel Computation, No. 6. Ne. 6 patois Fbingn 3) doma, 1099- "RCLaMCUEN curing Mg Markovian Decision Processes” in duunces in Neural iormaion Pravesing Sistem 2,1. Cowan, G. “Tesauo, an D- Toure, editors. Morgan Kawinann, Kort, 8. (1987), “Plansing as Seach A Quantaanve Approuch.” Arua! laelgence, vol 3, pp. 65-86 Cippiaas, Kit Kako, a Singee B99). Lh Neual Networks Mactibe Limos ana Shtstica! Softwase for atte Chsiication.” The Lincoln Laboratory Journ, vi 8 a. 2, gP. 249 268 Morin, T.1., (1947) “Computational Advances in Dynanv Programming.” in Dynamic Prgrananing and ls Avplisusions este by Patgrian, ML. 3350, Pops, {Girt F (1480) “Nanos for Approximation ant Learning” Proceedings ofthe IFEE, vo FS, a0. 9, pp. SL L897 Retz. D. (1977), “Approxinate Solubons of & Dscowntet Markoeian Decision Process.” Bonnor Muteae Schuciver,#)-& Seidman, A, LISSS) "Ceneralied Polyaoual Appcoxitations in Markovian Des Process Jounal af Mathematical Analysis nnd Applications. vi HD. pp. S68-582. Sutton, RS. (1988) “Leaning Ww Predict bythe Method of fempoeal Dilleenoes” Machine Leaning, vl “Tesamo, G. (1952), “Practical sue: ia Temporal Difference Lvarning Machiee Leumi, vol ® pp 257 ™m. Thitskls, 1. 11999), “Asyachianods Shuchastic Appusinusion and Q-Learing” Machive Leaning, wo Te. 1833 Watkins, C4. €. H, Dayan, P, 11992), “Q-eaning” Mace Leaving, val. 8, pp 279-292 Whit, W. (19/8). Approsinanins of Bysame Program: Aaoienues of Operauons Research VoL SB 2248 Received Dosmber 21998 Accepted March 28 195 Fal Mannnrape Osoher 13,195

You might also like