THE STRUCTURE OF INTELLIGENCE
Get any book for free on: www.Abika.com128Consider an algorithm y=A(f,x) which takes in a guess x at the solution to a certain problem f and outputs a (hopefully better) guess y at the solution. Assume that it is easy to compute andcompare the quality Q(x) of guess x and the quality Q(y) of guess y. Assume also that A containssome parameter p (which may be a numerical value, a vector of numerical values, etc.), so thatwe may write y=A(f,x,p). Then, for a given set S of problems f whose solutions all lie in someset R, there may be some value p which
maximizes
the average over all f in S of the averageover all x in R of Q(A(f,x,p)) - Q(x). Such a value of p will be called
optimal
for S.The determination of the optimal value of p for a given S can be a formidable optimizationproblem, even in the case where S has only one element. In practice, since one rarely possesses apriori information as to the performance of an algorithm under different parameter values, one isrequired to assess the performance of an algorithm with respect to different parameter values in areal-time fashion, as the algorithm operates. For instance, a common technique in numericalanalysis is to try p=a for (say) fifty passes of A, then p=b for fifty passes of A, and then adopt thevalue that seems to be more effective on a semi-permanent basis. Our goal here is a more generalapproach.Assume that A has been applied to various members of S from various guesses x, with variousvalues of p. Let U denote the nx2 matrix whose i'th row is (f
i
,x
i
), and let P denote the nx1 vectorwhose i'th entry is (p
i
), where f
i
, x
i
and p
i
are the values of f, x and p to which the i'th pass of Awas applied. Let I denote the nx1 vector whose i'th entry is Q(A(f
i
,x
i
,p
i
))-Q(x
i
). The crux of adaptation is finding a connection between parameter values and performance; in terms of thesematrices this implies that what one seeks is a function C(X,Y) such that %C(U,P)-I% is small,for some norm % %.So: once one has by some means determined C which thus relates U and I, then what? Theoverall object of the adaptation (and of A itself) is to maximize the size of I (specifically, themost relevant measure of size would seem to be the l
1
norm, according to which the norm of avector is the sum of the absolute values of its entries). Thus one seeks to maximize the functionC(X,Y) with respect to Y.
PARAMETER ADAPTATION AS A BANDIT PROBLEM
The problem here is that one must balance three tasks: experimenting with p so as to locate anaccurate C, experimenting with P so as to locate a maximum of C with respect to Y, and at eachstage implementing the what seems on the basis of current knowledge most appropriate p, so asto get the best answer out of A. This sort of predicament, in which one must balanceexperimentalvariation with use of the best results found through past experimentation, is knownas a "bandit problem" (Gittins, 1989). The reason for the name is the following question: given a"two-armed bandit", a slot machine with two handles such that pulling each handle gives apossibly different payoff, according to what strategy should one distribute pulls among the twohandles? If after a hundred pulls, the first handle seems to pay off twice as well, how much moreshould one pull the second handle just in case this observation is a fluke?To be more precise, the bandit problem associated with adaptation of parameters is as follows.In practice, one would seek to optimize C(X,Y) with respect to Y by varying Y about the current
Leave a Comment