You are on page 1of 58
ee ee ee Neural Networks and the Bias/Variance Dilemma ‘Stuart Geman. Divison of Appi! Mathematics, Broun University, Providence, Ri 02912 USA Elie Bienenstock René Doursat EESPCI, 10 rue Vaugutin, 75005 Pars, France Feedforward neural networks trained by error backpropagation are ex- ples of nonparametric regression estimators, We present a tutorial ‘on nonparametric inference and its relation to neural networks, and We use the statistical viewpoint to highlight strengths and weaknesses of neural models. We illustrate the main points with some recognition experiments involving artificial data as well as handvrritten numer- als. In way of conclusion, we suggest that current-generation feed- forward neural networks are largely inadequate for difficult problems in machine perception and machine learning. regardless of parallel- versus-serial hardware or other implementation issues. Furthermore, we suggest that the fundamental challenges in neural modeling are about representation rather than learning per se. This last point is, Supparted hy additional experimente sith handvertten mumeraly 1 Introduction a ‘Much of the recent work on feesiforward artificial neural networks brings to mind research in nonparametric statistical inference. This is a branch of statisties concerned with moclelfree estimation, of, from the biological viewpoint laude nie learning, A typsal nwnpernuetst infereiae pro Jem is the learning (or “estimating,” in statistical jargon) of arbitrary decision boundaries for a classification task, based on a collection of le bled (pre-clssified) training samples. The boundaries are arbitrary in Use seune that ne particular structure, or class of Boundaries, i assumed 8 priori. In particular, thee is no parametric model, as there would be with a presumption of, say, linear or quadratic decision surfaces. similar point of view is implicit in many recent neural network formulations, Suggesting @ close analogy to nonparametric inference. (f course statisticians who work on nonparametric inference rarely ‘concern themselves with the plausibility of their inference algorithms Neural Computation 41-58 (1952) @ 1992. Massachusetts Institut of Technology as brain models, much less with the prospects for implementation in “neural-like” parallel hantware, hut nevertheless certain generic inruer are unavoidable and therefore of common interest to both communities, ‘What sorts of tasks for instance can be learned, given unlimited time and training data? Also, can we identify “speed limits,” that is, bounds on hhow fact, in terms of the number of training, samples used, sneling can be leaned? ‘Nonparametric inference has matured in the past 10 years. There have been new theoretical and practical developments, and there is now a large iiterarure from which some themes emerge that bear on neural ‘modeling. In Section 2 we will show that learning, as itis represented in some current neural networks, can be formulated as a (nonlinear) re- {gression problem, thereby making the connection tothe statistical frame- Work. Concerning nonparametric inference, we will draw some general conclusions and briefly discuss some examples to illustrate the evident uliity of nonparametric methods in practical problems. But mainly we ‘will focus on the limitations of those methor's atleast as they apply € ‘nontrivial problems in pattern recognition, speech recognition, and other areas of machine perception. These limitations are well known, and well ‘understood in terms of what we will eal the bias/variance dilemma, The eerenee of the dilemma Kes in the fact that cstinalion exe vant Be decomposed into two components, known as bias and variance; whereas incorrect models lead to high bias, truly model-free inference suffers from high variance, Thus, model-ree tabula rasa) approaches to complex infer= lence asks are slow fo. converge,” an the sense that latge taining samples are required to achieve acceptable performance. ‘This isthe effect of high variance, and isa consequence of the large numberof parameters, indeed Infinite number in truly model-free inference, that need to be estimated. Prohibtively large training sets are then required to reduce the variance contribution to estimation error. Parallel architectures and fast hardware do not help here: this “convergence problem” has to do with training set size rather than implementation. The only way to control the variance in ‘complex inference problems is to use model-based estimation. However, and this is the other face of the dilemma, model-based inference is bias: [prone: proper models are hard to identify for these more complex (and Ineresting) inference probleme, and any model baved acheme i likey to be incorrect forthe task at Rand, that s, highly biased. The issues of bies and variance will be laid out in Section 3, and the “dilemma” wil be illustrated by experiments with artificial data as well fo wie a task vf houdlwstten waiienal aeeogilon, Mors by satstictans to control the tradeoff between bias and variance will be reviewed in sn 4. Also in Section 4, we will briely discuss the technical issue of cunsistency, which has to do with the asymptotic (infinite-training-sample) correctness of an inference algorithm. "his is of some recent interest in the neural network literature, In Section 5, we will discuss further the bias/variance dilemma, andl [Neural Networks and the Blas/Variance Dilemma 3 relate it to the more familiar notions of interpolation and extrapolation. ‘We will then argue that the dilemma and the limitations it implies are relevant to the performance of neural network models, especially as con- ‘cers difficult machine Tearning tasks. Such tasks, due to the high di- ‘mension of the “input space,” are problems of extrapolation rather than interpolation, and nonparametric schemes yield essentially unpredictable results when asked to extrapolate. We shall argue that consistency does not mitigate the dilemma, as it concerns asymplotic as opposed to finite sample performance. These discussione will Inad we to conclude, in See tion 6, that learning complex tasks is essentially impossible without the a priori introduction of carefully designed biases into the machine's ar chitecture. Furthermore, we will argue that, despite a long-standing pre- jccupation with leaming per se the identification and exploitation of the “right” biases are the more fundamental and difficult research issues in neural modeling. We will suggest that some of these important biases can be achieved through proper data representations, and we will lus nate this point by some further experiments with handwritten numeral recognition, 2 Neural Models and Nonparametric Inference 2:1 Least-Squares Learning and Regression. typical learning prob- tem mughtunvolve a feature or input vertor xa response vector Vane te goal of learning to predict y from x, where the par (xy) obeys some un ‘known joint probability distribution, P. A training ety, yy). 3-9) isa collection of observed (xy) pairs containing the desired response y for each input x. Usually these samples are independently dravin from P, though many variations are possible. Ina simple binary dassifiction problem, y is actually a sealar y€ (0,1), which may, for example, repre- tent the pay of binary imps cing « {0,1} or the teed raed classification ofa phoneime suitably ceded by xas a second example. The former is “degenerate” inthe sense that y is uniquely determined by x, whereas the classification of a phoneme might be ambiguous. For clearer exposition, we will take y — 9 to be onestinensional allbough Out te ‘marks apply more generally The learning problem is to construct a function (or “machine” f(x) based on the data (x45). Gtu 50 that fx) approximates the de- sired response w Typically fis chosen to minimize some cost functional. For example, in feedforward networks (Rumelhart etal. 198616), one usually forms the sum of observe squared exrors, DLbi-feaeP a a >. Geman, F blenenstock, and K. Dowrsat land fis chosen to make this sum as small as possible, Of course fs reolly parameterized, usually by idealized “synaptic weights,” and the minimization of equation 2.1 is not over all possible functions f, but ‘over the class generated by all allowed values ofthese parameters. Such rminimizations are much studied in statistics, since, as we shall later see, they are one way to estimate a regression. The regression Of y on x 15, Ely | x) thats, that (deterministic) function of x that gives the mean value of y conditioned on x. In the degenerate case, that is, ifthe probability distribution P allows only one value of y for each x (as in the parity problem for instance), Ely | x] is not really an average: it is just the allowed value. Yet the situation is often ambiguous, as in the phoneme Classification problem, ‘Consider the clasification example with just two classes: “Class A” andits complement. Let ybe 1ifa sample x isin Class A, and 0 otherwi ‘The regression is then simply Ely lx) = P= 11x) = Pi(Class Ax) the probability of being in Class A as a function of the feature vector x It may or may not be the case that x unambiguously determines class ‘membership, y. If it does, then for each x, Ely | x] is either 0 or 1: the regression is a binary-valued function. Binary classification will be illustrated numerically in Section 3, in a degenerate as well as in an. ambiguous case. ‘Mave gonersily we are mst ta "ft tha dat," oe, sara areniatay, the ensemble from which the data were drawn. The regression is an ‘excellent solution, by the following reasoning. For any function f(x), and any fixed x," E[ty—foo |x} = E[((y- Elvi x) + (Ely x foo" Ix) @29 = E[y~ ely ix) Lx] + (Ey |x) - foo? + 2P ly Fy Lop Il -Eys)—feo) = E [iy Ely| xb? ix] + (Ely | x] — fx)? ++ 2(Ely |x) — Ely |x) Ey Lx} f08) = Bly ely) x)? |x) + ly x) — foe? Ely Ely) x)? 1x) Fisat ary gen ta fhe menrguated cnr seen Se BH re Similar remarks apply to likehood-based (instead of least-squares: based) approaches, such as the Boltzmann Machine (Ackley etal. 1985; Hinton and Sojnowski 1986). Instead of decreasing squared error, the oc ran no a aay ay leu cnn pion fy) glen tht the average of 44} taken sith respec fo the conn pbubliy dsteaton Phx) |Neurat Networks and the tas) lance Dilemma 5 Boltzmann Machine implements a Monte Carlo computational algorithm Jor increasing likelihood. This lads to the maximunrlikelitnaal estima tor of a probability distribution, at least if we disregard local maxima and other confounding computational issues. The maximumlikelihood estimator ofa distribution is certainly well studied in statistics, primarily Lvewaune of its many optimality properties. OF course, there are many other examples of neural networks that realize well-defined, statistical estimators (se Section 5.1). ‘The most extensively studied neural network in recent years is prob- ably the backpropagation network, that is, a multilayer feedforward net- ‘work with the associated error backpropagation algorithm for minimiz- ing the observed sum of squared errors (Rumelhart etal, 19863). With this in mind, we will focus our diseussion hy arldressing leastesquares estimators almost exclusively. But the issues that we will raise are ubig- uitous in the theory of estimation, and our main conclusions apply to.a Droader class of neural networks. 2.2 Nonparametric Estimation and Consistency. If the response vari- able is binary, y © {0,1}, and if y = 1 indicates membership in "Class Av” then the regression is just P(Class | x), as we have already ob- served, A decision rule, such as “choose Class A if P{Class A|x) > 1/2," then generates a partition of the range of x (call this range #H) into Hy = "Ox : PiClass A | x) > 1/2) and its complomant BY Ha. — Hy ‘Thus, x € Ha is classified as “A.” xe Hy is classified as “not A” Te may bbe the case that Ha and H are separated by a regular surface (or “deci- sion boundary", planar of quadratic for example, or the separation may be highly iregulae Given a sequence of observations (x14) (Xe )).-» We cam proceed to estimate P(Class A | x) (= Fly | x)), and hence the decision bound- ary, from two rather different philosophies. On the one hand we can fsouuve prion hat Hy hewn up io a Ane, and preteraply smal ‘number of parameters, a5 would be the case f H, and H were linearly or quadratically separated, of, on the other hand, we can forgo such as- sumptions and "let the data speak for itself.” The chief advantage of the former, parametric, approach 18 of course efficiency: if the separation r= ally is planar or quadratic, then many fewer data are needed for accurate estimation than if we were to proceed without parametric specifications. But if the true separation departs substantially from the assumed form, then the parametric approach is destined to converge to af incorrect, and hence suboptimal solution, typically (but depending on details of the estimation algorithm) to a "best" approximation within the allowed class of decision boundaries. The later, nonparnmelsic. approach make no such a priori commitments, ‘The asymptotic (lage sample) convergence of an estimator tothe ob- ject of estimation is called consistency. Most nonparametric regression > . Geman, E: Blenenstock, and K Doursat algorithms are consistent, for essentially any regression function Ely |x|? ‘hic ie indced a reassuring property, but it comes with a high prise: de- pending on the particular algorithm and the particular regression, non Parametric methods can be extremely slow to converge. Thats, they may require very large numbers of examples to make relatively crude approx- mations of the target regression function. Indeed, with small samples the estimator may be too dependent on the particular samples observed, that i, on the particular realizations of (x,y) (we say thatthe variance of the estimator is high). Thus, fora fixed and finite training set paramet ric estimator may actually outperform a nonparametric estimator, even when the true regression is outside of the parameterized class. ‘These issues of bias and variance will be further discussed in Section 3, For now. the important point is that there exit many ennsistent none parametric estimators, for regressions as well a probability distributions ‘This means that, given enough training samples, optimal decision rules can be arbitrarily well approximated. These estimators are extensively studied in the modeen ctaities Ktoraturs. Parzen windows and sicarest neighbor rules (see, eg, Duda and Hart 1973; Hindle 1990), regular ization methods (see, eg, Wahba 1982) and the closely related method of sieves (Grenander 1981; Geman and Hwang 1982), projection pur- suit (Frieda and Stuetzle 1981; Huber 1989), recursive partitioning ‘methods such as “CART,” which stands for “Classification and Regres sion Trees” (Breiman eta. 1984), Alternating Conditional Expectations, of “ACE” (Breiman, and Friedman 1985), and Multivariate Adaptive Regres- sion Splines, or “MARS” (Friedman 1981), as well as feedforward neural netovorks (Rumelhart etal. 1986) and Boltzmann Machines (Ackley et al, 1985; Hinton and Sejnowski 1986), are a few examples of techniques that can be used to construct consistent nonparametric estimators, 23 Some Applications of Nonparametric Inference. In this paper, wwe hall be mosily concerned with mations of nonpuramatic mathe {nd with the relevance of these limitations to neural network modes But there is also much practical promise in these methods, and there have been some important successes. ‘An interesting and ilfcut problem in inductil “process spociin tion” was recently solved atthe General Motors Research Labs (Lorenzen 1988) with the help ofthe already mentioned CART method (Breiman et al. 1984). The essence of CART is the following. Suppose that there are tn claoes, ys {02s oo) tla Np OF facUTe Veto Me BASE ON 2 training sample (x,y)... Gw-an) the CART algorithm constructs a partitioning ofthe (usually high-dimensional) domain of x int retane Ore has 10 speci the mode of convergence: the estimator sel a function, and surhermore depends on the resioation of random ang se (ee Seton 42); One seas oe certain ctl onions sud ss meray fhe reson Neural Networks and the tas/ Variance Likemma 7 sgular cells, and estimates the class-probabilties {Ply ~ k) :k=1,...,m} ‘within each cel. Criteria are defined that prose sell in which the eo timated class probabilities are well-peaked around a single class, and at the same time discourage partitions into large numbers of cells lative to N. CART provides a family of recursive partitioning algorithms for approximately optimizing a combination of these competing criteria, ‘The GM problem solved by CART concerned the casting of certain engine-block components, A new technology known as lost-foam casting, promises to alleviate the high scrap rate associated with conventional Casting methods. A styrofoam “model” of the desited part is made, and then surrounded by packed sand. Molten metal is poured onto the styrofoam, which vaporizes and escapes through the sand. The metal then solidifies into a repli af the styrofoam model Many “process variables” enter into the procedure, involving the set tings of various temperatures, pressures, and other parameters, as well as the detailed composition of the various materials, such as sand. Engi- neers identified £0 ouch variables that were expected to be of particulsy importance, and data were collected to stucly the relationship between these variables and the likelihood of success of the lost-foam casting procedure. (These variables are proprietary) Straightforward data anal. ysis on a training set of 470 examples revealed no good “firstorder” predictors of success of casts (a binary variable) among the 80 process Variables. Figure I (from Lorenzen 1988) shows a histogram comparison for that variable that was judged to have the most visually disparate his fograms among the 80 variables: the left histogram is from a population of scrapped casts, and the right is from a population of accepted casts. Evi-