You are on page 1of 11
Capacity , Oveshtbing , Undeafitling Genevalization : Ally of oom sly te per boom well hee, prvoutly unten pols Generalization Ovot — kesk eww In Machate Learmany | we om fe oulteve Aow breintny rer and dew bes roa. We assume Lak He bwotntng and kerk dala ane, jerroted \y ne pobebilty dishes bake over eks} yess cued la dake generteing pet We alse make ttl assmmphens- independent and identically disbathuted — examples he eok dabaset Lvaining and Keod ona dependent dada G2e gerruted by sama. dirk buben chin Genexaking distebubion , Pi ML = Use bratnns _datarel bo redue Da Hel foatntng eancy Then use besk datwel be comple besk eaew Terk, eS wit. be grester than oF eave fe brainy Te Pa formance fo an ML chgewithm & meowed by tle obshdy fe: © ok, betray Oren Srrolh _ moke yr Li troy er and fet eror SweM- —y Undertilting - — Lange fratning rron —p Ova keiyy - Longe yr He Lat rany £ ek RNR error C ee We can cntrol % models bendency wr overdit by changihy we te undes fit Sete the model copanity . pouty wa 4 Capeetty - Models obthty te ht « wide vorely of Pomehons Lor copanty 9 Oe _unden FA Lio High capacity > con overtit » 7 > Undorfitting Appropriate capacity Overfitting = A ay + . ag Yo joey. . nt Is le Figure 5.2: We fit three models to this example training set. The training data was ~encsated evutieticaly by randomly sampling » values and choosing y decninstall . by ovalyating-a quadratic function. (LefJA linear function fit to the data suffers from underfitting—it cannot capture the curvature that is present in the data. (Center)A qucratic fmetion fit to the data generalizes wl to unseen points, Tt does not suffer from ee (Hag pla ds Tynomial of desies 2 the data suffers from overfitting. Here we used the Moore-Pentose pseudoinverse to Sefvo tho underdctorminod normal equations. ‘Te solution pasos through al the training nts exactly, but 3 t heen Incky enoiigi-TOP it fo extract the correct sTructure, Te now hase deep valley between OW Hg points that. does not appear in the true underlying fimction. It also increases sharply on the left side of the data, while the true function decreases in this arca. => Training error « A Camacity Figure 5.3: Typical relationship between capacity and error. ‘Training and test error hehave ifferontiy. At the left end of the graph, training error and generalization error are both high. This is the underfltting regime. As we increase capacity, training error decreases, but the gap between training and generalization error increases. Eventually ‘the size of this gap outwoighs the docrosse in training error, and wo enter the overfitting, regime, where capacity is too large, above the optimal capacity. Prertine snl Oveivineroye | Generatvation ero ||» Test erve iva xy = £0 2 Blas and Variance = £63) Ces); 4n)Jor the purpose of approwmaiing y at future obser- Capa een apr Ae Rijs (8n¥a)), we will write fs/D) instead of simply f(x). iven D, and given 2 particular x, a natufal measure of the effectiveness rc e mean-squared r (where El, means expectation with_re: to ‘the probability distribution P, sc& Section 2) Tn our new nolan eam phabring-he dependency of fon D (which is fied for the moment), equation 22 reads 4 Di wx» \s £\(y—fox:D)? |x.) Gc on ebay). eT 3 Flys)? dost depend on th dt, oon the exit co) v fH aime the vatane of yen x ence he squared distance 1 e sogresion REGIE #142) - Ely Ix] FF *, 2) (7 I=} conus ita natural way, the efletveness off 8 a predictor of» SRC ry Faatayuaed nor at a ag ATE a Me SpE OT fos: Ely | x)? Sp Fe [ie -Fyly'] <— en N ‘where Ep represents expectation with respect to the training set, D, that a is, the average over the ensemble of possible D (for fixed sample size N), 1, Ttaay be that for a particular training set, D, [Ge D) is an excellen 7 nence_@_near-optimal of y. At the approximation of Ely |x). err ne Goya re ae at Dg ae ate ‘prodher saldations af Dsed irgeneal vas batalla Sor xl itmay be that the azerage (over all possibie D) of f(x; D) is rather far from the regression Ely Lx) bute large valuer Pon hee vat ang iD) an unelable peter ay. A ach wap Oe n 51mg fa) an unable preter fy. useful way to assess —> Fea) xy, xy, (eH) DQ F040.) 109 ow » role“) é,[/ 22) NN hbem ys w 2, Lege 20 x fore Ad Hy CL these sources of estimation eror is via thebias/ variance decomposition, / aa v wich we derive in way sun Yo 22 or any )_y fe|itim) ayia] -¢ + ¢ £ xy 9 (fix) = Ep Wx: D))) + Ep ifix:D)] — Ely |x))"] = Fo (aD) —Fp fe: P| + Fo (Fo fe) ~ Fly |»)"] hes +2 (fC) ~ F(x) (Eo f(s: ~ Ely |») ony ~ Ee Epes DI| Eo foeD) ey lay? Bio foe) Ep foe DI) (Eo How] ely| x) _ Ely | x))!__<—~ "bias" aug ve FD) —EoTED) Jem “variance pedal ofp - \ Ton is different from E'y | x), then {(x;P) is said to WY & Pin Gene, ths depehds GaP he ze a ‘be biased 1d unbi ‘ LT Sine may be ood insome coe and bined OT rico 7s end above, an unbiased estimator may sil havea large sac ere be sale 2 variance is large. even wi Toy(xD)] = Ely| JRDF nay be highly semtive To the data, and. play or fers regression fly |e Thus iter bias or variance can contribute to poor Peterman ere often a aden bese the bis and variance contributions to the estimation error, which makes for a kind of “uncertainty principle” (Gronander 1951). Typically variance is weluced through “smoothing” via combining, for example, ofthe influences of samples that are nearby in the input (X) space. This, however, will introduce bias, as details of the regression function will be lost; for example, sharp peaks and valleys ‘will be blurred. 32 Examples. The issue of balancing bias and variance is much stud- ied in estimation theory. The tradeoff is already well illustrated in the ‘one-dimensional regression problem: x = x ¢ [0,1], In an clementary version of this problem, y is related to x by yase)ra 62) where gis an unknown function, and 1 is zero-mean “noise” with disti- bution independent of x. The regression is then g(x), and this isthe best ‘mean squared-error) predictor of y. To make our points more clearly, ve will suppose, for this example, that only y is random — x can be chosen as we please. If we are to collect N observations, then @ natural ‘design’ for the inputs is x; = iN. 1 < i et Figure 2: One hundred observations (squares) generated according to equation Ay with gis) = 426(e-# —4e-™ + 3e-3). The none ip zero-mean gausicien with standard eror02. In each panel. the broken curve ig and the oid curve isa spline ft. () Smoothing peranseter chosen to control variance. (6) Smoothing Parameter chosen to conttel bias.) A compromiing valve ofthe smoothing parameer, chosen automatically by cross-validation. (From Wana and Wold 1975) regression, g(x). (The solid curves are ostimates of the regression, as will be explained shortly) ‘The object is to make a guess at g(x), using the noizy observations, w= ght) +m, 1S