You are on page 1of 29

The Bias-Variance Tradeoff

The compromise bias-variance express the effect of various possible factors on the final error
between the hypothesis chosen by the LM and that which it would have had to choose, the ideal
target function.

According to the general model of learning from examples, the LM receive from the
environment a sample of data {x1 ,..., x m } where xi ∈ X . In the absence of additional information

on their source, and for reasons of simplicity of modeling and mathematical analysis, one will
suppose that these objects are drawn randomly and independently the ones of the others
according to a probability distribution DX (it is what one calls the assumption of independently
and identically distributed). Attached with each one of these data xi the LM receives in addition
one label or supervision ui produced according to a functional dependence between x and u .

We note S = {z1 = ( x1 , u1 ) ,..., z m = ( x m , um )} the sample of learning made up here of supervised

examples. To simplify, we will suppose that the functional dependence between an entry x and

1
its label u takes the form of a function f belonging to a family of functions F . Without
loosing the generality we also suppose that there can be erroneous labeling, in the form of a
noise, i.e. a measurable bias between the proposed label and the true label according to f. The
LM seeks to find a hypothesis function h , in the space of the functions H as near as possible to
f, the target function. We will specify later the concept of proximity used to evaluate the distance
between f and h.

Figure Error! No text of specified style in document.-1 illustrates the various sources of error between the
target functions f and the hypothesis function h. We call total error the error resulting from the
conjunction of these various errors between f and h. Let us detail them.

• The first source of error comes from the following fact: nothing does not allow a priori to
postulate the equality between the target functions space F of the Nature and the hypotheses
functions space H realizable by the LM. Of this fact, even if the LM provides an optimal
assumption h* (in the sense of the proximity measurement mentioned above), h* is inevitably

2
taken in H and can thus be different from the target function f. It is the approximation error
often called inductive bias (or simply bias) due to the difference between F and H .

• Then, the LM does not provide in general h* the optimal hypothesis in H but a hypothesis
ĥ based on the learning sample S . Depending of this sample, the learned hypothesis ĥ will be

{}
able to vary inside a set of functions that we denote by hˆ
S
to underline the dependence of each

one of its elements on the random sample S . The distance between h* and the estimated
hypothesis ĥ who depends on the particularities of S is the estimating error. One can show
formally that it is the variance related on the sensitivity of the calculation of the hypothesis ĥ as
function of the sample S . More the hypotheses space H is rich, more, in general, this variance
is important.

• Finally, it occurs the noise on labeling: because of transmission errors, the label u
associated to x can to be inaccurate with respect to f. Hence the LM receives a sample of data

3
relative to disturbed function f b = f + noise . It is the intrinsic error who generally complicates

the research of the optimal hypothesis h* .

4
5
Figure Error! No text of specified style in document.-1 The various types of errors arising in the estimate of a targets function starting from a
learning sample. With a more restricted space of hypotheses, one can reduce the variance, but generally at the price of a greater error of
approximation.

Being given these circumstances, the bias-variance trade off can be defined in the following
way: to reduce bias due to the bad adequacy of H to F it is necessary to increase the richness
of H . Unfortunately, this enrichment will be paid, generally, with an increase in the variance.
Of this fact, the total error, which is the sum of the approximation error and the estimation error,
cannot significantly be decreased.

The bias-variance tradeoff should thus be rather called the compromise of the approximation
error/estimation error. However, the important thing is that it is well a question of making a
compromise, since one exploits a sum of terms that vary together in contrary direction. On the
other hand the noise, or the intrinsic error, can only worsen the things while increasing. The
ideal would be to have a null noise and a restricted H hypotheses space to reduce the variance,
but at the same time well informed, i.e. containing only functions close to the target function,
which would obviously be equivalent to have an a priori knowledge on Nature.

6
Regularization Methods

The examination of the compromise bias-variance and the analysis of the ERM principle by
Vapnik have clearly shown that the mean of risk (the real risk) depends at the same time on the
empirical risk measured on the learning sample and on the "capacity" of the space of the
hypotheses functions. The larger this one is, the more there is a greater chance to find a
hypothesis close to the target function (small approximation error), but also the hypothesis
minimizing the empirical risk depends on the provided particular learning sample (big estimation
error), which prohibits to exploit with certainty the performance measured by the empirical risk
to the real risk.

In other words, supervised induction must always face the risk of over-fitting. If the space of the
assumptions H is too rich, there are strong chances that the selected hypothesis, whose
empirical risk is small, presents a high real risk. That is because several hypotheses can have a
small empirical risk on a learning sample, while having very different real risks. It is thus not
possible, only based on measured empirical risk, to distinguish the good hypothesis from the bad

7
one. It is thus necessary to restrict as much as possible the richness of the hypotheses space,
while seeking to preserve a sufficient approximation capacity.

Tuning the Hypotheses Class

Since one can measure only the empirical risk, the idea is thus to try to evaluate the real risk by
correcting the empirical risk, necessarily optimistic, by a penalization term corresponding to a
measurement of the capacity of H the used hypotheses space. It is there the essence of all
induction approaches, which revise the ERM principle (the adaptation to data) by a
regularization term (depending on the hypotheses class). This fundamental idea is found in the
heart of a whole set of methods like the regularization theory, Minimum Description Length
Principle: (MDLP), the Akaike information criterion (AIC), and other methods based on
complexity measures.

The problem thus defined is known, at least empirically, for a long time, and many techniques
were developed to solve there. One can arrange them in three principal categories: methods of
models selection, regularization techniques, and average methods.

8
• In the methods of models selection, the approach consists in considering a hypotheses space
H and to decompose it into a discrete collection of nested subspaces H1 ⊆ H 2 ⊆ ... ⊆ H d ⊆ ...
then, being given a learning sample, to try to identify the optimal subspace in which to choose
the final hypothesis. Several methods were proposed within this framework, that one can gather
in two types:

− complexity penalization methods, among which appear the structural risk


minimization principle (SRM) of Vapnik, the Minimum Description Length principle of Rissanen
(1978) and various methods or statistical criteria of selection,

− methods of validation by multiple learning: among which appears the cross validation
and bootstrapping.

• The regularization methods act in the same spirit as the methods selection models, put aside
that they do not impose a discrete decomposition on the hypotheses class. A penalization
criterion of is associated to each hypothesis, which, either measure the complexity of their
parametric structure, or the global properties of "regularity", related, for example, to the

9
derivability of the hypothesis functions or their dynamics (for example the high frequency
functions, i.e. changing value quickly, will be more penalized comparatively to the low
frequency functions).

• The average methods do not select a single hypothesis in the space H , but choose a
weighed combination of hypothesis to form one prediction function. Such a weighed
combination can have like effect "to smooth" the erratic hypothesis (as in the methods of
Bayesian average and in the bagging methods),or to increase the capacity of representation of
the hypothesis class if this one is not convex (as in the boosting methods).

All these methods generally led to notable improvements of the performances compared to the
"naive" methods. However, they ask to be used carefully. On the one side, indeed, they
correspond sometimes to an increase in the richness of the hypotheses space, and to an increased
risk of over-fitting. On the other side, they require frequently an expertise to be applied, in
particular because additional parameters should be regulated. Some recent work tries, for these

10
reasons, to determine automatically the suitable complexity of the candidate’s hypotheses to
adapt to the learning data.

Selection of the Models

We will define more formally the problem of the models selection, which is the objective of all
these methods.

Let H1 ⊆ H 2 ⊆ ...Hd ⊆ be a nested sequence of spaces or classes of hypotheses (or models)


where the spaces H d are of increasing capacity. The target function f can or cannot be included

in one of these classes. Let hd* be the optimal hypothesis in the class of hypotheses H d and

R ( d ) = Rreal ( hd* ) the associate real risk. We note that the sequence { R ( d )}1≤ d ≤∞ is decreasing

since the hypotheses classes H d are nested, and thus their approximation capacity of the targets
function f can only increase.

Using these notations, the problem of the models selection can be defined as follows.

11
Definition 1.2 The model selection problem consist to choose, on the basis of a learning sample
S of length m, a class of hypotheses H d and a hypothesis hd ∈ H d such that the associate real
* *

risk Rreal ( hd ) is minimal.

The underlying conjecture is that the real risk associate with the selected hypothesis hd for each
class H d present a global minimum for a nontrivial value of d (i.e. different from zero and m)
corresponding to the "ideal" hypothesis space H d * . (see Figure Error! No text of specified style in

document.-2).

It is thus a question of finding the ideal hypothesis space H d * , and in addition to select the best

hypothesis hd in H d * .The definition say nothing about this last problem. It is generally solved by

using the ERM principle dictating to seek the hypothesis that minimizes the empirical risk.

12
For the selection of H d * , one uses an estimate of the optimal real risk in each H d by choosing

the best hypothesis according to the empirical risk (the ERM method) and by correcting the
associated empirical risk with a penalization term related to the characteristics of space H d . The
problem of model selection consists then in solving an equation of the type:

d * = ArgMin {hd ∈ Hd : Rreal


estimated
( hd )}
d

= ArgMin {hd ∈ Hd : Remp ( hd )} + penalization term


d

Let us note that the choice of the best hypotheses space depends on the size m of the data
sample. The larger this one is, the more it is possible, if necessary, to choose without risk (i.e.
with a little variance or confidence interval) a rich hypotheses space making possible to
approach as much as possible the targets function f.

13
Figure Error! No text of specified style in document.-2 The bounds on the real risk results from the sum of the empirical risk and a confidence
interval depending on the capacity of the associated hypotheses space. By supposing that one has a nested sequence of hypotheses spaces of
increasing capacity and subscripted by d, the accessible optimal empirical risk decreases for increasing d (corresponding to the bias), while the
confidence interval (corresponding to the variance) increases. The minimal bound on the real risk is reached for a suitable hypotheses space
Hd .

14
Gasirea  polinomului  op.m  

S = {(x1 , u1 ), . . . , (xm , um )}
u   mul.me  cu  exemple  de  antrenare,  m  =  10  

ui = f (xi ) + �i

func.a  .nta   zgomot  aleator  


(vrem  sa  o  invatam)   (corupe  datele)  
�i ∼ N (µ, σ)
Problema  de  regresie:    
pe  baza  lui  S  gaseste  (invata)  func.a  h  care  stabileste  corespondenta  
  u = h(x)
 
-­‐  foloseste  h  pentru  predic.e  pentu  noi  valori  ale  lui  x    
Func.a  .nta  

S = {(x1 , u1 ), . . . , (xm , um )}
u   mul.me  cu  exemple  de  antrenare,  m  =  10  

ui = f (xi ) + �i

func.a  .nta   zgomot  aleator  


(vrem  sa  o  invatam)   (corupe  datele)  
�i ∼ N (µ, σ)
f (x) = sin(2πx)
�∞
(−1)n n x3 x5
sin(x) = x =x− + − ...
n=0
(2n + 1)! 3! 5!

sin(2πx) ≈ 6.28x − 41.34x3 + 81.60x5 − . . .


Spa.ul  de  func.i  

2 i
Hi = {h(x, w) = w0 + w1 x + w2 x + . . . wi x }
Spa.ul  func.ilor  polinomiale  (curbe)  de  grad  i  
w  e  parametru,  h  e  liniara  in  w,  h  e  neliniara  in  x  

H 0 ⊆ H 1 ⊆ H2 ⊆ . . .
func.e   drepte   parabole  
constanta  

Riscul  empiric  al  unei  ipoteze  h:  


|S|
1 �
Remp (h) = l(ui , h(xi , w))
|S| i=1
func.e  cost  (loss)  –  masoara  costul  pe  care  il  
implica  luarea  deciziei  h(xi)  in  loc  de  ui  
Func.a  cost  
|S|
1 �
Remp (h) = l(ui , h(xi , w))
|S| i=1
func.e  cost  (loss)  –  masoara  costul  pe  care  il  
implica  luarea  deciziei  h(xi)  in  loc  de  ui  

un  
u  

h(xn)  
Func.a  cost  
|S|
1 �
Remp (h) = l(ui , h(xi , w))
|S| i=1
func.e  cost  (loss)  –  masoara  costul  pe  care  il  
implica  luarea  deciziei  h(xi)  in  loc  de  ui  

Exemple  de  func1i  cost:  


|S|

2
l(ui , h(xi , w)) = (ui − h(xi , w))
i=1
|S|

l(ui , h(xi , w)) = |ui − h(xi , w)|
i=1
Principiul  ERM  

-­‐  gaseste  ipoteza  h*  care  minimizeaza  riscul  empiric  (eroarea  de  
antrenare)  

hS,H = arg min Remp (h)
h∈H
|S|
1 �
Remp (h) = l(ui , h(xi , w))
|S| i=1
-­‐  afla  parametri  w  care  minimizeaza  riscul  empiric  folosind  
func.a  de  cost     |S|

l(ui , h(xi , w)) = (ui − h(xi , w))2
i=1

Alegerea  modelului  H
         i  si  aflarea  lui     S,Hi
h

i  =  0   i  =  1  
u   u  

i  =  3   i  =  9  
u   u  

over-­‐fiMng  
Evolu.a  riscului  empiric  

Evaluam  ipotezele  pe  o  mul.me  de  test  de  100  de  exemple    

Over-­‐fi>ng:  
-­‐  eroare  pe  mul.mea  de    
antrenare  mica  
-­‐  eroare  pe  mul.mea  de  
testare  mare  

i  

Coeficien.i  lui    hS,Hi
2 i
Hi = {h(x, w) = w0 + w1 x + w2 x + . . . wi x }

i  =  0   i  =  1   i  =  6   i  =  9  

sin(2πx) ≈ 6.28x − 41.34x3 + 81.60x5 − . . .


Comportamentul  unui  model  in  func.e  de  
marimea  lui  S  

Problema  de  over-­‐fiMng  se  elimina  treptat  pe  masura  ce  creste  
numarul  de  exemple  de  antrenare  
Metode  de  regularizare  

|S|

l(ui , h(xi , w)) = (ui − h(xi , w))2 + λ||w||2
i=1
penalitate  
2 2 2 2
||w|| = w0 + w1 + ... + wi
λ -­‐  controleaza  importanta  termenului  de  regularizare/penalitate  
Impactul  includerii  unei  termen  de  regularizare  

|S|

l(ui , h(xi , w)) = (ui − h(xi , w))2 + λ||w||2
i=1
∗ ∗
hS,H9 hS,H9 penalitate  

u   u  
Impactul  includerii  unei  termen  de  regularizare  
i  =  9  
i  =  0   i  =  1   i  =  6   i  =  9  

sin(2πx) ≈ 6.28x − 41.34x3 + 81.60x5 − . . .


Impactul  includerii  unei  termen  de  regularizare  
i  =  9  
Alegerea  modelului  

Impar.m  datele  ini.ale  in  2  mul.mi:  mul.mea  de  antrenare  si  


mul.mea  de  validare.  
 
Alegem  i  sau  λ  pe  baza  erorii  pe  mul.mii  de  validare  

You might also like