Professional Documents
Culture Documents
Lecture 3 - ATM PDF
Lecture 3 - ATM PDF
The compromise bias-variance express the effect of various possible factors on the final error
between the hypothesis chosen by the LM and that which it would have had to choose, the ideal
target function.
According to the general model of learning from examples, the LM receive from the
environment a sample of data {x1 ,..., x m } where xi ∈ X . In the absence of additional information
on their source, and for reasons of simplicity of modeling and mathematical analysis, one will
suppose that these objects are drawn randomly and independently the ones of the others
according to a probability distribution DX (it is what one calls the assumption of independently
and identically distributed). Attached with each one of these data xi the LM receives in addition
one label or supervision ui produced according to a functional dependence between x and u .
examples. To simplify, we will suppose that the functional dependence between an entry x and
1
its label u takes the form of a function f belonging to a family of functions F . Without
loosing the generality we also suppose that there can be erroneous labeling, in the form of a
noise, i.e. a measurable bias between the proposed label and the true label according to f. The
LM seeks to find a hypothesis function h , in the space of the functions H as near as possible to
f, the target function. We will specify later the concept of proximity used to evaluate the distance
between f and h.
Figure Error! No text of specified style in document.-1 illustrates the various sources of error between the
target functions f and the hypothesis function h. We call total error the error resulting from the
conjunction of these various errors between f and h. Let us detail them.
• The first source of error comes from the following fact: nothing does not allow a priori to
postulate the equality between the target functions space F of the Nature and the hypotheses
functions space H realizable by the LM. Of this fact, even if the LM provides an optimal
assumption h* (in the sense of the proximity measurement mentioned above), h* is inevitably
2
taken in H and can thus be different from the target function f. It is the approximation error
often called inductive bias (or simply bias) due to the difference between F and H .
• Then, the LM does not provide in general h* the optimal hypothesis in H but a hypothesis
ĥ based on the learning sample S . Depending of this sample, the learned hypothesis ĥ will be
{}
able to vary inside a set of functions that we denote by hˆ
S
to underline the dependence of each
one of its elements on the random sample S . The distance between h* and the estimated
hypothesis ĥ who depends on the particularities of S is the estimating error. One can show
formally that it is the variance related on the sensitivity of the calculation of the hypothesis ĥ as
function of the sample S . More the hypotheses space H is rich, more, in general, this variance
is important.
• Finally, it occurs the noise on labeling: because of transmission errors, the label u
associated to x can to be inaccurate with respect to f. Hence the LM receives a sample of data
3
relative to disturbed function f b = f + noise . It is the intrinsic error who generally complicates
4
5
Figure Error! No text of specified style in document.-1 The various types of errors arising in the estimate of a targets function starting from a
learning sample. With a more restricted space of hypotheses, one can reduce the variance, but generally at the price of a greater error of
approximation.
Being given these circumstances, the bias-variance trade off can be defined in the following
way: to reduce bias due to the bad adequacy of H to F it is necessary to increase the richness
of H . Unfortunately, this enrichment will be paid, generally, with an increase in the variance.
Of this fact, the total error, which is the sum of the approximation error and the estimation error,
cannot significantly be decreased.
The bias-variance tradeoff should thus be rather called the compromise of the approximation
error/estimation error. However, the important thing is that it is well a question of making a
compromise, since one exploits a sum of terms that vary together in contrary direction. On the
other hand the noise, or the intrinsic error, can only worsen the things while increasing. The
ideal would be to have a null noise and a restricted H hypotheses space to reduce the variance,
but at the same time well informed, i.e. containing only functions close to the target function,
which would obviously be equivalent to have an a priori knowledge on Nature.
6
Regularization Methods
The examination of the compromise bias-variance and the analysis of the ERM principle by
Vapnik have clearly shown that the mean of risk (the real risk) depends at the same time on the
empirical risk measured on the learning sample and on the "capacity" of the space of the
hypotheses functions. The larger this one is, the more there is a greater chance to find a
hypothesis close to the target function (small approximation error), but also the hypothesis
minimizing the empirical risk depends on the provided particular learning sample (big estimation
error), which prohibits to exploit with certainty the performance measured by the empirical risk
to the real risk.
In other words, supervised induction must always face the risk of over-fitting. If the space of the
assumptions H is too rich, there are strong chances that the selected hypothesis, whose
empirical risk is small, presents a high real risk. That is because several hypotheses can have a
small empirical risk on a learning sample, while having very different real risks. It is thus not
possible, only based on measured empirical risk, to distinguish the good hypothesis from the bad
7
one. It is thus necessary to restrict as much as possible the richness of the hypotheses space,
while seeking to preserve a sufficient approximation capacity.
Since one can measure only the empirical risk, the idea is thus to try to evaluate the real risk by
correcting the empirical risk, necessarily optimistic, by a penalization term corresponding to a
measurement of the capacity of H the used hypotheses space. It is there the essence of all
induction approaches, which revise the ERM principle (the adaptation to data) by a
regularization term (depending on the hypotheses class). This fundamental idea is found in the
heart of a whole set of methods like the regularization theory, Minimum Description Length
Principle: (MDLP), the Akaike information criterion (AIC), and other methods based on
complexity measures.
The problem thus defined is known, at least empirically, for a long time, and many techniques
were developed to solve there. One can arrange them in three principal categories: methods of
models selection, regularization techniques, and average methods.
8
• In the methods of models selection, the approach consists in considering a hypotheses space
H and to decompose it into a discrete collection of nested subspaces H1 ⊆ H 2 ⊆ ... ⊆ H d ⊆ ...
then, being given a learning sample, to try to identify the optimal subspace in which to choose
the final hypothesis. Several methods were proposed within this framework, that one can gather
in two types:
− methods of validation by multiple learning: among which appears the cross validation
and bootstrapping.
• The regularization methods act in the same spirit as the methods selection models, put aside
that they do not impose a discrete decomposition on the hypotheses class. A penalization
criterion of is associated to each hypothesis, which, either measure the complexity of their
parametric structure, or the global properties of "regularity", related, for example, to the
9
derivability of the hypothesis functions or their dynamics (for example the high frequency
functions, i.e. changing value quickly, will be more penalized comparatively to the low
frequency functions).
• The average methods do not select a single hypothesis in the space H , but choose a
weighed combination of hypothesis to form one prediction function. Such a weighed
combination can have like effect "to smooth" the erratic hypothesis (as in the methods of
Bayesian average and in the bagging methods),or to increase the capacity of representation of
the hypothesis class if this one is not convex (as in the boosting methods).
All these methods generally led to notable improvements of the performances compared to the
"naive" methods. However, they ask to be used carefully. On the one side, indeed, they
correspond sometimes to an increase in the richness of the hypotheses space, and to an increased
risk of over-fitting. On the other side, they require frequently an expertise to be applied, in
particular because additional parameters should be regulated. Some recent work tries, for these
10
reasons, to determine automatically the suitable complexity of the candidate’s hypotheses to
adapt to the learning data.
We will define more formally the problem of the models selection, which is the objective of all
these methods.
in one of these classes. Let hd* be the optimal hypothesis in the class of hypotheses H d and
R ( d ) = Rreal ( hd* ) the associate real risk. We note that the sequence { R ( d )}1≤ d ≤∞ is decreasing
since the hypotheses classes H d are nested, and thus their approximation capacity of the targets
function f can only increase.
Using these notations, the problem of the models selection can be defined as follows.
11
Definition 1.2 The model selection problem consist to choose, on the basis of a learning sample
S of length m, a class of hypotheses H d and a hypothesis hd ∈ H d such that the associate real
* *
The underlying conjecture is that the real risk associate with the selected hypothesis hd for each
class H d present a global minimum for a nontrivial value of d (i.e. different from zero and m)
corresponding to the "ideal" hypothesis space H d * . (see Figure Error! No text of specified style in
document.-2).
It is thus a question of finding the ideal hypothesis space H d * , and in addition to select the best
hypothesis hd in H d * .The definition say nothing about this last problem. It is generally solved by
using the ERM principle dictating to seek the hypothesis that minimizes the empirical risk.
12
For the selection of H d * , one uses an estimate of the optimal real risk in each H d by choosing
the best hypothesis according to the empirical risk (the ERM method) and by correcting the
associated empirical risk with a penalization term related to the characteristics of space H d . The
problem of model selection consists then in solving an equation of the type:
Let us note that the choice of the best hypotheses space depends on the size m of the data
sample. The larger this one is, the more it is possible, if necessary, to choose without risk (i.e.
with a little variance or confidence interval) a rich hypotheses space making possible to
approach as much as possible the targets function f.
13
Figure Error! No text of specified style in document.-2 The bounds on the real risk results from the sum of the empirical risk and a confidence
interval depending on the capacity of the associated hypotheses space. By supposing that one has a nested sequence of hypotheses spaces of
increasing capacity and subscripted by d, the accessible optimal empirical risk decreases for increasing d (corresponding to the bias), while the
confidence interval (corresponding to the variance) increases. The minimal bound on the real risk is reached for a suitable hypotheses space
Hd .
14
Gasirea
polinomului
op.m
S = {(x1 , u1 ), . . . , (xm , um )}
u
mul.me
cu
exemple
de
antrenare,
m
=
10
ui = f (xi ) + �i
S = {(x1 , u1 ), . . . , (xm , um )}
u
mul.me
cu
exemple
de
antrenare,
m
=
10
ui = f (xi ) + �i
2 i
Hi = {h(x, w) = w0 + w1 x + w2 x + . . . wi x }
Spa.ul
func.ilor
polinomiale
(curbe)
de
grad
i
w
e
parametru,
h
e
liniara
in
w,
h
e
neliniara
in
x
H 0 ⊆ H 1 ⊆ H2 ⊆ . . .
func.e
drepte
parabole
constanta
un
u
h(xn)
Func.a
cost
|S|
1 �
Remp (h) = l(ui , h(xi , w))
|S| i=1
func.e
cost
(loss)
–
masoara
costul
pe
care
il
implica
luarea
deciziei
h(xi)
in
loc
de
ui
-‐
gaseste
ipoteza
h*
care
minimizeaza
riscul
empiric
(eroarea
de
antrenare)
∗
hS,H = arg min Remp (h)
h∈H
|S|
1 �
Remp (h) = l(ui , h(xi , w))
|S| i=1
-‐
afla
parametri
w
care
minimizeaza
riscul
empiric
folosind
func.a
de
cost
|S|
�
l(ui , h(xi , w)) = (ui − h(xi , w))2
i=1
∗
Alegerea
modelului
H
i
si
aflarea
lui
S,Hi
h
i
=
0
i
=
1
u
u
i
=
3
i
=
9
u
u
over-‐fiMng
Evolu.a
riscului
empiric
Evaluam ipotezele pe o mul.me de test de 100 de exemple
Over-‐fi>ng:
-‐
eroare
pe
mul.mea
de
antrenare
mica
-‐
eroare
pe
mul.mea
de
testare
mare
i
∗
Coeficien.i
lui
hS,Hi
2 i
Hi = {h(x, w) = w0 + w1 x + w2 x + . . . wi x }
i = 0 i = 1 i = 6 i = 9
Problema
de
over-‐fiMng
se
elimina
treptat
pe
masura
ce
creste
numarul
de
exemple
de
antrenare
Metode
de
regularizare
|S|
�
l(ui , h(xi , w)) = (ui − h(xi , w))2 + λ||w||2
i=1
penalitate
2 2 2 2
||w|| = w0 + w1 + ... + wi
λ -‐
controleaza
importanta
termenului
de
regularizare/penalitate
Impactul
includerii
unei
termen
de
regularizare
|S|
�
l(ui , h(xi , w)) = (ui − h(xi , w))2 + λ||w||2
i=1
∗ ∗
hS,H9 hS,H9 penalitate
u
u
Impactul
includerii
unei
termen
de
regularizare
i
=
9
i
=
0
i
=
1
i
=
6
i
=
9