Professional Documents
Culture Documents
Class 02
Class 02
Tomaso Poggio
9.520 Class 02
September 2015
In addition one can consider the data to be created in a deterministic, or stochastic or even adversarial way.
Regularization
Regularization provides a rigorous framework to solve
learning problems and to design learning algorithms.
In the course we will present a set of ideas and tools
which are at the core of several developments in
supervised learning and beyond it.
We will see the close connection during the last classes between kernel machines and deep networks.
Regularization
Regularization provides a rigorous framework to solve
learning problems and to design learning algorithms.
In the course we will present a set of ideas and tools
which are at the core of several developments in
supervised learning and beyond it.
We will see the close connection during the last classes between kernel machines and deep networks.
Sn = (x1 , y1 ), . . . , (xn , yn )
p (y|x)
x
X Y
"
#
even in a noise free case we
have to deal with sampling
$
%
.
/
,
-
&
'
*
+
(
)
0
1
might model
errors in the location of
the input points;
p(x)
discretization error for a
given grid;
presence or absence of
certain input instances
x
Sn = (x1 , y1 ), . . . , (xn , yn )
Predictivity or Generalization
Given the data, the goal is to learn how to make
decisions/predictions about future data / data not belonging to
the training set. Generalization is the key requirement
emphasized in Learning Theory: generalization is a masure of
predictivity. This emphasis makes it different from Bayesian or
traditional statistics (especially explanatory statistics).
V (f (x), y ) = (f (x) − y )2
V (f (x), y ) = |f (x) − y |
V (f (x), y ) = (1 − yf (x))+
V (f (x), y ) = (1 − yf (x))2
Sn → fn
lim Xn = X in probability
n→∞
if
∀ > 0 lim P{|Xn − X | ≥ } = 0
n→∞
Convergence in Expectation
Let {Xn } be a sequence of bounded random variables. Then
lim Xn = X in expectation
n→∞
if
lim E(|Xn − X |) = 0
n→∞
lim Xn = X in probability
n→∞
if
∀ > 0 lim P{|Xn − X | ≥ } = 0
n→∞
Convergence in Expectation
Let {Xn } be a sequence of bounded random variables. Then
lim Xn = X in expectation
n→∞
if
lim E(|Xn − X |) = 0
n→∞
Universal Consistency
We say that an algorithm is universally consistent if for all
probability p,
Universal Consistency
We say that an algorithm is universally consistent if for all
probability p,
Universal Consistency
We say that an algorithm is universally consistent if for all
probability p,
Sample Complexity
Or equivalently, ‘how many point do we need to achieve an
error with a prescribed probability δ?’
This can expressed as
P{I[fn ] − I[f∗ ] ≤ } ≥ 1 − δ,
Sample Complexity
Or equivalently, ‘how many point do we need to achieve an
error with a prescribed probability δ?’
This can expressed as
P{I[fn ] − I[f∗ ] ≤ } ≥ 1 − δ,
Sample Complexity
Or equivalently, ‘how many point do we need to achieve an
error with a prescribed probability δ?’
This can expressed as
P{I[fn ] − I[f∗ ] ≤ } ≥ 1 − δ,
Sample Complexity
Or equivalently, ‘how many point do we need to achieve an
error with a prescribed probability δ?’
This can expressed as
P{I[fn ] − I[f∗ ] ≤ } ≥ 1 − δ,
Sample Complexity
Or equivalently, ‘how many point do we need to achieve an
error with a prescribed probability δ?’
This can expressed as
P{I[fn ] − I[f∗ ] ≤ } ≥ 1 − δ,
Empirical Risk
The empirical risk is a natural proxy (how good?) for the
expected risk
n
1X
In [f ] = V (f (xi ), yi ).
n
i=1
Generalization Error
How good a proxy is captured by the generalization error,
P{|I[fn ] − In [fn ]| ≤ } ≥ 1 − δ,
Empirical Risk
The empirical risk is a natural proxy (how good?) for the
expected risk
n
1X
In [f ] = V (f (xi ), yi ).
n
i=1
Generalization Error
How good a proxy is captured by the generalization error,
P{|I[fn ] − In [fn ]| ≤ } ≥ 1 − δ,
Empirical Risk
The empirical risk is a natural proxy (how good?) for the
expected risk
n
1X
In [f ] = V (f (xi ), yi ).
n
i=1
Generalization Error
How good a proxy is captured by the generalization error,
P{|I[fn ] − In [fn ]| ≤ } ≥ 1 − δ,
A(Sn ) = fn ∈ H.
A(Sn ) = fn ∈ H.
ERM
A prototype algorithm in statistical learning theory is Empirical
Risk Minimization:
min In [f ].
f ∈H
1X
IS [f ] = V (f , zi )
n
fS = arg min IS [f ].
f ∈H
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
The theorem (Vapnik et al.) says that a proper choice of the hypothesis space H ensures generalization of
ERM (and consistency since for ERM generalization is necessary and sufficient for consistency and
viceversa). Other results characterize uGC classes in terms of measures of complexity or capacity of H
(such as VC dimension).
A separate theorem (Niyogi, Mukherjee, Rifkin, Poggio) says that stability (defined in a specific way) of
(supervised) ERM is sufficient and necessary for generalization of ERM. Thus with the appropriate definition
of stability, stability and generalization are equivalent for ERM; stability and H uGC are also equivalent.
n
1 X
V (f (xi ), yi )
n i=1
n
1 X
V (f (xi ), yi )
n i=1
R(f ) is the regulirizer, a penalization on f . In this course we will mainly discuss the case R(f ) = kf k2K where kf k2K
is the norm in the Reproducing Kernel Hilbert Space (RKHS) H, defined by the kernel K .
• Polynomial kernel
K (x, x 0 ) = (x · x 0 + 1)d , d ∈N
D = {φj , i = 1, . . . , p | φj : X → R, ∀j}
setting
p
X
0
K (x, x ) = φj (x)φj (x 0 ).
i=1
subject to
kf k2H ≤ R.
An important result
The minimizer over the RKHS H, fS , of the regularized
empirical functional
IS [f ] + λkf k2H ,
can be represented by the expression
n
X
fn (x) = ci K (xi , x),
i=1
In (f ) + λR(f )
z = (x, y )
S = z1 , ..., zn
S i = z1 , ..., zi−1 , zi+1 , ...zn
CV Stability
A learning algorithm A is CV loo stable if for each n there exists
(n) (n)
a βCV and a δCV such that for all p
n o
(n) (n)
P |V (fS i , zi ) − V (fS , zi )| ≤ βCV ≥ 1 − δCV ,
(n) (n)
with βCV and δCV going to zero for n → ∞.