# Statistical Learning Theory

Olivier Bousquet
Department of Empirical Inference
Max Planck Institute of Biological Cybernetics
olivier.bousquet@tuebingen.mpg.de
Machine Learning Summer School, August 2003
• Lecture 1: Introduction
• Part I: Binary Classiﬁcation
Lecture 2: Basic bounds
Lecture 3: VC theory
Lecture 4: Capacity measures
O. Bousquet – Statistical Learning Theory 1
• Part II: Real-Valued Classiﬁcation
Lecture 6: Margin and loss functions
Lecture 7: Regularization
Lecture 8: SVM
O. Bousquet – Statistical Learning Theory 2
Lecture 1
The Learning Problem
• Context
• Formalization
• Algorithms and Bounds
O. Bousquet – Statistical Learning Theory – Lecture 1 3
Learning and Inference
The inductive inference process:
1. Observe a phenomenon
2. Construct a model of the phenomenon
3. Make predictions
⇒ This is more or less the deﬁnition of natural sciences !
⇒ The goal of Machine Learning is to automate this process
⇒ The goal of Learning Theory is to formalize it.
O. Bousquet – Statistical Learning Theory – Lecture 1 4
Pattern recognition
We consider here the supervised learning framework for pattern
recognition:
• Data consists of pairs (instance, label)
• Label is +1 or −1
• Algorithm constructs a function (instance → label)
• Goal: make few mistakes on future unseen instances
O. Bousquet – Statistical Learning Theory – Lecture 1 5
Approximation/Interpolation
It is always possible to build a function that ﬁts exactly the data.
0 0.5 1 1.5
0
0.5
1
1.5
But is it reasonable ?
O. Bousquet – Statistical Learning Theory – Lecture 1 6
Occam’s Razor
Idea: look for regularities in the observed phenomenon
These can ge generalized from the observed past to the future
⇒ choose the simplest consistent model
How to measure simplicity ?
• Physics: number of constants
• Description length
• Number of parameters
• ...
O. Bousquet – Statistical Learning Theory – Lecture 1 7
No Free Lunch
• No Free Lunch
if there is no assumption on how the past is related to the future,
prediction is impossible
if there is no restriction on the possible phenomena, generalization
is impossible
• We need to make assumptions
• Simplicity is not absolute
• Data will never replace knowledge
• Generalization = data + knowledge
O. Bousquet – Statistical Learning Theory – Lecture 1 8
Assumptions
Two types of assumptions
• Future observations related to past ones
→ Stationarity of the phenomenon
• Constraints on the phenomenon
→ Notion of simplicity
O. Bousquet – Statistical Learning Theory – Lecture 1 9
Goals
⇒ How can we make predictions from the past ? what are the
assumptions ?
• Give a formal deﬁnition of learning, generalization, overﬁtting
• Characterize the performance of learning algorithms
• Design better algorithms
O. Bousquet – Statistical Learning Theory – Lecture 1 10
Probabilistic Model
Relationship between past and future observations
⇒ Sampled independently from the same distribution
• Independence: each new observation yields maximum information
• Identical distribution: the observations give information about the
underlying phenomenon (here a probability distribution)
O. Bousquet – Statistical Learning Theory – Lecture 1 11
Probabilistic Model
We consider an input space · and output space ¸.
Here: classiﬁcation case ¸ = ¦−1, 1].
Assumption: The pairs (X, Y ) ∈ · ·¸ are distributed
according to P (unknown).
Data: We observe a sequence of n i.i.d. pairs (X
i
, Y
i
) sampled
according to P.
Goal: construct a function g : · → ¸ which predicts Y from X.
O. Bousquet – Statistical Learning Theory – Lecture 1 12
Probabilistic Model
Criterion to choose our function:
Low probability of error P(g(X) ,= Y ).
Risk
R(g) = P(g(X) ,= Y ) = E
_
1
[g(X),=Y ]
¸
• P is unknown so that we cannot directly measure the risk
• Can only measure the agreement on the data
• Empirical Risk
R
n
(g) =
1
n
n

i=1
1
[g(X
i
),=Y
i
]
O. Bousquet – Statistical Learning Theory – Lecture 1 13
Target function
• P can be decomposed as P
X
·P(Y ]X)
• η(x) = E[Y ]X = x] = 2P [Y = 1]X = x] −1 is the regression
function
• t(x) = sgn η(x) is the target function
• in the deterministic case Y = t(X) (P [Y = 1]X] ∈ ¦0, 1])
• in general, n(x) = min(P [Y = 1]X = x] , 1−P [Y = 1]X = x]) =
(1 −η(x))/2 is the noise level
O. Bousquet – Statistical Learning Theory – Lecture 1 14
Indeed, if t(x) is totally chaotic, there is no possible generalization
from ﬁnite data.
Assumptions can be
• Preference (e.g. a priori probability distribution on possible functions)
• Restriction (set of possible functions)
Treating lack of knowledge
• Bayesian approach: uniform distribution
• Learning Theory approach: worst case analysis
O. Bousquet – Statistical Learning Theory – Lecture 1 15
Approximation/Interpolation (again)
How to trade-oﬀ knowledge and data ?
0 0.5 1 1.5
0
0.5
1
1.5
O. Bousquet – Statistical Learning Theory – Lecture 1 16
Overﬁtting/Underﬁtting
• Underﬁtting
model too small to ﬁt the data
• Overﬁtting
artiﬁcially good agreement with the data
No way to detect them from the data ! Need extra validation data.
O. Bousquet – Statistical Learning Theory – Lecture 1 17
Empirical Risk Minimization
• Choose a model Ç (set of possible functions)
• Minimize the empirical risk in the model
min
g∈Ç
R
n
(g)
What if the Bayes classiﬁer is not in the model ?
O. Bousquet – Statistical Learning Theory – Lecture 1 18
Approximation/Estimation
• Bayes risk
R

= inf
g
R(g) .
Best risk a deterministic function can have (risk of the target function,
or Bayes classiﬁer).
• Decomposition: R(g

) = inf
g∈Ç
R(g)
R(g
n
) −R

= R(g) −R

. ¸¸ .
+ R(g
n
) −R(g

)
. ¸¸ .
Approximation Estimation
• Only the estimation error is random (i.e. depends on the data).
O. Bousquet – Statistical Learning Theory – Lecture 1 19
Structural Risk Minimization
• Choose a collection of models ¦Ç
d
: d = 1, 2, . . .]
• Minimize the empirical risk in each model
• Minimize the penalized empirical risk
min
d
min
g∈Ç
d
R
n
(g) + pen(d, n)
pen(d, n) gives preference to models where estimation error is small
pen(d, n) measures the size or capacity of the model
O. Bousquet – Statistical Learning Theory – Lecture 1 20
Regularization
• Choose a large model Ç (possibly dense)
• Choose a regularizer jgj
• Minimize the regularized empirical risk
min
g∈Ç
R
n
(g) + λjgj
2
• Choose an optimal trade-oﬀ λ (regularization parameter).
Most methods can be thought of as regularization methods.
O. Bousquet – Statistical Learning Theory – Lecture 1 21
Bounds (1)
A learning algorithm
• Takes as input the data (X
1
, Y
1
), . . . , (X
n
, Y
n
)
• Produces a function g
n
Can we estimate the risk of g
n
?
⇒ random quantity (depends on the data).
⇒ need probabilistic bounds
O. Bousquet – Statistical Learning Theory – Lecture 1 22
Bounds (2)
• Error bounds
R(g
n
) ≤ R
n
(g
n
) + B
⇒ Estimation from an empirical quantity
• Relative error bounds
Best in a class
R(g
n
) ≤ R(g

) + B
Bayes risk
R(g
n
) ≤ R

+ B
⇒ Theoretical guarantees
O. Bousquet – Statistical Learning Theory – Lecture 1 23
Lecture 2
Basic Bounds
• Probability tools
• Relationship with empirical processes
• Law of large numbers
• Union bound
• Relative error bounds
O. Bousquet – Statistical Learning Theory – Lecture 2 24
Probability Tools (1)
Basic facts
• Union: P [A or B] ≤ P [A] +P [B]
• Inclusion: If A ⇒ B, then P [A] ≤ P [B].
• Inversion: If P [X ≥ t] ≤ F(t) then with probability at least 1 −δ,
X ≤ F
−1
(δ).
• Expectation: If X ≥ 0, E[X] =
_

0
P [X ≥ t] dt.
O. Bousquet – Statistical Learning Theory – Lecture 2 25
Probability Tools (2)
Basic inequalities
• Jensen: for f convex, f(E[X]) ≤ E[f(X)]
• Markov: If X ≥ 0 then for all t > 0, P [X ≥ t] ≤
E[X]
t
• Chebyshev: for t > 0, P []X −E[X] ] ≥ t] ≤
Var[X]
t
2
• Chernoﬀ: for all t ∈ R, P [X ≥ t] ≤ inf
λ≥0
E
_
e
λ(X−t)
_
O. Bousquet – Statistical Learning Theory – Lecture 2 26
Error bounds
Recall that we want to bound R(g
n
) = E
_
1
[g
n
(X),=Y ]
¸
where g
n
has
been constructed from (X
1
, Y
1
), . . . , (X
n
, Y
n
).
• Cannot be observed (P is unknown)
• Random (depends on the data)
⇒ we want to bound
P [R(g
n
) −R
n
(g
n
) > ε]
O. Bousquet – Statistical Learning Theory – Lecture 2 27
Loss class
For convenience, let Z
i
= (X
i
, Y
i
) and Z = (X, Y ). Given Ç deﬁne
the loss class
J = ¦f : (x, y) ·→ 1
[g(x),=y]
: g ∈ Ç]
Denote Pf = E[f(X, Y )] and P
n
f =
1
n

n
i=1
f(X
i
, Y
i
)
Quantity of interest:
Pf −P
n
f
We will go back and forth between J and Ç (bijection)
O. Bousquet – Statistical Learning Theory – Lecture 2 28
Empirical process
Empirical process:
¦Pf −P
n
f]
f∈J
• Process = collection of random variables (here indexed by functions
in J)
• Empirical = distribution of each random variable
⇒ Many techniques exist to control the supremum
sup
f∈J
Pf −P
n
f
O. Bousquet – Statistical Learning Theory – Lecture 2 29
The Law of Large Numbers
R(g) −R
n
(g) = E[f(Z)] −
1
n
n

i=1
f(Z
i
)
→ diﬀerence between the expectation and the empirical average of the
r.v. f(Z)
Law of large numbers
P
_
lim
n→∞
1
n
n

i=1
f(Z
i
) −E[f(Z)] = 0
_
= 1 .
⇒ can we quantify it ?
O. Bousquet – Statistical Learning Theory – Lecture 2 30
Hoeﬀding’s Inequality
Quantitative version of law of large numbers.
Assumes bounded random variables
Theorem 1. Let Z
1
, . . . , Z
n
be n i.i.d. random variables. If f(Z) ∈
[a, b]. Then for all ε > 0, we have
P

¸
¸
¸
¸
1
n
n

i=1
f(Z
i
) −E[f(Z)]
¸
¸
¸
¸
¸
> ε
_
≤ 2 exp
_

2nε
2
(b −a)
2
_
.
⇒ Let’s rewrite it to better understand
O. Bousquet – Statistical Learning Theory – Lecture 2 31
Hoeﬀding’s Inequality
Write
δ = 2 exp
_

2nε
2
(b −a)
2
_
Then
P
_
_
]P
n
f −Pf] > (b −a)
¸
log
2
δ
2n
_
_
≤ δ
or [Inversion] with probability at least 1 −δ,
]P
n
f −Pf] ≤ (b −a)
¸
log
2
δ
2n
O. Bousquet – Statistical Learning Theory – Lecture 2 32
Hoeﬀding’s inequality
Let’s apply to f(Z) = 1
[g(X),=Y ]
.
For any g, and any δ > 0, with probability at least 1 −δ
R(g) ≤ R
n
(g) +
¸
log
2
δ
2n
. (1)
Notice that one has to consider a ﬁxed function f and the probability is
with respect to the sampling of the data.
If the function depends on the data this does not apply !
O. Bousquet – Statistical Learning Theory – Lecture 2 33
Limitations
• For each ﬁxed function f ∈ J, there is a set S of samples for which
Pf −P
n
f ≤
_
log
2
δ
2n
(P [S] ≥ 1 −δ)
• They may be diﬀerent for diﬀerent functions
• The function chosen by the algorithm depends on the sample
⇒ For the observed sample, only some of the functions in J will satisfy
this inequality !
O. Bousquet – Statistical Learning Theory – Lecture 2 34
Limitations
What we need to bound is
Pf
n
−P
n
f
n
where f
n
is the function chosen by the algorithm based on the data.
For any ﬁxed sample, there exists a function f such that
Pf −P
n
f = 1
Take the function which is f(X
i
) = Y
i
on the data and f(X) = −Y
everywhere else.
This does not contradict Hoeﬀding but shows it is not enough
O. Bousquet – Statistical Learning Theory – Lecture 2 35
Limitations
Risk
Function class
R
R
emp
f f f
opt m
R[f]
R [f]
emp
Hoeﬀding’s inequality quantiﬁes diﬀerences for a ﬁxed function
O. Bousquet – Statistical Learning Theory – Lecture 2 36
Uniform Deviations
Before seeing the data, we do not know which function the algorithm
will choose.
The trick is to consider uniform deviations
R(f
n
) −R
n
(f
n
) ≤ sup
f∈J
(R(f) −R
n
(f))
We need a bound which holds simultaneously for all functions in a class
O. Bousquet – Statistical Learning Theory – Lecture 2 37
Union Bound
Consider two functions f
1
, f
2
and deﬁne
C
i
= ¦(x
1
, y
1
), . . . , (x
n
, y
n
) : Pf
i
−P
n
f
i
> ε]
From Hoeﬀding’s inequality, for each i
P [C
i
] ≤ δ
We want to bound the probability of being ’bad’ for i = 1 or i = 2
P [C
1
∪ C
2
] ≤ P [C
1
] +P [C
2
]
O. Bousquet – Statistical Learning Theory – Lecture 2 38
Finite Case
More generally
P [C
1
∪ . . . ∪ C
N
] ≤
N

i=1
P [C
i
]
We have
P [∃f ∈ ¦f
1
, . . . , f
N
] : Pf −P
n
f > ε]

N

i=1
P [Pf
i
−P
n
f
i
> ε]
≤ N exp
_
−2nε
2
_
O. Bousquet – Statistical Learning Theory – Lecture 2 39
Finite Case
We obtain, for Ç = ¦g
1
, . . . , g
N
], for all δ > 0
with probability at least 1 −δ,
∀g ∈ Ç, R(g) ≤ R
n
(g) +
¸
log N + log
1
δ
2m
This is a generalization bound !
Coding interpretation
log N is the number of bits to specify a function in J
O. Bousquet – Statistical Learning Theory – Lecture 2 40
Approximation/Estimation
Let
g

= arg min
g∈Ç
R(g)
If g
n
minimizes the empirical risk in Ç,
R
n
(g

) −R
n
(g
n
) ≥ 0
Thus
R(g
n
) = R(g
n
) −R(g

) + R(g

)
≤ R
n
(g

) −R
n
(g
n
) + R(g
n
) −R(g

) + R(g

)
≤ 2 sup
g∈Ç
]R(g) −R
n
(g)] + R(g

)
O. Bousquet – Statistical Learning Theory – Lecture 2 41
Approximation/Estimation
We obtain with probability at least 1 −δ
R(g
n
) ≤ R(g

) + 2
¸
log N + log
2
δ
2m
The ﬁrst term decreases if N increases
The second term increases
The size of Ç controls the trade-oﬀ
O. Bousquet – Statistical Learning Theory – Lecture 2 42
Summary (1)
• Inference requires assumptions
- Data sampled i.i.d. from P
- Restrict the possible functions to Ç
- Choose a sequence of models Ç
m
to have more ﬂexibility/control
O. Bousquet – Statistical Learning Theory – Lecture 2 43
Summary (2)
• Bounds are valid w.r.t. repeated sampling
- For a ﬁxed function g, for most of the samples
R(g) −R
n
(g) ≈ 1/

n
- For most of the samples if ]Ç] = N
sup
g∈Ç
R(g) −R
n
(g) ≈
_
log N/n
⇒ Extra variability because the chosen g
n
changes with the data
O. Bousquet – Statistical Learning Theory – Lecture 2 44
Improvements
We obtained
sup
g∈Ç
R(g) −R
n
(g) ≤
¸
log N + log
2
δ
2n
To be improved
• Hoeﬀding only uses boundedness, not the variance
• Union bound as bad as if independent
• Supremum is not what the algorithm chooses.
Next we improve the union bound and extend it to the inﬁnite case
O. Bousquet – Statistical Learning Theory – Lecture 2 45
Reﬁned union bound (1)
For each f ∈ J,
P
_
_
Pf −P
n
f >
¸
log
1
δ(f)
2n
_
_
≤ δ(f)
P
_
_
∃f ∈ J : Pf −P
n
f >
¸
log
1
δ(f)
2n
_
_

f∈J
δ(f)
Choose δ(f) = δp(f) with

f∈J
p(f) = 1
O. Bousquet – Statistical Learning Theory – Lecture 2 46
Reﬁned union bound (2)
With probability at least 1 −δ,
∀f ∈ J, Pf ≤ P
n
f +
¸
log
1
p(f)
+ log
1
δ
2n
• Applies to countably inﬁnite J
• Can put knowledge about the algorithm into p(f)
• But p chosen before seeing the data
O. Bousquet – Statistical Learning Theory – Lecture 2 47
Reﬁned union bound (3)
• Good p means good bound. The bound can be improved if you know
ahead of time the chosen function (knowledge improves the bound)
• In the inﬁnite case, how to choose the p (since it implies an ordering)
• The trick is to look at J through the data
O. Bousquet – Statistical Learning Theory – Lecture 2 48
Lecture 3
Inﬁnite Case: Vapnik-Chervonenkis Theory
• Growth function
• Vapnik-Chervonenkis dimension
• Proof of the VC bound
• VC entropy
• SRM
O. Bousquet – Statistical Learning Theory – Lecture 3 49
Inﬁnite Case
Measure of the size of an inﬁnite class ?
• Consider
J
z
1
,...,z
n
= ¦(f(z
1
), . . . , f(z
n
)) : f ∈ J]
The size of this set is the number of possible ways in which the data
(z
1
, . . . , z
n
) can be classiﬁed.
• Growth function
S
J
(n) = sup
(z
1
,...,z
n
)
]J
z
1
,...,z
n
]
• Note that S
J
(n) = S
Ç
(n)
O. Bousquet – Statistical Learning Theory – Lecture 3 50
Inﬁnite Case
• Result (Vapnik-Chervonenkis)
With probability at least 1 −δ
∀g ∈ Ç, R(g) ≤ R
n
(g) +
¸
log S
Ç
(2n) + log
4
δ
8n
• Always better than N in the ﬁnite case
• How to compute S
Ç
(n) in general ?
⇒ use VC dimension
O. Bousquet – Statistical Learning Theory – Lecture 3 51
VC Dimension
Notice that since g ∈ ¦−1, 1], S
Ç
(n) ≤ 2
n
If S
Ç
(n) = 2
n
, the class of functions can generate any classiﬁcation on
n points (shattering)
Deﬁnition 2. The VC-dimension of Ç is the largest n such that
S
Ç
(n) = 2
n
O. Bousquet – Statistical Learning Theory – Lecture 3 52
VC Dimension
Hyperplanes
In R
d
, V C(hyperplanes) = d + 1
O. Bousquet – Statistical Learning Theory – Lecture 3 53
VC Dimension
Number of Parameters
Is VC dimension equal to number of parameters ?
• One parameter
¦sgn(sin(tx)) : t ∈ R]
• Inﬁnite VC dimension !
O. Bousquet – Statistical Learning Theory – Lecture 3 54
VC Dimension
• We want to know S
Ç
(n) but we only know
S
Ç
(n) = 2
n
for n ≤ h
What happens for n ≥ h ?
m
log(S(m))
h
O. Bousquet – Statistical Learning Theory – Lecture 3 55
Vapnik-Chervonenkis-Sauer-Shelah Lemma
Lemma 3. Let Ç be a class of functions with ﬁnite VC-dimension h.
Then for all n ∈ N,
S
Ç
(n) ≤
h

i=0
_
n
i
_
and for all n ≥ h,
S
Ç
(n) ≤
_
en
h
_
h
⇒ phase transition
O. Bousquet – Statistical Learning Theory – Lecture 3 56
VC Bound
Let Ç be a class with VC dimension h.
With probability at least 1 −δ
∀g ∈ Ç, R(g) ≤ R
n
(g) +
¸
hlog
2en
h
+ log
4
δ
8n
So the error is of order
_
hlog n
n
O. Bousquet – Statistical Learning Theory – Lecture 3 57
Interpretation
VC dimension: measure of eﬀective dimension
• Depends on geometry of the class
• Gives a natural deﬁnition of simplicity (by quantifying the potential
overﬁtting)
• Not related to the number of parameters
• Finiteness guarantees learnability under any distribution
O. Bousquet – Statistical Learning Theory – Lecture 3 58
Symmetrization (lemma)
Key ingredient in VC bounds: Symmetrization
Let Z
t
1
, . . . , Z
t
n
an independent (ghost) sample and P
t
n
the
corresponding empirical measure.
Lemma 4. For any t > 0, such that nt
2
≥ 2,
P
_
sup
f∈J
(P −P
n
)f ≥ t
_
≤ 2P
_
sup
f∈J
(P
t
n
−P
n
)f ≥ t/2
_
O. Bousquet – Statistical Learning Theory – Lecture 3 59
Symmetrization (proof – 1)
f
n
the function achieving the supremum (depends on Z
1
, . . . , Z
n
)
1
[(P−P
n
)f
n
>t]
1
[(P−P
t
n
)f
n
<t/2]
= 1
[(P−P
n
)f
n
>t ∧(P−P
t
n
)f
n
<t/2]
≤ 1
[(P
t
n
−P
n
)f
n
>t/2]
Taking expectations with respect to the second sample gives
1
[(P−P
n
)f
n
>t]
P
t
_
(P −P
t
n
)f
n
< t/2
¸
≤ P
t
_
(P
t
n
−P
n
)f
n
> t/2
¸
O. Bousquet – Statistical Learning Theory – Lecture 3 60
Symmetrization (proof – 2)
• By Chebyshev inequality,
P
t
_
(P −P
t
n
)f
n
≥ t/2
¸

4Var [f
n
]
nt
2

1
nt
2
• Hence
1
[(P−P
n
)f
n
>t]
(1 −
1
nt
2
) ≤ P
t
_
(P
t
n
−P
n
)f
n
> t/2
¸
Take expectation with respect to ﬁrst sample.
O. Bousquet – Statistical Learning Theory – Lecture 3 61
Proof of VC bound (1)
• Symmetrization allows to replace expectation by average on ghost
sample
• Function class projected on the double sample
J
Z
1
,...,Z
n
,Z
t
1
,...,Z
t
n
• Union bound on J
Z
1
,...,Z
n
,Z
t
1
,...,Z
t
n
• Variant of Hoeﬀding’s inequality
P
_
P
n
f −P
t
n
f > t
¸
≤ 2e
−nt
2
/2
O. Bousquet – Statistical Learning Theory – Lecture 3 62
Proof of VC bound (2)
P
_
sup
f∈J
(P −P
n
)f ≥ t
¸
≤ 2P
_
sup
f∈J
(P
t
n
−P
n
)f ≥ t/2
¸
= 2P
_
sup
f∈J
Z
1
,...,Z
n
,Z
t
1
,...,Z
t
n
(P
t
n
−P
n
)f ≥ t/2
_
≤ 2S
F
(2n)P
_
(P
t
n
−P
n
)f ≥ t/2
¸
≤ 4S
F
(2n)e
−nt
2
/8
O. Bousquet – Statistical Learning Theory – Lecture 3 63
VC Entropy (1)
• VC dimension is distribution independent
⇒ The same bound holds for any distribution
⇒ It is loose for most distributions
• A similar proof can give a distribution-dependent result
O. Bousquet – Statistical Learning Theory – Lecture 3 64
VC Entropy (2)
• Denote the size of the projection N (J, z
1
. . . , z
n
) := #J
z
1
,...,z
n
• The VC entropy is deﬁned as
H
J
(n) = log E[N (J, Z
1
, . . . , Z
n
)] ,
• VC entropy bound: with probability at least 1 −δ
∀g ∈ Ç, R(g) ≤ R
n
(g) +
¸
H
Ç
(2n) + log
2
δ
8n
O. Bousquet – Statistical Learning Theory – Lecture 3 65
VC Entropy (proof)
Introduce σ
i
∈ ¦−1, 1] (probability 1/2), Rademacher variables
2P
_
sup
f∈J
Z,Z
t
(P
t
n
−P
n
)f ≥ t/2
_
≤ 2E
_
P
σ
_
sup
f∈J
Z,Z
t
1
n
n

i=1
σ
i
(f(Z
t
i
) −f(Z
i
)) ≥ t/2
__
≤ 2E
_
N
_
J, Z, Z
t

P
_
1
n
n

i=1
σ
i
≥ t/2
_
≤ 2E
_
N
_
J, Z, Z
t

e
−nt
2
/8
O. Bousquet – Statistical Learning Theory – Lecture 3 66
From Bounds to Algorithms
• For any distribution, H
Ç
(n)/n → 0 ensures consistency of empirical
risk minimizer (i.e. convergence to best in the class)
• Does it means we can learn anything ?
• No because of the approximation of the class
• Need to trade-oﬀ approximation and estimation error (assessed by
the bound)
⇒ Use the bound to control the trade-oﬀ
O. Bousquet – Statistical Learning Theory – Lecture 3 67
Structural risk minimization
• Structural risk minimization (SRM) (Vapnik, 1979): minimize the
right hand side of
R(g) ≤ R
n
(g) + B(h, n).
• To this end, introduce a structure on Ç.
• Learning machine ≡ a set of functions and an induction principle
O. Bousquet – Statistical Learning Theory – Lecture 3 68
SRM: The Picture
R(f
*
)
h
training error
capacity term
error
S
n−1
S
n
S
n+1
structure
bound on test error
O. Bousquet – Statistical Learning Theory – Lecture 3 69
Lecture 4
Capacity Measures
• Covering numbers
• Relationships
O. Bousquet – Statistical Learning Theory – Lecture 4 70
Covering numbers
• Deﬁne a (random) distance d between functions, e.g.
d(f, f
t
) =
1
n
#¦f(Z
i
) ,= f
t
(Z
i
) : i = 1, . . . , n]
Normalized Hamming distance of the ’projections’ on the sample
• A set f
1
, . . . , f
N
covers J at radius ε if
J ⊂ ∪
N
i=1
B(f
i
, ε)
• Covering number N(J, ε, n) is the minimum size of a cover of
Note that N(J, ε, n) = N(Ç, ε, n).
O. Bousquet – Statistical Learning Theory – Lecture 4 71
Bound with covering numbers
• When the covering numbers are ﬁnite, one can approximate the class
Ç by a ﬁnite set of functions
• Result
P [∃g ∈ Ç : R(g) −R
n
(g) ≥ t] ≤ 8E[N(Ç, t, n)] e
−nt
2
/128
O. Bousquet – Statistical Learning Theory – Lecture 4 72
Covering numbers and VC dimension
• Notice that for all t, N(Ç, t, n) ≤ #Ç
Z
= N(Ç, Z)
• Hence N(Ç, t, n) ≤ hlog
en
h
• Haussler
N(Ç, t, n) ≤ Ch(4e)
h
1
t
h
• Independent of n
O. Bousquet – Statistical Learning Theory – Lecture 4 73
Reﬁnement
• VC entropy corresponds to log covering numbers at minimal scale
• Covering number bound is a generalization where the scale is adapted
to the error
• Is this the right scale ?
• It turns out that results can be improved by considering all scales (→
chaining)
O. Bousquet – Statistical Learning Theory – Lecture 4 74
1
, . . . , σ
n
independent random variables
with
P [σ
i
= 1] = P [σ
i
= −1] =
1
2
• Notation (randomized empirical measure) R
n
f =
1
n

n
i=1
σ
i
f(Z
i
)
• Rademacher average: 7(J) = E
_
sup
f∈J
R
n
f
¸
n
(J) = E
σ
_
sup
f∈J
R
n
f
¸
O. Bousquet – Statistical Learning Theory – Lecture 4 75
Result
• Distribution dependent
∀f ∈ J, Pf ≤ P
n
f + 27(J) +
¸
log
1
δ
2n
,
• Data dependent
∀f ∈ J, Pf ≤ P
n
f + 27
n
(J) +
¸
2 log
2
δ
n
,
O. Bousquet – Statistical Learning Theory – Lecture 4 76
Concentration
• Hoeﬀding’s inequality is a concentration inequality
• When n increases, the average is concentrated around the expectation
• Generalization to functions that depend on i.i.d. random variables
exist
O. Bousquet – Statistical Learning Theory – Lecture 4 77
McDiarmid’s Inequality
Assume for all i = 1, . . . , n,
sup
z
1
,...,z
n
,z
t
i
]F(z
1
, . . . , z
i
, . . . , z
n
) −F(z
1
, . . . , z
t
i
, . . . , z
n
)] ≤ c
then for all ε > 0,
P []F −E[F] ] > ε] ≤ 2 exp
_

2
nc
2
_
O. Bousquet – Statistical Learning Theory – Lecture 4 78
• Use concentration to relate sup
f∈J
Pf −P
n
f to its expectation
• Use symmetrization to relate expectation to Rademacher average
• Use concentration again to relate Rademacher average to conditional
one
O. Bousquet – Statistical Learning Theory – Lecture 4 79
Application (1)
sup
f∈J
A(f) + B(f) ≤ sup
f∈J
A(f) + sup
f∈J
B(f)
Hence
] sup
f∈J
C(f) −sup
f∈J
A(f)] ≤ sup
f∈J
(C(f) −A(f))
this gives
] sup
f∈J
(Pf −P
n
f) −sup
f∈J
(Pf −P
t
n
f)] ≤ sup
f∈J
(P
t
n
f −P
n
f)
O. Bousquet – Statistical Learning Theory – Lecture 4 80
Application (2)
f ∈ ¦0, 1] hence,
P
t
n
f −P
n
f =
1
n
(f(Z
t
i
) −f(Z
i
)) ≤
1
n
thus
] sup
f∈J
(Pf −P
n
f) −sup
f∈J
(Pf −P
t
n
f)] ≤
1
n
McDiarmid’s inequality can be applied with c = 1/n
O. Bousquet – Statistical Learning Theory – Lecture 4 81
Symmetrization (1)
• Upper bound
E
_
sup
f∈J
Pf −P
n
f
_
≤ 2E
_
sup
f∈J
R
n
f
_
• Lower bound
E
_
sup
f∈J
]Pf −P
n
f]
_

1
2
E
_
sup
f∈J
7
n
f
_

1
2

n
O. Bousquet – Statistical Learning Theory – Lecture 4 82
Symmetrization (2)
E[sup
f∈J
Pf −P
n
f]
= E[sup
f∈J
E
_
P
t
n
f
¸
−P
n
f]
≤ E
Z,Z
t[sup
f∈J
P
t
n
f −P
n
f]
= E
σ,Z,Z
t
_
sup
f∈J
1
n
n

i=1
σ
i
(f(Z
t
i
) −f(Z
i
))
_
≤ 2E[sup
f∈J
R
n
f]
O. Bousquet – Statistical Learning Theory – Lecture 4 83
Loss class and initial class
7(J) = E
_
sup
g∈Ç
1
n
n

i=1
σ
i
1
[g(X
i
),=Y
i
]
_
= E
_
sup
g∈Ç
1
n
n

i=1
σ
i
1
2
(1 −Y
i
g(X
i
))
_
=
1
2
E
_
sup
g∈Ç
1
n
n

i=1
σ
i
Y
i
g(X
i
)
_
=
1
2
7(Ç)
O. Bousquet – Statistical Learning Theory – Lecture 4 84
1
2
E
_
sup
g∈Ç
1
n
n

i=1
σ
i
g(X
i
)
_
=
1
2
+E
_
sup
g∈Ç
1
n
n

i=1

1 −σ
i
g(X
i
)
2
_
=
1
2
−E
_
inf
g∈Ç
1
n
n

i=1
1 −σ
i
g(X
i
)
2
_
=
1
2
−E
_
inf
g∈Ç
R
n
(g, σ)
_
O. Bousquet – Statistical Learning Theory – Lecture 4 85
• Not harder than computing empirical risk minimizer
• Pick σ
i
randomly and minimize error with respect to labels σ
i
• Intuition: measure how much the class can ﬁt random noise
• Large class ⇒ 7(Ç) =
1
2
O. Bousquet – Statistical Learning Theory – Lecture 4 86
Concentration again
• Let
F = E
σ
_
sup
f∈J
R
n
f
_
Expectation with respect to σ
i
only, with (X
i
, Y
i
) ﬁxed.
• F satisﬁes McDiarmid’s assumptions with c =
1
n
⇒ E[F] = 7(J) can be estimated by F = 7
n
(J)
O. Bousquet – Statistical Learning Theory – Lecture 4 87
Relationship with VC dimension
• For a ﬁnite set J = ¦f
1
, . . . , f
N
]
7(J) ≤ 2
_
log N/n
• Consequence for VC class J with dimension h
7(J) ≤ 2
¸
hlog
en
h
n
.
⇒ Recovers VC bound with a concentration proof
O. Bousquet – Statistical Learning Theory – Lecture 4 88
Chaining
• Using covering numbers at all scales, the geometry of the class is
better captured
• Dudley
7
n
(J) ≤
C

n
_

0
_
log N(J, t, n) dt
• Consequence
7(J) ≤ C
_
h
n
• Removes the unnecessary log n factor !
O. Bousquet – Statistical Learning Theory – Lecture 4 89
Lecture 5
• Relative error bounds
• Noise conditions
• PAC-Bayesian bounds
O. Bousquet – Statistical Learning Theory – Lecture 5 90
Binomial tails
• P
n
f ∼ B(p, n) binomial distribution p = Pf
• P [Pf −P
n
f ≥ t] =

¡n(p−t)¸
k=0
_
n
k
_
p
k
(1 −p)
n−k
• Can be upper bounded
Exponential
_
1−p
1−p−t
_
n(1−p−t)
_
p
p+t
_
n(p+t)
Bennett e

np
1−p
((1−t/p) log(1−t/p)+t/p)
Bernstein e

nt
2
2p(1−p)+2t/3
Hoeﬀding e
−2nt
2
O. Bousquet – Statistical Learning Theory – Lecture 5 91
Tail behavior
• For small deviations, Gaussian behavior ≈ exp(−nt
2
/2p(1 −p))
⇒ Gaussian with variance p(1 −p)
• For large deviations, Poisson behavior ≈ exp(−3nt/2)
⇒ Tails heavier than Gaussian
• Can upper bound with a Gaussian with large (maximum) variance
exp(−2nt
2
)
O. Bousquet – Statistical Learning Theory – Lecture 5 92
Illustration (1)
Maximum variance (p = 0.5)
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
D
e
v
i
a
t
i
o
n

b
o
u
n
d
BinoCDF
Bernstein
Bennett
Binomial Tail
Best Gausian
O. Bousquet – Statistical Learning Theory – Lecture 5 93
Illustration (2)
Small variance (p = 0.1)
0.1 0.15 0.2 0.25 0.3 0.35
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
D
e
v
i
a
t
i
o
n

b
o
u
n
d
BinoCDF
Sub−Gaussian
Bernstein
Bennett
Binomial Tail
Best Gausian
0.3 0.32 0.34 0.36 0.38 0.4 0.42 0.44
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
x
D
e
v
i
a
t
i
o
n

b
o
u
n
d
BinoCDF
Bernstein
Bennett
Binomial Tail
Best Gausian
O. Bousquet – Statistical Learning Theory – Lecture 5 94
Taking the variance into account (1)
• Each function f ∈ J has a diﬀerent variance Pf(1 −Pf) ≤ Pf.
• For each f ∈ J, by Bernstein’s inequality
Pf ≤ P
n
f +
¸
2Pf log
1
δ
n
+
2 log
1
δ
3n
• The Gaussian part dominates (for Pf not too small, or n large
enough), it depends on Pf
O. Bousquet – Statistical Learning Theory – Lecture 5 95
Taking the variance into account (2)
• Central Limit Theorem

n
Pf −P
n
f
_
Pf(1 −Pf)
→ N(0, 1)
⇒ Idea is to consider the ratio
Pf −P
n
f

Pf
O. Bousquet – Statistical Learning Theory – Lecture 5 96
Normalization
• Here (f ∈ ¦0, 1]), Var [f] ≤ Pf
2
= Pf
• Large variance ⇒ large risk.
• After normalization, ﬂuctuations are more ”uniform”
sup
f∈J
Pf −P
n
f

Pf
not necessarily attained at functions with large variance.
• Focus of learning: functions with small error Pf (hence small
variance).
⇒ The normalized supremum takes this into account.
O. Bousquet – Statistical Learning Theory – Lecture 5 97
Relative deviations
Vapnik-Chervonenkis 1974
For δ > 0 with probability at least 1 −δ,
∀f ∈ J,
Pf −P
n
f

Pf
≤ 2
¸
log S
J
(2n) + log
4
δ
n
and
∀f ∈ J,
P
n
f −Pf

P
n
f
≤ 2
¸
log S
J
(2n) + log
4
δ
n
O. Bousquet – Statistical Learning Theory – Lecture 5 98
Proof sketch
1. Symmetrization
P
_
sup
f∈J
Pf −P
n
f

Pf
≥ t
_
≤ 2P
_
sup
f∈J
P
t
n
f −P
n
f
_
(P
n
f + P
t
n
f)/2
≥ t
_
2. Randomization
= 2E
_
P
σ
_
sup
f∈J
1
n

n
i=1
σ
i
(f(Z
t
i
) −f(Z
i
))
_
(P
n
f + P
t
n
f)/2
≥ t
__
3. Tail bound
O. Bousquet – Statistical Learning Theory – Lecture 5 99
Consequences
From the fact
A ≤ B + C

A ⇒ A ≤ B + C
2
+

BC
we get
∀f ∈ J, Pf ≤ P
n
f + 2
¸
P
n
f
log S
J
(2n) + log
4
δ
n
+4
log S
J
(2n) + log
4
δ
n
O. Bousquet – Statistical Learning Theory – Lecture 5 100
Zero noise
Ideal situation
• g
n
empirical risk minimizer
• t ∈ Ç
• R

= 0 (no noise, n(X) = 0 a.s.)
In that case
• R
n
(g
n
) = 0
⇒ R(g
n
) = O(
d log n
n
).
O. Bousquet – Statistical Learning Theory – Lecture 5 101
Interpolating between rates ?
• Rates are not correctly estimated by this inequality
• Consequence of relative error bounds
Pf
n
≤ Pf

+ 2
¸
Pf

log S
J
(2n) + log
4
δ
n
+4
log S
J
(2n) + log
4
δ
n
• The quantity which is small is not Pf

but Pf
n
−Pf

• But relative error bounds do not apply to diﬀerences
O. Bousquet – Statistical Learning Theory – Lecture 5 102
Deﬁnitions
• P = P
X
·P(Y ]X)
• regression function η(x) = E[Y ]X = x]
• target function t(x) = sgn η(x)
• noise level n(X) = (1 −]η(x)])/2
• Bayes risk R

= E[n(X)]
• R(g) = E[(1 −η(X))/2] +E
_
η(X)1
[g≤0]
¸
• R(g) −R

= E
_
]η(X)]1
[gη≤0]
¸
O. Bousquet – Statistical Learning Theory – Lecture 5 103
Intermediate noise
Instead of assuming that ]η(x)] = 1 (i.e. n(x) = 0), the
deterministic case, one can assume that n is well-behaved.
Two kinds of assumptions
• n not too close to 1/2
• n not often too close to 1/2
O. Bousquet – Statistical Learning Theory – Lecture 5 104
Massart Condition
• For some c > 0, assume
]η(X)] >
1
c
almost surely
• There is no region where the decision is completely random
• Noise bounded away from 1/2
O. Bousquet – Statistical Learning Theory – Lecture 5 105
Tsybakov Condition
Let α ∈ [0, 1], equivalent conditions
(1) ∃c > 0, ∀g ∈ ¦−1, 1]
·
,
P [g(X)η(X) ≤ 0] ≤ c(R(g) −R

)
α
(2) ∃c > 0, ∀A ⊂ ·,
_
A
dP(x) ≤ c(
_
A
]η(x)]dP(x))
α
(3) ∃B > 0, ∀t ≥ 0, P []η(X)] ≤ t] ≤ Bt
α
1−α
O. Bousquet – Statistical Learning Theory – Lecture 5 106
Equivalence
• (1) ⇔ (2) Recall R(g) − R

= E
_
]η(X)]1
[gη≤0]
¸
. For each
function g, there exists a set A such that 1
[A]
= 1
[gη≤0]
• (2) ⇒ (3) Let A = ¦x : ]η(x)] ≤ t]
P []η] ≤ t] =
_
A
dP(x) ≤ c(
_
A
]η(x)]dP(x))
α
≤ ct
α
(
_
A
dP(x))
α
⇒ P []η] ≤ t] ≤ c
1
1−α
t
α
1−α
O. Bousquet – Statistical Learning Theory – Lecture 5 107
• (3) ⇒ (1)
R(g) −R

= E
_
]η(X)] 1
[gη≤0]
¸
≥ tE
_
1
[gη≤0]
1
[]η]>t]
¸
= tP []η] > t] −tE
_
1
[gη>0]
1
[]η]>t]
¸
≥ t(1 −Bt
α
1−α
) −tP [gη > 0] = t(P [gη ≤ 0] −Bt
α
1−α
)
Take t =
_
(1−α)P[gη≤0]
B
_
(1−α)/α
⇒ P [gη ≤ 0] ≤
B
1−α
(1 −α)
(
1 −α)α
α
(R(g) −R

)
α
O. Bousquet – Statistical Learning Theory – Lecture 5 108
Remarks
• α is in [0, 1] because
R(g) −R

= E
_
]η(X)]1
[gη≤0]
¸
≤ E
_
1
[gη≤0]
¸
• α = 0 no condition
• α = 1 gives Massart’s condition
O. Bousquet – Statistical Learning Theory – Lecture 5 109
Consequences
• Under Massart’s condition
E
_
(1
[g(X),=Y ]
−1
[t(X),=Y ]
)
2
_
≤ c(R(g) −R

)
• Under Tsybakov’s condition
E
_
(1
[g(X),=Y ]
−1
[t(X),=Y ]
)
2
_
≤ c(R(g) −R

)
α
O. Bousquet – Statistical Learning Theory – Lecture 5 110
Relative loss class
• J is the loss class associated to Ç
• The relative loss class is deﬁned as
˜
J = ¦f −f

: f ∈ J]
• It satisﬁes
Pf
2
≤ c(Pf)
α
O. Bousquet – Statistical Learning Theory – Lecture 5 111
Finite case
• Union bound on
˜
J with Bernstein’s inequality would give
Pf
n
−Pf

≤ P
n
f
n
−P
n
f

+
¸
8c(Pf
n
−Pf

)
α
log
N
δ
n
+
4 log
N
δ
3n
• Consequence when f

∈ J (but R

> 0)
Pf
n
−Pf

≤ C
_
log
N
δ
n
_
1
2−α
always better than n
−1/2
for α > 0
O. Bousquet – Statistical Learning Theory – Lecture 5 112
• Deﬁnition
7(J, r) = E
_
sup
f∈J:Pf
2
≤r
R
n
f
_
• Allows to generalize the previous result
• Computes the capacity of a small ball in J (functions with small
variance)
• Under noise conditions, small variance implies small error
O. Bousquet – Statistical Learning Theory – Lecture 5 113
Sub-root functions
Deﬁnition
A function ψ : R → R is sub-root if
• ψ is non-decreasing
• ψ is non negative
• ψ(r)/

r is non-increasing
O. Bousquet – Statistical Learning Theory – Lecture 5 114
Sub-root functions
Properties
A sub-root function
• is continuous
• has a unique ﬁxed point ψ(r

) = r

0 0.5 1 1.5 2 2.5 3
0
0.5
1
1.5
2
2.5
3
x
phi(x)
O. Bousquet – Statistical Learning Theory – Lecture 5 115
Star hull
• Deﬁnition
J = ¦αf : f ∈ J, α ∈ [0, 1]]
• Properties
7
n
(J, r) is sub-root
• Entropy of J is not much bigger than entropy of J
O. Bousquet – Statistical Learning Theory – Lecture 5 116
Result
• r

ﬁxed point of 7(J, r)
• Bounded functions
Pf −P
n
f ≤ C
_
_
r

Var [f] +
log
1
δ
+ log log n
n
_
• Consequence for variance related to expectation (Var [f] ≤ c(Pf)
β
)
Pf ≤ C
_
P
n
f + (r

)
1
2−β
+
log
1
δ
+ log log n
n
_
O. Bousquet – Statistical Learning Theory – Lecture 5 117
Consequences
• For VC classes 7(J, r) ≤ C
_
rh
n
hence r

≤ C
h
n
• Rate of convergence of P
n
f to Pf in O(1/

n)
• But rate of convergence of Pf
n
to Pf

is O(1/n
1/(2−α)
)
Only condition is t ∈ Ç but can be removed by SRM/Model selection
O. Bousquet – Statistical Learning Theory – Lecture 5 118
Proof sketch (1)
• Talagrand’s inequality
sup
f∈J
Pf−P
n
f ≤ E
_
sup
f∈J
Pf −P
n
f
_
+c
_
sup
f∈J
Var [f] /n+c
t
/n
• Peeling of the class
J
k
= ¦f : Var [f] ∈ [x
k
, x
k+1
)]
O. Bousquet – Statistical Learning Theory – Lecture 5 119
Proof sketch (2)
• Application
sup
f∈J
k
Pf −P
n
f ≤ E
_
sup
f∈J
k
Pf −P
n
f
_
+c
_
xVar [f] /n+c
t
/n
• Symmetrization
∀f ∈ J, Pf−P
n
f ≤ 27(J, xVar [f])+c
_
xVar [f] /n+c
t
/n
O. Bousquet – Statistical Learning Theory – Lecture 5 120
Proof sketch (3)
• We need to ’solve’ this inequality. Things are simple if 7 behave like
a square root, hence the sub-root property
Pf −P
n
f ≤ 2
_
r

Var [f] + c
_
xVar [f] /n + c
t
/n
• Variance-expectation
Var [f] ≤ c(Pf)
α
Solve in Pf
O. Bousquet – Statistical Learning Theory – Lecture 5 121
Data-dependent version
• As in the global case, one can use data-dependent local Rademcher
averages
7
n
(J, r) = E
σ
_
sup
f∈J:Pf
2
≤r
R
n
f
_
• Using concentration one can also get
Pf ≤ C
_
P
n
f + (r

n
)
1
2−α
+
log
1
δ
+ log log n
n
_
where r

n
is the ﬁxed point of a sub-root upper bound of 7
n
(J, r)
O. Bousquet – Statistical Learning Theory – Lecture 5 122
Discussion
• Improved rates under low noise conditions
• Interpolation in the rates
• Capacity measure seems ’local’,
• but depends on all the functions,
• after appropriate rescaling: each f ∈ J is considered at scale r/Pf
2
O. Bousquet – Statistical Learning Theory – Lecture 5 123
Randomized Classiﬁers
Given Ç a class of functions
• Deterministic: picks a function g
n
and always use it to predict
• Randomized
construct a distribution ρ
n
over Ç
for each instance to classify, pick g ∼ ρ
n
• Error is averaged over ρ
n
R(ρ
n
) = ρ
n
Pf
R
n

n
) = ρ
n
P
n
f
O. Bousquet – Statistical Learning Theory – Lecture 5 124
Union Bound (1)
Let π be a (ﬁxed) distribution over J.
• Recall the reﬁned union bound
∀f ∈ J, Pf −P
n
f ≤
¸
log
1
π(f)
+ log
1
δ
2n
• Take expectation with respect to ρ
n
ρ
n
Pf −ρ
n
P
n
f ≤ ρ
n
¸
log
1
π(f)
+ log
1
δ
2n
O. Bousquet – Statistical Learning Theory – Lecture 5 125
Union Bound (2)
ρ
n
Pf −ρ
n
P
n
f ≤ ρ
n
_
_
−log π(f) + log
1
δ
_
/(2n)

_
_
−ρ
n
log π(f) + log
1
δ
_
/(2n)

_
_
K(ρ
n
, π) + H(ρ
n
) + log
1
δ
_
/(2n)
• K(ρ
n
, π) =
_
ρ
n
(f) log
ρ
n
(f)
π(f)
df Kullback-Leibler divergence
• H(ρ
n
) =
_
ρ
n
(f) log ρ
n
(f) df Entropy
O. Bousquet – Statistical Learning Theory – Lecture 5 126
PAC-Bayesian Reﬁnement
• It is possible to improve the previous bound.
• With probability at least 1 −δ,
ρ
n
Pf −ρ
n
P
n
f ≤
¸
K(ρ
n
, π) + log 4n + log
1
δ
2n −1
• Good if ρ
n
• Not interesting if ρ
n
= δ
f
n
O. Bousquet – Statistical Learning Theory – Lecture 5 127
Proof (1)
• Variational formulation of entropy: for any T
ρT(f) ≤ log πe
T(f)
+ K(ρ, π)
• Apply it to λ(Pf −P
n
f)
2
λρ
n
(Pf −P
n
f)
2
≤ log πe
λ(Pf−P
n
f)
2
+ K(ρ
n
, π)
• Markov’s inequality: with probability 1 −δ,
λρ
n
(Pf −P
n
f)
2
≤ log E
_
πe
λ(Pf−P
n
f)
2
_
+K(ρ
n
, π) + log
1
δ
O. Bousquet – Statistical Learning Theory – Lecture 5 128
Proof (2)
• Fubini
E
_
πe
λ(Pf−P
n
f)
2
_
= πE
_
e
λ(Pf−P
n
f)
2
_
• Modiﬁed Chernoﬀ bound
E
_
e
(2n−1)(Pf−P
n
f)
2
_
≤ 4n
• Putting together (λ = 2n −1)
(2n −1)ρ
n
(Pf −P
n
f)
2
≤ K(ρ
n
, π) + log 4n + log
1
δ
• Jensen (2n −1)(ρ
n
(Pf −P
n
f))
2
≤ (2n −1)ρ
n
(Pf −P
n
f)
2
O. Bousquet – Statistical Learning Theory – Lecture 5 129
Lecture 6
Loss Functions
• Properties
• Consistency
• Examples
• Losses and noise
O. Bousquet – Statistical Learning Theory – Lecture 6 130
Motivation (1)
• ERM: minimize

n
i=1
1
[g(X
i
),=Y
i
]
in a set Ç
⇒ Computationally hard
⇒ Smoothing
Replace binary by real-valued functions
Introduce smooth loss function
n

i=1
(g(X
i
), Y
i
)
O. Bousquet – Statistical Learning Theory – Lecture 6 131
Motivation (2)
• Hyperplanes in inﬁnite dimension have
inﬁnite VC-dimension
but ﬁnite scale-sensitive dimension (to be deﬁned later)
⇒ It is good to have a scale
⇒ This scale can be used to give a conﬁdence (i.e. estimate the density)
• However, losses do not need to be related to densities
• Can get bounds in terms of margin error instead of empirical error
(smoother → easier to optimize for model selection)
O. Bousquet – Statistical Learning Theory – Lecture 6 132
Margin
• It is convenient to work with (symmetry of +1 and −1)
(g(x), y) = φ(yg(x))
• yg(x) is the margin of g at (x, y)
• Loss
L(g) = E[φ(Y g(X))] , L
n
(g) =
1
n
n

i=1
φ(Y
i
g(X
i
))
• Loss class J = ¦f : (x, y) ·→ φ(yg(x)) : g ∈ Ç]
O. Bousquet – Statistical Learning Theory – Lecture 6 133
Minimizing the loss
• Decomposition of L(g)
1
2
E[E[(1 + η(X))φ(g(X)) + (1 −η(X))φ(−g(X))]X]]
• Minimization for each x
H(η) = inf
α∈R
((1 + η)φ(α)/2 + (1 −η)φ(−α)/2)
• L

:= inf
g
L(g) = E[H(η(X))]
O. Bousquet – Statistical Learning Theory – Lecture 6 134
Classiﬁcation-calibrated
• A minimal requirement is that the minimizer in H(η) has the correct
sign (that of the target t or that of η).
• Deﬁnition
φ is classiﬁcation-calibrated if, for any η ,= 0
inf
α:αη≤0
(1+η)φ(α)+(1−η)φ(−α) > inf
α∈R
(1+η)φ(α)+(1−η)φ(−α)
• This means the inﬁmum is achieved for an α of the correct sign (and
not for an α of the wrong sign, except possibly for η = 0).
O. Bousquet – Statistical Learning Theory – Lecture 6 135
Consequences (1)
Results due to (Jordan, Bartlett and McAuliﬀe 2003)
• φ is classiﬁcation-calibrated iﬀ for all sequences g
i
and every proba-
bility distribution P,
L(g
i
) → L

⇒ R(g
i
) → R

• When φ is convex (convenient for optimization) φ is classiﬁcation-
calibrated iﬀ it is diﬀerentiable at 0 and φ
t
(0) < 0
O. Bousquet – Statistical Learning Theory – Lecture 6 136
Consequences (2)
• Let H

(η) = inf
α:αη≤0
((1 + η)φ(α)/2 + (1 −η)φ(−α)/2)
• Let ψ(η) be the largest convex function below H

(η) −H(η)
• One has
ψ(R(g) −R

) ≤ L(g) −L

O. Bousquet – Statistical Learning Theory – Lecture 6 137
Examples (1)
−1 −0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
3.5
4
0−1
hinge
squared hinge
square
exponential
O. Bousquet – Statistical Learning Theory – Lecture 6 138
Examples (2)
• Hinge loss
φ(x) = max(0, 1 −x), ψ(x) = x
• Squared hinge loss
φ(x) = max(0, 1 −x)
2
, ψ(x) = x
2
• Square loss
φ(x) = (1 −x)
2
, ψ(x) = x
2
• Exponential
φ(x) = exp(−x), ψ(x) = 1 −
_
1 −x
2
O. Bousquet – Statistical Learning Theory – Lecture 6 139
Low noise conditions
• Relationship can be improved under low noise conditions
• Under Tsybakov’s condition with exponent α and constant c,
c(R(g) −R

)
α
ψ((R(g) −R

)
1−α
/2c) ≤ L(g) −L

• Hinge loss (no improvement)
R(g) −R

≤ L(g) −L

• Square loss or squared hinge loss
R(g) −R

≤ (4c(L(g) −L

))
1
2−α
O. Bousquet – Statistical Learning Theory – Lecture 6 140
Estimation error
• Recall that Tsybakov condition implies Pf
2
≤ c(Pf)
α
for the
relative loss class (with 0 −1 loss)
• What happens for the relative loss class associated to φ ?
• Two possibilities
Strictly convex loss (can modify the metric on R)
Piecewise linear
O. Bousquet – Statistical Learning Theory – Lecture 6 141
Strictly convex losses
• Noise behavior controlled by modulus of convexity
• Result
δ(
_
Pf
2
K
) ≤ Pf/2
with K Lipschitz constant of φ and δ modulus of convexity of L(g)
with respect to jf −gj
L
2
(P)
• Not related to noise exponent
O. Bousquet – Statistical Learning Theory – Lecture 6 142
Piecewise linear losses
• Noise behavior related to noise exponent
• Result for hinge loss
Pf
2
≤ CPf
α
if initial class Ç is uniformly bounded
O. Bousquet – Statistical Learning Theory – Lecture 6 143
Estimation error
• With bounded and Lipschitz loss with convexity exponent γ, for a
convex class Ç,
L(g) −L(g

) ≤ C
_
(r

)
2
γ
+
log
1
δ
+ log log n
n
_
• Under Tsybakov’s condition for the hinge loss (and general Ç)
Pf
2
≤ CPf
α
L(g) −L(g

) ≤ C
_
(r

)
1
2−α
+
log
1
δ
+ log log n
n
_
O. Bousquet – Statistical Learning Theory – Lecture 6 144
Examples
Under Tsybakov’s condition
• Hinge loss
R(g) −R

≤ L(g

) −L

+C
_
(r

)
1
2−α
+
log
1
δ
+ log log n
n
_
• Squared hinge loss or square loss δ(x) = cx
2
, Pf
2
≤ CPf
R(g)−R

≤ C
_
L(g

) −L

+ C
t
(r

+
log
1
δ
+ log log n
n
)
_
1
2−α
O. Bousquet – Statistical Learning Theory – Lecture 6 145
Classiﬁcation vs Regression losses
• Consider a classiﬁcation-calibrated function φ
• It is a classiﬁcation loss if L(t) = L

• otherwise it is a regression loss
O. Bousquet – Statistical Learning Theory – Lecture 6 146
Classiﬁcation vs Regression losses
• Square, squared hinge, exponential losses
Noise enters relationship between risk and loss
Modulus of convexity enters in estimation error
• Hinge loss
Direct relationship between risk and loss
Noise enters in estimation error
⇒ Approximation term not aﬀected by noise in second case
⇒ Real value does not bring probability information in second case
O. Bousquet – Statistical Learning Theory – Lecture 6 147
Lecture 7
Regularization
• Formulation
• Capacity measures
• Applications
O. Bousquet – Statistical Learning Theory – Lecture 7 148
Equivalent problems
Up to the choice of the regularization parameters, the following
problems are equivalent
min
f∈J
L
n
(f) + λΩ(f)
min
f∈J:L
n
(f)≤e
Ω(f)
min
f∈J:Ω(f)≤R
L
n
(f)
The solution sets are the same
O. Bousquet – Statistical Learning Theory – Lecture 7 149
• Computationally, variant of SRM
• variant of model selection by penalization
⇒ one has to choose a regularizer which makes sense
• Need a class that is large enough (for universal consistency)
• but has small balls
O. Bousquet – Statistical Learning Theory – Lecture 7 150
Rates
• To obtain bounds, consider ERM on balls
• Relevant capacity is that of balls
• Real-valued functions, need a generalization of VC dimension, entropy
or covering numbers
• Involve scale sensitive capacity (takes into account the value and not
only the sign)
O. Bousquet – Statistical Learning Theory – Lecture 7 151
Scale-sensitive capacity
• Generalization of VC entropy and VC dimension to real-valued func-
tions
• Deﬁnition: a set x
1
, . . . , x
n
is shattered by J (at scale ε) if there
exists a function s such that for all choices of α
i
∈ ¦−1, 1], there
exists f ∈ J
α
i
(f(x
i
) −s(x
i
)) ≥ ε
• The fat-shattering dimension of J at scale ε (denoted vc(J, ε)) is
the maximum cardinality of a shattered set
O. Bousquet – Statistical Learning Theory – Lecture 7 152
• Like VC dimension, fat-shattering dimension can be used to upper
bound covering numbers
• Result
N(J, t, n) ≤
_
C
1
t
_
C
2
vc(J,C
3
t)
• Note that one can also deﬁne data-dependent versions (restriction on
the sample)
O. Bousquet – Statistical Learning Theory – Lecture 7 153
• Consequence of covering number estimates
R
n
(J) ≤
C
1

n
_

0
_
vc(J, t) log
C
2
t
dt
N(0,1) variables)
Ç
n
(J) = E
g
_
sup
f∈J
1
n
n

i=1
g
i
f(Z
i
)
_
O. Bousquet – Statistical Learning Theory – Lecture 7 154
• Worst case average

n
(J) = sup
x
1
,...,x
n
E
σ
_
sup
f∈J
G
n
f
_
• Associated ”dimension” t(J, ) = sup¦n ∈ N :
n
(J) ≥ ]
• Result (Mendelson & Vershynin 2003)
vc(J, c
t
) ≤ t(J, ) ≤
K

2
vc(J, c)
O. Bousquet – Statistical Learning Theory – Lecture 7 155
• What matters is the capacity of J (loss class)
• If φ is Lipschitz with constant M
• then
7
n
(J) ≤ M7
n
(Ç)
• Relates to Rademacher average of the initial class (easier to compute)
O. Bousquet – Statistical Learning Theory – Lecture 7 156
Dualization
• Consider the problem min
jgj≤R
L
n
(g)
E
σ
_
sup
jgj≤R
R
n
g
_
• Duality
E
σ
_
sup
jfj≤R
R
n
f
_
=
R
n
E
σ
__
_
_
_
_
n

i=1
σ
i
δ
X
i
_
_
_
_
_

_
jj

dual norm, δ
X
i
evaluation at X
i
(element of the dual under
appropriate conditions)
O. Bousquet – Statistical Learning Theory – Lecture 7 157
RHKS
Given a positive deﬁnite kernel k
• Space of functions: reproducing kernel Hilbert space associated to k
• Regularizer: rkhs norm jj
k
• Properties: Representer theorem
g
n
=
n

i=1
α
i
k(X
i
, )
O. Bousquet – Statistical Learning Theory – Lecture 7 158
Shattering dimension of hyperplanes
• Set of functions
Ç = ¦g(x) = w x : jwj = 1]
• Assume jxj ≤ R
• Result
vc(Ç, ρ) ≤ R
2

2
O. Bousquet – Statistical Learning Theory – Lecture 7 159
Proof Strategy (Gurvits, 1997)
Assume that x
1
, . . . , x
r
are ρ-shattered by hyperplanes with jwj = 1,
i.e., for all y
1
, . . . , y
r
∈ ¦±1], there exists a w such that
y
i
¸w, x
i
) ≥ ρ for all i = 1, . . . , r. (2)
Two steps:
• prove that the more points we want to shatter (2), the larger
j

r
i=1
y
i
x
i
j must be
• upper bound the size of j

r
i=1
y
i
x
i
j in terms of R
Combining the two tells us how many points we can at most shatter
O. Bousquet – Statistical Learning Theory – Lecture 7 160
Part I
• Summing (2) yields ¸w, (

r
i=1
y
i
x
i
)) ≥ rρ
• By Cauchy-Schwarz inequality
_
w,
_
r

i=1
y
i
x
i
__
≤ jwj
_
_
_
_
_
r

i=1
y
i
x
i
_
_
_
_
_
=
_
_
_
_
_
r

i=1
y
i
x
i
_
_
_
_
_
• Combine both:
rρ ≤
_
_
_
_
_
r

i=1
y
i
x
i
_
_
_
_
_
. (3)
O. Bousquet – Statistical Learning Theory – Lecture 7 161
Part II
Consider labels y
i
E
_
_
_
_
_
_
_
r

i=1
y
i
x
i
_
_
_
_
_
2
_
_
= E
_
_
r

i,j=1
y
i
y
j
¸x
i
, x
j
)
_
_
=
r

i=1
E[¸x
i
, x
i
)] +E
_
_
r

i,=j
¸x
i
, x
j
)
_
_
=
r

i=1
jx
i
j
2
O. Bousquet – Statistical Learning Theory – Lecture 7 162
Part II, ctd.
• Since jx
i
j ≤ R, we get E
_
j

r
i=1
y
i
x
i
j
2
_
≤ rR
2
.
• This holds for the expectation over the random choices of the labels,
hence there must be at least one set of labels for which it also holds
true. Use this set.
• Hence
_
_
_
_
_
r

i=1
y
i
x
i
_
_
_
_
_
2
≤ rR
2
.
O. Bousquet – Statistical Learning Theory – Lecture 7 163
Part I and II Combined
• Part I: (rρ)
2
≤ j

r
i=1
y
i
x
i
j
2
• Part II: j

r
i=1
y
i
x
i
j
2
≤ rR
2
• Hence
r
2
ρ
2
≤ rR
2
,
i.e.,
r ≤
R
2
ρ
2
O. Bousquet – Statistical Learning Theory – Lecture 7 164
Boosting
Given a class 1 of functions
• Space of functions: linear span of 1
• Regularizer: 1-norm of the weights jgj = inf¦

i
] : g =

α
i
h
i
]
• Properties: weight concentrated on the (weighted) margin maximizers
g
n
=

w
h
h

d
i
Y
i
h(X
i
) = min
h
t
∈1

d
i
Y
i
h
t
(X
i
)
O. Bousquet – Statistical Learning Theory – Lecture 7 165
• Function class of interest
Ç
R
= ¦g ∈ span 1 : jgj
1
≤ R]
• Result
7
n

R
) = R7
n
(1)
⇒ Capacity (as measured by global Rademacher averages) not aﬀected
by taking linear combinations !
O. Bousquet – Statistical Learning Theory – Lecture 7 166
Lecture 8
SVM
• Computational aspects
• Capacity Control
• Universality
• Special case of RBF kernel
O. Bousquet – Statistical Learning Theory – Lecture 8 167
Formulation (1)
• Soft margin
min
w,b,ξ
1
2
jwj
2
+ C
m

i=1
ξ
i
y
i
(¸w, x
i
) + b) ≥ 1 −ξ
i
ξ
i
≥ 0
• Convex objective function and convex constraints
• Unique solution
• Eﬃcient procedures to ﬁnd it
→ Is it the right criterion ?
O. Bousquet – Statistical Learning Theory – Lecture 8 168
Formulation (2)
• Soft margin
min
w,b,ξ
1
2
jwj
2
+ C
m

i=1
ξ
i
y
i
(¸w, x
i
) + b) ≥ 1 −ξ
i
, ξ
i
≥ 0
• Optimal value of ξ
i
ξ

i
= max(0, 1 −y
i
(¸w, x
i
) + b))
• Substitute above to get
min
w,b
1
2
jwj
2
+ C
m

i=1
max(0, 1 −y
i
(¸w, x
i
) + b))
O. Bousquet – Statistical Learning Theory – Lecture 8 169
Regularization
General form of regularization problem
min
f∈J
1
m
n

i=1
c(y
i
f(x
i
)) + λjfj
2
→ Capacity control by regularization with convex cost
0 1
1
O. Bousquet – Statistical Learning Theory – Lecture 8 170
Loss Function
φ(Y f(X)) = max(0, 1 −Y f(X))
• Convex, non-increasing, upper bounds 1
[Y f(X)≤0]
• Classiﬁcation-calibrated
• Classiﬁcation type (L

= L(t))
R(g) −R

≤ L(g) −L

O. Bousquet – Statistical Learning Theory – Lecture 8 171
Regularization
Choosing a kernel corresponds to
• Choose a sequence (a
k
)
• Set
jfj
2
:=

k≥0
a
k
_
]f
(k)
]
2
dx
⇒ penalization of high order derivatives (high frequencies)
⇒ enforce smoothness of the solution
O. Bousquet – Statistical Learning Theory – Lecture 8 172
Capacity: VC dimension
• The VC dimension of the set of hyperplanes is d + 1 in R
d
.
Dimension of feature space ?
∞ for RBF kernel
• w choosen in the span of the data (w =

α
i
y
i
x
i
)
The span of the data has dimension m for RBF kernel (k(., x
i
)
linearly independent)
• The VC bound does not give any information
_
h
m
= 1
⇒ Need to take the margin into account
O. Bousquet – Statistical Learning Theory – Lecture 8 173
Capacity: Shattering dimension
Hyperplanes with Margin
If jxj ≤ R,
vc(hyperplanes with margin ρ, 1) ≤ R
2

2
O. Bousquet – Statistical Learning Theory – Lecture 8 174
Margin
• The shattering dimension is related to the margin
• Maximizing the margin means minimizing the shattering dimension
• Small shattering dimension ⇒ good control of the risk
⇒ this control is automatic (no need to choose the margin beforehand)
⇒ but requires tuning of regularization parameter
O. Bousquet – Statistical Learning Theory – Lecture 8 175
• Consider hyperplanes with jwj ≤ M
M
n

2
¸
¸
¸
_
n

i=1
k(x
i
, x
i
) ≤ 7
n

M
n
¸
¸
¸
_
n

i=1
k(x
i
, x
i
)
• Trace of the Gram matrix
• Notice that 7
n

_
R
2
/(n
2
ρ
2
)
O. Bousquet – Statistical Learning Theory – Lecture 8 176
E
_
sup
jwj≤M
1
n

n
i=1
σ
i
¸
w, δ
x
i
_
_
= E
_
sup
jwj≤M
_
w,
1
n

n
i=1
σ
i
δ
x
i
__
≤ E
_
sup
jwj≤M
jwj
_
_
_
1
n

n
i=1
σ
i
δ
x
i
_
_
_
_
=
M
n
E
_
_
_

n
i=1
σ
i
δ
x
i
,

n
i=1
σ
i
δ
x
i
_
_
O. Bousquet – Statistical Learning Theory – Lecture 8 177
M
n
E
_
_
_

n
i=1
σ
i
δ
x
i
,

n
i=1
σ
i
δ
x
i
_
_

M
n
_
E
__

n
i=1
σ
i
δ
x
i
,

n
i=1
σ
i
δ
x
i
__
=
M
n
_
E
_

i,j
σ
i
σ
j
_
δ
x
i
, δ
x
j
__
=
M
n
_

n
i=1
k(x
i
, x
i
)
O. Bousquet – Statistical Learning Theory – Lecture 8 178
Improved rates – Noise condition
• Under Massart’s condition (]η] > η
0
), with jgj

≤ M
E
_
(φ(Y g(X)) −φ(Y t(X)))
2
_
≤ (M−1+2/η
0
)(L(g)−L

) .
→ If noise is nice, variance linearly related to expectation
→ Estimation error of order r

(of the class Ç)
O. Bousquet – Statistical Learning Theory – Lecture 8 179
Improved rates – Capacity (1)
• r

n
related to decay of eigenvalues of the Gram matrix
r

n

c
n
min
d∈N
_
_
d +
¸

j>d
λ
j
_
_
• Note that d = 0 gives the trace bound
• r

n
always better than the trace bound (equality when λ
i
constant)
O. Bousquet – Statistical Learning Theory – Lecture 8 180
Improved rates – Capacity (2)
Example: exponential decay
• λ
i
= e
−αi
1

n
• r

n
of order
log n
n
O. Bousquet – Statistical Learning Theory – Lecture 8 181
Exponent of the margin
• Estimation error analysis shows that in Ç
M
= ¦g : jgj ≤ M]
R(g
n
) −R(g

) ≤ M...
• Wrong power (M
2
penalty) is used in the algorithm
• Computationally easier
• But does not give λ a dimension-free status
• Using M could improve the cutoﬀ detection
O. Bousquet – Statistical Learning Theory – Lecture 8 182
Kernel
Why is it good to use kernels ?
• Gaussian kernel (RBF)
k(x, y) = e

jx−yj
2

2
• σ is the width of the kernel
→ What is the geometry of the feature space ?
O. Bousquet – Statistical Learning Theory – Lecture 8 183
RBF
Geometry
• Norms
jΦ(x)j
2
= ¸Φ(x), Φ(x)) = e
0
= 1
• Angles
cos(

Φ(x), Φ(y)) =
_
Φ(x)
jΦ(x)j
,
Φ(y)
jΦ(y)j
_
= e
−jx−yj
2
/2σ
2
≥ 0
→ Angles less than 90 degrees
• Φ(x) = k(x, .) ≥ 0
O. Bousquet – Statistical Learning Theory – Lecture 8 184
RBF
O. Bousquet – Statistical Learning Theory – Lecture 8 185
RBF
Diﬀerential Geometry
• Flat Riemannian metric
→ ’distance’ along the sphere is equal to distance in input space
• Distances are contracted
→ ’shortcuts’ by getting outside the sphere
O. Bousquet – Statistical Learning Theory – Lecture 8 186
RBF
Geometry of the span
Ellipsoid
a
b
x^2/a^2 + y^2/b^2 = 1
• K = (k(x
i
, x
j
)) Gram matrix
• Eigenvalues λ
1
, . . . , λ
m
• Data points mapped to ellispoid with lengths

λ
1
, . . . ,

λ
m
O. Bousquet – Statistical Learning Theory – Lecture 8 187
RBF
Universality
• Consider the set of functions
1 = span¦k(x, ) : x ∈ ·]
• 1 is dense in C(·)
→ Any continuous function can be approximated (in the jj

norm) by
functions in 1
⇒ with enough data one can construct any function
O. Bousquet – Statistical Learning Theory – Lecture 8 188
RBF
Eigenvalues
• Exponentially decreasing
• Fourier domain: exponential penalization of derivatives
• Enforces smoothness with respect to the Lebesgue measure in input
space
O. Bousquet – Statistical Learning Theory – Lecture 8 189
RBF
Induced Distance and Flexibility
• σ → 0
1-nearest neighbor in input space
Each point in a separate dimension, everything orthogonal
• σ → ∞
linear classiﬁer in input space
All points very close on the sphere, initial geometry
• Tuning σ allows to try all possible intermediate combinations
O. Bousquet – Statistical Learning Theory – Lecture 8 190
RBF
Ideas
• Works well if the Euclidean distance is good
• Works well if decision boundary is smooth
• Universal
O. Bousquet – Statistical Learning Theory – Lecture 8 191
Choosing the Kernel
• Major issue of current research
• Prior knowledge (e.g. invariances, distance)
• Cross-validation (limited to 1-2 parameters)
• Bound (better with convex class)
⇒ Lots of open questions...
O. Bousquet – Statistical Learning Theory – Lecture 8 192
Learning Theory: some informal thoughts
• Need assumptions/restrictions to learn
• Data cannot replace knowledge
• No universal learning (simplicity measure)
• SVM work because of capacity control
• Choice of kernel = choice of prior/ regularizer
• RBF works well if Euclidean distance meaningful
• Knowledge improves performance (e.g. invariances)
O. Bousquet – Statistical Learning Theory – Lecture 8 193

• Lecture 1: Introduction • Part I: Binary Classiﬁcation
Lecture 2: Basic bounds Lecture 3: VC theory Lecture 4: Capacity measures Lecture 5: Advanced topics

O. Bousquet – Statistical Learning Theory

1

• Part II: Real-Valued Classiﬁcation

Lecture 6: Margin and loss functions Lecture 7: Regularization Lecture 8: SVM
O. Bousquet – Statistical Learning Theory 2

Lecture 1 The Learning Problem • Context • Formalization • Approximation/Estimation trade-oﬀ • Algorithms and Bounds O. Bousquet – Statistical Learning Theory – Lecture 1 3 .

Bousquet – Statistical Learning Theory – Lecture 1 4 . Construct a model of the phenomenon 3. Observe a phenomenon 2. Make predictions ⇒ This is more or less the deﬁnition of natural sciences ! ⇒ The goal of Machine Learning is to automate this process ⇒ The goal of Learning Theory is to formalize it. O.Learning and Inference The inductive inference process: 1.

Bousquet – Statistical Learning Theory – Lecture 1 5 . label) • Label is +1 or −1 • Algorithm constructs a function (instance → label) • Goal: make few mistakes on future unseen instances O.Pattern recognition We consider here the supervised learning framework for pattern recognition: • Data consists of pairs (instance.

Bousquet – Statistical Learning Theory – Lecture 1 6 .Approximation/Interpolation It is always possible to build a function that ﬁts exactly the data.5 0 0 0.5 1 1.5 But is it reasonable ? O. 1.5 1 0.

Bousquet – Statistical Learning Theory – Lecture 1 7 ...Occam’s Razor Idea: look for regularities in the observed phenomenon These can ge generalized from the observed past to the future ⇒ choose the simplest consistent model How to measure simplicity ? • • • • Physics: number of constants Description length Number of parameters . O.

No Free Lunch
• No Free Lunch if there is no assumption on how the past is related to the future, prediction is impossible if there is no restriction on the possible phenomena, generalization is impossible • • • •
We need to make assumptions Simplicity is not absolute Data will never replace knowledge Generalization = data + knowledge

O. Bousquet – Statistical Learning Theory – Lecture 1

8

Assumptions
Two types of assumptions

• Future observations related to past ones → Stationarity of the phenomenon

• Constraints on the phenomenon → Notion of simplicity

O. Bousquet – Statistical Learning Theory – Lecture 1

9

Goals
⇒ How can we make predictions from the past ? what are the assumptions ? • Give a formal deﬁnition of learning, generalization, overﬁtting • Characterize the performance of learning algorithms • Design better algorithms

O. Bousquet – Statistical Learning Theory – Lecture 1

10

Bousquet – Statistical Learning Theory – Lecture 1 11 .Probabilistic Model Relationship between past and future observations ⇒ Sampled independently from the same distribution • Independence: each new observation yields maximum information • Identical distribution: the observations give information about the underlying phenomenon (here a probability distribution) O.

1}. Here: classiﬁcation case Y = {−1.i. Bousquet – Statistical Learning Theory – Lecture 1 12 . Y ) ∈ X × Y are distributed according to P (unknown). Goal: construct a function g : X → Y which predicts Y from X . Data: We observe a sequence of n i. O. Yi) sampled according to P .Probabilistic Model We consider an input space X and output space Y . pairs (Xi. Assumption: The pairs (X.d.

Probabilistic Model Criterion to choose our function: Low probability of error P (g(X) = Y ). Bousquet – Statistical Learning Theory – Lecture 1 13 . Risk R(g) = P (g(X) = Y ) = E 1[g(X)=Y ] • P is unknown so that we cannot directly measure the risk • Can only measure the agreement on the data • Empirical Risk n 1 Rn(g) = 1[g(Xi)=Yi] n i=1 O.

Bousquet – Statistical Learning Theory – Lecture 1 14 . 1}) • in general. 1−P [Y = 1|X = x]) = (1 − η(x))/2 is the noise level O.Target function • P can be decomposed as PX × P (Y |X) • η(x) = E [Y |X = x] = 2P [Y = 1|X = x] − 1 is the regression function • t(x) = sgn η(x) is the target function • in the deterministic case Y = t(X) (P [Y = 1|X] ∈ {0. n(x) = min(P [Y = 1|X = x] .

Assumptions about P Need assumptions about P . a priori probability distribution on possible functions) • Restriction (set of possible functions) Treating lack of knowledge • Bayesian approach: uniform distribution • Learning Theory approach: worst case analysis O. if t(x) is totally chaotic.g. there is no possible generalization from ﬁnite data. Assumptions can be • Preference (e. Indeed. Bousquet – Statistical Learning Theory – Lecture 1 15 .

5 0 0 0.Approximation/Interpolation (again) How to trade-oﬀ knowledge and data ? 1.5 1 1. Bousquet – Statistical Learning Theory – Lecture 1 16 .5 O.5 1 0.

Bousquet – Statistical Learning Theory – Lecture 1 17 . O.Overﬁtting/Underﬁtting The data can mislead you. • Underﬁtting model too small to ﬁt the data • Overﬁtting artiﬁcially good agreement with the data No way to detect them from the data ! Need extra validation data.

Bousquet – Statistical Learning Theory – Lecture 1 18 .Empirical Risk Minimization • Choose a model G (set of possible functions) • Minimize the empirical risk in the model min Rn(g) g∈G What if the Bayes classiﬁer is not in the model ? O.

Approximation/Estimation • Bayes risk R = inf R(g) . or Bayes classiﬁer). depends on the data).e. Bousquet – Statistical Learning Theory – Lecture 1 19 . O. g ∗ Best risk a deterministic function can have (risk of the target function. • Decomposition: R(g ∗) = inf g∈G R(g) R(gn) − R = R(g) − R ∗ ∗ + R(gn) − R(g ) Estimation ∗ Approximation • Only the estimation error is random (i.

Structural Risk Minimization • Choose a collection of models {Gd : d = 1. . . n) d g∈Gd pen(d. .} • Minimize the empirical risk in each model • Minimize the penalized empirical risk min min Rn(g) + pen(d. 2. n) measures the size or capacity of the model O. Bousquet – Statistical Learning Theory – Lecture 1 20 . n) gives preference to models where estimation error is small pen(d.

Bousquet – Statistical Learning Theory – Lecture 1 21 . Most methods can be thought of as regularization methods. O.Regularization • Choose a large model G (possibly dense) • Choose a regularizer g • Minimize the regularized empirical risk min Rn(g) + λ g g∈G 2 • Choose an optimal trade-oﬀ λ (regularization parameter).

Bousquet – Statistical Learning Theory – Lecture 1 22 . ⇒ need probabilistic bounds O. .Bounds (1) A learning algorithm • Takes as input the data (X1. Yn) • Produces a function gn Can we estimate the risk of gn ? ⇒ random quantity (depends on the data). Y1). . . . (Xn.

Bousquet – Statistical Learning Theory – Lecture 1 23 .Bounds (2) • Error bounds R(gn) ≤ Rn(gn) + B ⇒ Estimation from an empirical quantity • Relative error bounds Best in a class R(gn) ≤ R(g ) + B Bayes risk ∗ R(gn) ≤ R + B ⇒ Theoretical guarantees ∗ O.

Lecture 2 Basic Bounds • Probability tools • Relationship with empirical processes • Law of large numbers • Union bound • Relative error bounds O. Bousquet – Statistical Learning Theory – Lecture 2 24 .

Bousquet – Statistical Learning Theory – Lecture 2 25 .Probability Tools (1) Basic facts • Union: P [A or B] ≤ P [A] + P [B] • Inclusion: If A ⇒ B . • Inversion: If P [X ≥ t] ≤ F (t) then with probability at least 1 − δ . E [X] = ∞ 0 P [X ≥ t] dt. • Expectation: If X ≥ 0. O. X ≤ F −1(δ). then P [A] ≤ P [B].

P [|X − E [X] | ≥ t] ≤ E[X] t Var[X] t2 • Chernoﬀ: for all t ∈ R. Bousquet – Statistical Learning Theory – Lecture 2 26 . f (E [X]) ≤ E [f (X)] • Markov: If X ≥ 0 then for all t > 0. P [X ≥ t] ≤ inf λ≥0 E eλ(X−t) O. P [X ≥ t] ≤ • Chebyshev: for t > 0.Probability Tools (2) Basic inequalities • Jensen: for f convex.

. • Cannot be observed (P is unknown) • Random (depends on the data) ⇒ we want to bound P [R(gn) − Rn(gn) > ε] O.Error bounds Recall that we want to bound R(gn) = E 1[gn(X)=Y ] where gn has been constructed from (X1. . Yn). Bousquet – Statistical Learning Theory – Lecture 2 27 . . (Xn. Y1). .

Given G deﬁne the loss class F = {f : (x. Y )] and Pnf = Quantity of interest: 1 n n i=1 f (Xi. y) → 1[g(x)=y] : g ∈ G} Denote P f = E [f (X. Bousquet – Statistical Learning Theory – Lecture 2 28 . Y ). Yi) and Z = (X. let Zi = (Xi. Yi) P f − Pn f We will go back and forth between F and G (bijection) O.Loss class For convenience.

Empirical process Empirical process: {P f − Pnf }f ∈F • Process = collection of random variables (here indexed by functions in F ) • Empirical = distribution of each random variable ⇒ Many techniques exist to control the supremum sup P f − Pnf f ∈F O. Bousquet – Statistical Learning Theory – Lecture 2 29 .

Bousquet – Statistical Learning Theory – Lecture 2 30 . f (Z) Law of large numbers n 1 P lim n→∞ n ⇒ can we quantify it ? n f (Zi) − E [f (Z)] = 0 i=1 = 1. O.v.The Law of Large Numbers 1 R(g) − Rn(g) = E [f (Z)] − f (Zi) n i=1 → diﬀerence between the expectation and the empirical average of the r.

⇒ Let’s rewrite it to better understand O. Zn be n i.i.d. Assumes bounded random variables Theorem 1. . . b]. Then for all ε > 0. If f (Z) ∈ [a. Let Z1. . we have P 1 n n f (Zi) − E [f (Z)] > ε i=1 ≤ 2 exp 2nε2 − (b − a)2 . random variables.Hoeﬀding’s Inequality Quantitative version of law of large numbers. . Bousquet – Statistical Learning Theory – Lecture 2 31 .

Bousquet – Statistical Learning Theory – Lecture 2 32 . |Pnf − P f | ≤ (b − a) log 2 δ 2n O.Hoeﬀding’s Inequality Write δ = 2 exp Then 2nε2 − (b − a)2 log 2 δ 2n  ≤δ  P |Pnf − P f | > (b − a) or [Inversion] with probability at least 1 − δ .

If the function depends on the data this does not apply ! O. (1) Notice that one has to consider a ﬁxed function f and the probability is with respect to the sampling of the data. Bousquet – Statistical Learning Theory – Lecture 2 33 . For any g .Hoeﬀding’s inequality Let’s apply to f (Z) = 1[g(X)=Y ]. and any δ > 0. with probability at least 1 − δ R(g) ≤ Rn(g) + log 2 δ 2n .

Limitations • For each ﬁxed function f ∈ F . only some of the functions in F will satisfy this inequality ! O. there is a set S of samples for which P f − Pn f ≤ log 2 δ 2n (P [S] ≥ 1 − δ ) • They may be diﬀerent for diﬀerent functions • The function chosen by the algorithm depends on the sample ⇒ For the observed sample. Bousquet – Statistical Learning Theory – Lecture 2 34 .

Bousquet – Statistical Learning Theory – Lecture 2 35 . This does not contradict Hoeﬀding but shows it is not enough O.Limitations What we need to bound is P f n − P n fn where fn is the function chosen by the algorithm based on the data. For any ﬁxed sample. there exists a function f such that P f − Pn f = 1 Take the function which is f (Xi) = Yi on the data and f (X) = −Y everywhere else.

Bousquet – Statistical Learning Theory – Lecture 2 36 .Limitations Risk Remp [f] R[f] R Remp f f opt fm Function class Hoeﬀding’s inequality quantiﬁes diﬀerences for a ﬁxed function O.

Bousquet – Statistical Learning Theory – Lecture 2 37 . we do not know which function the algorithm will choose. The trick is to consider uniform deviations R(fn) − Rn(fn) ≤ sup(R(f ) − Rn(f )) f ∈F We need a bound which holds simultaneously for all functions in a class O.Uniform Deviations Before seeing the data.

. . f2 and deﬁne Ci = {(x1. for each i P [Ci] ≤ δ We want to bound the probability of being ’bad’ for i = 1 or i = 2 P [C1 ∪ C2] ≤ P [C1] + P [C2] O. yn) : P fi − Pnfi > ε} From Hoeﬀding’s inequality.Union Bound Consider two functions f1. (xn. y1). . Bousquet – Statistical Learning Theory – Lecture 2 38 . .

Bousquet – Statistical Learning Theory – Lecture 2 39 . . . . ∪ CN ] ≤ i=1 P [Ci] We have P [∃f ∈ {f1. fN } : P f − Pnf > ε] N ≤ i=1 P [P fi − Pnfi > ε] ≤ N exp −2nε 2 O. .Finite Case More generally N P [C1 ∪ . . .

for G = {g1. for all δ > 0 with probability at least 1 − δ . . Bousquet – Statistical Learning Theory – Lecture 2 40 . .Finite Case We obtain. ∀g ∈ G. gN }. . R(g) ≤ Rn(g) + This is a generalization bound ! log N + log 1 δ 2m Coding interpretation log N is the number of bits to specify a function in F O. .

Bousquet – Statistical Learning Theory – Lecture 2 41 .Approximation/Estimation Let g = arg min R(g) g∈G ∗ If gn minimizes the empirical risk in G . Rn(g ) − Rn(gn) ≥ 0 Thus ∗ R(gn) = ≤ ≤ R(gn) − R(g ) + R(g ) Rn(g ) − Rn(gn) + R(gn) − R(g ) + R(g ) 2 sup |R(g) − Rn(g)| + R(g ) g∈G ∗ ∗ ∗ ∗ ∗ ∗ O.

Approximation/Estimation We obtain with probability at least 1 − δ ∗ R(gn) ≤ R(g ) + 2 The ﬁrst term decreases if N increases The second term increases log N + log 2 δ 2m The size of G controls the trade-oﬀ O. Bousquet – Statistical Learning Theory – Lecture 2 42 .

Choose a sequence of models Gm to have more ﬂexibility/control O.Summary (1) • Inference requires assumptions .d. Bousquet – Statistical Learning Theory – Lecture 2 43 .Restrict the possible functions to G .Data sampled i. from P .i.

for most of the samples √ R(g) − Rn(g) ≈ 1/ n .For most of the samples if |G| = N sup R(g) − Rn(g) ≈ g∈G log N/n ⇒ Extra variability because the chosen gn changes with the data O.Summary (2) • Bounds are valid w.t. Bousquet – Statistical Learning Theory – Lecture 2 44 .r. repeated sampling .For a ﬁxed function g .

Next we improve the union bound and extend it to the inﬁnite case O. Bousquet – Statistical Learning Theory – Lecture 2 45 .Improvements We obtained sup R(g) − Rn(g) ≤ g∈G log N + log 2 δ 2n To be improved • Hoeﬀding only uses boundedness. not the variance • Union bound as bad as if independent • Supremum is not what the algorithm chooses.

Reﬁned union bound (1) For each f ∈ F . Bousquet – Statistical Learning Theory – Lecture 2 46 .  P P f − Pnf > log 1 δ(f )   ≤ δ(f )  ≤ f ∈F 2n log  P ∃f ∈ F : P f − Pnf > Choose δ(f ) = δp(f ) with f ∈F 1 δ(f ) 2n δ(f ) p(f ) = 1 O.

Reﬁned union bound (2) With probability at least 1 − δ . 1 log p(f ) + log 1 δ ∀f ∈ F . P f ≤ Pnf + • Applies to countably inﬁnite F 2n • Can put knowledge about the algorithm into p(f ) • But p chosen before seeing the data O. Bousquet – Statistical Learning Theory – Lecture 2 47 .

Bousquet – Statistical Learning Theory – Lecture 2 48 . how to choose the p (since it implies an ordering) • The trick is to look at F through the data O. The bound can be improved if you know ahead of time the chosen function (knowledge improves the bound) • In the inﬁnite case.Reﬁned union bound (3) • Good p means good bound.

Lecture 3 Inﬁnite Case: Vapnik-Chervonenkis Theory • Growth function • Vapnik-Chervonenkis dimension • Proof of the VC bound • VC entropy • SRM O. Bousquet – Statistical Learning Theory – Lecture 3 49 .

.zn ) |Fz1. ..... .. .zn = {(f (z1). .. ..zn | O. zn) can be classiﬁed. Bousquet – Statistical Learning Theory – Lecture 3 50 . .Inﬁnite Case Measure of the size of an inﬁnite class ? • Consider Fz1. f (zn)) : f ∈ F } The size of this set is the number of possible ways in which the data (z1. ... . • Growth function SF (n) = • Note that SF (n) = SG (n) sup (z1 ...

R(g) ≤ Rn(g) + • Always better than N in the ﬁnite case • How to compute SG (n) in general ? ⇒ use VC dimension O.Inﬁnite Case • Result (Vapnik-Chervonenkis) With probability at least 1 − δ log SG (2n) + log 4 δ 8n ∀g ∈ G. Bousquet – Statistical Learning Theory – Lecture 3 51 .

SG (n) ≤ 2n If SG (n) = 2n.VC Dimension Notice that since g ∈ {−1. Bousquet – Statistical Learning Theory – Lecture 3 52 . The VC-dimension of G is the largest n such that SG (n) = 2 n O. the class of functions can generate any classiﬁcation on n points (shattering) Deﬁnition 2. 1}.

VC Dimension Hyperplanes In Rd. Bousquet – Statistical Learning Theory – Lecture 3 53 . V C(hyperplanes) = d + 1 O.

VC Dimension Number of Parameters Is VC dimension equal to number of parameters ? • One parameter {sgn(sin(tx)) : t ∈ R} • Inﬁnite VC dimension ! O. Bousquet – Statistical Learning Theory – Lecture 3 54 .

Bousquet – Statistical Learning Theory – Lecture 3 55 .VC Dimension • We want to know SG (n) but we only know SG (n) = 2n for n ≤ h What happens for n ≥ h ? log(S(m)) h m O.

SG (n) ≤ en h h ⇒ phase transition O. Bousquet – Statistical Learning Theory – Lecture 3 56 . Let G be a class of functions with ﬁnite VC-dimension h. h n SG (n) ≤ i i=0 and for all n ≥ h. Then for all n ∈ N.Vapnik-Chervonenkis-Sauer-Shelah Lemma Lemma 3.

Bousquet – Statistical Learning Theory – Lecture 3 57 . With probability at least 1 − δ ∀g ∈ G.VC Bound Let G be a class with VC dimension h. R(g) ≤ Rn(g) + h log 2en + log 4 h δ 8n So the error is of order h log n n O.

Bousquet – Statistical Learning Theory – Lecture 3 58 .Interpretation VC dimension: measure of eﬀective dimension • Depends on geometry of the class • Gives a natural deﬁnition of simplicity (by quantifying the potential overﬁtting) • Not related to the number of parameters • Finiteness guarantees learnability under any distribution O.

Bousquet – Statistical Learning Theory – Lecture 3 59 . such that nt2 ≥ 2.Symmetrization (lemma) Key ingredient in VC bounds: Symmetrization Let Z1. . For any t > 0. Zn an independent (ghost) sample and Pn the corresponding empirical measure. . . Lemma 4. . P sup(P − Pn)f ≥ t f ∈F ≤ 2P sup(Pn − Pn)f ≥ t/2 f ∈F O.

. . . .Symmetrization (proof – 1) fn the function achieving the supremum (depends on Z1. Zn) 1[(P −Pn)fn>t]1[(P −Pn)fn<t/2] = ≤ 1[(P −Pn)fn>t ∧ (P −Pn)fn<t/2] 1[(Pn−Pn)fn>t/2] Taking expectations with respect to the second sample gives 1[(P −Pn)fn>t]P (P − Pn)fn < t/2 ≤ P (Pn − Pn)fn > t/2 O. Bousquet – Statistical Learning Theory – Lecture 3 60 .

O. P (P − Pn)fn ≥ t/2 ≤ 4Var [fn] 1 ≤ nt2 nt2 • Hence 1[(P −Pn)fn>t](1 − 1 ) ≤ P (Pn − Pn)fn > t/2 2 nt Take expectation with respect to ﬁrst sample.Symmetrization (proof – 2) • By Chebyshev inequality. Bousquet – Statistical Learning Theory – Lecture 3 61 .

..Zn 1 . Bousquet – Statistical Learning Theory – Lecture 3 62 ....Z • Union bound on FZ1....Zn • Variant of Hoeﬀding’s inequality P Pnf − Pnf > t ≤ 2e −nt2 /2 O..Zn.Z 1 ..Zn.......Proof of VC bound (1) • Symmetrization allows to replace expectation by average on ghost sample • Function class projected on the double sample FZ1.

..Proof of VC bound (2) P supf ∈F (P − Pn)f ≥ t ≤ = ≤ ≤ 2P supf ∈F (Pn − Pn)f ≥ t/2 2P supf ∈F Z1 ..Z1 ..Zn (Pn − Pn)f ≥ t/2 2SF (2n)P (Pn − Pn)f ≥ t/2 4SF (2n)e −nt2 /8 O....Zn .. Bousquet – Statistical Learning Theory – Lecture 3 63 .

Bousquet – Statistical Learning Theory – Lecture 3 64 .VC Entropy (1) • VC dimension is distribution independent ⇒ The same bound holds for any distribution ⇒ It is loose for most distributions • A similar proof can give a distribution-dependent result O.

• VC entropy bound: with probability at least 1 − δ HG (2n) + log 2 δ 8n ∀g ∈ G. . . .. ..zn • The VC entropy is deﬁned as HF (n) = log E [N (F . zn) := #Fz1. . . R(g) ≤ Rn(g) + O. z1 .VC Entropy (2) • Denote the size of the projection N (F . Zn)] . Z1. ... Bousquet – Statistical Learning Theory – Lecture 3 65 .

Rademacher variables 2P supf ∈F ≤ Z. Z. Z 2E N F .VC Entropy (proof) Introduce σi ∈ {−1. Z 1 P n e σi ≥ t/2 i=1 −nt2 /8 O. Bousquet – Statistical Learning Theory – Lecture 3 66 .Z (Pn − Pn)f ≥ t/2 1 n n 2E Pσ supf ∈F Z. 1} (probability 1/2).Z σi(f (Zi ) − f (Zi)) ≥ t/2 i=1 n ≤ ≤ 2E N F . Z.

e. Bousquet – Statistical Learning Theory – Lecture 3 67 .From Bounds to Algorithms • For any distribution. convergence to best in the class) • Does it means we can learn anything ? • No because of the approximation of the class • Need to trade-oﬀ approximation and estimation error (assessed by the bound) ⇒ Use the bound to control the trade-oﬀ O. HG (n)/n → 0 ensures consistency of empirical risk minimizer (i.

introduce a structure on G .Structural risk minimization • Structural risk minimization (SRM) (Vapnik. Bousquet – Statistical Learning Theory – Lecture 3 68 . • Learning machine ≡ a set of functions and an induction principle O. • To this end. n). 1979): minimize the right hand side of R(g) ≤ Rn(g) + B(h.

Bousquet – Statistical Learning Theory – Lecture 3 69 .SRM: The Picture R(f* ) error bound on test error capacity term training error h structure Sn−1 Sn Sn+1 O.

Bousquet – Statistical Learning Theory – Lecture 4 70 .Lecture 4 Capacity Measures • Covering numbers • Rademacher averages • Relationships O.

. . ε. e.Covering numbers • Deﬁne a (random) distance d between functions. ε) • Covering number N (F . O. ε. fN covers F at radius ε if F ⊂ ∪i=1B(fi. d(f. . . . f ) = 1 #{f (Zi) = f (Zi) : i = 1. . n} n Normalized Hamming distance of the ’projections’ on the sample • A set f1. .g. n) = N (G. Bousquet – Statistical Learning Theory – Lecture 4 71 N . ε. n) is the minimum size of a cover of radius ε Note that N (F . . n).

Bousquet – Statistical Learning Theory – Lecture 4 72 . t. n)] e −nt2 /128 O.Bound with covering numbers • When the covering numbers are ﬁnite. one can approximate the class G by a ﬁnite set of functions • Result P [∃g ∈ G : R(g) − Rn(g) ≥ t] ≤ 8E [N (G.

Bousquet – Statistical Learning Theory – Lecture 4 73 . t. N (G. n) ≤ h log en h • Haussler 1 N (G. n) ≤ Ch(4e) h t h • Independent of n O. n) ≤ #GZ = N (G.Covering numbers and VC dimension • Notice that for all t. t. t. Z) • Hence N (G.

Reﬁnement • VC entropy corresponds to log covering numbers at minimal scale • Covering number bound is a generalization where the scale is adapted to the error • Is this the right scale ? • It turns out that results can be improved by considering all scales (→ chaining) O. Bousquet – Statistical Learning Theory – Lecture 4 74 .

. .Rademacher averages • Rademacher variables: σ1. σn independent random variables with 1 P [σi = 1] = P [σi = −1] = 2 • Notation (randomized empirical measure) Rnf = • Rademacher average: R(F ) = E supf ∈F Rnf • Conditional Rademacher average Rn(F ) = Eσ supf ∈F Rnf 1 n n i=1 σif (Zi) O. Bousquet – Statistical Learning Theory – Lecture 4 75 . . .

Result • Distribution dependent log 1 δ 2n ∀f ∈ F . O. P f ≤ Pnf + 2Rn(F ) + 2 log 2 δ n . P f ≤ Pnf + 2R(F ) + • Data dependent . ∀f ∈ F . Bousquet – Statistical Learning Theory – Lecture 4 76 .

Bousquet – Statistical Learning Theory – Lecture 4 77 .d. random variables exist O.Concentration • Hoeﬀding’s inequality is a concentration inequality • When n increases. the average is concentrated around the expectation • Generalization to functions that depend on i.i.

zi. . . .z i |F (z1. . . .McDiarmid’s Inequality Assume for all i = 1. .... . zi. . zn) − F (z1. . Bousquet – Statistical Learning Theory – Lecture 4 78 . P [|F − E [F ] | > ε] ≤ 2 exp 2ε2 − 2 nc O. . . . . . .. n. . sup z1 . .zn . . zn)| ≤ c then for all ε > 0. .

Bousquet – Statistical Learning Theory – Lecture 4 79 .Proof of Rademacher average bounds • Use concentration to relate supf ∈F P f − Pnf to its expectation • Use symmetrization to relate expectation to Rademacher average • Use concentration again to relate Rademacher average to conditional one O.

Application (1) sup A(f ) + B(f ) ≤ sup A(f ) + sup B(f ) f ∈F f ∈F f ∈F Hence | sup C(f ) − sup A(f )| ≤ sup(C(f ) − A(f )) f ∈F f ∈F f ∈F this gives | sup(P f − Pnf ) − sup(P f − Pnf )| ≤ sup(Pnf − Pnf ) f ∈F f ∈F f ∈F O. Bousquet – Statistical Learning Theory – Lecture 4 80 .

Pn f − Pn f = thus 1 1 (f (Zi ) − f (Zi)) ≤ n n 1 n | sup(P f − Pnf ) − sup(P f − Pnf )| ≤ f ∈F f ∈F McDiarmid’s inequality can be applied with c = 1/n O.Application (2) f ∈ {0. Bousquet – Statistical Learning Theory – Lecture 4 81 . 1} hence.

Symmetrization (1) • Upper bound E sup P f − Pnf f ∈F ≤ 2E sup Rnf f ∈F • Lower bound E sup |P f − Pnf | f ∈F ≥ 1 E sup Rnf 2 f ∈F 1 − √ 2 n O. Bousquet – Statistical Learning Theory – Lecture 4 82 .

Z [sup Pnf − Pnf ] f ∈F = Eσ. Bousquet – Statistical Learning Theory – Lecture 4 83 .Z 1 sup f ∈F n n σi(f (Zi ) − f (Zi)) i=1 ≤ 2E[sup Rnf ] f ∈F O.Symmetrization (2) E[sup P f − Pnf ] f ∈F = ≤ E[sup E Pnf − Pnf ] f ∈F EZ.Z.

Loss class and initial class n R(F ) = 1 E sup g∈G n 1 E sup g∈G n σi1[g(Xi)=Yi] i=1 n = i=1 n 1 σi (1 − Yig(Xi)) 2 σiYig(Xi) = 1 R(G) 2 = 1 1 E sup 2 g∈G n i=1 O. Bousquet – Statistical Learning Theory – Lecture 4 84 .

Computing Rademacher averages (1) 1 1 E sup 2 g∈G n = n σig(Xi) i=1 n 1 1 + E sup 2 g∈G n 1 1 − E inf g∈G n 2 − i=1 n 1 − σig(Xi) 2 = i=1 1 − σig(Xi) 2 = 1 − E inf Rn(g. σ) g∈G 2 85 O. Bousquet – Statistical Learning Theory – Lecture 4 .

Bousquet – Statistical Learning Theory – Lecture 4 86 .Computing Rademacher averages (2) • Not harder than computing empirical risk minimizer • Pick σi randomly and minimize error with respect to labels σi • Intuition: measure how much the class can ﬁt random noise • Large class ⇒ R(G) = 1 2 O.

with (Xi.Concentration again • Let F = Eσ sup Rnf f ∈F Expectation with respect to σi only. Bousquet – Statistical Learning Theory – Lecture 4 87 . Yi) ﬁxed. • F satisﬁes McDiarmid’s assumptions with c = 1 n ⇒ E [F ] = R(F ) can be estimated by F = Rn(F ) O.

. fN } R(F ) ≤ 2 log N/n • Consequence for VC class F with dimension h R(F ) ≤ 2 h log en h n . Bousquet – Statistical Learning Theory – Lecture 4 88 . . . . ⇒ Recovers VC bound with a concentration proof O.Relationship with VC dimension • For a ﬁnite set F = {f1.

Bousquet – Statistical Learning Theory – Lecture 4 89 .Chaining • Using covering numbers at all scales. n) dt O. t. the geometry of the class is better captured • Dudley C Rn(F ) ≤ √ n • Consequence R(F ) ≤ C • Removes the unnecessary log n factor ! h n 0 ∞ log N (F .

Bousquet – Statistical Learning Theory – Lecture 5 90 .Lecture 5 Advanced Topics • Relative error bounds • Noise conditions • Localized Rademacher averages • PAC-Bayesian bounds O.

n) binomial distribution p = P f • P [P f − Pnf ≥ t] = • Can be upper bounded n(1−p−t) n(p+t) 1−p p Exponential 1−p−t p+t np − 1−p ((1−t/p) log(1−t/p)+t/p) Bennett e n(p−t) k=0 n k pk (1 − p)n−k Bernstein e − nt2 2p(1−p)+2t/3 2 Hoeﬀding e−2nt O.Binomial tails • Pnf ∼ B(p. Bousquet – Statistical Learning Theory – Lecture 5 91 .

Tail behavior • For small deviations. Poisson behavior ≈ exp(−3nt/2) ⇒ Tails heavier than Gaussian • Can upper bound with a Gaussian with large (maximum) variance exp(−2nt2) O. Bousquet – Statistical Learning Theory – Lecture 5 92 . Gaussian behavior ≈ exp(−nt2/2p(1 − p)) ⇒ Gaussian with variance p(1 − p) • For large deviations.

7 0.8 0.5) 1 0.3 0.5 0.85 0. Bousquet – Statistical Learning Theory – Lecture 5 93 .2 0.5 BinoCDF Bernstein Bennett Binomial Tail Best Gausian Deviation bound 0.7 0.6 0.9 0.65 0.75 x 0.55 0.Illustration (1) Maximum variance (p = 0.95 1 O.1 0 0.9 0.8 0.6 0.4 0.

05 Deviation bound Deviation bound 0.1 0 0.42 0.35 0.01 0.3 0.02 x 0.7 0.6 0.3 BinoCDF Sub−Gaussian Bernstein Bennett Binomial Tail Best Gausian 0.15 0.8 0.5 0.34 0.07 0.25 0.1) 1 0.4 0.04 0.9 0.03 0.4 0.3 0.2 0. Bousquet – Statistical Learning Theory – Lecture 5 94 .Illustration (2) Small variance (p = 0.06 BinoCDF Bernstein Bennett Binomial Tail Best Gausian 0.44 O.32 0.1 0 0.38 0.2 0.36 x 0.

by Bernstein’s inequality 2P f log 1 δ n 2 log 1 δ 3n P f ≤ Pn f + + • The Gaussian part dominates (for P f not too small. Bousquet – Statistical Learning Theory – Lecture 5 95 . it depends on P f O. or n large enough). • For each f ∈ F .Taking the variance into account (1) • Each function f ∈ F has a diﬀerent variance P f (1 − P f ) ≤ P f .

1) ⇒ Idea is to consider the ratio P f − Pn f √ Pf O. Bousquet – Statistical Learning Theory – Lecture 5 96 .Taking the variance into account (2) • Central Limit Theorem √ n P f − Pn f P f (1 − P f ) → N (0.

1}). O. • After normalization.Normalization • Here (f ∈ {0. ﬂuctuations are more ”uniform” sup f ∈F P f − Pn f √ Pf not necessarily attained at functions with large variance. ⇒ The normalized supremum takes this into account. Bousquet – Statistical Learning Theory – Lecture 5 97 . Var [f ] ≤ P f 2 = P f • Large variance ⇒ large risk. • Focus of learning: functions with small error P f (hence small variance).

P f − Pn f ≤2 ∀f ∈ F . √ Pn f O. Bousquet – Statistical Learning Theory – Lecture 5 98 .Relative deviations Vapnik-Chervonenkis 1974 For δ > 0 with probability at least 1 − δ . √ Pf and log SF (2n) + log 4 δ n log SF (2n) + log 4 δ n Pn f − P f ≤2 ∀f ∈ F .

Symmetrization P f − Pn f P sup √ ≥t Pf f ∈F 2.Proof sketch 1. Tail bound O. Bousquet – Statistical Learning Theory – Lecture 5 99 . Randomization ≤ 2P sup f ∈F Pn f − Pn f (Pnf + Pnf )/2 ≥t · · · = 2E Pσ sup f ∈F 1 n n i=1 σi(f (Zi ) − f (Zi)) (Pnf + Pnf )/2 ≥t 3.

P f ≤ Pn f + 2 +4 Pn f log SF (2n) + log 4 δ n log SF (2n) + log 4 δ n O.Consequences From the fact A≤B+C we get √ A⇒A≤B+C + 2 √ BC ∀f ∈ F . Bousquet – Statistical Learning Theory – Lecture 5 100 .

) In that case • Rn(gn) = 0 ⇒ R(gn) = O( d log n ).Zero noise Ideal situation • gn empirical risk minimizer • t∈G • R∗ = 0 (no noise. Bousquet – Statistical Learning Theory – Lecture 5 101 . n(X) = 0 a. n O.s.

Bousquet – Statistical Learning Theory – Lecture 5 102 .Interpolating between rates ? • Rates are not correctly estimated by this inequality • Consequence of relative error bounds ≤ ∗ P fn Pf + 2 +4 Pf log SF (2n) + log 4 δ ∗ n log SF (2n) + log 4 δ n • The quantity which is small is not P f ∗ but P fn − P f ∗ • But relative error bounds do not apply to diﬀerences O.

Bousquet – Statistical Learning Theory – Lecture 5 103 .Deﬁnitions • P = PX × P (Y |X) • regression function η(x) = E [Y |X = x] • target function t(x) = sgn η(x) • noise level n(X) = (1 − |η(x)|)/2 • Bayes risk R∗ = E [n(X)] • R(g) = E [(1 − η(X))/2] + E η(X)1[g≤0] • R(g) − R∗ = E |η(X)|1[gη≤0] O.

Two kinds of assumptions • n not too close to 1/2 • n not often too close to 1/2 O. the deterministic case.Intermediate noise Instead of assuming that |η(x)| = 1 (i.e. n(x) = 0). one can assume that n is well-behaved. Bousquet – Statistical Learning Theory – Lecture 5 104 .

Massart Condition • For some c > 0. assume |η(X)| > 1 almost surely c • There is no region where the decision is completely random • Noise bounded away from 1/2 O. Bousquet – Statistical Learning Theory – Lecture 5 105 .

1} . equivalent conditions (1) ∃c > 0.Tsybakov Condition Let α ∈ [0. P [g(X)η(X) ≤ 0] ≤ c(R(g) − R ) ∗ α α X (2) (3) ∃c > 0. 1]. ∀g ∈ {−1. A dP (x) ≤ c( A |η(x)|dP (x)) α ∃B > 0. Bousquet – Statistical Learning Theory – Lecture 5 106 . ∀t ≥ 0. ∀A ⊂ X . P [|η(X)| ≤ t] ≤ Bt 1−α O.

Equivalence • (1) ⇔ (2) Recall R(g) − R∗ = E |η(X)|1[gη≤0] . there exists a set A such that 1[A] = 1[gη≤0] • (2) ⇒ (3) Let A = {x : |η(x)| ≤ t} P [|η| ≤ t] = A dP (x) ≤ ≤ c( A |η(x)|dP (x)) dP (x)) A α α α ct ( 1 α ⇒ P [|η| ≤ t] ≤ c 1−α t 1−α O. Bousquet – Statistical Learning Theory – Lecture 5 107 . For each function g .

Bousquet – Statistical Learning Theory – Lecture 5 108 .• (3) ⇒ (1) R(g) − R = E |η(X)| 1[gη≤0] ≥ = ≥ tE 1[gη≤0]1[|η|>t] tP [|η| > t] − tE 1[gη>0]1[|η|>t] t(1 − Bt 1−α ) − tP [gη > 0] = t(P [gη ≤ 0] − Bt 1−α ) (1−α)P[gη≤0] B (1−α)/α α α ∗ Take t = B 1−α ∗ α ⇒ P [gη ≤ 0] ≤ (R(g) − R ) (1 − α)(1 − α)αα O.

1] because R(g) − R = E |η(X)|1[gη≤0] ≤ E 1[gη≤0] ∗ • α = 0 no condition • α = 1 gives Massart’s condition O.Remarks • α is in [0. Bousquet – Statistical Learning Theory – Lecture 5 109 .

Consequences • Under Massart’s condition E (1[g(X)=Y ] − 1[t(X)=Y ]) 2 ≤ c(R(g) − R ) ∗ • Under Tsybakov’s condition E (1[g(X)=Y ] − 1[t(X)=Y ]) 2 ≤ c(R(g) − R ) ∗ α O. Bousquet – Statistical Learning Theory – Lecture 5 110 .

Bousquet – Statistical Learning Theory – Lecture 5 111 .Relative loss class • F is the loss class associated to G • The relative loss class is deﬁned as ∗ ˜ F = {f − f : f ∈ F } • It satisﬁes P f ≤ c(P f ) 2 α O.

Finite case ˜ • Union bound on F with Bernstein’s inequality would give P fn−P f ≤ Pnfn−Pnf + ∗ ∗ 8c(P fn − P f ∗)α log N δ n + 4 log N δ 3n • Consequence when f ∗ ∈ F (but R∗ > 0) ∗ P fn − P f ≤ C always better than n−1/2 for α > 0 O. Bousquet – Statistical Learning Theory – Lecture 5 log n N δ 1 2−α 112 .

Bousquet – Statistical Learning Theory – Lecture 5 113 . r) = E sup f ∈F :P f 2 ≤r Rnf • Allows to generalize the previous result • Computes the capacity of a small ball in F (functions with small variance) • Under noise conditions.Local Rademacher average • Deﬁnition R(F . small variance implies small error O.

Bousquet – Statistical Learning Theory – Lecture 5 114 .Sub-root functions Deﬁnition A function ψ : R → R is sub-root if • ψ is non-decreasing • ψ is non negative √ • ψ(r)/ r is non-increasing O.

5 3 O.5 2 1.Sub-root functions Properties A sub-root function • is continuous • has a unique ﬁxed point ψ(r ∗) = r ∗ 3 x phi(x) 2. Bousquet – Statistical Learning Theory – Lecture 5 115 .5 2 2.5 1 1.5 1 0.5 0 0 0.

1]} • Properties Rn( F . r) is sub-root • Entropy of F is not much bigger than entropy of F O. α ∈ [0. Bousquet – Statistical Learning Theory – Lecture 5 116 .Star hull • Deﬁnition F = {αf : f ∈ F .

Bousquet – Statistical Learning Theory – Lecture 5 117 .Result • r ∗ ﬁxed point of R( F . r) • Bounded functions P f − Pn f ≤ C r ∗Var [f ] + log 1 + log log n δ n • Consequence for variance related to expectation (Var [f ] ≤ c(P f )β ) Pf ≤ C Pn f + 1 ∗ (r ) 2−β + log 1 + log log n δ n O.

Bousquet – Statistical Learning Theory – Lecture 5 118 .Consequences • For VC classes R(F . r) ≤ C rh n h hence r ∗ ≤ C n √ • Rate of convergence of Pnf to P f in O(1/ n) • But rate of convergence of P fn to P f ∗ is O(1/n1/(2−α)) Only condition is t ∈ G but can be removed by SRM/Model selection O.

Proof sketch (1) • Talagrand’s inequality sup P f −Pnf ≤ E sup P f − Pnf +c f ∈F f ∈F sup Var [f ] /n+c /n f ∈F • Peeling of the class Fk = {f : Var [f ] ∈ [x . x k k+1 )} O. Bousquet – Statistical Learning Theory – Lecture 5 119 .

P f −Pnf ≤ 2R(F . xVar [f ])+c xVar [f ] /n+c /n O. Bousquet – Statistical Learning Theory – Lecture 5 120 .Proof sketch (2) • Application sup P f −Pnf ≤ E sup P f − Pnf +c f ∈Fk xVar [f ] /n+c /n f ∈Fk • Symmetrization ∀f ∈ F .

Proof sketch (3) • We need to ’solve’ this inequality. hence the sub-root property P f − Pn f ≤ 2 • Variance-expectation Var [f ] ≤ c(P f ) Solve in P f α r ∗Var [f ] + c xVar [f ] /n + c /n O. Bousquet – Statistical Learning Theory – Lecture 5 121 . Things are simple if R behave like a square root.

one can use data-dependent local Rademcher averages Rn(F . r) O. Bousquet – Statistical Learning Theory – Lecture 5 122 . r) = Eσ sup f ∈F :P f 2 ≤r Rnf • Using concentration one can also get Pf ≤ C Pn f + 1 ∗ (rn) 2−α + log 1 + log log n δ n ∗ where rn is the ﬁxed point of a sub-root upper bound of Rn(F .Data-dependent version • As in the global case.

Discussion • Improved rates under low noise conditions • Interpolation in the rates • Capacity measure seems ’local’. Bousquet – Statistical Learning Theory – Lecture 5 123 . • after appropriate rescaling: each f ∈ F is considered at scale r/P f 2 O. • but depends on all the functions.

Bousquet – Statistical Learning Theory – Lecture 5 124 . pick g ∼ ρn • Error is averaged over ρn R(ρn) = ρnP f Rn(ρn) = ρnPnf O.Randomized Classiﬁers Given G a class of functions • Deterministic: picks a function gn and always use it to predict • Randomized construct a distribution ρn over G for each instance to classify.

Bousquet – Statistical Learning Theory – Lecture 5 125 . • Recall the reﬁned union bound ∀f ∈ F .Union Bound (1) Let π be a (ﬁxed) distribution over F . P f − Pnf ≤ • Take expectation with respect to ρn ρnP f − ρnPnf ≤ ρn 1 log π(f ) + log 1 δ 1 log π(f ) + log 1 δ 2n 2n O.

Bousquet – Statistical Learning Theory – Lecture 5 . π) + H(ρn) + log 1 /(2n) δ • K(ρn.Union Bound (2) ρnP f − ρnPnf ≤ ≤ ≤ ρn − log π(f ) + log 1 /(2n) δ −ρn log π(f ) + log 1 /(2n) δ K(ρn. π) = • H(ρn) = n (f ρn(f ) log ρπ(f )) df Kullback-Leibler divergence ρn(f ) log ρn(f ) df Entropy 126 O.

large entropy) • Not interesting if ρn = δfn O. Bousquet – Statistical Learning Theory – Lecture 5 127 . π) + log 4n + log 1 δ 2n − 1 • Good if ρn is spread (i. ρnP f − ρnPnf ≤ K(ρn.PAC-Bayesian Reﬁnement • It is possible to improve the previous bound.e. • With probability at least 1 − δ .

π) • Markov’s inequality: with probability 1 − δ . π) + log 1 δ O.Proof (1) • Variational formulation of entropy: for any T ρT (f ) ≤ log πe • Apply it to λ(P f − Pnf )2 λρn(P f − Pnf ) ≤ log πe 2 λ(P f −Pn f )2 T (f ) + K(ρ. π) + K(ρn. λρn(P f − Pnf ) ≤ log E πe 2 λ(P f −Pn f )2 + K(ρn. Bousquet – Statistical Learning Theory – Lecture 5 128 .

Bousquet – Statistical Learning Theory – Lecture 5 129 2 .Proof (2) • Fubini E πe λ(P f −Pn f )2 = πE e λ(P f −Pn f )2 • Modiﬁed Chernoﬀ bound E e (2n−1)(P f −Pn f )2 ≤ 4n • Putting together (λ = 2n − 1) (2n − 1)ρn(P f − Pnf ) ≤ K(ρn. π) + log 4n + log 1 δ • Jensen (2n − 1)(ρn(P f − Pnf ))2 ≤ (2n − 1)ρn(P f − Pnf )2 O.

Lecture 6 Loss Functions • Properties • Consistency • Examples • Losses and noise O. Bousquet – Statistical Learning Theory – Lecture 6 130 .

Yi) i=1 O.Motivation (1) • ERM: minimize n i=1 1[g(Xi)=Yi] in a set G ⇒ Computationally hard ⇒ Smoothing Replace binary by real-valued functions Introduce smooth loss function n (g(Xi). Bousquet – Statistical Learning Theory – Lecture 6 131 .

losses do not need to be related to densities • Can get bounds in terms of margin error instead of empirical error (smoother → easier to optimize for model selection) O. Bousquet – Statistical Learning Theory – Lecture 6 132 . estimate the density) • However.Motivation (2) • Hyperplanes in inﬁnite dimension have inﬁnite VC-dimension but ﬁnite scale-sensitive dimension (to be deﬁned later) ⇒ It is good to have a scale ⇒ This scale can be used to give a conﬁdence (i.e.

Margin • It is convenient to work with (symmetry of +1 and −1) (g(x). Ln(g) = n n φ(Yig(Xi)) i=1 • Loss class F = {f : (x. y) → φ(yg(x)) : g ∈ G} O. Bousquet – Statistical Learning Theory – Lecture 6 133 . y) • Loss 1 L(g) = E [φ(Y g(X))] . y) = φ(yg(x)) • yg(x) is the margin of g at (x.

Minimizing the loss • Decomposition of L(g) 1 E [E [(1 + η(X))φ(g(X)) + (1 − η(X))φ(−g(X))|X]] 2 • Minimization for each x H(η) = inf ((1 + η)φ(α)/2 + (1 − η)φ(−α)/2) α∈R • L∗ := inf g L(g) = E [H(η(X))] O. Bousquet – Statistical Learning Theory – Lecture 6 134 .

O.Classiﬁcation-calibrated • A minimal requirement is that the minimizer in H(η) has the correct sign (that of the target t or that of η ). except possibly for η = 0). Bousquet – Statistical Learning Theory – Lecture 6 135 . • Deﬁnition φ is classiﬁcation-calibrated if. for any η = 0 α:αη≤0 inf (1+η)φ(α)+(1−η)φ(−α) > inf (1+η)φ(α)+(1−η)φ(−α) α∈R • This means the inﬁmum is achieved for an α of the correct sign (and not for an α of the wrong sign.

L(gi) → L ∗ ⇒ R(gi) → R ∗ • When φ is convex (convenient for optimization) φ is classiﬁcationcalibrated iﬀ it is diﬀerentiable at 0 and φ (0) < 0 O. Bartlett and McAuliﬀe 2003) • φ is classiﬁcation-calibrated iﬀ for all sequences gi and every probability distribution P .Consequences (1) Results due to (Jordan. Bousquet – Statistical Learning Theory – Lecture 6 136 .

Bousquet – Statistical Learning Theory – Lecture 6 137 .Consequences (2) • Let H −(η) = inf α:αη≤0 ((1 + η)φ(α)/2 + (1 − η)φ(−α)/2) • Let ψ(η) be the largest convex function below H −(η) − H(η) • One has ψ(R(g) − R ) ≤ L(g) − L ∗ ∗ O.

5 1 1.5 2 O.5 0 −1 −0.5 0−1 hinge squared hinge square exponential 3 2.5 2 1. Bousquet – Statistical Learning Theory – Lecture 6 138 .Examples (1) 4 3.5 0 0.5 1 0.

Bousquet – Statistical Learning Theory – Lecture 6 139 . 1 − x). ψ(x) = x • Square loss φ(x) = (1 − x) . 1 − x) .Examples (2) • Hinge loss φ(x) = max(0. ψ(x) = x • Exponential φ(x) = exp(−x). ψ(x) = 1 − 1 − x2 2 2 2 2 O. ψ(x) = x • Squared hinge loss φ(x) = max(0.

c(R(g) − R ) ψ((R(g) − R ) • Hinge loss (no improvement) R(g) − R ≤ L(g) − L • Square loss or squared hinge loss R(g) − R ≤ (4c(L(g) − L )) 2−α ∗ ∗ 1 ∗ ∗ ∗ α ∗ 1−α /2c) ≤ L(g) − L ∗ O. Bousquet – Statistical Learning Theory – Lecture 6 140 .Low noise conditions • Relationship can be improved under low noise conditions • Under Tsybakov’s condition with exponent α and constant c.

Estimation error • Recall that Tsybakov condition implies P f 2 ≤ c(P f )α for the relative loss class (with 0 − 1 loss) • What happens for the relative loss class associated to φ ? • Two possibilities Strictly convex loss (can modify the metric on R) Piecewise linear O. Bousquet – Statistical Learning Theory – Lecture 6 141 .

Strictly convex losses • Noise behavior controlled by modulus of convexity • Result P f2 δ( ) ≤ P f /2 K with K Lipschitz constant of φ and δ modulus of convexity of L(g) with respect to f − g L (P ) 2 • Not related to noise exponent O. Bousquet – Statistical Learning Theory – Lecture 6 142 .

Piecewise linear losses • Noise behavior related to noise exponent • Result for hinge loss P f ≤ CP f if initial class G is uniformly bounded 2 α O. Bousquet – Statistical Learning Theory – Lecture 6 143 .

Bousquet – Statistical Learning Theory – Lecture 6 144 . for a convex class G . L(g) − L(g ) ≤ C ∗ ∗ 2 (r ) γ + log 1 + log log n δ n • Under Tsybakov’s condition for the hinge loss (and general G ) P f 2 ≤ CP f α L(g) − L(g ) ≤ C ∗ 1 ∗ 2−α (r ) + log 1 + log log n δ n O.Estimation error • With bounded and Lipschitz loss with convexity exponent γ .

Examples Under Tsybakov’s condition • Hinge loss R(g) − R ≤ L(g ) − L + C ∗ ∗ ∗ 1 ∗ 2−α (r ) + log 1 + log log n δ n • Squared hinge loss or square loss δ(x) = cx2. P f 2 ≤ CP f ∗ ∗ ∗ ∗ R(g)−R ≤ C L(g ) − L + C (r + log 1 δ + log log n n 1 2−α ) O. Bousquet – Statistical Learning Theory – Lecture 6 145 .

Classiﬁcation vs Regression losses • Consider a classiﬁcation-calibrated function φ • It is a classiﬁcation loss if L(t) = L∗ • otherwise it is a regression loss O. Bousquet – Statistical Learning Theory – Lecture 6 146 .

exponential losses Noise enters relationship between risk and loss Modulus of convexity enters in estimation error • Hinge loss Direct relationship between risk and loss Noise enters in estimation error ⇒ Approximation term not aﬀected by noise in second case ⇒ Real value does not bring probability information in second case O.Classiﬁcation vs Regression losses • Square. squared hinge. Bousquet – Statistical Learning Theory – Lecture 6 147 .

Lecture 7 Regularization • Formulation • Capacity measures • Computing Rademacher averages • Applications O. Bousquet – Statistical Learning Theory – Lecture 7 148 .

Equivalent problems
Up to the choice of the regularization parameters, the following problems are equivalent

min Ln(f ) + λΩ(f )
f ∈F

f ∈F :Ln (f )≤e f ∈F :Ω(f )≤R

min

Ω(f )

min

Ln(f )

The solution sets are the same

O. Bousquet – Statistical Learning Theory – Lecture 7

149

• Computationally, variant of SRM • variant of model selection by penalization ⇒ one has to choose a regularizer which makes sense • Need a class that is large enough (for universal consistency) • but has small balls

O. Bousquet – Statistical Learning Theory – Lecture 7

150

Rates
• To obtain bounds, consider ERM on balls • Relevant capacity is that of balls • Real-valued functions, need a generalization of VC dimension, entropy or covering numbers • Involve scale sensitive capacity (takes into account the value and not only the sign)

O. Bousquet – Statistical Learning Theory – Lecture 7

151

Scale-sensitive capacity • Generalization of VC entropy and VC dimension to real-valued functions • Deﬁnition: a set x1. . there exists f ∈ F αi(f (xi) − s(xi)) ≥ ε • The fat-shattering dimension of F at scale ε (denoted vc(F . . . xn is shattered by F (at scale ε) if there exists a function s such that for all choices of αi ∈ {−1. . 1}. Bousquet – Statistical Learning Theory – Lecture 7 152 . ε)) is the maximum cardinality of a shattered set O.

• Like VC dimension, fat-shattering dimension can be used to upper bound covering numbers • Result N (F , t, n) ≤ C1 t

C2 vc(F ,C3 t)

• Note that one can also deﬁne data-dependent versions (restriction on the sample)

O. Bousquet – Statistical Learning Theory – Lecture 7

153

• Consequence of covering number estimates C1 Rn(F ) ≤ √ n

vc(F , t) log
0

C2 dt t

• Another link via Gaussian averages (replace Rademacher by Gaussian N(0,1) variables) Gn(F ) = Eg 1 sup f ∈F n
n

gif (Zi)
i=1

O. Bousquet – Statistical Learning Theory – Lecture 7

154

• Worst case average
n (F )

= sup Eσ sup Gnf
x1 ,...,xn f ∈F n (F )

• Associated ”dimension” t(F , ) = sup{n ∈ N : • Result (Mendelson & Vershynin 2003) vc(F , c ) ≤ t(F , ) ≤ K
2

≥ }

vc(F , c )

O. Bousquet – Statistical Learning Theory – Lecture 7

155

Rademacher averages and Lipschitz losses • What matters is the capacity of F (loss class) • If φ is Lipschitz with constant M • then Rn(F ) ≤ M Rn(G) • Relates to Rademacher average of the initial class (easier to compute) O. Bousquet – Statistical Learning Theory – Lecture 7 156 .

Dualization • Consider the problem min • Rademacher of ball g ≤R Ln(g) Eσ sup Rng g ≤R • Duality Eσ ∗ sup Rnf f ≤R R = Eσ n n ∗ σiδXi i=1 dual norm. δXi evaluation at Xi (element of the dual under appropriate conditions) O. Bousquet – Statistical Learning Theory – Lecture 7 157 .

Bousquet – Statistical Learning Theory – Lecture 7 158 . ·) O.RHKS Given a positive deﬁnite kernel k • Space of functions: reproducing kernel Hilbert space associated to k • Regularizer: rkhs norm · k • Properties: Representer theorem n gn = i=1 αik(Xi.

Shattering dimension of hyperplanes • Set of functions G = {g(x) = w · x : w = 1} • Assume x ≤ R • Result vc(G. ρ) ≤ R /ρ 2 2 O. Bousquet – Statistical Learning Theory – Lecture 7 159 .

. r.e. the larger r i=1 yi xi must be r • upper bound the size of i=1 yi xi in terms of R Combining the two tells us how many points we can at most shatter O. . Bousquet – Statistical Learning Theory – Lecture 7 160 . xi ≥ ρ for all i = 1. . 1997) Assume that x1. . . . yr ∈ {±1}. there exists a w such that yi w. Two steps: (2) • prove that the more points we want to shatter (2). .. . xr are ρ-shattered by hyperplanes with w = 1. for all y1. .Proof Strategy (Gurvits. . i. . .

Bousquet – Statistical Learning Theory – Lecture 7 161 . i=1 y i xi ≤ w i=1 y i xi = i=1 y i xi • Combine both: rρ ≤ r y i xi . ( r i=1 yixi) ≥ rρ • By Cauchy-Schwarz inequality r r r w.Part I • Summing (2) yields w. i=1 (3) O.

Part II
Consider labels yi ∈ {±1}, as (Rademacher variables).    
r 2 r

E
i=1

y i xi

=

E
i,j=1 r

y i y j xi , xj  
E [ xi, xi ] + E 
r

xi , xj 

=
i=1 r

i=j 2

=
i=1

xi

O. Bousquet – Statistical Learning Theory – Lecture 7

162

Part II, ctd.
• Since xi ≤ R, we get E
r i=1

y i xi

2

≤ rR2.

• This holds for the expectation over the random choices of the labels, hence there must be at least one set of labels for which it also holds true. Use this set. • Hence
r 2

yixi
i=1

≤ rR .

2

O. Bousquet – Statistical Learning Theory – Lecture 7

163

Part I and II Combined
• Part I: (rρ) ≤ • Part II: • Hence r ρ ≤ rR ,
i.e.,
2 2 2 r i=1 2 r i=1 2

y i xi

2

y i xi

≤ rR2

R2 r≤ 2 ρ

O. Bousquet – Statistical Learning Theory – Lecture 7

164

Bousquet – Statistical Learning Theory – Lecture 7 165 .Boosting Given a class H of functions • Space of functions: linear span of H • Regularizer: 1-norm of the weights αihi} g = inf{ |αi| : g = • Properties: weight concentrated on the (weighted) margin maximizers gn = wh h diYih (Xi) diYih(Xi) = min h ∈H O.

Bousquet – Statistical Learning Theory – Lecture 7 166 .Rademacher averages for boosting • Function class of interest GR = {g ∈ span H : g 1 ≤ R} • Result Rn(GR ) = RRn(H) ⇒ Capacity (as measured by global Rademacher averages) not aﬀected by taking linear combinations ! O.

Bousquet – Statistical Learning Theory – Lecture 8 167 .Lecture 8 SVM • Computational aspects • Capacity Control • Universality • Special case of RBF kernel O.

Formulation (1) • Soft margin 1 min w w.ξ 2 m 2 +C i=1 ξi yi( w. xi + b) ≥ 1 − ξi ξi ≥ 0 • Convex objective function and convex constraints • Unique solution • Eﬃcient procedures to ﬁnd it → Is it the right criterion ? O.b. Bousquet – Statistical Learning Theory – Lecture 8 168 .

b 2 m 2 ∗ m +C i=1 max(0. 1 − yi( w. Bousquet – Statistical Learning Theory – Lecture 8 169 . xi + b) ≥ 1 − ξi.b. xi + b)) O. 1 − yi( w.Formulation (2) • Soft margin 1 2 ξi w +C min w. xi + b)) • Substitute above to get 1 min w w. ξi ≥ 0 • Optimal value of ξi ξi = max(0.ξ 2 i=1 yi( w.

Bousquet – Statistical Learning Theory – Lecture 8 170 .Regularization General form of regularization problem 1 min f ∈F m n c(yif (xi)) + λ f i=1 2 → Capacity control by regularization with convex cost 1 0 1 O.

Loss Function φ(Y f (X)) = max(0. 1 − Y f (X)) • Convex. Bousquet – Statistical Learning Theory – Lecture 8 171 . non-increasing. upper bounds 1[Y f (X)≤0] • Classiﬁcation-calibrated • Classiﬁcation type (L∗ = L(t)) R(g) − R ≤ L(g) − L ∗ ∗ O.

Regularization Choosing a kernel corresponds to • Choose a sequence (ak ) • Set f 2 (k) 2 := k≥0 ak |f | dx ⇒ penalization of high order derivatives (high frequencies) ⇒ enforce smoothness of the solution O. Bousquet – Statistical Learning Theory – Lecture 8 172 .

. Bousquet – Statistical Learning Theory – Lecture 8 173 .Capacity: VC dimension • The VC dimension of the set of hyperplanes is d + 1 in Rd. Dimension of feature space ? ∞ for RBF kernel • w choosen in the span of the data (w = αiyixi) The span of the data has dimension m for RBF kernel (k(. xi) linearly independent) • The VC bound does not give any information h =1 m ⇒ Need to take the margin into account O.

vc(hyperplanes with margin ρ. 1) ≤ R2/ρ2 O. Bousquet – Statistical Learning Theory – Lecture 8 174 .Capacity: Shattering dimension Hyperplanes with Margin If x ≤ R.

Bousquet – Statistical Learning Theory – Lecture 8 175 .Margin • The shattering dimension is related to the margin • Maximizing the margin means minimizing the shattering dimension • Small shattering dimension ⇒ good control of the risk ⇒ this control is automatic (no need to choose the margin beforehand) ⇒ but requires tuning of regularization parameter O.

Bousquet – Statistical Learning Theory – Lecture 8 176 . xi) i=1 • Trace of the Gram matrix • Notice that Rn ≤ R2/(n2ρ2) O.Capacity: Rademacher Averages (1) • Consider hyperplanes with w ≤ M • Rademacher average M √ n 2 n i=1 M k(xi. xi) ≤ Rn ≤ n n k(xi.

δxi w. w 1 n 1 n n i=1 n = ≤ = E sup E sup w ≤M σiδxi σiδxi σiδxi w ≤M i=1 n M E n n i=1 σiδxi . Bousquet – Statistical Learning Theory – Lecture 8 177 . i=1 O.Rademacher Averages (2) n i=1 E sup 1 w ≤M n σi w.

σiδxi i.Rademacher Averages (3) M E n ≤ = = M n M n M n E E n i=1 n i=1 σiδxi . δxj k(xi. n i=1 n i=1 σiδxi n i=1 σiδxi .j σiσj δxi . Bousquet – Statistical Learning Theory – Lecture 8 178 . xi) O.

with g E (φ(Y g(X)) − φ(Y t(X))) 2 ∞ ≤M ∗ ≤ (M −1+2/η0)(L(g)−L ) . → If noise is nice. Bousquet – Statistical Learning Theory – Lecture 8 179 . variance linearly related to expectation → Estimation error of order r ∗ (of the class G ) O.Improved rates – Noise condition • Under Massart’s condition (|η| > η0).

Bousquet – Statistical Learning Theory – Lecture 8 180 .Improved rates – Capacity (1) ∗ • rn related to decay of eigenvalues of the Gram matrix rn ≤ ∗ c min d + n d∈N   λj  j>d • Note that d = 0 gives the trace bound ∗ • rn always better than the trace bound (equality when λi constant) O.

Improved rates – Capacity (2) Example: exponential decay • λi = e−αi • Global Rademacher of order ∗ • rn of order 1 √ n log n n O. Bousquet – Statistical Learning Theory – Lecture 8 181 .

. • Wrong power (M 2 penalty) is used in the algorithm • Computationally easier • But does not give λ a dimension-free status • Using M could improve the cutoﬀ detection ∗ O..Exponent of the margin • Estimation error analysis shows that in GM = {g : g ≤ M } R(gn) − R(g ) ≤ M. Bousquet – Statistical Learning Theory – Lecture 8 182 .

Kernel Why is it good to use kernels ? • Gaussian kernel (RBF) k(x. Bousquet – Statistical Learning Theory – Lecture 8 183 . y) = e − x−y 2 2σ 2 • σ is the width of the kernel → What is the geometry of the feature space ? O.

) ≥ 0 → positive quadrant O. Φ(x) Φ(y) =e − x−y 2 /2σ 2 ≥0 → Angles less than 90 degrees • Φ(x) = k(x.RBF Geometry • Norms Φ(x) → sphere of radius 1 • Angles cos(Φ(x). Φ(x) = e = 1 0 Φ(y) Φ(x) . Φ(y)) = 2 = Φ(x). . Bousquet – Statistical Learning Theory – Lecture 8 184 .

RBF O. Bousquet – Statistical Learning Theory – Lecture 8 185 .

Bousquet – Statistical Learning Theory – Lecture 8 186 .RBF Diﬀerential Geometry • Flat Riemannian metric → ’distance’ along the sphere is equal to distance in input space • Distances are contracted → ’shortcuts’ by getting outside the sphere O.

.RBF Geometry of the span Ellipsoid x^2/a^2 + y^2/b^2 = 1 b a • K = (k(xi. Bousquet – Statistical Learning Theory – Lecture 8 187 . . . λm √ √ • Data points mapped to ellispoid with lengths λ1. λm O. . . . . . xj )) Gram matrix • Eigenvalues λ1.

Bousquet – Statistical Learning Theory – Lecture 8 188 ∞ norm) by . ·) : x ∈ X } • H is dense in C(X ) → Any continuous function can be approximated (in the functions in H ⇒ with enough data one can construct any function O.RBF Universality • Consider the set of functions H = span{k(x.

Bousquet – Statistical Learning Theory – Lecture 8 189 .RBF Eigenvalues • Exponentially decreasing • Fourier domain: exponential penalization of derivatives • Enforces smoothness with respect to the Lebesgue measure in input space O.

everything orthogonal • σ→∞ linear classiﬁer in input space All points very close on the sphere. initial geometry • Tuning σ allows to try all possible intermediate combinations O.RBF Induced Distance and Flexibility • σ→0 1-nearest neighbor in input space Each point in a separate dimension. Bousquet – Statistical Learning Theory – Lecture 8 190 .

RBF Ideas • Works well if the Euclidean distance is good • Works well if decision boundary is smooth • Adapt smoothness via σ • Universal O. Bousquet – Statistical Learning Theory – Lecture 8 191 .

Choosing the Kernel • Major issue of current research • Prior knowledge (e. O.g.. distance) • Cross-validation (limited to 1-2 parameters) • Bound (better with convex class) ⇒ Lots of open questions. Bousquet – Statistical Learning Theory – Lecture 8 192 . invariances..

g.Learning Theory: some informal thoughts • Need assumptions/restrictions to learn • Data cannot replace knowledge • No universal learning (simplicity measure) • SVM work because of capacity control • Choice of kernel = choice of prior/ regularizer • RBF works well if Euclidean distance meaningful • Knowledge improves performance (e. Bousquet – Statistical Learning Theory – Lecture 8 193 . invariances) O.