Statistical Learning Theory

Olivier Bousquet
Department of Empirical Inference
Max Planck Institute of Biological Cybernetics
olivier.bousquet@tuebingen.mpg.de
Machine Learning Summer School, August 2003
Roadmap (1)
• Lecture 1: Introduction
• Part I: Binary Classification
Lecture 2: Basic bounds
Lecture 3: VC theory
Lecture 4: Capacity measures
Lecture 5: Advanced topics
O. Bousquet – Statistical Learning Theory 1
Roadmap (2)
• Part II: Real-Valued Classification
Lecture 6: Margin and loss functions
Lecture 7: Regularization
Lecture 8: SVM
O. Bousquet – Statistical Learning Theory 2
Lecture 1
The Learning Problem
• Context
• Formalization
• Approximation/Estimation trade-off
• Algorithms and Bounds
O. Bousquet – Statistical Learning Theory – Lecture 1 3
Learning and Inference
The inductive inference process:
1. Observe a phenomenon
2. Construct a model of the phenomenon
3. Make predictions
⇒ This is more or less the definition of natural sciences !
⇒ The goal of Machine Learning is to automate this process
⇒ The goal of Learning Theory is to formalize it.
O. Bousquet – Statistical Learning Theory – Lecture 1 4
Pattern recognition
We consider here the supervised learning framework for pattern
recognition:
• Data consists of pairs (instance, label)
• Label is +1 or −1
• Algorithm constructs a function (instance → label)
• Goal: make few mistakes on future unseen instances
O. Bousquet – Statistical Learning Theory – Lecture 1 5
Approximation/Interpolation
It is always possible to build a function that fits exactly the data.
0 0.5 1 1.5
0
0.5
1
1.5
But is it reasonable ?
O. Bousquet – Statistical Learning Theory – Lecture 1 6
Occam’s Razor
Idea: look for regularities in the observed phenomenon
These can ge generalized from the observed past to the future
⇒ choose the simplest consistent model
How to measure simplicity ?
• Physics: number of constants
• Description length
• Number of parameters
• ...
O. Bousquet – Statistical Learning Theory – Lecture 1 7
No Free Lunch
• No Free Lunch
if there is no assumption on how the past is related to the future,
prediction is impossible
if there is no restriction on the possible phenomena, generalization
is impossible
• We need to make assumptions
• Simplicity is not absolute
• Data will never replace knowledge
• Generalization = data + knowledge
O. Bousquet – Statistical Learning Theory – Lecture 1 8
Assumptions
Two types of assumptions
• Future observations related to past ones
→ Stationarity of the phenomenon
• Constraints on the phenomenon
→ Notion of simplicity
O. Bousquet – Statistical Learning Theory – Lecture 1 9
Goals
⇒ How can we make predictions from the past ? what are the
assumptions ?
• Give a formal definition of learning, generalization, overfitting
• Characterize the performance of learning algorithms
• Design better algorithms
O. Bousquet – Statistical Learning Theory – Lecture 1 10
Probabilistic Model
Relationship between past and future observations
⇒ Sampled independently from the same distribution
• Independence: each new observation yields maximum information
• Identical distribution: the observations give information about the
underlying phenomenon (here a probability distribution)
O. Bousquet – Statistical Learning Theory – Lecture 1 11
Probabilistic Model
We consider an input space · and output space ¸.
Here: classification case ¸ = ¦−1, 1].
Assumption: The pairs (X, Y ) ∈ · ·¸ are distributed
according to P (unknown).
Data: We observe a sequence of n i.i.d. pairs (X
i
, Y
i
) sampled
according to P.
Goal: construct a function g : · → ¸ which predicts Y from X.
O. Bousquet – Statistical Learning Theory – Lecture 1 12
Probabilistic Model
Criterion to choose our function:
Low probability of error P(g(X) ,= Y ).
Risk
R(g) = P(g(X) ,= Y ) = E
_
1
[g(X),=Y ]
¸
• P is unknown so that we cannot directly measure the risk
• Can only measure the agreement on the data
• Empirical Risk
R
n
(g) =
1
n
n

i=1
1
[g(X
i
),=Y
i
]
O. Bousquet – Statistical Learning Theory – Lecture 1 13
Target function
• P can be decomposed as P
X
·P(Y ]X)
• η(x) = E[Y ]X = x] = 2P [Y = 1]X = x] −1 is the regression
function
• t(x) = sgn η(x) is the target function
• in the deterministic case Y = t(X) (P [Y = 1]X] ∈ ¦0, 1])
• in general, n(x) = min(P [Y = 1]X = x] , 1−P [Y = 1]X = x]) =
(1 −η(x))/2 is the noise level
O. Bousquet – Statistical Learning Theory – Lecture 1 14
Assumptions about P
Need assumptions about P.
Indeed, if t(x) is totally chaotic, there is no possible generalization
from finite data.
Assumptions can be
• Preference (e.g. a priori probability distribution on possible functions)
• Restriction (set of possible functions)
Treating lack of knowledge
• Bayesian approach: uniform distribution
• Learning Theory approach: worst case analysis
O. Bousquet – Statistical Learning Theory – Lecture 1 15
Approximation/Interpolation (again)
How to trade-off knowledge and data ?
0 0.5 1 1.5
0
0.5
1
1.5
O. Bousquet – Statistical Learning Theory – Lecture 1 16
Overfitting/Underfitting
The data can mislead you.
• Underfitting
model too small to fit the data
• Overfitting
artificially good agreement with the data
No way to detect them from the data ! Need extra validation data.
O. Bousquet – Statistical Learning Theory – Lecture 1 17
Empirical Risk Minimization
• Choose a model Ç (set of possible functions)
• Minimize the empirical risk in the model
min
g∈Ç
R
n
(g)
What if the Bayes classifier is not in the model ?
O. Bousquet – Statistical Learning Theory – Lecture 1 18
Approximation/Estimation
• Bayes risk
R

= inf
g
R(g) .
Best risk a deterministic function can have (risk of the target function,
or Bayes classifier).
• Decomposition: R(g

) = inf
g∈Ç
R(g)
R(g
n
) −R

= R(g) −R

. ¸¸ .
+ R(g
n
) −R(g

)
. ¸¸ .
Approximation Estimation
• Only the estimation error is random (i.e. depends on the data).
O. Bousquet – Statistical Learning Theory – Lecture 1 19
Structural Risk Minimization
• Choose a collection of models ¦Ç
d
: d = 1, 2, . . .]
• Minimize the empirical risk in each model
• Minimize the penalized empirical risk
min
d
min
g∈Ç
d
R
n
(g) + pen(d, n)
pen(d, n) gives preference to models where estimation error is small
pen(d, n) measures the size or capacity of the model
O. Bousquet – Statistical Learning Theory – Lecture 1 20
Regularization
• Choose a large model Ç (possibly dense)
• Choose a regularizer jgj
• Minimize the regularized empirical risk
min
g∈Ç
R
n
(g) + λjgj
2
• Choose an optimal trade-off λ (regularization parameter).
Most methods can be thought of as regularization methods.
O. Bousquet – Statistical Learning Theory – Lecture 1 21
Bounds (1)
A learning algorithm
• Takes as input the data (X
1
, Y
1
), . . . , (X
n
, Y
n
)
• Produces a function g
n
Can we estimate the risk of g
n
?
⇒ random quantity (depends on the data).
⇒ need probabilistic bounds
O. Bousquet – Statistical Learning Theory – Lecture 1 22
Bounds (2)
• Error bounds
R(g
n
) ≤ R
n
(g
n
) + B
⇒ Estimation from an empirical quantity
• Relative error bounds
Best in a class
R(g
n
) ≤ R(g

) + B
Bayes risk
R(g
n
) ≤ R

+ B
⇒ Theoretical guarantees
O. Bousquet – Statistical Learning Theory – Lecture 1 23
Lecture 2
Basic Bounds
• Probability tools
• Relationship with empirical processes
• Law of large numbers
• Union bound
• Relative error bounds
O. Bousquet – Statistical Learning Theory – Lecture 2 24
Probability Tools (1)
Basic facts
• Union: P [A or B] ≤ P [A] +P [B]
• Inclusion: If A ⇒ B, then P [A] ≤ P [B].
• Inversion: If P [X ≥ t] ≤ F(t) then with probability at least 1 −δ,
X ≤ F
−1
(δ).
• Expectation: If X ≥ 0, E[X] =
_

0
P [X ≥ t] dt.
O. Bousquet – Statistical Learning Theory – Lecture 2 25
Probability Tools (2)
Basic inequalities
• Jensen: for f convex, f(E[X]) ≤ E[f(X)]
• Markov: If X ≥ 0 then for all t > 0, P [X ≥ t] ≤
E[X]
t
• Chebyshev: for t > 0, P []X −E[X] ] ≥ t] ≤
Var[X]
t
2
• Chernoff: for all t ∈ R, P [X ≥ t] ≤ inf
λ≥0
E
_
e
λ(X−t)
_
O. Bousquet – Statistical Learning Theory – Lecture 2 26
Error bounds
Recall that we want to bound R(g
n
) = E
_
1
[g
n
(X),=Y ]
¸
where g
n
has
been constructed from (X
1
, Y
1
), . . . , (X
n
, Y
n
).
• Cannot be observed (P is unknown)
• Random (depends on the data)
⇒ we want to bound
P [R(g
n
) −R
n
(g
n
) > ε]
O. Bousquet – Statistical Learning Theory – Lecture 2 27
Loss class
For convenience, let Z
i
= (X
i
, Y
i
) and Z = (X, Y ). Given Ç define
the loss class
J = ¦f : (x, y) ·→ 1
[g(x),=y]
: g ∈ Ç]
Denote Pf = E[f(X, Y )] and P
n
f =
1
n

n
i=1
f(X
i
, Y
i
)
Quantity of interest:
Pf −P
n
f
We will go back and forth between J and Ç (bijection)
O. Bousquet – Statistical Learning Theory – Lecture 2 28
Empirical process
Empirical process:
¦Pf −P
n
f]
f∈J
• Process = collection of random variables (here indexed by functions
in J)
• Empirical = distribution of each random variable
⇒ Many techniques exist to control the supremum
sup
f∈J
Pf −P
n
f
O. Bousquet – Statistical Learning Theory – Lecture 2 29
The Law of Large Numbers
R(g) −R
n
(g) = E[f(Z)] −
1
n
n

i=1
f(Z
i
)
→ difference between the expectation and the empirical average of the
r.v. f(Z)
Law of large numbers
P
_
lim
n→∞
1
n
n

i=1
f(Z
i
) −E[f(Z)] = 0
_
= 1 .
⇒ can we quantify it ?
O. Bousquet – Statistical Learning Theory – Lecture 2 30
Hoeffding’s Inequality
Quantitative version of law of large numbers.
Assumes bounded random variables
Theorem 1. Let Z
1
, . . . , Z
n
be n i.i.d. random variables. If f(Z) ∈
[a, b]. Then for all ε > 0, we have
P

¸
¸
¸
¸
1
n
n

i=1
f(Z
i
) −E[f(Z)]
¸
¸
¸
¸
¸
> ε
_
≤ 2 exp
_

2nε
2
(b −a)
2
_
.
⇒ Let’s rewrite it to better understand
O. Bousquet – Statistical Learning Theory – Lecture 2 31
Hoeffding’s Inequality
Write
δ = 2 exp
_

2nε
2
(b −a)
2
_
Then
P
_
_
]P
n
f −Pf] > (b −a)
¸
log
2
δ
2n
_
_
≤ δ
or [Inversion] with probability at least 1 −δ,
]P
n
f −Pf] ≤ (b −a)
¸
log
2
δ
2n
O. Bousquet – Statistical Learning Theory – Lecture 2 32
Hoeffding’s inequality
Let’s apply to f(Z) = 1
[g(X),=Y ]
.
For any g, and any δ > 0, with probability at least 1 −δ
R(g) ≤ R
n
(g) +
¸
log
2
δ
2n
. (1)
Notice that one has to consider a fixed function f and the probability is
with respect to the sampling of the data.
If the function depends on the data this does not apply !
O. Bousquet – Statistical Learning Theory – Lecture 2 33
Limitations
• For each fixed function f ∈ J, there is a set S of samples for which
Pf −P
n
f ≤
_
log
2
δ
2n
(P [S] ≥ 1 −δ)
• They may be different for different functions
• The function chosen by the algorithm depends on the sample
⇒ For the observed sample, only some of the functions in J will satisfy
this inequality !
O. Bousquet – Statistical Learning Theory – Lecture 2 34
Limitations
What we need to bound is
Pf
n
−P
n
f
n
where f
n
is the function chosen by the algorithm based on the data.
For any fixed sample, there exists a function f such that
Pf −P
n
f = 1
Take the function which is f(X
i
) = Y
i
on the data and f(X) = −Y
everywhere else.
This does not contradict Hoeffding but shows it is not enough
O. Bousquet – Statistical Learning Theory – Lecture 2 35
Limitations
Risk
Function class
R
R
emp
f f f
opt m
R[f]
R [f]
emp
Hoeffding’s inequality quantifies differences for a fixed function
O. Bousquet – Statistical Learning Theory – Lecture 2 36
Uniform Deviations
Before seeing the data, we do not know which function the algorithm
will choose.
The trick is to consider uniform deviations
R(f
n
) −R
n
(f
n
) ≤ sup
f∈J
(R(f) −R
n
(f))
We need a bound which holds simultaneously for all functions in a class
O. Bousquet – Statistical Learning Theory – Lecture 2 37
Union Bound
Consider two functions f
1
, f
2
and define
C
i
= ¦(x
1
, y
1
), . . . , (x
n
, y
n
) : Pf
i
−P
n
f
i
> ε]
From Hoeffding’s inequality, for each i
P [C
i
] ≤ δ
We want to bound the probability of being ’bad’ for i = 1 or i = 2
P [C
1
∪ C
2
] ≤ P [C
1
] +P [C
2
]
O. Bousquet – Statistical Learning Theory – Lecture 2 38
Finite Case
More generally
P [C
1
∪ . . . ∪ C
N
] ≤
N

i=1
P [C
i
]
We have
P [∃f ∈ ¦f
1
, . . . , f
N
] : Pf −P
n
f > ε]

N

i=1
P [Pf
i
−P
n
f
i
> ε]
≤ N exp
_
−2nε
2
_
O. Bousquet – Statistical Learning Theory – Lecture 2 39
Finite Case
We obtain, for Ç = ¦g
1
, . . . , g
N
], for all δ > 0
with probability at least 1 −δ,
∀g ∈ Ç, R(g) ≤ R
n
(g) +
¸
log N + log
1
δ
2m
This is a generalization bound !
Coding interpretation
log N is the number of bits to specify a function in J
O. Bousquet – Statistical Learning Theory – Lecture 2 40
Approximation/Estimation
Let
g

= arg min
g∈Ç
R(g)
If g
n
minimizes the empirical risk in Ç,
R
n
(g

) −R
n
(g
n
) ≥ 0
Thus
R(g
n
) = R(g
n
) −R(g

) + R(g

)
≤ R
n
(g

) −R
n
(g
n
) + R(g
n
) −R(g

) + R(g

)
≤ 2 sup
g∈Ç
]R(g) −R
n
(g)] + R(g

)
O. Bousquet – Statistical Learning Theory – Lecture 2 41
Approximation/Estimation
We obtain with probability at least 1 −δ
R(g
n
) ≤ R(g

) + 2
¸
log N + log
2
δ
2m
The first term decreases if N increases
The second term increases
The size of Ç controls the trade-off
O. Bousquet – Statistical Learning Theory – Lecture 2 42
Summary (1)
• Inference requires assumptions
- Data sampled i.i.d. from P
- Restrict the possible functions to Ç
- Choose a sequence of models Ç
m
to have more flexibility/control
O. Bousquet – Statistical Learning Theory – Lecture 2 43
Summary (2)
• Bounds are valid w.r.t. repeated sampling
- For a fixed function g, for most of the samples
R(g) −R
n
(g) ≈ 1/

n
- For most of the samples if ]Ç] = N
sup
g∈Ç
R(g) −R
n
(g) ≈
_
log N/n
⇒ Extra variability because the chosen g
n
changes with the data
O. Bousquet – Statistical Learning Theory – Lecture 2 44
Improvements
We obtained
sup
g∈Ç
R(g) −R
n
(g) ≤
¸
log N + log
2
δ
2n
To be improved
• Hoeffding only uses boundedness, not the variance
• Union bound as bad as if independent
• Supremum is not what the algorithm chooses.
Next we improve the union bound and extend it to the infinite case
O. Bousquet – Statistical Learning Theory – Lecture 2 45
Refined union bound (1)
For each f ∈ J,
P
_
_
Pf −P
n
f >
¸
log
1
δ(f)
2n
_
_
≤ δ(f)
P
_
_
∃f ∈ J : Pf −P
n
f >
¸
log
1
δ(f)
2n
_
_

f∈J
δ(f)
Choose δ(f) = δp(f) with

f∈J
p(f) = 1
O. Bousquet – Statistical Learning Theory – Lecture 2 46
Refined union bound (2)
With probability at least 1 −δ,
∀f ∈ J, Pf ≤ P
n
f +
¸
log
1
p(f)
+ log
1
δ
2n
• Applies to countably infinite J
• Can put knowledge about the algorithm into p(f)
• But p chosen before seeing the data
O. Bousquet – Statistical Learning Theory – Lecture 2 47
Refined union bound (3)
• Good p means good bound. The bound can be improved if you know
ahead of time the chosen function (knowledge improves the bound)
• In the infinite case, how to choose the p (since it implies an ordering)
• The trick is to look at J through the data
O. Bousquet – Statistical Learning Theory – Lecture 2 48
Lecture 3
Infinite Case: Vapnik-Chervonenkis Theory
• Growth function
• Vapnik-Chervonenkis dimension
• Proof of the VC bound
• VC entropy
• SRM
O. Bousquet – Statistical Learning Theory – Lecture 3 49
Infinite Case
Measure of the size of an infinite class ?
• Consider
J
z
1
,...,z
n
= ¦(f(z
1
), . . . , f(z
n
)) : f ∈ J]
The size of this set is the number of possible ways in which the data
(z
1
, . . . , z
n
) can be classified.
• Growth function
S
J
(n) = sup
(z
1
,...,z
n
)
]J
z
1
,...,z
n
]
• Note that S
J
(n) = S
Ç
(n)
O. Bousquet – Statistical Learning Theory – Lecture 3 50
Infinite Case
• Result (Vapnik-Chervonenkis)
With probability at least 1 −δ
∀g ∈ Ç, R(g) ≤ R
n
(g) +
¸
log S
Ç
(2n) + log
4
δ
8n
• Always better than N in the finite case
• How to compute S
Ç
(n) in general ?
⇒ use VC dimension
O. Bousquet – Statistical Learning Theory – Lecture 3 51
VC Dimension
Notice that since g ∈ ¦−1, 1], S
Ç
(n) ≤ 2
n
If S
Ç
(n) = 2
n
, the class of functions can generate any classification on
n points (shattering)
Definition 2. The VC-dimension of Ç is the largest n such that
S
Ç
(n) = 2
n
O. Bousquet – Statistical Learning Theory – Lecture 3 52
VC Dimension
Hyperplanes
In R
d
, V C(hyperplanes) = d + 1
O. Bousquet – Statistical Learning Theory – Lecture 3 53
VC Dimension
Number of Parameters
Is VC dimension equal to number of parameters ?
• One parameter
¦sgn(sin(tx)) : t ∈ R]
• Infinite VC dimension !
O. Bousquet – Statistical Learning Theory – Lecture 3 54
VC Dimension
• We want to know S
Ç
(n) but we only know
S
Ç
(n) = 2
n
for n ≤ h
What happens for n ≥ h ?
m
log(S(m))
h
O. Bousquet – Statistical Learning Theory – Lecture 3 55
Vapnik-Chervonenkis-Sauer-Shelah Lemma
Lemma 3. Let Ç be a class of functions with finite VC-dimension h.
Then for all n ∈ N,
S
Ç
(n) ≤
h

i=0
_
n
i
_
and for all n ≥ h,
S
Ç
(n) ≤
_
en
h
_
h
⇒ phase transition
O. Bousquet – Statistical Learning Theory – Lecture 3 56
VC Bound
Let Ç be a class with VC dimension h.
With probability at least 1 −δ
∀g ∈ Ç, R(g) ≤ R
n
(g) +
¸
hlog
2en
h
+ log
4
δ
8n
So the error is of order
_
hlog n
n
O. Bousquet – Statistical Learning Theory – Lecture 3 57
Interpretation
VC dimension: measure of effective dimension
• Depends on geometry of the class
• Gives a natural definition of simplicity (by quantifying the potential
overfitting)
• Not related to the number of parameters
• Finiteness guarantees learnability under any distribution
O. Bousquet – Statistical Learning Theory – Lecture 3 58
Symmetrization (lemma)
Key ingredient in VC bounds: Symmetrization
Let Z
t
1
, . . . , Z
t
n
an independent (ghost) sample and P
t
n
the
corresponding empirical measure.
Lemma 4. For any t > 0, such that nt
2
≥ 2,
P
_
sup
f∈J
(P −P
n
)f ≥ t
_
≤ 2P
_
sup
f∈J
(P
t
n
−P
n
)f ≥ t/2
_
O. Bousquet – Statistical Learning Theory – Lecture 3 59
Symmetrization (proof – 1)
f
n
the function achieving the supremum (depends on Z
1
, . . . , Z
n
)
1
[(P−P
n
)f
n
>t]
1
[(P−P
t
n
)f
n
<t/2]
= 1
[(P−P
n
)f
n
>t ∧(P−P
t
n
)f
n
<t/2]
≤ 1
[(P
t
n
−P
n
)f
n
>t/2]
Taking expectations with respect to the second sample gives
1
[(P−P
n
)f
n
>t]
P
t
_
(P −P
t
n
)f
n
< t/2
¸
≤ P
t
_
(P
t
n
−P
n
)f
n
> t/2
¸
O. Bousquet – Statistical Learning Theory – Lecture 3 60
Symmetrization (proof – 2)
• By Chebyshev inequality,
P
t
_
(P −P
t
n
)f
n
≥ t/2
¸

4Var [f
n
]
nt
2

1
nt
2
• Hence
1
[(P−P
n
)f
n
>t]
(1 −
1
nt
2
) ≤ P
t
_
(P
t
n
−P
n
)f
n
> t/2
¸
Take expectation with respect to first sample.
O. Bousquet – Statistical Learning Theory – Lecture 3 61
Proof of VC bound (1)
• Symmetrization allows to replace expectation by average on ghost
sample
• Function class projected on the double sample
J
Z
1
,...,Z
n
,Z
t
1
,...,Z
t
n
• Union bound on J
Z
1
,...,Z
n
,Z
t
1
,...,Z
t
n
• Variant of Hoeffding’s inequality
P
_
P
n
f −P
t
n
f > t
¸
≤ 2e
−nt
2
/2
O. Bousquet – Statistical Learning Theory – Lecture 3 62
Proof of VC bound (2)
P
_
sup
f∈J
(P −P
n
)f ≥ t
¸
≤ 2P
_
sup
f∈J
(P
t
n
−P
n
)f ≥ t/2
¸
= 2P
_
sup
f∈J
Z
1
,...,Z
n
,Z
t
1
,...,Z
t
n
(P
t
n
−P
n
)f ≥ t/2
_
≤ 2S
F
(2n)P
_
(P
t
n
−P
n
)f ≥ t/2
¸
≤ 4S
F
(2n)e
−nt
2
/8
O. Bousquet – Statistical Learning Theory – Lecture 3 63
VC Entropy (1)
• VC dimension is distribution independent
⇒ The same bound holds for any distribution
⇒ It is loose for most distributions
• A similar proof can give a distribution-dependent result
O. Bousquet – Statistical Learning Theory – Lecture 3 64
VC Entropy (2)
• Denote the size of the projection N (J, z
1
. . . , z
n
) := #J
z
1
,...,z
n
• The VC entropy is defined as
H
J
(n) = log E[N (J, Z
1
, . . . , Z
n
)] ,
• VC entropy bound: with probability at least 1 −δ
∀g ∈ Ç, R(g) ≤ R
n
(g) +
¸
H
Ç
(2n) + log
2
δ
8n
O. Bousquet – Statistical Learning Theory – Lecture 3 65
VC Entropy (proof)
Introduce σ
i
∈ ¦−1, 1] (probability 1/2), Rademacher variables
2P
_
sup
f∈J
Z,Z
t
(P
t
n
−P
n
)f ≥ t/2
_
≤ 2E
_
P
σ
_
sup
f∈J
Z,Z
t
1
n
n

i=1
σ
i
(f(Z
t
i
) −f(Z
i
)) ≥ t/2
__
≤ 2E
_
N
_
J, Z, Z
t

P
_
1
n
n

i=1
σ
i
≥ t/2
_
≤ 2E
_
N
_
J, Z, Z
t

e
−nt
2
/8
O. Bousquet – Statistical Learning Theory – Lecture 3 66
From Bounds to Algorithms
• For any distribution, H
Ç
(n)/n → 0 ensures consistency of empirical
risk minimizer (i.e. convergence to best in the class)
• Does it means we can learn anything ?
• No because of the approximation of the class
• Need to trade-off approximation and estimation error (assessed by
the bound)
⇒ Use the bound to control the trade-off
O. Bousquet – Statistical Learning Theory – Lecture 3 67
Structural risk minimization
• Structural risk minimization (SRM) (Vapnik, 1979): minimize the
right hand side of
R(g) ≤ R
n
(g) + B(h, n).
• To this end, introduce a structure on Ç.
• Learning machine ≡ a set of functions and an induction principle
O. Bousquet – Statistical Learning Theory – Lecture 3 68
SRM: The Picture
R(f
*
)
h
training error
capacity term
error
S
n−1
S
n
S
n+1
structure
bound on test error
O. Bousquet – Statistical Learning Theory – Lecture 3 69
Lecture 4
Capacity Measures
• Covering numbers
• Rademacher averages
• Relationships
O. Bousquet – Statistical Learning Theory – Lecture 4 70
Covering numbers
• Define a (random) distance d between functions, e.g.
d(f, f
t
) =
1
n
#¦f(Z
i
) ,= f
t
(Z
i
) : i = 1, . . . , n]
Normalized Hamming distance of the ’projections’ on the sample
• A set f
1
, . . . , f
N
covers J at radius ε if
J ⊂ ∪
N
i=1
B(f
i
, ε)
• Covering number N(J, ε, n) is the minimum size of a cover of
radius ε
Note that N(J, ε, n) = N(Ç, ε, n).
O. Bousquet – Statistical Learning Theory – Lecture 4 71
Bound with covering numbers
• When the covering numbers are finite, one can approximate the class
Ç by a finite set of functions
• Result
P [∃g ∈ Ç : R(g) −R
n
(g) ≥ t] ≤ 8E[N(Ç, t, n)] e
−nt
2
/128
O. Bousquet – Statistical Learning Theory – Lecture 4 72
Covering numbers and VC dimension
• Notice that for all t, N(Ç, t, n) ≤ #Ç
Z
= N(Ç, Z)
• Hence N(Ç, t, n) ≤ hlog
en
h
• Haussler
N(Ç, t, n) ≤ Ch(4e)
h
1
t
h
• Independent of n
O. Bousquet – Statistical Learning Theory – Lecture 4 73
Refinement
• VC entropy corresponds to log covering numbers at minimal scale
• Covering number bound is a generalization where the scale is adapted
to the error
• Is this the right scale ?
• It turns out that results can be improved by considering all scales (→
chaining)
O. Bousquet – Statistical Learning Theory – Lecture 4 74
Rademacher averages
• Rademacher variables: σ
1
, . . . , σ
n
independent random variables
with
P [σ
i
= 1] = P [σ
i
= −1] =
1
2
• Notation (randomized empirical measure) R
n
f =
1
n

n
i=1
σ
i
f(Z
i
)
• Rademacher average: 7(J) = E
_
sup
f∈J
R
n
f
¸
• Conditional Rademacher average 7
n
(J) = E
σ
_
sup
f∈J
R
n
f
¸
O. Bousquet – Statistical Learning Theory – Lecture 4 75
Result
• Distribution dependent
∀f ∈ J, Pf ≤ P
n
f + 27(J) +
¸
log
1
δ
2n
,
• Data dependent
∀f ∈ J, Pf ≤ P
n
f + 27
n
(J) +
¸
2 log
2
δ
n
,
O. Bousquet – Statistical Learning Theory – Lecture 4 76
Concentration
• Hoeffding’s inequality is a concentration inequality
• When n increases, the average is concentrated around the expectation
• Generalization to functions that depend on i.i.d. random variables
exist
O. Bousquet – Statistical Learning Theory – Lecture 4 77
McDiarmid’s Inequality
Assume for all i = 1, . . . , n,
sup
z
1
,...,z
n
,z
t
i
]F(z
1
, . . . , z
i
, . . . , z
n
) −F(z
1
, . . . , z
t
i
, . . . , z
n
)] ≤ c
then for all ε > 0,
P []F −E[F] ] > ε] ≤ 2 exp
_


2
nc
2
_
O. Bousquet – Statistical Learning Theory – Lecture 4 78
Proof of Rademacher average bounds
• Use concentration to relate sup
f∈J
Pf −P
n
f to its expectation
• Use symmetrization to relate expectation to Rademacher average
• Use concentration again to relate Rademacher average to conditional
one
O. Bousquet – Statistical Learning Theory – Lecture 4 79
Application (1)
sup
f∈J
A(f) + B(f) ≤ sup
f∈J
A(f) + sup
f∈J
B(f)
Hence
] sup
f∈J
C(f) −sup
f∈J
A(f)] ≤ sup
f∈J
(C(f) −A(f))
this gives
] sup
f∈J
(Pf −P
n
f) −sup
f∈J
(Pf −P
t
n
f)] ≤ sup
f∈J
(P
t
n
f −P
n
f)
O. Bousquet – Statistical Learning Theory – Lecture 4 80
Application (2)
f ∈ ¦0, 1] hence,
P
t
n
f −P
n
f =
1
n
(f(Z
t
i
) −f(Z
i
)) ≤
1
n
thus
] sup
f∈J
(Pf −P
n
f) −sup
f∈J
(Pf −P
t
n
f)] ≤
1
n
McDiarmid’s inequality can be applied with c = 1/n
O. Bousquet – Statistical Learning Theory – Lecture 4 81
Symmetrization (1)
• Upper bound
E
_
sup
f∈J
Pf −P
n
f
_
≤ 2E
_
sup
f∈J
R
n
f
_
• Lower bound
E
_
sup
f∈J
]Pf −P
n
f]
_

1
2
E
_
sup
f∈J
7
n
f
_

1
2

n
O. Bousquet – Statistical Learning Theory – Lecture 4 82
Symmetrization (2)
E[sup
f∈J
Pf −P
n
f]
= E[sup
f∈J
E
_
P
t
n
f
¸
−P
n
f]
≤ E
Z,Z
t[sup
f∈J
P
t
n
f −P
n
f]
= E
σ,Z,Z
t
_
sup
f∈J
1
n
n

i=1
σ
i
(f(Z
t
i
) −f(Z
i
))
_
≤ 2E[sup
f∈J
R
n
f]
O. Bousquet – Statistical Learning Theory – Lecture 4 83
Loss class and initial class
7(J) = E
_
sup
g∈Ç
1
n
n

i=1
σ
i
1
[g(X
i
),=Y
i
]
_
= E
_
sup
g∈Ç
1
n
n

i=1
σ
i
1
2
(1 −Y
i
g(X
i
))
_
=
1
2
E
_
sup
g∈Ç
1
n
n

i=1
σ
i
Y
i
g(X
i
)
_
=
1
2
7(Ç)
O. Bousquet – Statistical Learning Theory – Lecture 4 84
Computing Rademacher averages (1)
1
2
E
_
sup
g∈Ç
1
n
n

i=1
σ
i
g(X
i
)
_
=
1
2
+E
_
sup
g∈Ç
1
n
n

i=1

1 −σ
i
g(X
i
)
2
_
=
1
2
−E
_
inf
g∈Ç
1
n
n

i=1
1 −σ
i
g(X
i
)
2
_
=
1
2
−E
_
inf
g∈Ç
R
n
(g, σ)
_
O. Bousquet – Statistical Learning Theory – Lecture 4 85
Computing Rademacher averages (2)
• Not harder than computing empirical risk minimizer
• Pick σ
i
randomly and minimize error with respect to labels σ
i
• Intuition: measure how much the class can fit random noise
• Large class ⇒ 7(Ç) =
1
2
O. Bousquet – Statistical Learning Theory – Lecture 4 86
Concentration again
• Let
F = E
σ
_
sup
f∈J
R
n
f
_
Expectation with respect to σ
i
only, with (X
i
, Y
i
) fixed.
• F satisfies McDiarmid’s assumptions with c =
1
n
⇒ E[F] = 7(J) can be estimated by F = 7
n
(J)
O. Bousquet – Statistical Learning Theory – Lecture 4 87
Relationship with VC dimension
• For a finite set J = ¦f
1
, . . . , f
N
]
7(J) ≤ 2
_
log N/n
• Consequence for VC class J with dimension h
7(J) ≤ 2
¸
hlog
en
h
n
.
⇒ Recovers VC bound with a concentration proof
O. Bousquet – Statistical Learning Theory – Lecture 4 88
Chaining
• Using covering numbers at all scales, the geometry of the class is
better captured
• Dudley
7
n
(J) ≤
C

n
_

0
_
log N(J, t, n) dt
• Consequence
7(J) ≤ C
_
h
n
• Removes the unnecessary log n factor !
O. Bousquet – Statistical Learning Theory – Lecture 4 89
Lecture 5
Advanced Topics
• Relative error bounds
• Noise conditions
• Localized Rademacher averages
• PAC-Bayesian bounds
O. Bousquet – Statistical Learning Theory – Lecture 5 90
Binomial tails
• P
n
f ∼ B(p, n) binomial distribution p = Pf
• P [Pf −P
n
f ≥ t] =

¡n(p−t)¸
k=0
_
n
k
_
p
k
(1 −p)
n−k
• Can be upper bounded
Exponential
_
1−p
1−p−t
_
n(1−p−t)
_
p
p+t
_
n(p+t)
Bennett e

np
1−p
((1−t/p) log(1−t/p)+t/p)
Bernstein e

nt
2
2p(1−p)+2t/3
Hoeffding e
−2nt
2
O. Bousquet – Statistical Learning Theory – Lecture 5 91
Tail behavior
• For small deviations, Gaussian behavior ≈ exp(−nt
2
/2p(1 −p))
⇒ Gaussian with variance p(1 −p)
• For large deviations, Poisson behavior ≈ exp(−3nt/2)
⇒ Tails heavier than Gaussian
• Can upper bound with a Gaussian with large (maximum) variance
exp(−2nt
2
)
O. Bousquet – Statistical Learning Theory – Lecture 5 92
Illustration (1)
Maximum variance (p = 0.5)
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
D
e
v
i
a
t
i
o
n

b
o
u
n
d
BinoCDF
Bernstein
Bennett
Binomial Tail
Best Gausian
O. Bousquet – Statistical Learning Theory – Lecture 5 93
Illustration (2)
Small variance (p = 0.1)
0.1 0.15 0.2 0.25 0.3 0.35
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
D
e
v
i
a
t
i
o
n

b
o
u
n
d
BinoCDF
Sub−Gaussian
Bernstein
Bennett
Binomial Tail
Best Gausian
0.3 0.32 0.34 0.36 0.38 0.4 0.42 0.44
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
x
D
e
v
i
a
t
i
o
n

b
o
u
n
d
BinoCDF
Bernstein
Bennett
Binomial Tail
Best Gausian
O. Bousquet – Statistical Learning Theory – Lecture 5 94
Taking the variance into account (1)
• Each function f ∈ J has a different variance Pf(1 −Pf) ≤ Pf.
• For each f ∈ J, by Bernstein’s inequality
Pf ≤ P
n
f +
¸
2Pf log
1
δ
n
+
2 log
1
δ
3n
• The Gaussian part dominates (for Pf not too small, or n large
enough), it depends on Pf
O. Bousquet – Statistical Learning Theory – Lecture 5 95
Taking the variance into account (2)
• Central Limit Theorem

n
Pf −P
n
f
_
Pf(1 −Pf)
→ N(0, 1)
⇒ Idea is to consider the ratio
Pf −P
n
f

Pf
O. Bousquet – Statistical Learning Theory – Lecture 5 96
Normalization
• Here (f ∈ ¦0, 1]), Var [f] ≤ Pf
2
= Pf
• Large variance ⇒ large risk.
• After normalization, fluctuations are more ”uniform”
sup
f∈J
Pf −P
n
f

Pf
not necessarily attained at functions with large variance.
• Focus of learning: functions with small error Pf (hence small
variance).
⇒ The normalized supremum takes this into account.
O. Bousquet – Statistical Learning Theory – Lecture 5 97
Relative deviations
Vapnik-Chervonenkis 1974
For δ > 0 with probability at least 1 −δ,
∀f ∈ J,
Pf −P
n
f

Pf
≤ 2
¸
log S
J
(2n) + log
4
δ
n
and
∀f ∈ J,
P
n
f −Pf

P
n
f
≤ 2
¸
log S
J
(2n) + log
4
δ
n
O. Bousquet – Statistical Learning Theory – Lecture 5 98
Proof sketch
1. Symmetrization
P
_
sup
f∈J
Pf −P
n
f

Pf
≥ t
_
≤ 2P
_
sup
f∈J
P
t
n
f −P
n
f
_
(P
n
f + P
t
n
f)/2
≥ t
_
2. Randomization
= 2E
_
P
σ
_
sup
f∈J
1
n

n
i=1
σ
i
(f(Z
t
i
) −f(Z
i
))
_
(P
n
f + P
t
n
f)/2
≥ t
__
3. Tail bound
O. Bousquet – Statistical Learning Theory – Lecture 5 99
Consequences
From the fact
A ≤ B + C

A ⇒ A ≤ B + C
2
+

BC
we get
∀f ∈ J, Pf ≤ P
n
f + 2
¸
P
n
f
log S
J
(2n) + log
4
δ
n
+4
log S
J
(2n) + log
4
δ
n
O. Bousquet – Statistical Learning Theory – Lecture 5 100
Zero noise
Ideal situation
• g
n
empirical risk minimizer
• t ∈ Ç
• R

= 0 (no noise, n(X) = 0 a.s.)
In that case
• R
n
(g
n
) = 0
⇒ R(g
n
) = O(
d log n
n
).
O. Bousquet – Statistical Learning Theory – Lecture 5 101
Interpolating between rates ?
• Rates are not correctly estimated by this inequality
• Consequence of relative error bounds
Pf
n
≤ Pf

+ 2
¸
Pf

log S
J
(2n) + log
4
δ
n
+4
log S
J
(2n) + log
4
δ
n
• The quantity which is small is not Pf

but Pf
n
−Pf

• But relative error bounds do not apply to differences
O. Bousquet – Statistical Learning Theory – Lecture 5 102
Definitions
• P = P
X
·P(Y ]X)
• regression function η(x) = E[Y ]X = x]
• target function t(x) = sgn η(x)
• noise level n(X) = (1 −]η(x)])/2
• Bayes risk R

= E[n(X)]
• R(g) = E[(1 −η(X))/2] +E
_
η(X)1
[g≤0]
¸
• R(g) −R

= E
_
]η(X)]1
[gη≤0]
¸
O. Bousquet – Statistical Learning Theory – Lecture 5 103
Intermediate noise
Instead of assuming that ]η(x)] = 1 (i.e. n(x) = 0), the
deterministic case, one can assume that n is well-behaved.
Two kinds of assumptions
• n not too close to 1/2
• n not often too close to 1/2
O. Bousquet – Statistical Learning Theory – Lecture 5 104
Massart Condition
• For some c > 0, assume
]η(X)] >
1
c
almost surely
• There is no region where the decision is completely random
• Noise bounded away from 1/2
O. Bousquet – Statistical Learning Theory – Lecture 5 105
Tsybakov Condition
Let α ∈ [0, 1], equivalent conditions
(1) ∃c > 0, ∀g ∈ ¦−1, 1]
·
,
P [g(X)η(X) ≤ 0] ≤ c(R(g) −R

)
α
(2) ∃c > 0, ∀A ⊂ ·,
_
A
dP(x) ≤ c(
_
A
]η(x)]dP(x))
α
(3) ∃B > 0, ∀t ≥ 0, P []η(X)] ≤ t] ≤ Bt
α
1−α
O. Bousquet – Statistical Learning Theory – Lecture 5 106
Equivalence
• (1) ⇔ (2) Recall R(g) − R

= E
_
]η(X)]1
[gη≤0]
¸
. For each
function g, there exists a set A such that 1
[A]
= 1
[gη≤0]
• (2) ⇒ (3) Let A = ¦x : ]η(x)] ≤ t]
P []η] ≤ t] =
_
A
dP(x) ≤ c(
_
A
]η(x)]dP(x))
α
≤ ct
α
(
_
A
dP(x))
α
⇒ P []η] ≤ t] ≤ c
1
1−α
t
α
1−α
O. Bousquet – Statistical Learning Theory – Lecture 5 107
• (3) ⇒ (1)
R(g) −R

= E
_
]η(X)] 1
[gη≤0]
¸
≥ tE
_
1
[gη≤0]
1
[]η]>t]
¸
= tP []η] > t] −tE
_
1
[gη>0]
1
[]η]>t]
¸
≥ t(1 −Bt
α
1−α
) −tP [gη > 0] = t(P [gη ≤ 0] −Bt
α
1−α
)
Take t =
_
(1−α)P[gη≤0]
B
_
(1−α)/α
⇒ P [gη ≤ 0] ≤
B
1−α
(1 −α)
(
1 −α)α
α
(R(g) −R

)
α
O. Bousquet – Statistical Learning Theory – Lecture 5 108
Remarks
• α is in [0, 1] because
R(g) −R

= E
_
]η(X)]1
[gη≤0]
¸
≤ E
_
1
[gη≤0]
¸
• α = 0 no condition
• α = 1 gives Massart’s condition
O. Bousquet – Statistical Learning Theory – Lecture 5 109
Consequences
• Under Massart’s condition
E
_
(1
[g(X),=Y ]
−1
[t(X),=Y ]
)
2
_
≤ c(R(g) −R

)
• Under Tsybakov’s condition
E
_
(1
[g(X),=Y ]
−1
[t(X),=Y ]
)
2
_
≤ c(R(g) −R

)
α
O. Bousquet – Statistical Learning Theory – Lecture 5 110
Relative loss class
• J is the loss class associated to Ç
• The relative loss class is defined as
˜
J = ¦f −f

: f ∈ J]
• It satisfies
Pf
2
≤ c(Pf)
α
O. Bousquet – Statistical Learning Theory – Lecture 5 111
Finite case
• Union bound on
˜
J with Bernstein’s inequality would give
Pf
n
−Pf

≤ P
n
f
n
−P
n
f

+
¸
8c(Pf
n
−Pf

)
α
log
N
δ
n
+
4 log
N
δ
3n
• Consequence when f

∈ J (but R

> 0)
Pf
n
−Pf

≤ C
_
log
N
δ
n
_
1
2−α
always better than n
−1/2
for α > 0
O. Bousquet – Statistical Learning Theory – Lecture 5 112
Local Rademacher average
• Definition
7(J, r) = E
_
sup
f∈J:Pf
2
≤r
R
n
f
_
• Allows to generalize the previous result
• Computes the capacity of a small ball in J (functions with small
variance)
• Under noise conditions, small variance implies small error
O. Bousquet – Statistical Learning Theory – Lecture 5 113
Sub-root functions
Definition
A function ψ : R → R is sub-root if
• ψ is non-decreasing
• ψ is non negative
• ψ(r)/

r is non-increasing
O. Bousquet – Statistical Learning Theory – Lecture 5 114
Sub-root functions
Properties
A sub-root function
• is continuous
• has a unique fixed point ψ(r

) = r

0 0.5 1 1.5 2 2.5 3
0
0.5
1
1.5
2
2.5
3
x
phi(x)
O. Bousquet – Statistical Learning Theory – Lecture 5 115
Star hull
• Definition
J = ¦αf : f ∈ J, α ∈ [0, 1]]
• Properties
7
n
(J, r) is sub-root
• Entropy of J is not much bigger than entropy of J
O. Bousquet – Statistical Learning Theory – Lecture 5 116
Result
• r

fixed point of 7(J, r)
• Bounded functions
Pf −P
n
f ≤ C
_
_
r

Var [f] +
log
1
δ
+ log log n
n
_
• Consequence for variance related to expectation (Var [f] ≤ c(Pf)
β
)
Pf ≤ C
_
P
n
f + (r

)
1
2−β
+
log
1
δ
+ log log n
n
_
O. Bousquet – Statistical Learning Theory – Lecture 5 117
Consequences
• For VC classes 7(J, r) ≤ C
_
rh
n
hence r

≤ C
h
n
• Rate of convergence of P
n
f to Pf in O(1/

n)
• But rate of convergence of Pf
n
to Pf

is O(1/n
1/(2−α)
)
Only condition is t ∈ Ç but can be removed by SRM/Model selection
O. Bousquet – Statistical Learning Theory – Lecture 5 118
Proof sketch (1)
• Talagrand’s inequality
sup
f∈J
Pf−P
n
f ≤ E
_
sup
f∈J
Pf −P
n
f
_
+c
_
sup
f∈J
Var [f] /n+c
t
/n
• Peeling of the class
J
k
= ¦f : Var [f] ∈ [x
k
, x
k+1
)]
O. Bousquet – Statistical Learning Theory – Lecture 5 119
Proof sketch (2)
• Application
sup
f∈J
k
Pf −P
n
f ≤ E
_
sup
f∈J
k
Pf −P
n
f
_
+c
_
xVar [f] /n+c
t
/n
• Symmetrization
∀f ∈ J, Pf−P
n
f ≤ 27(J, xVar [f])+c
_
xVar [f] /n+c
t
/n
O. Bousquet – Statistical Learning Theory – Lecture 5 120
Proof sketch (3)
• We need to ’solve’ this inequality. Things are simple if 7 behave like
a square root, hence the sub-root property
Pf −P
n
f ≤ 2
_
r

Var [f] + c
_
xVar [f] /n + c
t
/n
• Variance-expectation
Var [f] ≤ c(Pf)
α
Solve in Pf
O. Bousquet – Statistical Learning Theory – Lecture 5 121
Data-dependent version
• As in the global case, one can use data-dependent local Rademcher
averages
7
n
(J, r) = E
σ
_
sup
f∈J:Pf
2
≤r
R
n
f
_
• Using concentration one can also get
Pf ≤ C
_
P
n
f + (r

n
)
1
2−α
+
log
1
δ
+ log log n
n
_
where r

n
is the fixed point of a sub-root upper bound of 7
n
(J, r)
O. Bousquet – Statistical Learning Theory – Lecture 5 122
Discussion
• Improved rates under low noise conditions
• Interpolation in the rates
• Capacity measure seems ’local’,
• but depends on all the functions,
• after appropriate rescaling: each f ∈ J is considered at scale r/Pf
2
O. Bousquet – Statistical Learning Theory – Lecture 5 123
Randomized Classifiers
Given Ç a class of functions
• Deterministic: picks a function g
n
and always use it to predict
• Randomized
construct a distribution ρ
n
over Ç
for each instance to classify, pick g ∼ ρ
n
• Error is averaged over ρ
n
R(ρ
n
) = ρ
n
Pf
R
n

n
) = ρ
n
P
n
f
O. Bousquet – Statistical Learning Theory – Lecture 5 124
Union Bound (1)
Let π be a (fixed) distribution over J.
• Recall the refined union bound
∀f ∈ J, Pf −P
n
f ≤
¸
log
1
π(f)
+ log
1
δ
2n
• Take expectation with respect to ρ
n
ρ
n
Pf −ρ
n
P
n
f ≤ ρ
n
¸
log
1
π(f)
+ log
1
δ
2n
O. Bousquet – Statistical Learning Theory – Lecture 5 125
Union Bound (2)
ρ
n
Pf −ρ
n
P
n
f ≤ ρ
n
_
_
−log π(f) + log
1
δ
_
/(2n)

_
_
−ρ
n
log π(f) + log
1
δ
_
/(2n)

_
_
K(ρ
n
, π) + H(ρ
n
) + log
1
δ
_
/(2n)
• K(ρ
n
, π) =
_
ρ
n
(f) log
ρ
n
(f)
π(f)
df Kullback-Leibler divergence
• H(ρ
n
) =
_
ρ
n
(f) log ρ
n
(f) df Entropy
O. Bousquet – Statistical Learning Theory – Lecture 5 126
PAC-Bayesian Refinement
• It is possible to improve the previous bound.
• With probability at least 1 −δ,
ρ
n
Pf −ρ
n
P
n
f ≤
¸
K(ρ
n
, π) + log 4n + log
1
δ
2n −1
• Good if ρ
n
is spread (i.e. large entropy)
• Not interesting if ρ
n
= δ
f
n
O. Bousquet – Statistical Learning Theory – Lecture 5 127
Proof (1)
• Variational formulation of entropy: for any T
ρT(f) ≤ log πe
T(f)
+ K(ρ, π)
• Apply it to λ(Pf −P
n
f)
2
λρ
n
(Pf −P
n
f)
2
≤ log πe
λ(Pf−P
n
f)
2
+ K(ρ
n
, π)
• Markov’s inequality: with probability 1 −δ,
λρ
n
(Pf −P
n
f)
2
≤ log E
_
πe
λ(Pf−P
n
f)
2
_
+K(ρ
n
, π) + log
1
δ
O. Bousquet – Statistical Learning Theory – Lecture 5 128
Proof (2)
• Fubini
E
_
πe
λ(Pf−P
n
f)
2
_
= πE
_
e
λ(Pf−P
n
f)
2
_
• Modified Chernoff bound
E
_
e
(2n−1)(Pf−P
n
f)
2
_
≤ 4n
• Putting together (λ = 2n −1)
(2n −1)ρ
n
(Pf −P
n
f)
2
≤ K(ρ
n
, π) + log 4n + log
1
δ
• Jensen (2n −1)(ρ
n
(Pf −P
n
f))
2
≤ (2n −1)ρ
n
(Pf −P
n
f)
2
O. Bousquet – Statistical Learning Theory – Lecture 5 129
Lecture 6
Loss Functions
• Properties
• Consistency
• Examples
• Losses and noise
O. Bousquet – Statistical Learning Theory – Lecture 6 130
Motivation (1)
• ERM: minimize

n
i=1
1
[g(X
i
),=Y
i
]
in a set Ç
⇒ Computationally hard
⇒ Smoothing
Replace binary by real-valued functions
Introduce smooth loss function
n

i=1
(g(X
i
), Y
i
)
O. Bousquet – Statistical Learning Theory – Lecture 6 131
Motivation (2)
• Hyperplanes in infinite dimension have
infinite VC-dimension
but finite scale-sensitive dimension (to be defined later)
⇒ It is good to have a scale
⇒ This scale can be used to give a confidence (i.e. estimate the density)
• However, losses do not need to be related to densities
• Can get bounds in terms of margin error instead of empirical error
(smoother → easier to optimize for model selection)
O. Bousquet – Statistical Learning Theory – Lecture 6 132
Margin
• It is convenient to work with (symmetry of +1 and −1)
(g(x), y) = φ(yg(x))
• yg(x) is the margin of g at (x, y)
• Loss
L(g) = E[φ(Y g(X))] , L
n
(g) =
1
n
n

i=1
φ(Y
i
g(X
i
))
• Loss class J = ¦f : (x, y) ·→ φ(yg(x)) : g ∈ Ç]
O. Bousquet – Statistical Learning Theory – Lecture 6 133
Minimizing the loss
• Decomposition of L(g)
1
2
E[E[(1 + η(X))φ(g(X)) + (1 −η(X))φ(−g(X))]X]]
• Minimization for each x
H(η) = inf
α∈R
((1 + η)φ(α)/2 + (1 −η)φ(−α)/2)
• L

:= inf
g
L(g) = E[H(η(X))]
O. Bousquet – Statistical Learning Theory – Lecture 6 134
Classification-calibrated
• A minimal requirement is that the minimizer in H(η) has the correct
sign (that of the target t or that of η).
• Definition
φ is classification-calibrated if, for any η ,= 0
inf
α:αη≤0
(1+η)φ(α)+(1−η)φ(−α) > inf
α∈R
(1+η)φ(α)+(1−η)φ(−α)
• This means the infimum is achieved for an α of the correct sign (and
not for an α of the wrong sign, except possibly for η = 0).
O. Bousquet – Statistical Learning Theory – Lecture 6 135
Consequences (1)
Results due to (Jordan, Bartlett and McAuliffe 2003)
• φ is classification-calibrated iff for all sequences g
i
and every proba-
bility distribution P,
L(g
i
) → L

⇒ R(g
i
) → R

• When φ is convex (convenient for optimization) φ is classification-
calibrated iff it is differentiable at 0 and φ
t
(0) < 0
O. Bousquet – Statistical Learning Theory – Lecture 6 136
Consequences (2)
• Let H

(η) = inf
α:αη≤0
((1 + η)φ(α)/2 + (1 −η)φ(−α)/2)
• Let ψ(η) be the largest convex function below H

(η) −H(η)
• One has
ψ(R(g) −R

) ≤ L(g) −L

O. Bousquet – Statistical Learning Theory – Lecture 6 137
Examples (1)
−1 −0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
3.5
4
0−1
hinge
squared hinge
square
exponential
O. Bousquet – Statistical Learning Theory – Lecture 6 138
Examples (2)
• Hinge loss
φ(x) = max(0, 1 −x), ψ(x) = x
• Squared hinge loss
φ(x) = max(0, 1 −x)
2
, ψ(x) = x
2
• Square loss
φ(x) = (1 −x)
2
, ψ(x) = x
2
• Exponential
φ(x) = exp(−x), ψ(x) = 1 −
_
1 −x
2
O. Bousquet – Statistical Learning Theory – Lecture 6 139
Low noise conditions
• Relationship can be improved under low noise conditions
• Under Tsybakov’s condition with exponent α and constant c,
c(R(g) −R

)
α
ψ((R(g) −R

)
1−α
/2c) ≤ L(g) −L

• Hinge loss (no improvement)
R(g) −R

≤ L(g) −L

• Square loss or squared hinge loss
R(g) −R

≤ (4c(L(g) −L

))
1
2−α
O. Bousquet – Statistical Learning Theory – Lecture 6 140
Estimation error
• Recall that Tsybakov condition implies Pf
2
≤ c(Pf)
α
for the
relative loss class (with 0 −1 loss)
• What happens for the relative loss class associated to φ ?
• Two possibilities
Strictly convex loss (can modify the metric on R)
Piecewise linear
O. Bousquet – Statistical Learning Theory – Lecture 6 141
Strictly convex losses
• Noise behavior controlled by modulus of convexity
• Result
δ(
_
Pf
2
K
) ≤ Pf/2
with K Lipschitz constant of φ and δ modulus of convexity of L(g)
with respect to jf −gj
L
2
(P)
• Not related to noise exponent
O. Bousquet – Statistical Learning Theory – Lecture 6 142
Piecewise linear losses
• Noise behavior related to noise exponent
• Result for hinge loss
Pf
2
≤ CPf
α
if initial class Ç is uniformly bounded
O. Bousquet – Statistical Learning Theory – Lecture 6 143
Estimation error
• With bounded and Lipschitz loss with convexity exponent γ, for a
convex class Ç,
L(g) −L(g

) ≤ C
_
(r

)
2
γ
+
log
1
δ
+ log log n
n
_
• Under Tsybakov’s condition for the hinge loss (and general Ç)
Pf
2
≤ CPf
α
L(g) −L(g

) ≤ C
_
(r

)
1
2−α
+
log
1
δ
+ log log n
n
_
O. Bousquet – Statistical Learning Theory – Lecture 6 144
Examples
Under Tsybakov’s condition
• Hinge loss
R(g) −R

≤ L(g

) −L

+C
_
(r

)
1
2−α
+
log
1
δ
+ log log n
n
_
• Squared hinge loss or square loss δ(x) = cx
2
, Pf
2
≤ CPf
R(g)−R

≤ C
_
L(g

) −L

+ C
t
(r

+
log
1
δ
+ log log n
n
)
_
1
2−α
O. Bousquet – Statistical Learning Theory – Lecture 6 145
Classification vs Regression losses
• Consider a classification-calibrated function φ
• It is a classification loss if L(t) = L

• otherwise it is a regression loss
O. Bousquet – Statistical Learning Theory – Lecture 6 146
Classification vs Regression losses
• Square, squared hinge, exponential losses
Noise enters relationship between risk and loss
Modulus of convexity enters in estimation error
• Hinge loss
Direct relationship between risk and loss
Noise enters in estimation error
⇒ Approximation term not affected by noise in second case
⇒ Real value does not bring probability information in second case
O. Bousquet – Statistical Learning Theory – Lecture 6 147
Lecture 7
Regularization
• Formulation
• Capacity measures
• Computing Rademacher averages
• Applications
O. Bousquet – Statistical Learning Theory – Lecture 7 148
Equivalent problems
Up to the choice of the regularization parameters, the following
problems are equivalent
min
f∈J
L
n
(f) + λΩ(f)
min
f∈J:L
n
(f)≤e
Ω(f)
min
f∈J:Ω(f)≤R
L
n
(f)
The solution sets are the same
O. Bousquet – Statistical Learning Theory – Lecture 7 149
Comments
• Computationally, variant of SRM
• variant of model selection by penalization
⇒ one has to choose a regularizer which makes sense
• Need a class that is large enough (for universal consistency)
• but has small balls
O. Bousquet – Statistical Learning Theory – Lecture 7 150
Rates
• To obtain bounds, consider ERM on balls
• Relevant capacity is that of balls
• Real-valued functions, need a generalization of VC dimension, entropy
or covering numbers
• Involve scale sensitive capacity (takes into account the value and not
only the sign)
O. Bousquet – Statistical Learning Theory – Lecture 7 151
Scale-sensitive capacity
• Generalization of VC entropy and VC dimension to real-valued func-
tions
• Definition: a set x
1
, . . . , x
n
is shattered by J (at scale ε) if there
exists a function s such that for all choices of α
i
∈ ¦−1, 1], there
exists f ∈ J
α
i
(f(x
i
) −s(x
i
)) ≥ ε
• The fat-shattering dimension of J at scale ε (denoted vc(J, ε)) is
the maximum cardinality of a shattered set
O. Bousquet – Statistical Learning Theory – Lecture 7 152
Link with covering numbers
• Like VC dimension, fat-shattering dimension can be used to upper
bound covering numbers
• Result
N(J, t, n) ≤
_
C
1
t
_
C
2
vc(J,C
3
t)
• Note that one can also define data-dependent versions (restriction on
the sample)
O. Bousquet – Statistical Learning Theory – Lecture 7 153
Link with Rademacher averages (1)
• Consequence of covering number estimates
R
n
(J) ≤
C
1

n
_

0
_
vc(J, t) log
C
2
t
dt
• Another link via Gaussian averages (replace Rademacher by Gaussian
N(0,1) variables)
Ç
n
(J) = E
g
_
sup
f∈J
1
n
n

i=1
g
i
f(Z
i
)
_
O. Bousquet – Statistical Learning Theory – Lecture 7 154
Link with Rademacher averages (2)
• Worst case average

n
(J) = sup
x
1
,...,x
n
E
σ
_
sup
f∈J
G
n
f
_
• Associated ”dimension” t(J, ) = sup¦n ∈ N :
n
(J) ≥ ]
• Result (Mendelson & Vershynin 2003)
vc(J, c
t
) ≤ t(J, ) ≤
K

2
vc(J, c)
O. Bousquet – Statistical Learning Theory – Lecture 7 155
Rademacher averages and Lipschitz losses
• What matters is the capacity of J (loss class)
• If φ is Lipschitz with constant M
• then
7
n
(J) ≤ M7
n
(Ç)
• Relates to Rademacher average of the initial class (easier to compute)
O. Bousquet – Statistical Learning Theory – Lecture 7 156
Dualization
• Consider the problem min
jgj≤R
L
n
(g)
• Rademacher of ball
E
σ
_
sup
jgj≤R
R
n
g
_
• Duality
E
σ
_
sup
jfj≤R
R
n
f
_
=
R
n
E
σ
__
_
_
_
_
n

i=1
σ
i
δ
X
i
_
_
_
_
_

_
jj

dual norm, δ
X
i
evaluation at X
i
(element of the dual under
appropriate conditions)
O. Bousquet – Statistical Learning Theory – Lecture 7 157
RHKS
Given a positive definite kernel k
• Space of functions: reproducing kernel Hilbert space associated to k
• Regularizer: rkhs norm jj
k
• Properties: Representer theorem
g
n
=
n

i=1
α
i
k(X
i
, )
O. Bousquet – Statistical Learning Theory – Lecture 7 158
Shattering dimension of hyperplanes
• Set of functions
Ç = ¦g(x) = w x : jwj = 1]
• Assume jxj ≤ R
• Result
vc(Ç, ρ) ≤ R
2

2
O. Bousquet – Statistical Learning Theory – Lecture 7 159
Proof Strategy (Gurvits, 1997)
Assume that x
1
, . . . , x
r
are ρ-shattered by hyperplanes with jwj = 1,
i.e., for all y
1
, . . . , y
r
∈ ¦±1], there exists a w such that
y
i
¸w, x
i
) ≥ ρ for all i = 1, . . . , r. (2)
Two steps:
• prove that the more points we want to shatter (2), the larger
j

r
i=1
y
i
x
i
j must be
• upper bound the size of j

r
i=1
y
i
x
i
j in terms of R
Combining the two tells us how many points we can at most shatter
O. Bousquet – Statistical Learning Theory – Lecture 7 160
Part I
• Summing (2) yields ¸w, (

r
i=1
y
i
x
i
)) ≥ rρ
• By Cauchy-Schwarz inequality
_
w,
_
r

i=1
y
i
x
i
__
≤ jwj
_
_
_
_
_
r

i=1
y
i
x
i
_
_
_
_
_
=
_
_
_
_
_
r

i=1
y
i
x
i
_
_
_
_
_
• Combine both:
rρ ≤
_
_
_
_
_
r

i=1
y
i
x
i
_
_
_
_
_
. (3)
O. Bousquet – Statistical Learning Theory – Lecture 7 161
Part II
Consider labels y
i
∈ ¦±1], as (Rademacher variables).
E
_
_
_
_
_
_
_
r

i=1
y
i
x
i
_
_
_
_
_
2
_
_
= E
_
_
r

i,j=1
y
i
y
j
¸x
i
, x
j
)
_
_
=
r

i=1
E[¸x
i
, x
i
)] +E
_
_
r

i,=j
¸x
i
, x
j
)
_
_
=
r

i=1
jx
i
j
2
O. Bousquet – Statistical Learning Theory – Lecture 7 162
Part II, ctd.
• Since jx
i
j ≤ R, we get E
_
j

r
i=1
y
i
x
i
j
2
_
≤ rR
2
.
• This holds for the expectation over the random choices of the labels,
hence there must be at least one set of labels for which it also holds
true. Use this set.
• Hence
_
_
_
_
_
r

i=1
y
i
x
i
_
_
_
_
_
2
≤ rR
2
.
O. Bousquet – Statistical Learning Theory – Lecture 7 163
Part I and II Combined
• Part I: (rρ)
2
≤ j

r
i=1
y
i
x
i
j
2
• Part II: j

r
i=1
y
i
x
i
j
2
≤ rR
2
• Hence
r
2
ρ
2
≤ rR
2
,
i.e.,
r ≤
R
2
ρ
2
O. Bousquet – Statistical Learning Theory – Lecture 7 164
Boosting
Given a class 1 of functions
• Space of functions: linear span of 1
• Regularizer: 1-norm of the weights jgj = inf¦


i
] : g =

α
i
h
i
]
• Properties: weight concentrated on the (weighted) margin maximizers
g
n
=

w
h
h

d
i
Y
i
h(X
i
) = min
h
t
∈1

d
i
Y
i
h
t
(X
i
)
O. Bousquet – Statistical Learning Theory – Lecture 7 165
Rademacher averages for boosting
• Function class of interest
Ç
R
= ¦g ∈ span 1 : jgj
1
≤ R]
• Result
7
n

R
) = R7
n
(1)
⇒ Capacity (as measured by global Rademacher averages) not affected
by taking linear combinations !
O. Bousquet – Statistical Learning Theory – Lecture 7 166
Lecture 8
SVM
• Computational aspects
• Capacity Control
• Universality
• Special case of RBF kernel
O. Bousquet – Statistical Learning Theory – Lecture 8 167
Formulation (1)
• Soft margin
min
w,b,ξ
1
2
jwj
2
+ C
m

i=1
ξ
i
y
i
(¸w, x
i
) + b) ≥ 1 −ξ
i
ξ
i
≥ 0
• Convex objective function and convex constraints
• Unique solution
• Efficient procedures to find it
→ Is it the right criterion ?
O. Bousquet – Statistical Learning Theory – Lecture 8 168
Formulation (2)
• Soft margin
min
w,b,ξ
1
2
jwj
2
+ C
m

i=1
ξ
i
y
i
(¸w, x
i
) + b) ≥ 1 −ξ
i
, ξ
i
≥ 0
• Optimal value of ξ
i
ξ

i
= max(0, 1 −y
i
(¸w, x
i
) + b))
• Substitute above to get
min
w,b
1
2
jwj
2
+ C
m

i=1
max(0, 1 −y
i
(¸w, x
i
) + b))
O. Bousquet – Statistical Learning Theory – Lecture 8 169
Regularization
General form of regularization problem
min
f∈J
1
m
n

i=1
c(y
i
f(x
i
)) + λjfj
2
→ Capacity control by regularization with convex cost
0 1
1
O. Bousquet – Statistical Learning Theory – Lecture 8 170
Loss Function
φ(Y f(X)) = max(0, 1 −Y f(X))
• Convex, non-increasing, upper bounds 1
[Y f(X)≤0]
• Classification-calibrated
• Classification type (L

= L(t))
R(g) −R

≤ L(g) −L

O. Bousquet – Statistical Learning Theory – Lecture 8 171
Regularization
Choosing a kernel corresponds to
• Choose a sequence (a
k
)
• Set
jfj
2
:=

k≥0
a
k
_
]f
(k)
]
2
dx
⇒ penalization of high order derivatives (high frequencies)
⇒ enforce smoothness of the solution
O. Bousquet – Statistical Learning Theory – Lecture 8 172
Capacity: VC dimension
• The VC dimension of the set of hyperplanes is d + 1 in R
d
.
Dimension of feature space ?
∞ for RBF kernel
• w choosen in the span of the data (w =

α
i
y
i
x
i
)
The span of the data has dimension m for RBF kernel (k(., x
i
)
linearly independent)
• The VC bound does not give any information
_
h
m
= 1
⇒ Need to take the margin into account
O. Bousquet – Statistical Learning Theory – Lecture 8 173
Capacity: Shattering dimension
Hyperplanes with Margin
If jxj ≤ R,
vc(hyperplanes with margin ρ, 1) ≤ R
2

2
O. Bousquet – Statistical Learning Theory – Lecture 8 174
Margin
• The shattering dimension is related to the margin
• Maximizing the margin means minimizing the shattering dimension
• Small shattering dimension ⇒ good control of the risk
⇒ this control is automatic (no need to choose the margin beforehand)
⇒ but requires tuning of regularization parameter
O. Bousquet – Statistical Learning Theory – Lecture 8 175
Capacity: Rademacher Averages (1)
• Consider hyperplanes with jwj ≤ M
• Rademacher average
M
n

2
¸
¸
¸
_
n

i=1
k(x
i
, x
i
) ≤ 7
n

M
n
¸
¸
¸
_
n

i=1
k(x
i
, x
i
)
• Trace of the Gram matrix
• Notice that 7
n

_
R
2
/(n
2
ρ
2
)
O. Bousquet – Statistical Learning Theory – Lecture 8 176
Rademacher Averages (2)
E
_
sup
jwj≤M
1
n

n
i=1
σ
i
¸
w, δ
x
i
_
_
= E
_
sup
jwj≤M
_
w,
1
n

n
i=1
σ
i
δ
x
i
__
≤ E
_
sup
jwj≤M
jwj
_
_
_
1
n

n
i=1
σ
i
δ
x
i
_
_
_
_
=
M
n
E
_
_
_

n
i=1
σ
i
δ
x
i
,

n
i=1
σ
i
δ
x
i
_
_
O. Bousquet – Statistical Learning Theory – Lecture 8 177
Rademacher Averages (3)
M
n
E
_
_
_

n
i=1
σ
i
δ
x
i
,

n
i=1
σ
i
δ
x
i
_
_

M
n
_
E
__

n
i=1
σ
i
δ
x
i
,

n
i=1
σ
i
δ
x
i
__
=
M
n
_
E
_

i,j
σ
i
σ
j
_
δ
x
i
, δ
x
j
__
=
M
n
_

n
i=1
k(x
i
, x
i
)
O. Bousquet – Statistical Learning Theory – Lecture 8 178
Improved rates – Noise condition
• Under Massart’s condition (]η] > η
0
), with jgj

≤ M
E
_
(φ(Y g(X)) −φ(Y t(X)))
2
_
≤ (M−1+2/η
0
)(L(g)−L

) .
→ If noise is nice, variance linearly related to expectation
→ Estimation error of order r

(of the class Ç)
O. Bousquet – Statistical Learning Theory – Lecture 8 179
Improved rates – Capacity (1)
• r

n
related to decay of eigenvalues of the Gram matrix
r

n

c
n
min
d∈N
_
_
d +
¸

j>d
λ
j
_
_
• Note that d = 0 gives the trace bound
• r

n
always better than the trace bound (equality when λ
i
constant)
O. Bousquet – Statistical Learning Theory – Lecture 8 180
Improved rates – Capacity (2)
Example: exponential decay
• λ
i
= e
−αi
• Global Rademacher of order
1

n
• r

n
of order
log n
n
O. Bousquet – Statistical Learning Theory – Lecture 8 181
Exponent of the margin
• Estimation error analysis shows that in Ç
M
= ¦g : jgj ≤ M]
R(g
n
) −R(g

) ≤ M...
• Wrong power (M
2
penalty) is used in the algorithm
• Computationally easier
• But does not give λ a dimension-free status
• Using M could improve the cutoff detection
O. Bousquet – Statistical Learning Theory – Lecture 8 182
Kernel
Why is it good to use kernels ?
• Gaussian kernel (RBF)
k(x, y) = e

jx−yj
2

2
• σ is the width of the kernel
→ What is the geometry of the feature space ?
O. Bousquet – Statistical Learning Theory – Lecture 8 183
RBF
Geometry
• Norms
jΦ(x)j
2
= ¸Φ(x), Φ(x)) = e
0
= 1
→ sphere of radius 1
• Angles
cos(

Φ(x), Φ(y)) =
_
Φ(x)
jΦ(x)j
,
Φ(y)
jΦ(y)j
_
= e
−jx−yj
2
/2σ
2
≥ 0
→ Angles less than 90 degrees
• Φ(x) = k(x, .) ≥ 0
→ positive quadrant
O. Bousquet – Statistical Learning Theory – Lecture 8 184
RBF
O. Bousquet – Statistical Learning Theory – Lecture 8 185
RBF
Differential Geometry
• Flat Riemannian metric
→ ’distance’ along the sphere is equal to distance in input space
• Distances are contracted
→ ’shortcuts’ by getting outside the sphere
O. Bousquet – Statistical Learning Theory – Lecture 8 186
RBF
Geometry of the span
Ellipsoid
a
b
x^2/a^2 + y^2/b^2 = 1
• K = (k(x
i
, x
j
)) Gram matrix
• Eigenvalues λ
1
, . . . , λ
m
• Data points mapped to ellispoid with lengths

λ
1
, . . . ,

λ
m
O. Bousquet – Statistical Learning Theory – Lecture 8 187
RBF
Universality
• Consider the set of functions
1 = span¦k(x, ) : x ∈ ·]
• 1 is dense in C(·)
→ Any continuous function can be approximated (in the jj

norm) by
functions in 1
⇒ with enough data one can construct any function
O. Bousquet – Statistical Learning Theory – Lecture 8 188
RBF
Eigenvalues
• Exponentially decreasing
• Fourier domain: exponential penalization of derivatives
• Enforces smoothness with respect to the Lebesgue measure in input
space
O. Bousquet – Statistical Learning Theory – Lecture 8 189
RBF
Induced Distance and Flexibility
• σ → 0
1-nearest neighbor in input space
Each point in a separate dimension, everything orthogonal
• σ → ∞
linear classifier in input space
All points very close on the sphere, initial geometry
• Tuning σ allows to try all possible intermediate combinations
O. Bousquet – Statistical Learning Theory – Lecture 8 190
RBF
Ideas
• Works well if the Euclidean distance is good
• Works well if decision boundary is smooth
• Adapt smoothness via σ
• Universal
O. Bousquet – Statistical Learning Theory – Lecture 8 191
Choosing the Kernel
• Major issue of current research
• Prior knowledge (e.g. invariances, distance)
• Cross-validation (limited to 1-2 parameters)
• Bound (better with convex class)
⇒ Lots of open questions...
O. Bousquet – Statistical Learning Theory – Lecture 8 192
Learning Theory: some informal thoughts
• Need assumptions/restrictions to learn
• Data cannot replace knowledge
• No universal learning (simplicity measure)
• SVM work because of capacity control
• Choice of kernel = choice of prior/ regularizer
• RBF works well if Euclidean distance meaningful
• Knowledge improves performance (e.g. invariances)
O. Bousquet – Statistical Learning Theory – Lecture 8 193

Roadmap (1)
• Lecture 1: Introduction • Part I: Binary Classification
Lecture 2: Basic bounds Lecture 3: VC theory Lecture 4: Capacity measures Lecture 5: Advanced topics

O. Bousquet – Statistical Learning Theory

1

Roadmap (2)

• Part II: Real-Valued Classification

Lecture 6: Margin and loss functions Lecture 7: Regularization Lecture 8: SVM
O. Bousquet – Statistical Learning Theory 2

Lecture 1 The Learning Problem • Context • Formalization • Approximation/Estimation trade-off • Algorithms and Bounds O. Bousquet – Statistical Learning Theory – Lecture 1 3 .

Bousquet – Statistical Learning Theory – Lecture 1 4 . Construct a model of the phenomenon 3. Observe a phenomenon 2. Make predictions ⇒ This is more or less the definition of natural sciences ! ⇒ The goal of Machine Learning is to automate this process ⇒ The goal of Learning Theory is to formalize it. O.Learning and Inference The inductive inference process: 1.

Bousquet – Statistical Learning Theory – Lecture 1 5 . label) • Label is +1 or −1 • Algorithm constructs a function (instance → label) • Goal: make few mistakes on future unseen instances O.Pattern recognition We consider here the supervised learning framework for pattern recognition: • Data consists of pairs (instance.

Bousquet – Statistical Learning Theory – Lecture 1 6 .Approximation/Interpolation It is always possible to build a function that fits exactly the data.5 0 0 0.5 1 1.5 But is it reasonable ? O. 1.5 1 0.

Bousquet – Statistical Learning Theory – Lecture 1 7 ...Occam’s Razor Idea: look for regularities in the observed phenomenon These can ge generalized from the observed past to the future ⇒ choose the simplest consistent model How to measure simplicity ? • • • • Physics: number of constants Description length Number of parameters . O.

No Free Lunch
• No Free Lunch if there is no assumption on how the past is related to the future, prediction is impossible if there is no restriction on the possible phenomena, generalization is impossible • • • •
We need to make assumptions Simplicity is not absolute Data will never replace knowledge Generalization = data + knowledge

O. Bousquet – Statistical Learning Theory – Lecture 1

8

Assumptions
Two types of assumptions

• Future observations related to past ones → Stationarity of the phenomenon

• Constraints on the phenomenon → Notion of simplicity

O. Bousquet – Statistical Learning Theory – Lecture 1

9

Goals
⇒ How can we make predictions from the past ? what are the assumptions ? • Give a formal definition of learning, generalization, overfitting • Characterize the performance of learning algorithms • Design better algorithms

O. Bousquet – Statistical Learning Theory – Lecture 1

10

Bousquet – Statistical Learning Theory – Lecture 1 11 .Probabilistic Model Relationship between past and future observations ⇒ Sampled independently from the same distribution • Independence: each new observation yields maximum information • Identical distribution: the observations give information about the underlying phenomenon (here a probability distribution) O.

1}. Here: classification case Y = {−1.i. Bousquet – Statistical Learning Theory – Lecture 1 12 . Y ) ∈ X × Y are distributed according to P (unknown). Goal: construct a function g : X → Y which predicts Y from X . Data: We observe a sequence of n i. O. Yi) sampled according to P .Probabilistic Model We consider an input space X and output space Y . pairs (Xi. Assumption: The pairs (X.d.

Probabilistic Model Criterion to choose our function: Low probability of error P (g(X) = Y ). Bousquet – Statistical Learning Theory – Lecture 1 13 . Risk R(g) = P (g(X) = Y ) = E 1[g(X)=Y ] • P is unknown so that we cannot directly measure the risk • Can only measure the agreement on the data • Empirical Risk n 1 Rn(g) = 1[g(Xi)=Yi] n i=1 O.

Bousquet – Statistical Learning Theory – Lecture 1 14 . 1}) • in general. 1−P [Y = 1|X = x]) = (1 − η(x))/2 is the noise level O.Target function • P can be decomposed as PX × P (Y |X) • η(x) = E [Y |X = x] = 2P [Y = 1|X = x] − 1 is the regression function • t(x) = sgn η(x) is the target function • in the deterministic case Y = t(X) (P [Y = 1|X] ∈ {0. n(x) = min(P [Y = 1|X = x] .

Assumptions about P Need assumptions about P . a priori probability distribution on possible functions) • Restriction (set of possible functions) Treating lack of knowledge • Bayesian approach: uniform distribution • Learning Theory approach: worst case analysis O. if t(x) is totally chaotic.g. there is no possible generalization from finite data. Assumptions can be • Preference (e. Indeed. Bousquet – Statistical Learning Theory – Lecture 1 15 .

5 0 0 0.Approximation/Interpolation (again) How to trade-off knowledge and data ? 1.5 1 1. Bousquet – Statistical Learning Theory – Lecture 1 16 .5 O.5 1 0.

Bousquet – Statistical Learning Theory – Lecture 1 17 . O.Overfitting/Underfitting The data can mislead you. • Underfitting model too small to fit the data • Overfitting artificially good agreement with the data No way to detect them from the data ! Need extra validation data.

Bousquet – Statistical Learning Theory – Lecture 1 18 .Empirical Risk Minimization • Choose a model G (set of possible functions) • Minimize the empirical risk in the model min Rn(g) g∈G What if the Bayes classifier is not in the model ? O.

Approximation/Estimation • Bayes risk R = inf R(g) . or Bayes classifier). depends on the data).e. Bousquet – Statistical Learning Theory – Lecture 1 19 . O. g ∗ Best risk a deterministic function can have (risk of the target function. • Decomposition: R(g ∗) = inf g∈G R(g) R(gn) − R = R(g) − R ∗ ∗ + R(gn) − R(g ) Estimation ∗ Approximation • Only the estimation error is random (i.

Structural Risk Minimization • Choose a collection of models {Gd : d = 1. . . n) d g∈Gd pen(d. .} • Minimize the empirical risk in each model • Minimize the penalized empirical risk min min Rn(g) + pen(d. 2. n) measures the size or capacity of the model O. Bousquet – Statistical Learning Theory – Lecture 1 20 . n) gives preference to models where estimation error is small pen(d.

Bousquet – Statistical Learning Theory – Lecture 1 21 . Most methods can be thought of as regularization methods. O.Regularization • Choose a large model G (possibly dense) • Choose a regularizer g • Minimize the regularized empirical risk min Rn(g) + λ g g∈G 2 • Choose an optimal trade-off λ (regularization parameter).

Bousquet – Statistical Learning Theory – Lecture 1 22 . ⇒ need probabilistic bounds O. .Bounds (1) A learning algorithm • Takes as input the data (X1. Yn) • Produces a function gn Can we estimate the risk of gn ? ⇒ random quantity (depends on the data). Y1). . . . (Xn.

Bousquet – Statistical Learning Theory – Lecture 1 23 .Bounds (2) • Error bounds R(gn) ≤ Rn(gn) + B ⇒ Estimation from an empirical quantity • Relative error bounds Best in a class R(gn) ≤ R(g ) + B Bayes risk ∗ R(gn) ≤ R + B ⇒ Theoretical guarantees ∗ O.

Lecture 2 Basic Bounds • Probability tools • Relationship with empirical processes • Law of large numbers • Union bound • Relative error bounds O. Bousquet – Statistical Learning Theory – Lecture 2 24 .

Bousquet – Statistical Learning Theory – Lecture 2 25 .Probability Tools (1) Basic facts • Union: P [A or B] ≤ P [A] + P [B] • Inclusion: If A ⇒ B . • Inversion: If P [X ≥ t] ≤ F (t) then with probability at least 1 − δ . E [X] = ∞ 0 P [X ≥ t] dt. • Expectation: If X ≥ 0. O. X ≤ F −1(δ). then P [A] ≤ P [B].

P [|X − E [X] | ≥ t] ≤ E[X] t Var[X] t2 • Chernoff: for all t ∈ R. Bousquet – Statistical Learning Theory – Lecture 2 26 . f (E [X]) ≤ E [f (X)] • Markov: If X ≥ 0 then for all t > 0. P [X ≥ t] ≤ inf λ≥0 E eλ(X−t) O. P [X ≥ t] ≤ • Chebyshev: for t > 0.Probability Tools (2) Basic inequalities • Jensen: for f convex.

. • Cannot be observed (P is unknown) • Random (depends on the data) ⇒ we want to bound P [R(gn) − Rn(gn) > ε] O.Error bounds Recall that we want to bound R(gn) = E 1[gn(X)=Y ] where gn has been constructed from (X1. . Yn). Bousquet – Statistical Learning Theory – Lecture 2 27 . . (Xn. Y1). .

Given G define the loss class F = {f : (x. Y )] and Pnf = Quantity of interest: 1 n n i=1 f (Xi. y) → 1[g(x)=y] : g ∈ G} Denote P f = E [f (X. Bousquet – Statistical Learning Theory – Lecture 2 28 . Y ). Yi) and Z = (X. let Zi = (Xi. Yi) P f − Pn f We will go back and forth between F and G (bijection) O.Loss class For convenience.

Empirical process Empirical process: {P f − Pnf }f ∈F • Process = collection of random variables (here indexed by functions in F ) • Empirical = distribution of each random variable ⇒ Many techniques exist to control the supremum sup P f − Pnf f ∈F O. Bousquet – Statistical Learning Theory – Lecture 2 29 .

Bousquet – Statistical Learning Theory – Lecture 2 30 . f (Z) Law of large numbers n 1 P lim n→∞ n ⇒ can we quantify it ? n f (Zi) − E [f (Z)] = 0 i=1 = 1. O.v.The Law of Large Numbers 1 R(g) − Rn(g) = E [f (Z)] − f (Zi) n i=1 → difference between the expectation and the empirical average of the r.

⇒ Let’s rewrite it to better understand O. Zn be n i.i.d. Assumes bounded random variables Theorem 1. . . b]. Then for all ε > 0. If f (Z) ∈ [a. Let Z1. . we have P 1 n n f (Zi) − E [f (Z)] > ε i=1 ≤ 2 exp 2nε2 − (b − a)2 . random variables.Hoeffding’s Inequality Quantitative version of law of large numbers. . Bousquet – Statistical Learning Theory – Lecture 2 31 .

Bousquet – Statistical Learning Theory – Lecture 2 32 . |Pnf − P f | ≤ (b − a) log 2 δ 2n O.Hoeffding’s Inequality Write δ = 2 exp Then 2nε2 − (b − a)2 log 2 δ 2n  ≤δ  P |Pnf − P f | > (b − a) or [Inversion] with probability at least 1 − δ .

If the function depends on the data this does not apply ! O. (1) Notice that one has to consider a fixed function f and the probability is with respect to the sampling of the data. Bousquet – Statistical Learning Theory – Lecture 2 33 . For any g .Hoeffding’s inequality Let’s apply to f (Z) = 1[g(X)=Y ]. and any δ > 0. with probability at least 1 − δ R(g) ≤ Rn(g) + log 2 δ 2n .

Limitations • For each fixed function f ∈ F . only some of the functions in F will satisfy this inequality ! O. there is a set S of samples for which P f − Pn f ≤ log 2 δ 2n (P [S] ≥ 1 − δ ) • They may be different for different functions • The function chosen by the algorithm depends on the sample ⇒ For the observed sample. Bousquet – Statistical Learning Theory – Lecture 2 34 .

Bousquet – Statistical Learning Theory – Lecture 2 35 . This does not contradict Hoeffding but shows it is not enough O.Limitations What we need to bound is P f n − P n fn where fn is the function chosen by the algorithm based on the data. For any fixed sample. there exists a function f such that P f − Pn f = 1 Take the function which is f (Xi) = Yi on the data and f (X) = −Y everywhere else.

Bousquet – Statistical Learning Theory – Lecture 2 36 .Limitations Risk Remp [f] R[f] R Remp f f opt fm Function class Hoeffding’s inequality quantifies differences for a fixed function O.

Bousquet – Statistical Learning Theory – Lecture 2 37 . we do not know which function the algorithm will choose. The trick is to consider uniform deviations R(fn) − Rn(fn) ≤ sup(R(f ) − Rn(f )) f ∈F We need a bound which holds simultaneously for all functions in a class O.Uniform Deviations Before seeing the data.

. . f2 and define Ci = {(x1. for each i P [Ci] ≤ δ We want to bound the probability of being ’bad’ for i = 1 or i = 2 P [C1 ∪ C2] ≤ P [C1] + P [C2] O. yn) : P fi − Pnfi > ε} From Hoeffding’s inequality.Union Bound Consider two functions f1. (xn. y1). . Bousquet – Statistical Learning Theory – Lecture 2 38 . .

Bousquet – Statistical Learning Theory – Lecture 2 39 . . . . ∪ CN ] ≤ i=1 P [Ci] We have P [∃f ∈ {f1. fN } : P f − Pnf > ε] N ≤ i=1 P [P fi − Pnfi > ε] ≤ N exp −2nε 2 O. .Finite Case More generally N P [C1 ∪ . . .

for G = {g1. for all δ > 0 with probability at least 1 − δ . . Bousquet – Statistical Learning Theory – Lecture 2 40 . .Finite Case We obtain. ∀g ∈ G. gN }. . R(g) ≤ Rn(g) + This is a generalization bound ! log N + log 1 δ 2m Coding interpretation log N is the number of bits to specify a function in F O. .

Bousquet – Statistical Learning Theory – Lecture 2 41 .Approximation/Estimation Let g = arg min R(g) g∈G ∗ If gn minimizes the empirical risk in G . Rn(g ) − Rn(gn) ≥ 0 Thus ∗ R(gn) = ≤ ≤ R(gn) − R(g ) + R(g ) Rn(g ) − Rn(gn) + R(gn) − R(g ) + R(g ) 2 sup |R(g) − Rn(g)| + R(g ) g∈G ∗ ∗ ∗ ∗ ∗ ∗ O.

Approximation/Estimation We obtain with probability at least 1 − δ ∗ R(gn) ≤ R(g ) + 2 The first term decreases if N increases The second term increases log N + log 2 δ 2m The size of G controls the trade-off O. Bousquet – Statistical Learning Theory – Lecture 2 42 .

Choose a sequence of models Gm to have more flexibility/control O.Summary (1) • Inference requires assumptions .d. Bousquet – Statistical Learning Theory – Lecture 2 43 .Restrict the possible functions to G .Data sampled i. from P .i.

for most of the samples √ R(g) − Rn(g) ≈ 1/ n .For most of the samples if |G| = N sup R(g) − Rn(g) ≈ g∈G log N/n ⇒ Extra variability because the chosen gn changes with the data O.Summary (2) • Bounds are valid w.t. Bousquet – Statistical Learning Theory – Lecture 2 44 .r. repeated sampling .For a fixed function g .

Next we improve the union bound and extend it to the infinite case O. Bousquet – Statistical Learning Theory – Lecture 2 45 .Improvements We obtained sup R(g) − Rn(g) ≤ g∈G log N + log 2 δ 2n To be improved • Hoeffding only uses boundedness. not the variance • Union bound as bad as if independent • Supremum is not what the algorithm chooses.

Refined union bound (1) For each f ∈ F . Bousquet – Statistical Learning Theory – Lecture 2 46 .  P P f − Pnf > log 1 δ(f )   ≤ δ(f )  ≤ f ∈F 2n log  P ∃f ∈ F : P f − Pnf > Choose δ(f ) = δp(f ) with f ∈F 1 δ(f ) 2n δ(f ) p(f ) = 1 O.

Refined union bound (2) With probability at least 1 − δ . 1 log p(f ) + log 1 δ ∀f ∈ F . P f ≤ Pnf + • Applies to countably infinite F 2n • Can put knowledge about the algorithm into p(f ) • But p chosen before seeing the data O. Bousquet – Statistical Learning Theory – Lecture 2 47 .

Bousquet – Statistical Learning Theory – Lecture 2 48 . how to choose the p (since it implies an ordering) • The trick is to look at F through the data O. The bound can be improved if you know ahead of time the chosen function (knowledge improves the bound) • In the infinite case.Refined union bound (3) • Good p means good bound.

Lecture 3 Infinite Case: Vapnik-Chervonenkis Theory • Growth function • Vapnik-Chervonenkis dimension • Proof of the VC bound • VC entropy • SRM O. Bousquet – Statistical Learning Theory – Lecture 3 49 .

.zn ) |Fz1. ..... .. .zn = {(f (z1). .. ..zn | O. zn) can be classified. Bousquet – Statistical Learning Theory – Lecture 3 50 . .Infinite Case Measure of the size of an infinite class ? • Consider Fz1. f (zn)) : f ∈ F } The size of this set is the number of possible ways in which the data (z1. ... . • Growth function SF (n) = • Note that SF (n) = SG (n) sup (z1 ...

R(g) ≤ Rn(g) + • Always better than N in the finite case • How to compute SG (n) in general ? ⇒ use VC dimension O.Infinite Case • Result (Vapnik-Chervonenkis) With probability at least 1 − δ log SG (2n) + log 4 δ 8n ∀g ∈ G. Bousquet – Statistical Learning Theory – Lecture 3 51 .

SG (n) ≤ 2n If SG (n) = 2n.VC Dimension Notice that since g ∈ {−1. Bousquet – Statistical Learning Theory – Lecture 3 52 . The VC-dimension of G is the largest n such that SG (n) = 2 n O. the class of functions can generate any classification on n points (shattering) Definition 2. 1}.

VC Dimension Hyperplanes In Rd. Bousquet – Statistical Learning Theory – Lecture 3 53 . V C(hyperplanes) = d + 1 O.

VC Dimension Number of Parameters Is VC dimension equal to number of parameters ? • One parameter {sgn(sin(tx)) : t ∈ R} • Infinite VC dimension ! O. Bousquet – Statistical Learning Theory – Lecture 3 54 .

Bousquet – Statistical Learning Theory – Lecture 3 55 .VC Dimension • We want to know SG (n) but we only know SG (n) = 2n for n ≤ h What happens for n ≥ h ? log(S(m)) h m O.

SG (n) ≤ en h h ⇒ phase transition O. Bousquet – Statistical Learning Theory – Lecture 3 56 . Let G be a class of functions with finite VC-dimension h. h n SG (n) ≤ i i=0 and for all n ≥ h. Then for all n ∈ N.Vapnik-Chervonenkis-Sauer-Shelah Lemma Lemma 3.

Bousquet – Statistical Learning Theory – Lecture 3 57 . With probability at least 1 − δ ∀g ∈ G.VC Bound Let G be a class with VC dimension h. R(g) ≤ Rn(g) + h log 2en + log 4 h δ 8n So the error is of order h log n n O.

Bousquet – Statistical Learning Theory – Lecture 3 58 .Interpretation VC dimension: measure of effective dimension • Depends on geometry of the class • Gives a natural definition of simplicity (by quantifying the potential overfitting) • Not related to the number of parameters • Finiteness guarantees learnability under any distribution O.

Bousquet – Statistical Learning Theory – Lecture 3 59 . such that nt2 ≥ 2.Symmetrization (lemma) Key ingredient in VC bounds: Symmetrization Let Z1. . For any t > 0. Zn an independent (ghost) sample and Pn the corresponding empirical measure. . . Lemma 4. . P sup(P − Pn)f ≥ t f ∈F ≤ 2P sup(Pn − Pn)f ≥ t/2 f ∈F O.

. . . .Symmetrization (proof – 1) fn the function achieving the supremum (depends on Z1. Zn) 1[(P −Pn)fn>t]1[(P −Pn)fn<t/2] = ≤ 1[(P −Pn)fn>t ∧ (P −Pn)fn<t/2] 1[(Pn−Pn)fn>t/2] Taking expectations with respect to the second sample gives 1[(P −Pn)fn>t]P (P − Pn)fn < t/2 ≤ P (Pn − Pn)fn > t/2 O. Bousquet – Statistical Learning Theory – Lecture 3 60 .

O. P (P − Pn)fn ≥ t/2 ≤ 4Var [fn] 1 ≤ nt2 nt2 • Hence 1[(P −Pn)fn>t](1 − 1 ) ≤ P (Pn − Pn)fn > t/2 2 nt Take expectation with respect to first sample.Symmetrization (proof – 2) • By Chebyshev inequality. Bousquet – Statistical Learning Theory – Lecture 3 61 .

..Zn 1 . Bousquet – Statistical Learning Theory – Lecture 3 62 ....Z • Union bound on FZ1....Zn • Variant of Hoeffding’s inequality P Pnf − Pnf > t ≤ 2e −nt2 /2 O..Zn.Z 1 ..Zn.......Proof of VC bound (1) • Symmetrization allows to replace expectation by average on ghost sample • Function class projected on the double sample FZ1.

..Proof of VC bound (2) P supf ∈F (P − Pn)f ≥ t ≤ = ≤ ≤ 2P supf ∈F (Pn − Pn)f ≥ t/2 2P supf ∈F Z1 ..Z1 ..Zn (Pn − Pn)f ≥ t/2 2SF (2n)P (Pn − Pn)f ≥ t/2 4SF (2n)e −nt2 /8 O....Zn .. Bousquet – Statistical Learning Theory – Lecture 3 63 .

Bousquet – Statistical Learning Theory – Lecture 3 64 .VC Entropy (1) • VC dimension is distribution independent ⇒ The same bound holds for any distribution ⇒ It is loose for most distributions • A similar proof can give a distribution-dependent result O.

• VC entropy bound: with probability at least 1 − δ HG (2n) + log 2 δ 8n ∀g ∈ G. . . .. ..zn • The VC entropy is defined as HF (n) = log E [N (F . zn) := #Fz1. . . R(g) ≤ Rn(g) + O. z1 .VC Entropy (2) • Denote the size of the projection N (F . Zn)] . Z1. ... Bousquet – Statistical Learning Theory – Lecture 3 65 .

Rademacher variables 2P supf ∈F ≤ Z. Z. Z 2E N F .VC Entropy (proof) Introduce σi ∈ {−1. Z 1 P n e σi ≥ t/2 i=1 −nt2 /8 O. Bousquet – Statistical Learning Theory – Lecture 3 66 .Z (Pn − Pn)f ≥ t/2 1 n n 2E Pσ supf ∈F Z. 1} (probability 1/2).Z σi(f (Zi ) − f (Zi)) ≥ t/2 i=1 n ≤ ≤ 2E N F . Z.

e. Bousquet – Statistical Learning Theory – Lecture 3 67 .From Bounds to Algorithms • For any distribution. convergence to best in the class) • Does it means we can learn anything ? • No because of the approximation of the class • Need to trade-off approximation and estimation error (assessed by the bound) ⇒ Use the bound to control the trade-off O. HG (n)/n → 0 ensures consistency of empirical risk minimizer (i.

introduce a structure on G .Structural risk minimization • Structural risk minimization (SRM) (Vapnik. Bousquet – Statistical Learning Theory – Lecture 3 68 . • Learning machine ≡ a set of functions and an induction principle O. • To this end. n). 1979): minimize the right hand side of R(g) ≤ Rn(g) + B(h.

Bousquet – Statistical Learning Theory – Lecture 3 69 .SRM: The Picture R(f* ) error bound on test error capacity term training error h structure Sn−1 Sn Sn+1 O.

Bousquet – Statistical Learning Theory – Lecture 4 70 .Lecture 4 Capacity Measures • Covering numbers • Rademacher averages • Relationships O.

. . ε. e.Covering numbers • Define a (random) distance d between functions. ε) • Covering number N (F . O. ε. fN covers F at radius ε if F ⊂ ∪i=1B(fi. d(f. . . . f ) = 1 #{f (Zi) = f (Zi) : i = 1. . n} n Normalized Hamming distance of the ’projections’ on the sample • A set f1. .g. n) = N (G. Bousquet – Statistical Learning Theory – Lecture 4 71 N . ε. n) is the minimum size of a cover of radius ε Note that N (F . . n).

Bousquet – Statistical Learning Theory – Lecture 4 72 . t. n)] e −nt2 /128 O.Bound with covering numbers • When the covering numbers are finite. one can approximate the class G by a finite set of functions • Result P [∃g ∈ G : R(g) − Rn(g) ≥ t] ≤ 8E [N (G.

Bousquet – Statistical Learning Theory – Lecture 4 73 . t. N (G. n) ≤ h log en h • Haussler 1 N (G. n) ≤ Ch(4e) h t h • Independent of n O. n) ≤ #GZ = N (G.Covering numbers and VC dimension • Notice that for all t. t. t. Z) • Hence N (G.

Refinement • VC entropy corresponds to log covering numbers at minimal scale • Covering number bound is a generalization where the scale is adapted to the error • Is this the right scale ? • It turns out that results can be improved by considering all scales (→ chaining) O. Bousquet – Statistical Learning Theory – Lecture 4 74 .

. .Rademacher averages • Rademacher variables: σ1. σn independent random variables with 1 P [σi = 1] = P [σi = −1] = 2 • Notation (randomized empirical measure) Rnf = • Rademacher average: R(F ) = E supf ∈F Rnf • Conditional Rademacher average Rn(F ) = Eσ supf ∈F Rnf 1 n n i=1 σif (Zi) O. Bousquet – Statistical Learning Theory – Lecture 4 75 . . .

Result • Distribution dependent log 1 δ 2n ∀f ∈ F . O. P f ≤ Pnf + 2Rn(F ) + 2 log 2 δ n . P f ≤ Pnf + 2R(F ) + • Data dependent . ∀f ∈ F . Bousquet – Statistical Learning Theory – Lecture 4 76 .

Bousquet – Statistical Learning Theory – Lecture 4 77 .d. random variables exist O.Concentration • Hoeffding’s inequality is a concentration inequality • When n increases. the average is concentrated around the expectation • Generalization to functions that depend on i.i.

zi. . . .z i |F (z1. . . .McDiarmid’s Inequality Assume for all i = 1. .... . zi. . zn) − F (z1. . Bousquet – Statistical Learning Theory – Lecture 4 78 . P [|F − E [F ] | > ε] ≤ 2 exp 2ε2 − 2 nc O. . . . . . .. n. . sup z1 . .zn . . zn)| ≤ c then for all ε > 0. .

Bousquet – Statistical Learning Theory – Lecture 4 79 .Proof of Rademacher average bounds • Use concentration to relate supf ∈F P f − Pnf to its expectation • Use symmetrization to relate expectation to Rademacher average • Use concentration again to relate Rademacher average to conditional one O.

Application (1) sup A(f ) + B(f ) ≤ sup A(f ) + sup B(f ) f ∈F f ∈F f ∈F Hence | sup C(f ) − sup A(f )| ≤ sup(C(f ) − A(f )) f ∈F f ∈F f ∈F this gives | sup(P f − Pnf ) − sup(P f − Pnf )| ≤ sup(Pnf − Pnf ) f ∈F f ∈F f ∈F O. Bousquet – Statistical Learning Theory – Lecture 4 80 .

Pn f − Pn f = thus 1 1 (f (Zi ) − f (Zi)) ≤ n n 1 n | sup(P f − Pnf ) − sup(P f − Pnf )| ≤ f ∈F f ∈F McDiarmid’s inequality can be applied with c = 1/n O.Application (2) f ∈ {0. Bousquet – Statistical Learning Theory – Lecture 4 81 . 1} hence.

Symmetrization (1) • Upper bound E sup P f − Pnf f ∈F ≤ 2E sup Rnf f ∈F • Lower bound E sup |P f − Pnf | f ∈F ≥ 1 E sup Rnf 2 f ∈F 1 − √ 2 n O. Bousquet – Statistical Learning Theory – Lecture 4 82 .

Z [sup Pnf − Pnf ] f ∈F = Eσ. Bousquet – Statistical Learning Theory – Lecture 4 83 .Z 1 sup f ∈F n n σi(f (Zi ) − f (Zi)) i=1 ≤ 2E[sup Rnf ] f ∈F O.Symmetrization (2) E[sup P f − Pnf ] f ∈F = ≤ E[sup E Pnf − Pnf ] f ∈F EZ.Z.

Loss class and initial class n R(F ) = 1 E sup g∈G n 1 E sup g∈G n σi1[g(Xi)=Yi] i=1 n = i=1 n 1 σi (1 − Yig(Xi)) 2 σiYig(Xi) = 1 R(G) 2 = 1 1 E sup 2 g∈G n i=1 O. Bousquet – Statistical Learning Theory – Lecture 4 84 .

Computing Rademacher averages (1) 1 1 E sup 2 g∈G n = n σig(Xi) i=1 n 1 1 + E sup 2 g∈G n 1 1 − E inf g∈G n 2 − i=1 n 1 − σig(Xi) 2 = i=1 1 − σig(Xi) 2 = 1 − E inf Rn(g. σ) g∈G 2 85 O. Bousquet – Statistical Learning Theory – Lecture 4 .

Bousquet – Statistical Learning Theory – Lecture 4 86 .Computing Rademacher averages (2) • Not harder than computing empirical risk minimizer • Pick σi randomly and minimize error with respect to labels σi • Intuition: measure how much the class can fit random noise • Large class ⇒ R(G) = 1 2 O.

with (Xi.Concentration again • Let F = Eσ sup Rnf f ∈F Expectation with respect to σi only. Bousquet – Statistical Learning Theory – Lecture 4 87 . Yi) fixed. • F satisfies McDiarmid’s assumptions with c = 1 n ⇒ E [F ] = R(F ) can be estimated by F = Rn(F ) O.

. fN } R(F ) ≤ 2 log N/n • Consequence for VC class F with dimension h R(F ) ≤ 2 h log en h n . Bousquet – Statistical Learning Theory – Lecture 4 88 . . . . ⇒ Recovers VC bound with a concentration proof O.Relationship with VC dimension • For a finite set F = {f1.

Bousquet – Statistical Learning Theory – Lecture 4 89 .Chaining • Using covering numbers at all scales. n) dt O. t. the geometry of the class is better captured • Dudley C Rn(F ) ≤ √ n • Consequence R(F ) ≤ C • Removes the unnecessary log n factor ! h n 0 ∞ log N (F .

Bousquet – Statistical Learning Theory – Lecture 5 90 .Lecture 5 Advanced Topics • Relative error bounds • Noise conditions • Localized Rademacher averages • PAC-Bayesian bounds O.

n) binomial distribution p = P f • P [P f − Pnf ≥ t] = • Can be upper bounded n(1−p−t) n(p+t) 1−p p Exponential 1−p−t p+t np − 1−p ((1−t/p) log(1−t/p)+t/p) Bennett e n(p−t) k=0 n k pk (1 − p)n−k Bernstein e − nt2 2p(1−p)+2t/3 2 Hoeffding e−2nt O.Binomial tails • Pnf ∼ B(p. Bousquet – Statistical Learning Theory – Lecture 5 91 .

Tail behavior • For small deviations. Poisson behavior ≈ exp(−3nt/2) ⇒ Tails heavier than Gaussian • Can upper bound with a Gaussian with large (maximum) variance exp(−2nt2) O. Bousquet – Statistical Learning Theory – Lecture 5 92 . Gaussian behavior ≈ exp(−nt2/2p(1 − p)) ⇒ Gaussian with variance p(1 − p) • For large deviations.

7 0.8 0.5) 1 0.3 0.5 0.85 0. Bousquet – Statistical Learning Theory – Lecture 5 93 .2 0.5 BinoCDF Bernstein Bennett Binomial Tail Best Gausian Deviation bound 0.7 0.6 0.9 0.65 0.75 x 0.55 0.Illustration (1) Maximum variance (p = 0.95 1 O.1 0 0.9 0.8 0.6 0.4 0.

05 Deviation bound Deviation bound 0.1 0 0.42 0.35 0.01 0.3 0.02 x 0.7 0.6 0.3 BinoCDF Sub−Gaussian Bernstein Bennett Binomial Tail Best Gausian 0.15 0.8 0.5 0.34 0.07 0.25 0.1) 1 0.4 0.04 0.9 0.03 0.4 0.3 0.2 0. Bousquet – Statistical Learning Theory – Lecture 5 94 .Illustration (2) Small variance (p = 0.06 BinoCDF Bernstein Bennett Binomial Tail Best Gausian 0.44 O.32 0.1 0 0.38 0.2 0.36 x 0.

by Bernstein’s inequality 2P f log 1 δ n 2 log 1 δ 3n P f ≤ Pn f + + • The Gaussian part dominates (for P f not too small. Bousquet – Statistical Learning Theory – Lecture 5 95 . it depends on P f O. or n large enough). • For each f ∈ F .Taking the variance into account (1) • Each function f ∈ F has a different variance P f (1 − P f ) ≤ P f .

1) ⇒ Idea is to consider the ratio P f − Pn f √ Pf O. Bousquet – Statistical Learning Theory – Lecture 5 96 .Taking the variance into account (2) • Central Limit Theorem √ n P f − Pn f P f (1 − P f ) → N (0.

1}). O. • After normalization.Normalization • Here (f ∈ {0. fluctuations are more ”uniform” sup f ∈F P f − Pn f √ Pf not necessarily attained at functions with large variance. ⇒ The normalized supremum takes this into account. Bousquet – Statistical Learning Theory – Lecture 5 97 . Var [f ] ≤ P f 2 = P f • Large variance ⇒ large risk. • Focus of learning: functions with small error P f (hence small variance).

P f − Pn f ≤2 ∀f ∈ F . √ Pn f O. Bousquet – Statistical Learning Theory – Lecture 5 98 .Relative deviations Vapnik-Chervonenkis 1974 For δ > 0 with probability at least 1 − δ . √ Pf and log SF (2n) + log 4 δ n log SF (2n) + log 4 δ n Pn f − P f ≤2 ∀f ∈ F .

Symmetrization P f − Pn f P sup √ ≥t Pf f ∈F 2.Proof sketch 1. Tail bound O. Bousquet – Statistical Learning Theory – Lecture 5 99 . Randomization ≤ 2P sup f ∈F Pn f − Pn f (Pnf + Pnf )/2 ≥t · · · = 2E Pσ sup f ∈F 1 n n i=1 σi(f (Zi ) − f (Zi)) (Pnf + Pnf )/2 ≥t 3.

P f ≤ Pn f + 2 +4 Pn f log SF (2n) + log 4 δ n log SF (2n) + log 4 δ n O.Consequences From the fact A≤B+C we get √ A⇒A≤B+C + 2 √ BC ∀f ∈ F . Bousquet – Statistical Learning Theory – Lecture 5 100 .

) In that case • Rn(gn) = 0 ⇒ R(gn) = O( d log n ).Zero noise Ideal situation • gn empirical risk minimizer • t∈G • R∗ = 0 (no noise. Bousquet – Statistical Learning Theory – Lecture 5 101 . n(X) = 0 a. n O.s.

Bousquet – Statistical Learning Theory – Lecture 5 102 .Interpolating between rates ? • Rates are not correctly estimated by this inequality • Consequence of relative error bounds ≤ ∗ P fn Pf + 2 +4 Pf log SF (2n) + log 4 δ ∗ n log SF (2n) + log 4 δ n • The quantity which is small is not P f ∗ but P fn − P f ∗ • But relative error bounds do not apply to differences O.

Bousquet – Statistical Learning Theory – Lecture 5 103 .Definitions • P = PX × P (Y |X) • regression function η(x) = E [Y |X = x] • target function t(x) = sgn η(x) • noise level n(X) = (1 − |η(x)|)/2 • Bayes risk R∗ = E [n(X)] • R(g) = E [(1 − η(X))/2] + E η(X)1[g≤0] • R(g) − R∗ = E |η(X)|1[gη≤0] O.

Two kinds of assumptions • n not too close to 1/2 • n not often too close to 1/2 O. the deterministic case.Intermediate noise Instead of assuming that |η(x)| = 1 (i.e. n(x) = 0). one can assume that n is well-behaved. Bousquet – Statistical Learning Theory – Lecture 5 104 .

Massart Condition • For some c > 0. assume |η(X)| > 1 almost surely c • There is no region where the decision is completely random • Noise bounded away from 1/2 O. Bousquet – Statistical Learning Theory – Lecture 5 105 .

1} . equivalent conditions (1) ∃c > 0.Tsybakov Condition Let α ∈ [0. P [g(X)η(X) ≤ 0] ≤ c(R(g) − R ) ∗ α α X (2) (3) ∃c > 0. 1]. ∀g ∈ {−1. A dP (x) ≤ c( A |η(x)|dP (x)) α ∃B > 0. Bousquet – Statistical Learning Theory – Lecture 5 106 . ∀t ≥ 0. ∀A ⊂ X . P [|η(X)| ≤ t] ≤ Bt 1−α O.

Equivalence • (1) ⇔ (2) Recall R(g) − R∗ = E |η(X)|1[gη≤0] . there exists a set A such that 1[A] = 1[gη≤0] • (2) ⇒ (3) Let A = {x : |η(x)| ≤ t} P [|η| ≤ t] = A dP (x) ≤ ≤ c( A |η(x)|dP (x)) dP (x)) A α α α ct ( 1 α ⇒ P [|η| ≤ t] ≤ c 1−α t 1−α O. Bousquet – Statistical Learning Theory – Lecture 5 107 . For each function g .

Bousquet – Statistical Learning Theory – Lecture 5 108 .• (3) ⇒ (1) R(g) − R = E |η(X)| 1[gη≤0] ≥ = ≥ tE 1[gη≤0]1[|η|>t] tP [|η| > t] − tE 1[gη>0]1[|η|>t] t(1 − Bt 1−α ) − tP [gη > 0] = t(P [gη ≤ 0] − Bt 1−α ) (1−α)P[gη≤0] B (1−α)/α α α ∗ Take t = B 1−α ∗ α ⇒ P [gη ≤ 0] ≤ (R(g) − R ) (1 − α)(1 − α)αα O.

1] because R(g) − R = E |η(X)|1[gη≤0] ≤ E 1[gη≤0] ∗ • α = 0 no condition • α = 1 gives Massart’s condition O.Remarks • α is in [0. Bousquet – Statistical Learning Theory – Lecture 5 109 .

Consequences • Under Massart’s condition E (1[g(X)=Y ] − 1[t(X)=Y ]) 2 ≤ c(R(g) − R ) ∗ • Under Tsybakov’s condition E (1[g(X)=Y ] − 1[t(X)=Y ]) 2 ≤ c(R(g) − R ) ∗ α O. Bousquet – Statistical Learning Theory – Lecture 5 110 .

Bousquet – Statistical Learning Theory – Lecture 5 111 .Relative loss class • F is the loss class associated to G • The relative loss class is defined as ∗ ˜ F = {f − f : f ∈ F } • It satisfies P f ≤ c(P f ) 2 α O.

Finite case ˜ • Union bound on F with Bernstein’s inequality would give P fn−P f ≤ Pnfn−Pnf + ∗ ∗ 8c(P fn − P f ∗)α log N δ n + 4 log N δ 3n • Consequence when f ∗ ∈ F (but R∗ > 0) ∗ P fn − P f ≤ C always better than n−1/2 for α > 0 O. Bousquet – Statistical Learning Theory – Lecture 5 log n N δ 1 2−α 112 .

Bousquet – Statistical Learning Theory – Lecture 5 113 . r) = E sup f ∈F :P f 2 ≤r Rnf • Allows to generalize the previous result • Computes the capacity of a small ball in F (functions with small variance) • Under noise conditions.Local Rademacher average • Definition R(F . small variance implies small error O.

Bousquet – Statistical Learning Theory – Lecture 5 114 .Sub-root functions Definition A function ψ : R → R is sub-root if • ψ is non-decreasing • ψ is non negative √ • ψ(r)/ r is non-increasing O.

5 3 O.5 2 1.Sub-root functions Properties A sub-root function • is continuous • has a unique fixed point ψ(r ∗) = r ∗ 3 x phi(x) 2. Bousquet – Statistical Learning Theory – Lecture 5 115 .5 2 2.5 1 1.5 1 0.5 0 0 0.

1]} • Properties Rn( F . r) is sub-root • Entropy of F is not much bigger than entropy of F O. α ∈ [0. Bousquet – Statistical Learning Theory – Lecture 5 116 .Star hull • Definition F = {αf : f ∈ F .

Bousquet – Statistical Learning Theory – Lecture 5 117 .Result • r ∗ fixed point of R( F . r) • Bounded functions P f − Pn f ≤ C r ∗Var [f ] + log 1 + log log n δ n • Consequence for variance related to expectation (Var [f ] ≤ c(P f )β ) Pf ≤ C Pn f + 1 ∗ (r ) 2−β + log 1 + log log n δ n O.

Bousquet – Statistical Learning Theory – Lecture 5 118 .Consequences • For VC classes R(F . r) ≤ C rh n h hence r ∗ ≤ C n √ • Rate of convergence of Pnf to P f in O(1/ n) • But rate of convergence of P fn to P f ∗ is O(1/n1/(2−α)) Only condition is t ∈ G but can be removed by SRM/Model selection O.

Proof sketch (1) • Talagrand’s inequality sup P f −Pnf ≤ E sup P f − Pnf +c f ∈F f ∈F sup Var [f ] /n+c /n f ∈F • Peeling of the class Fk = {f : Var [f ] ∈ [x . x k k+1 )} O. Bousquet – Statistical Learning Theory – Lecture 5 119 .

P f −Pnf ≤ 2R(F . xVar [f ])+c xVar [f ] /n+c /n O. Bousquet – Statistical Learning Theory – Lecture 5 120 .Proof sketch (2) • Application sup P f −Pnf ≤ E sup P f − Pnf +c f ∈Fk xVar [f ] /n+c /n f ∈Fk • Symmetrization ∀f ∈ F .

Proof sketch (3) • We need to ’solve’ this inequality. hence the sub-root property P f − Pn f ≤ 2 • Variance-expectation Var [f ] ≤ c(P f ) Solve in P f α r ∗Var [f ] + c xVar [f ] /n + c /n O. Bousquet – Statistical Learning Theory – Lecture 5 121 . Things are simple if R behave like a square root.

one can use data-dependent local Rademcher averages Rn(F . r) O. Bousquet – Statistical Learning Theory – Lecture 5 122 . r) = Eσ sup f ∈F :P f 2 ≤r Rnf • Using concentration one can also get Pf ≤ C Pn f + 1 ∗ (rn) 2−α + log 1 + log log n δ n ∗ where rn is the fixed point of a sub-root upper bound of Rn(F .Data-dependent version • As in the global case.

Discussion • Improved rates under low noise conditions • Interpolation in the rates • Capacity measure seems ’local’. Bousquet – Statistical Learning Theory – Lecture 5 123 . • after appropriate rescaling: each f ∈ F is considered at scale r/P f 2 O. • but depends on all the functions.

Bousquet – Statistical Learning Theory – Lecture 5 124 . pick g ∼ ρn • Error is averaged over ρn R(ρn) = ρnP f Rn(ρn) = ρnPnf O.Randomized Classifiers Given G a class of functions • Deterministic: picks a function gn and always use it to predict • Randomized construct a distribution ρn over G for each instance to classify.

Bousquet – Statistical Learning Theory – Lecture 5 125 . • Recall the refined union bound ∀f ∈ F .Union Bound (1) Let π be a (fixed) distribution over F . P f − Pnf ≤ • Take expectation with respect to ρn ρnP f − ρnPnf ≤ ρn 1 log π(f ) + log 1 δ 1 log π(f ) + log 1 δ 2n 2n O.

Bousquet – Statistical Learning Theory – Lecture 5 . π) + H(ρn) + log 1 /(2n) δ • K(ρn.Union Bound (2) ρnP f − ρnPnf ≤ ≤ ≤ ρn − log π(f ) + log 1 /(2n) δ −ρn log π(f ) + log 1 /(2n) δ K(ρn. π) = • H(ρn) = n (f ρn(f ) log ρπ(f )) df Kullback-Leibler divergence ρn(f ) log ρn(f ) df Entropy 126 O.

large entropy) • Not interesting if ρn = δfn O. Bousquet – Statistical Learning Theory – Lecture 5 127 . π) + log 4n + log 1 δ 2n − 1 • Good if ρn is spread (i. ρnP f − ρnPnf ≤ K(ρn.PAC-Bayesian Refinement • It is possible to improve the previous bound.e. • With probability at least 1 − δ .

π) • Markov’s inequality: with probability 1 − δ . π) + log 1 δ O.Proof (1) • Variational formulation of entropy: for any T ρT (f ) ≤ log πe • Apply it to λ(P f − Pnf )2 λρn(P f − Pnf ) ≤ log πe 2 λ(P f −Pn f )2 T (f ) + K(ρ. π) + K(ρn. λρn(P f − Pnf ) ≤ log E πe 2 λ(P f −Pn f )2 + K(ρn. Bousquet – Statistical Learning Theory – Lecture 5 128 .

Bousquet – Statistical Learning Theory – Lecture 5 129 2 .Proof (2) • Fubini E πe λ(P f −Pn f )2 = πE e λ(P f −Pn f )2 • Modified Chernoff bound E e (2n−1)(P f −Pn f )2 ≤ 4n • Putting together (λ = 2n − 1) (2n − 1)ρn(P f − Pnf ) ≤ K(ρn. π) + log 4n + log 1 δ • Jensen (2n − 1)(ρn(P f − Pnf ))2 ≤ (2n − 1)ρn(P f − Pnf )2 O.

Lecture 6 Loss Functions • Properties • Consistency • Examples • Losses and noise O. Bousquet – Statistical Learning Theory – Lecture 6 130 .

Yi) i=1 O.Motivation (1) • ERM: minimize n i=1 1[g(Xi)=Yi] in a set G ⇒ Computationally hard ⇒ Smoothing Replace binary by real-valued functions Introduce smooth loss function n (g(Xi). Bousquet – Statistical Learning Theory – Lecture 6 131 .

losses do not need to be related to densities • Can get bounds in terms of margin error instead of empirical error (smoother → easier to optimize for model selection) O. Bousquet – Statistical Learning Theory – Lecture 6 132 . estimate the density) • However.Motivation (2) • Hyperplanes in infinite dimension have infinite VC-dimension but finite scale-sensitive dimension (to be defined later) ⇒ It is good to have a scale ⇒ This scale can be used to give a confidence (i.e.

Margin • It is convenient to work with (symmetry of +1 and −1) (g(x). Ln(g) = n n φ(Yig(Xi)) i=1 • Loss class F = {f : (x. y) → φ(yg(x)) : g ∈ G} O. Bousquet – Statistical Learning Theory – Lecture 6 133 . y) • Loss 1 L(g) = E [φ(Y g(X))] . y) = φ(yg(x)) • yg(x) is the margin of g at (x.

Minimizing the loss • Decomposition of L(g) 1 E [E [(1 + η(X))φ(g(X)) + (1 − η(X))φ(−g(X))|X]] 2 • Minimization for each x H(η) = inf ((1 + η)φ(α)/2 + (1 − η)φ(−α)/2) α∈R • L∗ := inf g L(g) = E [H(η(X))] O. Bousquet – Statistical Learning Theory – Lecture 6 134 .

O.Classification-calibrated • A minimal requirement is that the minimizer in H(η) has the correct sign (that of the target t or that of η ). except possibly for η = 0). Bousquet – Statistical Learning Theory – Lecture 6 135 . • Definition φ is classification-calibrated if. for any η = 0 α:αη≤0 inf (1+η)φ(α)+(1−η)φ(−α) > inf (1+η)φ(α)+(1−η)φ(−α) α∈R • This means the infimum is achieved for an α of the correct sign (and not for an α of the wrong sign.

L(gi) → L ∗ ⇒ R(gi) → R ∗ • When φ is convex (convenient for optimization) φ is classificationcalibrated iff it is differentiable at 0 and φ (0) < 0 O. Bartlett and McAuliffe 2003) • φ is classification-calibrated iff for all sequences gi and every probability distribution P .Consequences (1) Results due to (Jordan. Bousquet – Statistical Learning Theory – Lecture 6 136 .

Bousquet – Statistical Learning Theory – Lecture 6 137 .Consequences (2) • Let H −(η) = inf α:αη≤0 ((1 + η)φ(α)/2 + (1 − η)φ(−α)/2) • Let ψ(η) be the largest convex function below H −(η) − H(η) • One has ψ(R(g) − R ) ≤ L(g) − L ∗ ∗ O.

5 1 1.5 2 O.5 0 −1 −0.5 0−1 hinge squared hinge square exponential 3 2.5 2 1. Bousquet – Statistical Learning Theory – Lecture 6 138 .Examples (1) 4 3.5 0 0.5 1 0.

Bousquet – Statistical Learning Theory – Lecture 6 139 . 1 − x). ψ(x) = x • Square loss φ(x) = (1 − x) . 1 − x) .Examples (2) • Hinge loss φ(x) = max(0. ψ(x) = x • Exponential φ(x) = exp(−x). ψ(x) = 1 − 1 − x2 2 2 2 2 O. ψ(x) = x • Squared hinge loss φ(x) = max(0.

c(R(g) − R ) ψ((R(g) − R ) • Hinge loss (no improvement) R(g) − R ≤ L(g) − L • Square loss or squared hinge loss R(g) − R ≤ (4c(L(g) − L )) 2−α ∗ ∗ 1 ∗ ∗ ∗ α ∗ 1−α /2c) ≤ L(g) − L ∗ O. Bousquet – Statistical Learning Theory – Lecture 6 140 .Low noise conditions • Relationship can be improved under low noise conditions • Under Tsybakov’s condition with exponent α and constant c.

Estimation error • Recall that Tsybakov condition implies P f 2 ≤ c(P f )α for the relative loss class (with 0 − 1 loss) • What happens for the relative loss class associated to φ ? • Two possibilities Strictly convex loss (can modify the metric on R) Piecewise linear O. Bousquet – Statistical Learning Theory – Lecture 6 141 .

Strictly convex losses • Noise behavior controlled by modulus of convexity • Result P f2 δ( ) ≤ P f /2 K with K Lipschitz constant of φ and δ modulus of convexity of L(g) with respect to f − g L (P ) 2 • Not related to noise exponent O. Bousquet – Statistical Learning Theory – Lecture 6 142 .

Piecewise linear losses • Noise behavior related to noise exponent • Result for hinge loss P f ≤ CP f if initial class G is uniformly bounded 2 α O. Bousquet – Statistical Learning Theory – Lecture 6 143 .

Bousquet – Statistical Learning Theory – Lecture 6 144 . for a convex class G . L(g) − L(g ) ≤ C ∗ ∗ 2 (r ) γ + log 1 + log log n δ n • Under Tsybakov’s condition for the hinge loss (and general G ) P f 2 ≤ CP f α L(g) − L(g ) ≤ C ∗ 1 ∗ 2−α (r ) + log 1 + log log n δ n O.Estimation error • With bounded and Lipschitz loss with convexity exponent γ .

Examples Under Tsybakov’s condition • Hinge loss R(g) − R ≤ L(g ) − L + C ∗ ∗ ∗ 1 ∗ 2−α (r ) + log 1 + log log n δ n • Squared hinge loss or square loss δ(x) = cx2. P f 2 ≤ CP f ∗ ∗ ∗ ∗ R(g)−R ≤ C L(g ) − L + C (r + log 1 δ + log log n n 1 2−α ) O. Bousquet – Statistical Learning Theory – Lecture 6 145 .

Classification vs Regression losses • Consider a classification-calibrated function φ • It is a classification loss if L(t) = L∗ • otherwise it is a regression loss O. Bousquet – Statistical Learning Theory – Lecture 6 146 .

exponential losses Noise enters relationship between risk and loss Modulus of convexity enters in estimation error • Hinge loss Direct relationship between risk and loss Noise enters in estimation error ⇒ Approximation term not affected by noise in second case ⇒ Real value does not bring probability information in second case O.Classification vs Regression losses • Square. squared hinge. Bousquet – Statistical Learning Theory – Lecture 6 147 .

Lecture 7 Regularization • Formulation • Capacity measures • Computing Rademacher averages • Applications O. Bousquet – Statistical Learning Theory – Lecture 7 148 .

Equivalent problems
Up to the choice of the regularization parameters, the following problems are equivalent

min Ln(f ) + λΩ(f )
f ∈F

f ∈F :Ln (f )≤e f ∈F :Ω(f )≤R

min

Ω(f )

min

Ln(f )

The solution sets are the same

O. Bousquet – Statistical Learning Theory – Lecture 7

149

Comments
• Computationally, variant of SRM • variant of model selection by penalization ⇒ one has to choose a regularizer which makes sense • Need a class that is large enough (for universal consistency) • but has small balls

O. Bousquet – Statistical Learning Theory – Lecture 7

150

Rates
• To obtain bounds, consider ERM on balls • Relevant capacity is that of balls • Real-valued functions, need a generalization of VC dimension, entropy or covering numbers • Involve scale sensitive capacity (takes into account the value and not only the sign)

O. Bousquet – Statistical Learning Theory – Lecture 7

151

Scale-sensitive capacity • Generalization of VC entropy and VC dimension to real-valued functions • Definition: a set x1. . there exists f ∈ F αi(f (xi) − s(xi)) ≥ ε • The fat-shattering dimension of F at scale ε (denoted vc(F . . . xn is shattered by F (at scale ε) if there exists a function s such that for all choices of αi ∈ {−1. . 1}. Bousquet – Statistical Learning Theory – Lecture 7 152 . ε)) is the maximum cardinality of a shattered set O.

Link with covering numbers
• Like VC dimension, fat-shattering dimension can be used to upper bound covering numbers • Result N (F , t, n) ≤ C1 t

C2 vc(F ,C3 t)

• Note that one can also define data-dependent versions (restriction on the sample)

O. Bousquet – Statistical Learning Theory – Lecture 7

153

Link with Rademacher averages (1)
• Consequence of covering number estimates C1 Rn(F ) ≤ √ n

vc(F , t) log
0

C2 dt t

• Another link via Gaussian averages (replace Rademacher by Gaussian N(0,1) variables) Gn(F ) = Eg 1 sup f ∈F n
n

gif (Zi)
i=1

O. Bousquet – Statistical Learning Theory – Lecture 7

154

Link with Rademacher averages (2)
• Worst case average
n (F )

= sup Eσ sup Gnf
x1 ,...,xn f ∈F n (F )

• Associated ”dimension” t(F , ) = sup{n ∈ N : • Result (Mendelson & Vershynin 2003) vc(F , c ) ≤ t(F , ) ≤ K
2

≥ }

vc(F , c )

O. Bousquet – Statistical Learning Theory – Lecture 7

155

Rademacher averages and Lipschitz losses • What matters is the capacity of F (loss class) • If φ is Lipschitz with constant M • then Rn(F ) ≤ M Rn(G) • Relates to Rademacher average of the initial class (easier to compute) O. Bousquet – Statistical Learning Theory – Lecture 7 156 .

Dualization • Consider the problem min • Rademacher of ball g ≤R Ln(g) Eσ sup Rng g ≤R • Duality Eσ ∗ sup Rnf f ≤R R = Eσ n n ∗ σiδXi i=1 dual norm. δXi evaluation at Xi (element of the dual under appropriate conditions) O. Bousquet – Statistical Learning Theory – Lecture 7 157 .

Bousquet – Statistical Learning Theory – Lecture 7 158 . ·) O.RHKS Given a positive definite kernel k • Space of functions: reproducing kernel Hilbert space associated to k • Regularizer: rkhs norm · k • Properties: Representer theorem n gn = i=1 αik(Xi.

Shattering dimension of hyperplanes • Set of functions G = {g(x) = w · x : w = 1} • Assume x ≤ R • Result vc(G. ρ) ≤ R /ρ 2 2 O. Bousquet – Statistical Learning Theory – Lecture 7 159 .

. r.e. the larger r i=1 yi xi must be r • upper bound the size of i=1 yi xi in terms of R Combining the two tells us how many points we can at most shatter O. . Bousquet – Statistical Learning Theory – Lecture 7 160 . xi ≥ ρ for all i = 1. . 1997) Assume that x1. . . . yr ∈ {±1}. there exists a w such that yi w. Two steps: (2) • prove that the more points we want to shatter (2). .. . xr are ρ-shattered by hyperplanes with w = 1. for all y1. .Proof Strategy (Gurvits. . i. . .

Bousquet – Statistical Learning Theory – Lecture 7 161 . i=1 y i xi ≤ w i=1 y i xi = i=1 y i xi • Combine both: rρ ≤ r y i xi . ( r i=1 yixi) ≥ rρ • By Cauchy-Schwarz inequality r r r w.Part I • Summing (2) yields w. i=1 (3) O.

Part II
Consider labels yi ∈ {±1}, as (Rademacher variables).    
r 2 r

E
i=1

y i xi

=

E
i,j=1 r

y i y j xi , xj  
E [ xi, xi ] + E 
r


xi , xj 

=
i=1 r

i=j 2

=
i=1

xi

O. Bousquet – Statistical Learning Theory – Lecture 7

162

Part II, ctd.
• Since xi ≤ R, we get E
r i=1

y i xi

2

≤ rR2.

• This holds for the expectation over the random choices of the labels, hence there must be at least one set of labels for which it also holds true. Use this set. • Hence
r 2

yixi
i=1

≤ rR .

2

O. Bousquet – Statistical Learning Theory – Lecture 7

163

Part I and II Combined
• Part I: (rρ) ≤ • Part II: • Hence r ρ ≤ rR ,
i.e.,
2 2 2 r i=1 2 r i=1 2

y i xi

2

y i xi

≤ rR2

R2 r≤ 2 ρ

O. Bousquet – Statistical Learning Theory – Lecture 7

164

Bousquet – Statistical Learning Theory – Lecture 7 165 .Boosting Given a class H of functions • Space of functions: linear span of H • Regularizer: 1-norm of the weights αihi} g = inf{ |αi| : g = • Properties: weight concentrated on the (weighted) margin maximizers gn = wh h diYih (Xi) diYih(Xi) = min h ∈H O.

Bousquet – Statistical Learning Theory – Lecture 7 166 .Rademacher averages for boosting • Function class of interest GR = {g ∈ span H : g 1 ≤ R} • Result Rn(GR ) = RRn(H) ⇒ Capacity (as measured by global Rademacher averages) not affected by taking linear combinations ! O.

Bousquet – Statistical Learning Theory – Lecture 8 167 .Lecture 8 SVM • Computational aspects • Capacity Control • Universality • Special case of RBF kernel O.

Formulation (1) • Soft margin 1 min w w.ξ 2 m 2 +C i=1 ξi yi( w. xi + b) ≥ 1 − ξi ξi ≥ 0 • Convex objective function and convex constraints • Unique solution • Efficient procedures to find it → Is it the right criterion ? O.b. Bousquet – Statistical Learning Theory – Lecture 8 168 .

b 2 m 2 ∗ m +C i=1 max(0. 1 − yi( w. Bousquet – Statistical Learning Theory – Lecture 8 169 . xi + b) ≥ 1 − ξi.b. xi + b)) O. 1 − yi( w.Formulation (2) • Soft margin 1 2 ξi w +C min w. xi + b)) • Substitute above to get 1 min w w. ξi ≥ 0 • Optimal value of ξi ξi = max(0.ξ 2 i=1 yi( w.

Bousquet – Statistical Learning Theory – Lecture 8 170 .Regularization General form of regularization problem 1 min f ∈F m n c(yif (xi)) + λ f i=1 2 → Capacity control by regularization with convex cost 1 0 1 O.

Loss Function φ(Y f (X)) = max(0. 1 − Y f (X)) • Convex. Bousquet – Statistical Learning Theory – Lecture 8 171 . non-increasing. upper bounds 1[Y f (X)≤0] • Classification-calibrated • Classification type (L∗ = L(t)) R(g) − R ≤ L(g) − L ∗ ∗ O.

Regularization Choosing a kernel corresponds to • Choose a sequence (ak ) • Set f 2 (k) 2 := k≥0 ak |f | dx ⇒ penalization of high order derivatives (high frequencies) ⇒ enforce smoothness of the solution O. Bousquet – Statistical Learning Theory – Lecture 8 172 .

. Bousquet – Statistical Learning Theory – Lecture 8 173 .Capacity: VC dimension • The VC dimension of the set of hyperplanes is d + 1 in Rd. Dimension of feature space ? ∞ for RBF kernel • w choosen in the span of the data (w = αiyixi) The span of the data has dimension m for RBF kernel (k(. xi) linearly independent) • The VC bound does not give any information h =1 m ⇒ Need to take the margin into account O.

vc(hyperplanes with margin ρ. 1) ≤ R2/ρ2 O. Bousquet – Statistical Learning Theory – Lecture 8 174 .Capacity: Shattering dimension Hyperplanes with Margin If x ≤ R.

Bousquet – Statistical Learning Theory – Lecture 8 175 .Margin • The shattering dimension is related to the margin • Maximizing the margin means minimizing the shattering dimension • Small shattering dimension ⇒ good control of the risk ⇒ this control is automatic (no need to choose the margin beforehand) ⇒ but requires tuning of regularization parameter O.

Bousquet – Statistical Learning Theory – Lecture 8 176 . xi) i=1 • Trace of the Gram matrix • Notice that Rn ≤ R2/(n2ρ2) O.Capacity: Rademacher Averages (1) • Consider hyperplanes with w ≤ M • Rademacher average M √ n 2 n i=1 M k(xi. xi) ≤ Rn ≤ n n k(xi.

δxi w. w 1 n 1 n n i=1 n = ≤ = E sup E sup w ≤M σiδxi σiδxi σiδxi w ≤M i=1 n M E n n i=1 σiδxi . Bousquet – Statistical Learning Theory – Lecture 8 177 . i=1 O.Rademacher Averages (2) n i=1 E sup 1 w ≤M n σi w.

σiδxi i.Rademacher Averages (3) M E n ≤ = = M n M n M n E E n i=1 n i=1 σiδxi . δxj k(xi. n i=1 n i=1 σiδxi n i=1 σiδxi .j σiσj δxi . Bousquet – Statistical Learning Theory – Lecture 8 178 . xi) O.

with g E (φ(Y g(X)) − φ(Y t(X))) 2 ∞ ≤M ∗ ≤ (M −1+2/η0)(L(g)−L ) . → If noise is nice. Bousquet – Statistical Learning Theory – Lecture 8 179 . variance linearly related to expectation → Estimation error of order r ∗ (of the class G ) O.Improved rates – Noise condition • Under Massart’s condition (|η| > η0).

Bousquet – Statistical Learning Theory – Lecture 8 180 .Improved rates – Capacity (1) ∗ • rn related to decay of eigenvalues of the Gram matrix rn ≤ ∗ c min d + n d∈N   λj  j>d • Note that d = 0 gives the trace bound ∗ • rn always better than the trace bound (equality when λi constant) O.

Improved rates – Capacity (2) Example: exponential decay • λi = e−αi • Global Rademacher of order ∗ • rn of order 1 √ n log n n O. Bousquet – Statistical Learning Theory – Lecture 8 181 .

. • Wrong power (M 2 penalty) is used in the algorithm • Computationally easier • But does not give λ a dimension-free status • Using M could improve the cutoff detection ∗ O..Exponent of the margin • Estimation error analysis shows that in GM = {g : g ≤ M } R(gn) − R(g ) ≤ M. Bousquet – Statistical Learning Theory – Lecture 8 182 .

Kernel Why is it good to use kernels ? • Gaussian kernel (RBF) k(x. Bousquet – Statistical Learning Theory – Lecture 8 183 . y) = e − x−y 2 2σ 2 • σ is the width of the kernel → What is the geometry of the feature space ? O.

) ≥ 0 → positive quadrant O. Φ(x) Φ(y) =e − x−y 2 /2σ 2 ≥0 → Angles less than 90 degrees • Φ(x) = k(x.RBF Geometry • Norms Φ(x) → sphere of radius 1 • Angles cos(Φ(x). Φ(x) = e = 1 0 Φ(y) Φ(x) . Φ(y)) = 2 = Φ(x). . Bousquet – Statistical Learning Theory – Lecture 8 184 .

RBF O. Bousquet – Statistical Learning Theory – Lecture 8 185 .

Bousquet – Statistical Learning Theory – Lecture 8 186 .RBF Differential Geometry • Flat Riemannian metric → ’distance’ along the sphere is equal to distance in input space • Distances are contracted → ’shortcuts’ by getting outside the sphere O.

.RBF Geometry of the span Ellipsoid x^2/a^2 + y^2/b^2 = 1 b a • K = (k(xi. Bousquet – Statistical Learning Theory – Lecture 8 187 . . . λm √ √ • Data points mapped to ellispoid with lengths λ1. λm O. . . . . . xj )) Gram matrix • Eigenvalues λ1.

Bousquet – Statistical Learning Theory – Lecture 8 188 ∞ norm) by . ·) : x ∈ X } • H is dense in C(X ) → Any continuous function can be approximated (in the functions in H ⇒ with enough data one can construct any function O.RBF Universality • Consider the set of functions H = span{k(x.

Bousquet – Statistical Learning Theory – Lecture 8 189 .RBF Eigenvalues • Exponentially decreasing • Fourier domain: exponential penalization of derivatives • Enforces smoothness with respect to the Lebesgue measure in input space O.

everything orthogonal • σ→∞ linear classifier in input space All points very close on the sphere. initial geometry • Tuning σ allows to try all possible intermediate combinations O.RBF Induced Distance and Flexibility • σ→0 1-nearest neighbor in input space Each point in a separate dimension. Bousquet – Statistical Learning Theory – Lecture 8 190 .

RBF Ideas • Works well if the Euclidean distance is good • Works well if decision boundary is smooth • Adapt smoothness via σ • Universal O. Bousquet – Statistical Learning Theory – Lecture 8 191 .

Choosing the Kernel • Major issue of current research • Prior knowledge (e. O.g.. distance) • Cross-validation (limited to 1-2 parameters) • Bound (better with convex class) ⇒ Lots of open questions. Bousquet – Statistical Learning Theory – Lecture 8 192 . invariances..

g.Learning Theory: some informal thoughts • Need assumptions/restrictions to learn • Data cannot replace knowledge • No universal learning (simplicity measure) • SVM work because of capacity control • Choice of kernel = choice of prior/ regularizer • RBF works well if Euclidean distance meaningful • Knowledge improves performance (e. Bousquet – Statistical Learning Theory – Lecture 8 193 . invariances) O.

Sign up to vote on this title
UsefulNot useful