Professional Documents
Culture Documents
3: Is Learning Feasible?
Jörg Schäfer
Frankfurt University of Applied Sciences
Department of Computer Sciences
Nibelungenplatz 1
D-60318 Frankfurt am Main
Dichotomies
Growth Function
VC Dimension
Examples
Bibliography
i.e. we can estimate that X does not differ from its mean by
something exponentially bound in N.
We will set X = Ein (h) and use E [X ] ≈ Eout (h) to generalise this to
achieve and prove the Generalization Bound for ML.
4/45 Jörg Schäfer | Learning From Data | c b n a 3: Is Learning Feasible?
Generalization Strategy
Assume, we fix the hypothesis h ∈ H in advance. Then from Hoeffding’s
inequality we conclude:
2N
P[|Ein (h) − Eout (h)| > ] ≤ 2e −2
which holds true for any probability measure and set of events Ai . It is crude, however:
A2
A2
A2
A3
A1
A1 A3
A3
A1
P(A1 ∪ A2 ∪ A3 ) ≈ P(Ai )
P(A1 ∪ A2 ∪ A3 ) ≈ 2P(Ai )
P(A1 ∪ A2 ∪ A3 ) ≈ 3P(Ai )
|H| = ∞!
|H| = ∞!
2.5
2.0
o o
1.5
x2
1.0
+ +
0.5
x1
11/45 Jörg Schäfer | Learning From Data | c b n a 3: Is Learning Feasible?
Number of Hypothesis – Examples
How many hypothesis do we have for the perceptron?
Answer infinitely many, as any choice of w ∈ R is a hypothesis!
But, how many are really different?
2.5
2.0
o o
1.5
x2
1.0
+ +
0.5
x1
11/45 Jörg Schäfer | Learning From Data | c b na 3: Is Learning Feasible?
Strategy of Generalization
1. We will show, that there exists a function mH (N) measuring the real,
i.e. effective complexity of the hypothesis set .
2. This function can be used to replace M in the union bound:
2N
P[|Ein (h) − Eout (h)| > ] ≤ M 2e −2
Definition 2
Let H be a set of hypothesis. Let x1 , . . . , xn ∈ D. Then the dichotomies
generated by H are defined by
Thus, the growth function mH (N) is the maximum number of dichotomies that
can be generated by H on any N points, henceforth it can be interpreted as the
worst possible x1 , . . . , xN (most difficult to separate). It is a measure of
complexity of the hypothesis set H. (Note, that mH (N) ≤ 2N – Why?)
Questions:
1. Can we bound mH (N) by a polynomial in N?
2. Can we replace |H| by mH (N) in the generalization bound?
+ + + +
+ o + o
x2
x2
x2
+ o o +
x1 x1 x1 x1
o o o o
o + o +
x2
x2
x2
o + + o
Thus, mH (3) = 23 = 8.
16/45 Jörg Schäfer | Learning From Data | c bn a 3: Is Learning Feasible?
Perceptron (cont.)
Remember, the growth function is defined as the max, thus, we need to
find any configuration of 3 points that cannot be shattered to prove that
mH (3) < 23 = 8. This let us look at the generic configuration, i.e. 3
non-co-linear points:
+ + + +
+ o + o
x2
x2
x2
+ o o +
x1 x1 x1 x1
o o o o
o + o +
x2
x2
x2
o + + o
Thus, mH (3) = 23 = 8.
16/45 Jörg Schäfer | Learning From Data | c b na 3: Is Learning Feasible?
Perceptron (cont.)
What about k = 4?
o +
+ o
o +
+ o
a
−3 −2 −1 0 1 2 3
a b
−3 −2 −1 0 1 2 3
Definition 4
If H can generate all possible dichotomies on a set x1 , . . . , xN then
H(x1 , . . . , xN ) = {−1, +1}N , we say that H can shatter the set
x1 , . . . , xN .
Definition 5
If no data set of size k can be shattered by H than k is said to be a
break point for H.
x1 x2 x3 x4
◦ ◦ ◦ ◦
◦ ◦ ◦ •
◦ ◦ • ◦
◦ • ◦ ◦
• ◦ ◦ ◦
• • ◦ ◦
Adding the sixth line shatters x1 and x2 , thus, is easy to check that
B(4, 2) = 5.
x1 x2 x3 x4
◦ ◦ ◦ ◦
◦ ◦ ◦ •
◦ ◦ • ◦
◦ • ◦ ◦
• ◦ ◦ ◦
• • ◦ ◦
Adding the sixth line shatters x1 and x2 , thus, is easy to check that
B(4, 2) = 5.
x1 x2 x3 x4
◦ ◦ ◦ ◦
◦ ◦ ◦ •
◦ ◦ • ◦
◦ • ◦ ◦
• ◦ ◦ ◦
• • ◦ ◦
Adding the sixth line shatters x1 and x2 , thus, is easy to check that
B(4, 2) = 5.
x1 x2 x3 x4
◦ ◦ ◦ ◦
◦ ◦ ◦ •
◦ ◦ • ◦
◦ • ◦ ◦
• ◦ ◦ ◦
• • ◦ ◦
Adding the sixth line shatters x1 and x2 , thus, is easy to check that
B(4, 2) = 5.
x1 x2 x3 x4
◦ ◦ ◦ ◦
◦ ◦ ◦ •
◦ ◦ • ◦
◦ • ◦ ◦
• ◦ ◦ ◦
• • ◦ ◦
Adding the sixth line shatters x1 and x2 , thus, is easy to check that
B(4, 2) = 5.
x1 x2 x3 x4
◦ ◦ ◦ ◦
◦ ◦ ◦ •
◦ ◦ • ◦
◦ • ◦ ◦
• ◦ ◦ ◦
• • ◦ ◦
Adding the sixth line shatters x1 and x2 , thus, is easy to check that
B(4, 2) = 5.
mH (N) ≤ B(N, k)
Proof.
Pk−1
Strategy: mH (N) ≤ max{mH (N) given k} ≤ B(N, k) ≤ i=0 ai N i
26/45 Jörg Schäfer | Learning From Data | c b na 3: Is Learning Feasible?
Bound for B(N, k)
We try to bound B(N, k) (we will compute below an exact solution, too
but the bound is all we need).
Suppose, we know that
Lemma 8
The following recursion inequality holds
Then it follows
Lemma 9
(Sauer’s Lemma)
k−1
!
X N
B(N, k) ≤
i=0
i
k→
1 2 3 4 5 6 ...
1 1 2 2 2 2 2 ...
2 1 3 4 4 4 4 ...
3 1 4 7 8 8 8 ...
N↓
& ↓ ...
4 1 5 11 ... ... ......
.. ..
5 1 6 . . ...
We have
B(N, 1) = 1, ∀N
B(1, k) = 2, ∀k > 1.
We have B(N, 1) = 1 as only one dichotomy exists (if we had at least two,
then they had to differ in one place and thus a subset of size 1 would be
shattered). We have B(1, k) = 2 as the condition is void.
Consider the set of all dichotomies S for N ≥ 2 and k ≥ 2.
We group S into three sets S = S1 ∪ S2− ∪ S2+ , where S2± contains those
dichotomies that are identical on the first x1 , . . . , xn−1 coordinates and
differ only in the last coordinate by either −1 (the S2− ) or +1 (the set S2− ).
The set S1 contains the remaining dichotomies.
Define α := |S1 |, and β := |S2− | = |S2+ |.
29/45 Jörg Schäfer | Learning From Data | c b na 3: Is Learning Feasible?
Proof of Recursion Inequality
Proof (cont.)
α + β ≤ B(N − 1, k).
We claim that
β ≤ B(N − 1, k − 1).
Indeed, if a subset of size k − 1 of N − 1 could be shattered by the
dichotomies of S2+ then we could take the corresponding subset in S2− and
add xN to the data points. This yields a subset of size k that can be
shattered contradicting the definition of B(N, k).
Putting everything together we get the desired inequality.
Proof.
Pk−1 N
We only have to show B(N, k) ≥ i=0 i .
Consider a dichotomy of having zero {−1}’s. Clearly all subsets of
size k cannot be shattered. (Why?)
Consider a set of dichotomies of having i {−1}’s with i < k. Clearly
all subsets of size k cannot be shattered. (Why?)
Consider the union of all of these. Clearly all subsets of size k of this
set cannot be shattered.
N
There are i subsets having i {−1}’s.
10
Bad
8 Good
6
4
2
2 4 6 8 10
Proof.
To prove that, ∀N ∈ N the points {21 , 22 , . . . , 2N } are shattered by
h let d = 0.y1 y2 . . . yN be a binary encoding of the desired labels yi
where we map −1 to 0 and 1 to 1.
Let d 0 be the corresponding decimal number and define ω = −π ∗ d 0 .
Essentially each xi bit shifts ω to produce the desired label as a
result of the fact that sign(sin(πz)) = (−1)bzc .
Lemma 14
K
Y
mH (N) ≤ mH̃ (N) mHi (N)
i=1
This can be used to reason about composed machines, i.e. neural networks!
39/45 Jörg Schäfer | Learning From Data | c b na 3: Is Learning Feasible?
Vapnik-Chervonenkis Theorem
in−sample error
out−of−sample error
model complexity
Error
dVC
VC dimension dVC
Conclusion
Nonetheless, it is conceptually the most important mathematical theorem
for ML as it proves that learning is possible at all and provide guidance
for model (complexity) selection.