You are on page 1of 67

Learning From Data

3: Is Learning Feasible?
Jörg Schäfer
Frankfurt University of Applied Sciences
Department of Computer Sciences
Nibelungenplatz 1
D-60318 Frankfurt am Main

Wissen durch Praxis stärkt


1/45 Jörg Schäfer | Learning From Data | c b n a 3: Is Learning Feasible? Summer Semester 2022
Content
Motivation

Dichotomies

Growth Function

Bounding the Growth Function

VC Dimension

Examples

When is Learning Feasible?

Bibliography

2/45 Jörg Schäfer | Learning From Data | c b n a 3: Is Learning Feasible?


By far, the greatest danger of Artificial Intelligence is that people
conclude too early that they understand it.
– Eliezer Yudkowsky

3/45 Jörg Schäfer | Learning From Data | c b n a 3: Is Learning Feasible?


Recap and Proof Strategy
We know that deterministic learning is not feasible.
However, what if we just want to estimate the probability that we
are wrong by a certain amount?
We will use Hoeffding’s inequality from probability theory to learn
something about this probabilities.
Let Xi be independent random variables, then according to
Hoeffding’s inequality for a random variable X := N1 N
P
i=1 Xi with
mean E [X ] we have:
2N
P[|X − E [X ]| > ] ≤ 2e −2

i.e. we can estimate that X does not differ from its mean by
something exponentially bound in N.
We will set X = Ein (h) and use E [X ] ≈ Eout (h) to generalise this to
achieve and prove the Generalization Bound for ML.
4/45 Jörg Schäfer | Learning From Data | c b n a 3: Is Learning Feasible?
Generalization Strategy
Assume, we fix the hypothesis h ∈ H in advance. Then from Hoeffding’s
inequality we conclude:
2N
P[|Ein (h) − Eout (h)| > ] ≤ 2e −2

We can calculate (the random variable) Ein from the data D.


We can not calculate Eout .
Assuming, that Ein is small on the data D, i.e. f (x ) ≈ h(x ) on D.
Then we can conclude with high probability that Eout ≈ 0. This
means P[f (x ) 6= h(x )] ≈ 0. Thus we have learned something about
the unknown f from the data D
Assuming, that Ein is not small on the data D, i.e. that the machine
learned the training set badly, we know that it will fail on unknown
data with high probability, too. However, we have not learned much
in this case.
5/45 Jörg Schäfer | Learning From Data | c b n a 3: Is Learning Feasible?
Illustration
(Exposition inspired by [AMMIL12].) Question: Does µ tell us anything
about ν?
Take a bin with red and green Answer: Yes and no. µ is probably
marbles close to ν!
µ := P[picking red marble]
P[picking green marble] =
1−µ
µ is unknown
We repeat N times
independently
Fraction of red marbles is
denoted as ν

6/45 Jörg Schäfer | Learning From Data | c b n a 3: Is Learning Feasible?


Hoeffding
According to Hoeffding’s inequality:
2N
P[|ν − µ| > ] ≤ 2e −2

i.e. µ = ν is probably almost correct (P.A.C.)!


Note, that bound does not depend on µ i.e. bound is distribution free.
In learning,
the unknown is a function f : X 7→ Y, not just a number µ.
however, each marble is a point x ∈ X
hypothesis got it right h(x ) = f (x ) is green
hypothesis got it wrong h(x ) 6= f (x ) is red
However,
2N
P[|Ein (h) − Eout (h)| > ] ≤ 2e −2
is wrong. Hoeffding is valid for a single h ∈ H!
7/45 Jörg Schäfer | Learning From Data | c b n a 3: Is Learning Feasible?
Generalization
In reality, however,
the hypothesis h ∈ H is not fixed in advance, but rather computed
from the training data.
Henceforth, it becomes a random variable and we cannot apply
Hoeffding.
Let us assume that |H| = M, i.e. that we have only finitely many
hypothesis. Then we have the union bound:
"M #
[
P[|Ein (h) − Eout (h)| > ] = P {|Ein (hi ) − Eout (hi )| > }
i=1
M
X
≤ P[|Ein (hi ) − Eout (hi )| > ]
i=1
M
2N 2N
2e −2 = M 2e −2
X

i=1

8/45 Jörg Schäfer | Learning From Data | c b n a 3: Is Learning Feasible?


The Union Bound
We have used the union bound:
M
! M
[ X
P Ai ≤ P(Ai ),
i=1 i=1

which holds true for any probability measure and set of events Ai . It is crude, however:

A2
A2
A2
A3
A1
A1 A3
A3
A1
P(A1 ∪ A2 ∪ A3 ) ≈ P(Ai )
P(A1 ∪ A2 ∪ A3 ) ≈ 2P(Ai )
P(A1 ∪ A2 ∪ A3 ) ≈ 3P(Ai )

9/45 Jörg Schäfer | Learning From Data | c b n a 3: Is Learning Feasible?


Learning with Finitely Many Hypothesis

Nonetheless, learning with finitely many hypothesis is feasible as we have


Lemma 1
Let |H| = M. Then we have
2N
P[|Ein (h) − Eout (h)| > ] ≤ M 2e −2

We face, however a serious problem now:


For any interesting learning machine, we have

|H| = ∞!

10/45 Jörg Schäfer | Learning From Data | c b na 3: Is Learning Feasible?


Learning with Finitely Many Hypothesis

Nonetheless, learning with finitely many hypothesis is feasible as we have


Lemma 1
Let |H| = M. Then we have
2N
P[|Ein (h) − Eout (h)| > ] ≤ M 2e −2

We face, however a serious problem now:


For any interesting learning machine, we have

|H| = ∞!

10/45 Jörg Schäfer | Learning From Data | c b na 3: Is Learning Feasible?


Number of Hypothesis – Examples
How many hypothesis do we have for the perceptron?
Answer infinitely many, as any choice of w ∈ R is a hypothesis!
But, how many are really different?

2.5
2.0

o o
1.5
x2

1.0

+ +
0.5

0.5 1.0 1.5 2.0 2.5

x1
11/45 Jörg Schäfer | Learning From Data | c b n a 3: Is Learning Feasible?
Number of Hypothesis – Examples
How many hypothesis do we have for the perceptron?
Answer infinitely many, as any choice of w ∈ R is a hypothesis!
But, how many are really different?

2.5
2.0

o o
1.5
x2

1.0

+ +
0.5

0.5 1.0 1.5 2.0 2.5

x1
11/45 Jörg Schäfer | Learning From Data | c b na 3: Is Learning Feasible?
Strategy of Generalization

1. We will show, that there exists a function mH (N) measuring the real,
i.e. effective complexity of the hypothesis set .
2. This function can be used to replace M in the union bound:
2N
P[|Ein (h) − Eout (h)| > ] ≤ M 2e −2

3. This will yield the famous Vapnik-Chervonenkis Bound:


2 N/8
P[|Ein (h) − Eout (h)| > ] ≤ 4mH (2N) e −

12/45 Jörg Schäfer | Learning From Data | c b n a 3: Is Learning Feasible?


Dichotomies

Definition 2
Let H be a set of hypothesis. Let x1 , . . . , xn ∈ D. Then the dichotomies
generated by H are defined by

H(x1 , . . . , xn ) := {(h(x1 ), . . . , h(xn )) |h ∈ H}.

One can think of H(x1 , . . . , xn ) as a projection of H onto the data set


x1 , . . . , xn ∈ D given. Note, that H(x1 , . . . , xn ) is finite, even for H
containing infinitely many hypothesis.

13/45 Jörg Schäfer | Learning From Data | c b na 3: Is Learning Feasible?


Growth Function
Definition 3
The growth function defined for a set of hypothesis is defined as a map from
N→N
mH (N) := max |H(x1 , . . . , xN )| ,
x1 ,...,xN ∈X

where | · | denotes the cardinality of the set.

Thus, the growth function mH (N) is the maximum number of dichotomies that
can be generated by H on any N points, henceforth it can be interpreted as the
worst possible x1 , . . . , xN (most difficult to separate). It is a measure of
complexity of the hypothesis set H. (Note, that mH (N) ≤ 2N – Why?)
Questions:
1. Can we bound mH (N) by a polynomial in N?
2. Can we replace |H| by mH (N) in the generalization bound?

14/45 Jörg Schäfer | Learning From Data | c b na 3: Is Learning Feasible?


Examples of Growth Functions – Perceptron
Let us compute the growth function for the perceptron (for small k).
mH (1) = 2 (Why?)
mH (2) = 4 (Why?)
For mH (3) we look at three co-linear points

Obviously, this dichotomy cannot be shattered (see definition von


page 22)!
Does this imply mH (3) < 23 = 8?
No!
15/45 Jörg Schäfer | Learning From Data | c b n a 3: Is Learning Feasible?
Examples of Growth Functions – Perceptron
Let us compute the growth function for the perceptron (for small k).
mH (1) = 2 (Why?)
mH (2) = 4 (Why?)
For mH (3) we look at three co-linear points

Obviously, this dichotomy cannot be shattered (see definition von


page 22)!
Does this imply mH (3) < 23 = 8?
No!
15/45 Jörg Schäfer | Learning From Data | c b na 3: Is Learning Feasible?
Perceptron (cont.)
Remember, the growth function is defined as the max, thus, we need to
find any configuration of 3 points that cannot be shattered to prove that
mH (3) < 23 = 8. This let us look at the generic configuration, i.e. 3
non-co-linear points:

+ + + +

+ o + o
x2

x2

x2
+ o o +

x1 x1 x1 x1

o o o o

o + o +
x2

x2

x2
o + + o

Thus, mH (3) = 23 = 8.
16/45 Jörg Schäfer | Learning From Data | c bn a 3: Is Learning Feasible?
Perceptron (cont.)
Remember, the growth function is defined as the max, thus, we need to
find any configuration of 3 points that cannot be shattered to prove that
mH (3) < 23 = 8. This let us look at the generic configuration, i.e. 3
non-co-linear points:

+ + + +

+ o + o
x2

x2

x2
+ o o +

x1 x1 x1 x1

o o o o

o + o +
x2

x2

x2
o + + o

Thus, mH (3) = 23 = 8.
16/45 Jörg Schäfer | Learning From Data | c b na 3: Is Learning Feasible?
Perceptron (cont.)
What about k = 4?

o +

+ o

We conclude, that we cannot shatter this and the corresponding symmetrical


configuration with + and o flipped (all others can). (Note, that this holds true
for any configuration of 4 points and is not specific to the “square”
configuration depicted above.) Thus, mH (4) = 14 < 24 = 16. We call this a
break point (see below).
17/45 Jörg Schäfer | Learning From Data | c b n a 3: Is Learning Feasible?
Perceptron (cont.)
What about k = 4?

o +

+ o

We conclude, that we cannot shatter this and the corresponding symmetrical


configuration with + and o flipped (all others can). (Note, that this holds true
for any configuration of 4 points and is not specific to the “square”
configuration depicted above.) Thus, mH (4) = 14 < 24 = 16. We call this a
break point (see below).
17/45 Jörg Schäfer | Learning From Data | c b na 3: Is Learning Feasible?
Example: Positive Rays

The positive rays example consists of


R → {−1, +1}
Hypothesis h(x ) = sign(x − a)
mH (N) = N + 1

a
−3 −2 −1 0 1 2 3

18/45 Jörg Schäfer | Learning From Data | c b n a 3: Is Learning Feasible?


Example: Positive Intervals
The positive intervals example consists of two a, and b defined as above:

a b
−3 −2 −1 0 1 2 3

Thus, the growth function is how to select two points from N + 1,


i.e. N+1
2 . However, we have to add one configuration for all red points
corresponding to the start and end point in the same segment
Thus, !
N +1 1 1
mH (N) = + 1 = N2 + N + 1
2 2 2

19/45 Jörg Schäfer | Learning From Data | c b n a 3: Is Learning Feasible?


Example: Convex Sets
The convex sets example consist of hypothesis in two dimensions that are
positive inside convex sets and negative elsewhere.
In general, an arbitrarily chosen set of N points cannot be shattered as
points may lie inside and thus violating the convex property.
However, we may choose the N points to lie on a circle.
We connect all +1 points.
Assuming, we have more than three +1 points, we get a convex polygon.
If we have dichotomies with less than three +1 points, the convex set will
be a line or a point or an empty set.
In both cases, the +1 and −1 points are correctly classified.
This proves that any hypothesis on N points can be realised with a convex
hypothesis if we chose the N points carefully.
Thus,
mH (N) = 2N
20/45 Jörg Schäfer | Learning From Data | c b n a 3: Is Learning Feasible?
Overview Example Growth Function

Table: Example Growth Functions


x1 x2 x3 x4 x5 ...
Perceptron 2 4 8 14 ...
1-D pos. ray 2 3 4 5 ...
1-D intervals 2 4 7 11 16
Convex Sets 2 4 8 16 32

Calculating the growth function is difficult, in general.


However, only if mH (N) drops below 2N we can hope for a solution
anyway, thus let us find those!
Luckily, it turns our – as a bonus – that for those cases, we do not
even have to exactly calculate the growth function!

21/45 Jörg Schäfer | Learning From Data | c b n a 3: Is Learning Feasible?


Break Points

Definition 4
If H can generate all possible dichotomies on a set x1 , . . . , xN then
H(x1 , . . . , xN ) = {−1, +1}N , we say that H can shatter the set
x1 , . . . , xN .

Definition 5
If no data set of size k can be shattered by H than k is said to be a
break point for H.

Thus, a break point k is any k for which mH (k) < 2k .

22/45 Jörg Schäfer | Learning From Data | c b na 3: Is Learning Feasible?


Characterization of the Growth Function
Problem:
In general, the growth function mH (N) is difficult to compute
However, we only have to estimate, i.e. to bound it (ideally by a
polynomial)!

To this end, a pure combinatorial function can be helpful:


Definition 6
B(N, k) is defined as the maximum number of dichotomies on N points
that no subset of size k of these N points can be shattered by these
dichotomies.
Remark: We can think of B(N, k) as the maximum number of
dichotomies on N points, with allow for a break point k.

23/45 Jörg Schäfer | Learning From Data | c b na 3: Is Learning Feasible?


Example B(N, k)

How many dichotomies can you list on 4 points so that no 2 are


shattered?
x1 x2 x3 x4
◦ ◦ ◦ ◦
◦ ◦ ◦ •
◦ ◦ • ◦
◦ • ◦ ◦
• ◦ ◦ ◦
Can we add a 6th dichotomy?

24/45 Jörg Schäfer | Learning From Data | c b n a 3: Is Learning Feasible?


Example B(N, k)

How many dichotomies can you list on 4 points so that no 2 are


shattered?
x1 x2 x3 x4
◦ ◦ ◦ ◦
◦ ◦ ◦ •
◦ ◦ • ◦
◦ • ◦ ◦
• ◦ ◦ ◦
Can we add a 6th dichotomy?

24/45 Jörg Schäfer | Learning From Data | c b n a 3: Is Learning Feasible?


Example B(N, k)

How many dichotomies can you list on 4 points so that no 2 are


shattered?
x1 x2 x3 x4
◦ ◦ ◦ ◦
◦ ◦ ◦ •
◦ ◦ • ◦
◦ • ◦ ◦
• ◦ ◦ ◦
Can we add a 6th dichotomy?

24/45 Jörg Schäfer | Learning From Data | c b n a 3: Is Learning Feasible?


Example B(N, k)

How many dichotomies can you list on 4 points so that no 2 are


shattered?
x1 x2 x3 x4
◦ ◦ ◦ ◦
◦ ◦ ◦ •
◦ ◦ • ◦
◦ • ◦ ◦
• ◦ ◦ ◦
Can we add a 6th dichotomy?

24/45 Jörg Schäfer | Learning From Data | c b n a 3: Is Learning Feasible?


Example B(N, k)

How many dichotomies can you list on 4 points so that no 2 are


shattered?
x1 x2 x3 x4
◦ ◦ ◦ ◦
◦ ◦ ◦ •
◦ ◦ • ◦
◦ • ◦ ◦
• ◦ ◦ ◦
Can we add a 6th dichotomy?

24/45 Jörg Schäfer | Learning From Data | c b n a 3: Is Learning Feasible?


Example B(N, k)

How many dichotomies can you list on 4 points so that no 2 are


shattered?
x1 x2 x3 x4
◦ ◦ ◦ ◦
◦ ◦ ◦ •
◦ ◦ • ◦
◦ • ◦ ◦
• ◦ ◦ ◦
Can we add a 6th dichotomy?

24/45 Jörg Schäfer | Learning From Data | c b na 3: Is Learning Feasible?


Example B(N, k)

How many dichotomies can you list on 4 points so that no 2 are


shattered?
x1 x2 x3 x4
◦ ◦ ◦ ◦
◦ ◦ ◦ •
◦ ◦ • ◦
◦ • ◦ ◦
• ◦ ◦ ◦
Can we add a 6th dichotomy?

24/45 Jörg Schäfer | Learning From Data | c b na 3: Is Learning Feasible?


Example B(N, k)

x1 x2 x3 x4
◦ ◦ ◦ ◦
◦ ◦ ◦ •
◦ ◦ • ◦
◦ • ◦ ◦
• ◦ ◦ ◦
• • ◦ ◦
Adding the sixth line shatters x1 and x2 , thus, is easy to check that
B(4, 2) = 5.

25/45 Jörg Schäfer | Learning From Data | c b n a 3: Is Learning Feasible?


Example B(N, k)

x1 x2 x3 x4
◦ ◦ ◦ ◦
◦ ◦ ◦ •
◦ ◦ • ◦
◦ • ◦ ◦
• ◦ ◦ ◦
• • ◦ ◦
Adding the sixth line shatters x1 and x2 , thus, is easy to check that
B(4, 2) = 5.

25/45 Jörg Schäfer | Learning From Data | c b n a 3: Is Learning Feasible?


Example B(N, k)

x1 x2 x3 x4
◦ ◦ ◦ ◦
◦ ◦ ◦ •
◦ ◦ • ◦
◦ • ◦ ◦
• ◦ ◦ ◦
• • ◦ ◦
Adding the sixth line shatters x1 and x2 , thus, is easy to check that
B(4, 2) = 5.

25/45 Jörg Schäfer | Learning From Data | c b n a 3: Is Learning Feasible?


Example B(N, k)

x1 x2 x3 x4
◦ ◦ ◦ ◦
◦ ◦ ◦ •
◦ ◦ • ◦
◦ • ◦ ◦
• ◦ ◦ ◦
• • ◦ ◦
Adding the sixth line shatters x1 and x2 , thus, is easy to check that
B(4, 2) = 5.

25/45 Jörg Schäfer | Learning From Data | c b n a 3: Is Learning Feasible?


Example B(N, k)

x1 x2 x3 x4
◦ ◦ ◦ ◦
◦ ◦ ◦ •
◦ ◦ • ◦
◦ • ◦ ◦
• ◦ ◦ ◦
• • ◦ ◦
Adding the sixth line shatters x1 and x2 , thus, is easy to check that
B(4, 2) = 5.

25/45 Jörg Schäfer | Learning From Data | c b n a 3: Is Learning Feasible?


Example B(N, k)

x1 x2 x3 x4
◦ ◦ ◦ ◦
◦ ◦ ◦ •
◦ ◦ • ◦
◦ • ◦ ◦
• ◦ ◦ ◦
• • ◦ ◦
Adding the sixth line shatters x1 and x2 , thus, is easy to check that
B(4, 2) = 5.

25/45 Jörg Schäfer | Learning From Data | c b na 3: Is Learning Feasible?


Usefulness of B(N, k)
Theorem 7
Let k be a break point of mH (N). Then

mH (N) ≤ B(N, k)

Proof.

Consider any k points.


They cannot be shattered as otherwise k would not be a break point of
mH (N) by definition.
B(N, k) is the largest such list (as it is defined as the maximum of all of
the possible dichotomies – remember, it is defined as a purely
combinatorial number independent of the hypothesis set).

Pk−1
Strategy: mH (N) ≤ max{mH (N) given k} ≤ B(N, k) ≤ i=0 ai N i
26/45 Jörg Schäfer | Learning From Data | c b na 3: Is Learning Feasible?
Bound for B(N, k)
We try to bound B(N, k) (we will compute below an exact solution, too
but the bound is all we need).
Suppose, we know that
Lemma 8
The following recursion inequality holds

B(N, k) ≤ B(N − 1, k) + B(N − 1, k − 1)

Then it follows
Lemma 9
(Sauer’s Lemma)
k−1
!
X N
B(N, k) ≤
i=0
i

27/45 Jörg Schäfer | Learning From Data | c b na 3: Is Learning Feasible?


Recursion Inequality Illustrated

k→
1 2 3 4 5 6 ...
1 1 2 2 2 2 2 ...
2 1 3 4 4 4 4 ...
3 1 4 7 8 8 8 ...
N↓
& ↓ ...
4 1 5 11 ... ... ......
.. ..
5 1 6 . . ...

B(N, k) ≤ B(N − 1, k) + B(N − 1, k − 1)


Note,this should remind you of the dynamic programming algorithm!

28/45 Jörg Schäfer | Learning From Data | c b n a 3: Is Learning Feasible?


Proof of Recursion Inequality
Proof.

We have

B(N, 1) = 1, ∀N
B(1, k) = 2, ∀k > 1.

We have B(N, 1) = 1 as only one dichotomy exists (if we had at least two,
then they had to differ in one place and thus a subset of size 1 would be
shattered). We have B(1, k) = 2 as the condition is void.
Consider the set of all dichotomies S for N ≥ 2 and k ≥ 2.
We group S into three sets S = S1 ∪ S2− ∪ S2+ , where S2± contains those
dichotomies that are identical on the first x1 , . . . , xn−1 coordinates and
differ only in the last coordinate by either −1 (the S2− ) or +1 (the set S2− ).
The set S1 contains the remaining dichotomies.
Define α := |S1 |, and β := |S2− | = |S2+ |.
29/45 Jörg Schäfer | Learning From Data | c b na 3: Is Learning Feasible?
Proof of Recursion Inequality
Proof (cont.)

# of rows x1 x2 ... xN−1 xN


+1 +1 ... +1 +1
−1 +1 ... +1 +1
S1 α .. .. .. .. ..
. . . . .
+1 −1 ... −1 −1
−1 +1 ... −1 +1
+1 −1 ... +1 +1
−1 −1 ... +1 +1 By definition:
S2+ β .. .. .. .. .. B(N, k) = α + 2β
. . . . .
+1 −1 ... +1 +1
−1 −1 ... −1 +1
+1 −1 ... +1 −1
−1 −1 ... +1 −1
S2− β .. .. .. .. ..
. . . . .
+1 −1 ... +1 −1
−1 −1 ... −1 −1

30/45 Jörg Schäfer | Learning From Data | c b na 3: Is Learning Feasible?


Proof of Recursion Inequality
Proof (cont.)

Now, consider the total number of dichotomies on the first N − 1 points.


This number is given by α + β as S2− is identical to S2+ on those points
(they are redundant). Thus by definition of B(N, k)

α + β ≤ B(N − 1, k).

We claim that
β ≤ B(N − 1, k − 1).
Indeed, if a subset of size k − 1 of N − 1 could be shattered by the
dichotomies of S2+ then we could take the corresponding subset in S2− and
add xN to the data points. This yields a subset of size k that can be
shattered contradicting the definition of B(N, k).
Putting everything together we get the desired inequality.

31/45 Jörg Schäfer | Learning From Data | c b na 3: Is Learning Feasible?


Proof of Sauer’s Lemma
Proof.
Proof by induction. Statement is true for k = 1 or N = 1 by inspection.
Assume, it is true for any N ≤ N0 and all k.
Recursion inequality implies B(N0 + 1, k) ≤ B(N0 , k) + B(N0 , k − 1), thus
k−1   k−2  
X N0 X N0
B(N0 + 1, k) ≤ +
i i
i=0 i=0
k−1
X  k−1
X  N0 
N0
=1+ +
i i −1
i=1 i=1
k−1    
X N0 N0
=1+ +
i i −1
i=1
k−1   k−1  
X N0 + 1 X N0 + 1
=1+ =
i i
i=1 i=0

32/45 Jörg Schäfer | Learning From Data | c b na 3: Is Learning Feasible?


Excursion: Exact formula for B(N, k)
We do not need it, but:
Theorem 10
Pk−1 N 
B(N, k) = i=0 i

Proof.
Pk−1 N 
We only have to show B(N, k) ≥ i=0 i .
Consider a dichotomy of having zero {−1}’s. Clearly all subsets of
size k cannot be shattered. (Why?)
Consider a set of dichotomies of having i {−1}’s with i < k. Clearly
all subsets of size k cannot be shattered. (Why?)
Consider the union of all of these. Clearly all subsets of size k of this
set cannot be shattered.
N
There are i subsets having i {−1}’s.

33/45 Jörg Schäfer | Learning From Data | c b na 3: Is Learning Feasible?


Summary
The growth function mH (N) has either a breakpoint k or not.
If it has no breakpoint k, we know mH (N) = 2N (bad case).
Pk−1 N

If it has a breakpoint k, we know mH (N) ≤ B(N, k) = i=0 i (good
case).
We know that having a breakpoint is the good case, because1
Lemma 11

k−1   k−1    k−1


X N X N eN
≤ N k−1 + 1 and ≤
i=0
i i=0
i k −1

As both alternative expressions are polynominal, we know that the growth


function mH (N) is polynomially bounded, henceforth “good case”.
1
Proof: Homework, exercises
34/45 Jörg Schäfer | Learning From Data | c b na 3: Is Learning Feasible?
Growth of the Growth Function
We either have no luck at all or we are in good shape – nothing in
between. If broken once (breakpoint!) it stays polynomial.

10
Bad
8 Good

Ugly Region is void


log mH

6
4
2

2 4 6 8 10

35/45 Jörg Schäfer | Learning From Data | c b n a 3: Is Learning Feasible?


Link between Growth Function and VC Dimension
Note, that mH could have several break points. The smaller such number,
the better for the generalisation (see below). Thus, we define
Vapnik-Chervonenkis dimension as follows:
Definition 12
Let H be a set of hypothesis. The Vapnik-Chervonenkis dimension of
the set H, denoted by dVC (H) is the largest value of N such that
mH (N) = 2N . If no such number exists, then dVC (H) := ∞

The Vapnik-Chervonenkis dimension can be considered as a measure of


the real complexity of a set of hypothesis. It characterizes the learning
model’s complexity.
Corollary 13

mH (N) ≤ N dVC (H) + 1

36/45 Jörg Schäfer | Learning From Data | c b na 3: Is Learning Feasible?


Warning: Counter Example sin(ωx )

Intuitively we tend to believe that the Vapnik-Chervonenkis


dimension is related to the number of parameters.
This intuition is wrong.
Here is a counter example:
Let X = R and H = R. For a fixed number ω ∈ R a hypothesis is given
by
hω (x ) = sign(sin(ω x )).
Now we claim
For all N ∈ N the points {21 , 22 , . . . , 2N } are shattered by h.
Thus VC dimension is infinite, despite the fact that we only have a
single! parameter (see also [GH15]).

37/45 Jörg Schäfer | Learning From Data | c b n a 3: Is Learning Feasible?


Warning: Counter Example sin(ωx )

Intuitively we tend to believe that the Vapnik-Chervonenkis


dimension is related to the number of parameters.
This intuition is wrong.
Here is a counter example:
Let X = R and H = R. For a fixed number ω ∈ R a hypothesis is given
by
hω (x ) = sign(sin(ω x )).
Now we claim
For all N ∈ N the points {21 , 22 , . . . , 2N } are shattered by h.
Thus VC dimension is infinite, despite the fact that we only have a
single! parameter (see also [GH15]).

37/45 Jörg Schäfer | Learning From Data | c b na 3: Is Learning Feasible?


Warning: Counter Example sin(ωx )

Intuitively we tend to believe that the Vapnik-Chervonenkis


dimension is related to the number of parameters.
This intuition is wrong.
Here is a counter example:
Let X = R and H = R. For a fixed number ω ∈ R a hypothesis is given
by
hω (x ) = sign(sin(ω x )).
Now we claim
For all N ∈ N the points {21 , 22 , . . . , 2N } are shattered by h.
Thus VC dimension is infinite, despite the fact that we only have a
single! parameter (see also [GH15]).

37/45 Jörg Schäfer | Learning From Data | c b na 3: Is Learning Feasible?


Warning: Counter Example sin(ωx )

Intuitively we tend to believe that the Vapnik-Chervonenkis


dimension is related to the number of parameters.
This intuition is wrong.
Here is a counter example:
Let X = R and H = R. For a fixed number ω ∈ R a hypothesis is given
by
hω (x ) = sign(sin(ω x )).
Now we claim
For all N ∈ N the points {21 , 22 , . . . , 2N } are shattered by h.
Thus VC dimension is infinite, despite the fact that we only have a
single! parameter (see also [GH15]).

37/45 Jörg Schäfer | Learning From Data | c b na 3: Is Learning Feasible?


Warning: Counter Example sin(ωx )

Intuitively we tend to believe that the Vapnik-Chervonenkis


dimension is related to the number of parameters.
This intuition is wrong.
Here is a counter example:
Let X = R and H = R. For a fixed number ω ∈ R a hypothesis is given
by
hω (x ) = sign(sin(ω x )).
Now we claim
For all N ∈ N the points {21 , 22 , . . . , 2N } are shattered by h.
Thus VC dimension is infinite, despite the fact that we only have a
single! parameter (see also [GH15]).

37/45 Jörg Schäfer | Learning From Data | c b na 3: Is Learning Feasible?


Warning: Counter Example sin(ωx )

Intuitively we tend to believe that the Vapnik-Chervonenkis


dimension is related to the number of parameters.
This intuition is wrong.
Here is a counter example:
Let X = R and H = R. For a fixed number ω ∈ R a hypothesis is given
by
hω (x ) = sign(sin(ω x )).
Now we claim
For all N ∈ N the points {21 , 22 , . . . , 2N } are shattered by h.
Thus VC dimension is infinite, despite the fact that we only have a
single! parameter (see also [GH15]).

37/45 Jörg Schäfer | Learning From Data | c b na 3: Is Learning Feasible?


Warning: Counter Example sin(ωx )

Intuitively we tend to believe that the Vapnik-Chervonenkis


dimension is related to the number of parameters.
This intuition is wrong.
Here is a counter example:
Let X = R and H = R. For a fixed number ω ∈ R a hypothesis is given
by
hω (x ) = sign(sin(ω x )).
Now we claim
For all N ∈ N the points {21 , 22 , . . . , 2N } are shattered by h.
Thus VC dimension is infinite, despite the fact that we only have a
single! parameter (see also [GH15]).

37/45 Jörg Schäfer | Learning From Data | c b na 3: Is Learning Feasible?


Warning: Counter Example sin(ωx ) (cont.)

Proof.
To prove that, ∀N ∈ N the points {21 , 22 , . . . , 2N } are shattered by
h let d = 0.y1 y2 . . . yN be a binary encoding of the desired labels yi
where we map −1 to 0 and 1 to 1.
Let d 0 be the corresponding decimal number and define ω = −π ∗ d 0 .
Essentially each xi bit shifts ω to produce the desired label as a
result of the fact that sign(sin(πz)) = (−1)bzc .

Intuitive reason: If mapped to standard interval, ω becomes huge, and


sin(ω x ) becomes too wiggly.

38/45 Jörg Schäfer | Learning From Data | c b na 3: Is Learning Feasible?


Composition of Growth Functions
Let H1 , . . . , HK be hypothesis sets with VC dimensions d1 , . . . , dK .
Let H̃ be hypothesis set of functions taking inputs in RK . Assume H̃ has
VC dimension d̃.
Then we can define the composition H as the hypothesis set
H := H̃ ◦ (H1 , . . . , HK ), Hi ∈ HK by

h(x) = h̃(h1 (x), . . . , hK (x))

Lemma 14

K
Y
mH (N) ≤ mH̃ (N) mHi (N)
i=1

This can be used to reason about composed machines, i.e. neural networks!
39/45 Jörg Schäfer | Learning From Data | c b na 3: Is Learning Feasible?
Vapnik-Chervonenkis Theorem

The following is the famous Vapnik-Chervonenkis Theorem also known as


Generalization Bound for ML.
Theorem 15
Let H be a hypothesis set with finite VC dimension dVC . Assuming we
draw sample data D of size N independently from the same unknown
distribution. Then for any  > 0 we can bound the out-of-sample error
Eout by the in sample error Ein as follows:
2 N/8
P[|Ein (h) − Eout (h)| > ] ≤ 4mH (2N) e − ,

where mH (N) denotes the growth function and h ∈ H denotes the


learned hypothesis.

40/45 Jörg Schäfer | Learning From Data | c b na 3: Is Learning Feasible?


Sketch of proof
We intuitively want to replace M in the union bound by mH (N). Having finite
VC dimension dVC this makes sense.
However, we cannot directly apply Hoeffding as the hypothesis h is not fixed.
Thus estimating the probability of the event {|Ein (h) − Eout (h)| > } is difficult.
We introduce a “ghost” data set D0 of the same distribution than D and draw
2N samples (thus, we replace mH (N) by mH (2N), which explains the change of
some of the constants).
This provides another in sample error Ein0 (h).
Intuitively, if Ein (h) ≈ Eout (h) than also Ein0 (h) ≈ Eout (h) and Ein (h) ≈ Ein0 (h).
This can be made rigorously using conditional expectations.
Henceforth, we can approximate {|Ein (h) − Eout (h)| > } (difficult) with
{|Ein (h) − Ein0 (h)| > } (easy) as the latter involves only finitely many data (here
the growth function comes into play, too).
The full details is subject of a seminar talk, enjoy!

41/45 Jörg Schäfer | Learning From Data | c b n a 3: Is Learning Feasible?


Model Complexity and Error
There is a trade off:

in−sample error
out−of−sample error
model complexity
Error

dVC

VC dimension dVC

42/45 Jörg Schäfer | Learning From Data | c b n a 3: Is Learning Feasible?


Interpreting the Generalization Bound
We can rephrase the inequality of the Generalization Bound as follows:
For any fixed tolerance level δ we have with probability of at least 1 − δ:
s
8 4mH (2N)
Eout (h) ≤ Ein (h) + log (1)
N δ
s
8 4mH (2N)
Eout (h) ≥ Ein (h) − log . (2)
N δ
(The proof is left as an exercise to the reader!)
Note that,
1. equation 1 implies that we can bound the out-of-sample error by the
in-sample-error (independent of any knowledge of the distribution).
2. equation 2 implies that if we have learned a hypothesis well (with
small epsilon) than any other hypothesis will not do much better if
at all. Thus we have found a close to optimal solution.
43/45 Jörg Schäfer | Learning From Data | c b n a 3: Is Learning Feasible?
Interpreting the Generalization Bound (cont.)

1. The Generalization bound is a universal result.


2. It is kind of worst case (as it is distribution independent).
3. Thus, the estimates are very loose.

Conclusion
Nonetheless, it is conceptually the most important mathematical theorem
for ML as it proves that learning is possible at all and provide guidance
for model (complexity) selection.

Remark: You find a very useful explanation of the generalisation bound


here: https://mostafa-samir.github.io/ml-theory-pt2/.

44/45 Jörg Schäfer | Learning From Data | c b na 3: Is Learning Feasible?


References I

[AMMIL12] Y. S. Abu-Mostafa, M. Magdon-Ismail, and H.-T. Lin, Learning


From Data. AMLBook, 2012.

[GH15] H. N. G. Harman, S.R. Kulkarni, “sin(ωx ) can approximate almost


every finite set of samples,” Constructive Approximation, vol. 42,
pp. 303–311, June 2015.

[SSBD14] C. S. Shalev-Shwartz and S. Ben-David, From Theory to


Algorithms, 1st ed. Cambridge University Press, May 2014.
[Online]. Available:
https://www.cs.huji.ac.il/~shais/UnderstandingMachineLearning/
understanding-machine-learning-theory-algorithms.pdf

[Vap95] V. N. Vapnik, The nature of statistical learning theory. New


York, NY, USA: Springer-Verlag New York, Inc., 1995.

45/45 Jörg Schäfer | Learning From Data | c b n a 3: Is Learning Feasible?

You might also like