Learning From Data

Learning From Data
3: Is Learning Feasible?
Jörg Schäfer
Frankfurt University of Applied Sciences
Department of Computer Sciences
Nibelungenplatz 1
D-60318 Frankfurt am Main
Wissen durch Praxis stärkt

1/45 Jörg Schäfer | Learning From Data | c b n a 3: Is Learning Feasible? Summer Semester 2022
Content
Motivation
Dichotomies
Growth Function
Bounding the Growth Function
VC Dimension
Examples
When is Learning Feasible?
Bibliography
2/45 Jörg Schäfer | Learning From Data | c b n a 3: Is Learning Feasible?

By far, the greatest danger of Artificial Intelligence is that people
conclude too early that they understand it.
– Eliezer Yudkowsky

Recap and Proof Strategy
We know that deterministic learning is not feasible.
However, what if we just want to estimate the probability that we
are wrong by a certain amount?
We will use Hoeffding’s inequality from probability theory to learn
something about this probabilities.
Let Xi be independent random variables, then according to
Hoeffding’s inequality for a random variable X := N1 N
P
i=1 Xi with
mean E [X ] we have:
2N
P[|X − E [X ]| > ] ≤ 2e −2
i.e. we can estimate that X does not differ from its mean by
something exponentially bound in N.
We will set X = Ein (h) and use E [X ] ≈ Eout (h) to generalise this to
achieve and prove the Generalization Bound for ML.
Generalization Strategy
Assume, we fix the hypothesis h ∈ H in advance. Then from Hoeffding’s
inequality we conclude:
2N
P[|Ein (h) − Eout (h)| > ] ≤ 2e −2
We can calculate (the random variable) Ein from the data D.

We can not calculate Eout .
Assuming, that Ein is small on the data D, i.e. f (x ) ≈ h(x ) on D.
Then we can conclude with high probability that Eout ≈ 0. This
means P[f (x ) 6= h(x )] ≈ 0. Thus we have learned something about
the unknown f from the data D
Assuming, that Ein is not small on the data D, i.e. that the machine
learned the training set badly, we know that it will fail on unknown
data with high probability, too. However, we have not learned much
in this case.
Illustration
(Exposition inspired by [AMMIL12].) Question: Does µ tell us anything
about ν?
Take a bin with red and green Answer: Yes and no. µ is probably
marbles close to ν!
µ := P[picking red marble]
P[picking green marble] =
1−µ
µ is unknown
We repeat N times
independently
Fraction of red marbles is
denoted as ν

Hoeffding
According to Hoeffding’s inequality:
2N
P[|ν − µ| > ] ≤ 2e −2
i.e. µ = ν is probably almost correct (P.A.C.)!

Note, that bound does not depend on µ i.e. bound is distribution free.
In learning,
the unknown is a function f : X 7→ Y, not just a number µ.
however, each marble is a point x ∈ X
hypothesis got it right h(x ) = f (x ) is green
hypothesis got it wrong h(x ) 6= f (x ) is red
However,
2N
P[|Ein (h) − Eout (h)| > ] ≤ 2e −2
is wrong. Hoeffding is valid for a single h ∈ H!
Generalization
In reality, however,
the hypothesis h ∈ H is not fixed in advance, but rather computed
from the training data.
Henceforth, it becomes a random variable and we cannot apply
Hoeffding.
Let us assume that |H| = M, i.e. that we have only finitely many
hypothesis. Then we have the union bound:
"M #
[
P[|Ein (h) − Eout (h)| > ] = P {|Ein (hi ) − Eout (hi )| > }
i=1
M
X
≤ P[|Ein (hi ) − Eout (hi )| > ]
i=1
M
2N 2N
2e −2 = M 2e −2
X
≤
i=1

The Union Bound
We have used the union bound:
M
! M
[ X
P Ai ≤ P(Ai ),
i=1 i=1
which holds true for any probability measure and set of events Ai . It is crude, however:
A2
A2
A2
A3
A1
A1 A3
A3
A1
P(A1 ∪ A2 ∪ A3 ) ≈ P(Ai )
P(A1 ∪ A2 ∪ A3 ) ≈ 2P(Ai )
P(A1 ∪ A2 ∪ A3 ) ≈ 3P(Ai )

Learning with Finitely Many Hypothesis
Nonetheless, learning with finitely many hypothesis is feasible as we have

Lemma 1
Let |H| = M. Then we have
2N
P[|Ein (h) − Eout (h)| > ] ≤ M 2e −2
We face, however a serious problem now:

For any interesting learning machine, we have
|H| = ∞!
10/45 Jörg Schäfer | Learning From Data | c b na 3: Is Learning Feasible?

Learning with Finitely Many Hypothesis
Nonetheless, learning with finitely many hypothesis is feasible as we have

Lemma 1
Let |H| = M. Then we have
2N
P[|Ein (h) − Eout (h)| > ] ≤ M 2e −2
We face, however a serious problem now:

For any interesting learning machine, we have
|H| = ∞!

Number of Hypothesis – Examples
How many hypothesis do we have for the perceptron?
Answer infinitely many, as any choice of w ∈ R is a hypothesis!
But, how many are really different?
2.5
2.0
o o
1.5
x2
1.0
+ +
0.5
0.5 1.0 1.5 2.0 2.5
x1
Number of Hypothesis – Examples
How many hypothesis do we have for the perceptron?
Answer infinitely many, as any choice of w ∈ R is a hypothesis!
But, how many are really different?
2.5
2.0
o o
1.5
x2
1.0
+ +
0.5
0.5 1.0 1.5 2.0 2.5
x1
Strategy of Generalization
1. We will show, that there exists a function mH (N) measuring the real,
i.e. effective complexity of the hypothesis set .
2. This function can be used to replace M in the union bound:
2N
P[|Ein (h) − Eout (h)| > ] ≤ M 2e −2
3. This will yield the famous Vapnik-Chervonenkis Bound:

2 N/8
P[|Ein (h) − Eout (h)| > ] ≤ 4mH (2N) e −

Dichotomies
Definition 2
Let H be a set of hypothesis. Let x1 , . . . , xn ∈ D. Then the dichotomies
generated by H are defined by
H(x1 , . . . , xn ) := {(h(x1 ), . . . , h(xn )) |h ∈ H}.
One can think of H(x1 , . . . , xn ) as a projection of H onto the data set

x1 , . . . , xn ∈ D given. Note, that H(x1 , . . . , xn ) is finite, even for H
containing infinitely many hypothesis.

Growth Function
Definition 3
The growth function defined for a set of hypothesis is defined as a map from
N→N
mH (N) := max |H(x1 , . . . , xN )| ,
x1 ,...,xN ∈X
where | · | denotes the cardinality of the set.
Thus, the growth function mH (N) is the maximum number of dichotomies that
can be generated by H on any N points, henceforth it can be interpreted as the
worst possible x1 , . . . , xN (most difficult to separate). It is a measure of
complexity of the hypothesis set H. (Note, that mH (N) ≤ 2N – Why?)
Questions:
1. Can we bound mH (N) by a polynomial in N?
2. Can we replace |H| by mH (N) in the generalization bound?

Examples of Growth Functions – Perceptron
Let us compute the growth function for the perceptron (for small k).
mH (1) = 2 (Why?)
mH (2) = 4 (Why?)
For mH (3) we look at three co-linear points
Obviously, this dichotomy cannot be shattered (see definition von

page 22)!
Does this imply mH (3) < 23 = 8?
No!
Examples of Growth Functions – Perceptron
Let us compute the growth function for the perceptron (for small k).
mH (1) = 2 (Why?)
mH (2) = 4 (Why?)
For mH (3) we look at three co-linear points
Obviously, this dichotomy cannot be shattered (see definition von

page 22)!
Does this imply mH (3) < 23 = 8?
No!
Perceptron (cont.)
Remember, the growth function is defined as the max, thus, we need to
find any configuration of 3 points that cannot be shattered to prove that
mH (3) < 23 = 8. This let us look at the generic configuration, i.e. 3
non-co-linear points:
+ + + +
+ o + o
x2
x2
x2
+ o o +
x1 x1 x1 x1
o o o o
o + o +
x2
x2
x2
o + + o
Thus, mH (3) = 23 = 8.
16/45 Jörg Schäfer | Learning From Data | c bn a 3: Is Learning Feasible?
Perceptron (cont.)
Remember, the growth function is defined as the max, thus, we need to
find any configuration of 3 points that cannot be shattered to prove that
mH (3) < 23 = 8. This let us look at the generic configuration, i.e. 3
non-co-linear points:
+ + + +
+ o + o
x2
x2
x2
+ o o +
x1 x1 x1 x1
o o o o
o + o +
x2
x2
x2
o + + o
Thus, mH (3) = 23 = 8.
Perceptron (cont.)
What about k = 4?
o +
+ o
We conclude, that we cannot shatter this and the corresponding symmetrical

configuration with + and o flipped (all others can). (Note, that this holds true
for any configuration of 4 points and is not specific to the “square”
configuration depicted above.) Thus, mH (4) = 14 < 24 = 16. We call this a
break point (see below).
Perceptron (cont.)
What about k = 4?
o +
+ o
We conclude, that we cannot shatter this and the corresponding symmetrical

configuration with + and o flipped (all others can). (Note, that this holds true
for any configuration of 4 points and is not specific to the “square”
configuration depicted above.) Thus, mH (4) = 14 < 24 = 16. We call this a
break point (see below).
Example: Positive Rays
The positive rays example consists of

R → {−1, +1}
Hypothesis h(x ) = sign(x − a)
mH (N) = N + 1
a
−3 −2 −1 0 1 2 3

Example: Positive Intervals
The positive intervals example consists of two a, and b defined as above:
a b
−3 −2 −1 0 1 2 3
Thus, the growth function is how to select two points from N + 1,

i.e. N+1
2 . However, we have to add one configuration for all red points
corresponding to the start and end point in the same segment
Thus, !
N +1 1 1
mH (N) = + 1 = N2 + N + 1
2 2 2

Example: Convex Sets
The convex sets example consist of hypothesis in two dimensions that are
positive inside convex sets and negative elsewhere.
In general, an arbitrarily chosen set of N points cannot be shattered as
points may lie inside and thus violating the convex property.
However, we may choose the N points to lie on a circle.
We connect all +1 points.
Assuming, we have more than three +1 points, we get a convex polygon.
If we have dichotomies with less than three +1 points, the convex set will
be a line or a point or an empty set.
In both cases, the +1 and −1 points are correctly classified.
This proves that any hypothesis on N points can be realised with a convex
hypothesis if we chose the N points carefully.
Thus,
mH (N) = 2N
Overview Example Growth Function
Table: Example Growth Functions

x1 x2 x3 x4 x5 ...
Perceptron 2 4 8 14 ...
1-D pos. ray 2 3 4 5 ...
1-D intervals 2 4 7 11 16
Convex Sets 2 4 8 16 32
Calculating the growth function is difficult, in general.

However, only if mH (N) drops below 2N we can hope for a solution
anyway, thus let us find those!
Luckily, it turns our – as a bonus – that for those cases, we do not
even have to exactly calculate the growth function!

Break Points
Definition 4
If H can generate all possible dichotomies on a set x1 , . . . , xN then
H(x1 , . . . , xN ) = {−1, +1}N , we say that H can shatter the set
x1 , . . . , xN .
Definition 5
If no data set of size k can be shattered by H than k is said to be a
break point for H.
Thus, a break point k is any k for which mH (k) < 2k .

Characterization of the Growth Function
Problem:
In general, the growth function mH (N) is difficult to compute
However, we only have to estimate, i.e. to bound it (ideally by a
polynomial)!
To this end, a pure combinatorial function can be helpful:

Definition 6
B(N, k) is defined as the maximum number of dichotomies on N points
that no subset of size k of these N points can be shattered by these
dichotomies.
Remark: We can think of B(N, k) as the maximum number of
dichotomies on N points, with allow for a break point k.

Example B(N, k)
How many dichotomies can you list on 4 points so that no 2 are

shattered?
x1 x2 x3 x4
◦ ◦ ◦ ◦
◦ ◦ ◦ •
◦ ◦ • ◦
◦ • ◦ ◦
• ◦ ◦ ◦
Can we add a 6th dichotomy?

Example B(N, k)

shattered?
x1 x2 x3 x4
◦ ◦ ◦ ◦
◦ ◦ ◦ •
◦ ◦ • ◦
◦ • ◦ ◦
• ◦ ◦ ◦

Example B(N, k)

shattered?
x1 x2 x3 x4
◦ ◦ ◦ ◦
◦ ◦ ◦ •
◦ ◦ • ◦
◦ • ◦ ◦
• ◦ ◦ ◦

Example B(N, k)

shattered?
x1 x2 x3 x4
◦ ◦ ◦ ◦
◦ ◦ ◦ •
◦ ◦ • ◦
◦ • ◦ ◦
• ◦ ◦ ◦

Example B(N, k)

shattered?
x1 x2 x3 x4
◦ ◦ ◦ ◦
◦ ◦ ◦ •
◦ ◦ • ◦
◦ • ◦ ◦
• ◦ ◦ ◦

Example B(N, k)

shattered?
x1 x2 x3 x4
◦ ◦ ◦ ◦
◦ ◦ ◦ •
◦ ◦ • ◦
◦ • ◦ ◦
• ◦ ◦ ◦

Example B(N, k)

shattered?
x1 x2 x3 x4
◦ ◦ ◦ ◦
◦ ◦ ◦ •
◦ ◦ • ◦
◦ • ◦ ◦
• ◦ ◦ ◦

Example B(N, k)
x1 x2 x3 x4
◦ ◦ ◦ ◦
◦ ◦ ◦ •
◦ ◦ • ◦
◦ • ◦ ◦
• ◦ ◦ ◦
• • ◦ ◦
Adding the sixth line shatters x1 and x2 , thus, is easy to check that
B(4, 2) = 5.

Example B(N, k)
x1 x2 x3 x4
◦ ◦ ◦ ◦
◦ ◦ ◦ •
◦ ◦ • ◦
◦ • ◦ ◦
• ◦ ◦ ◦
• • ◦ ◦
B(4, 2) = 5.

Example B(N, k)
x1 x2 x3 x4
◦ ◦ ◦ ◦
◦ ◦ ◦ •
◦ ◦ • ◦
◦ • ◦ ◦
• ◦ ◦ ◦
• • ◦ ◦
B(4, 2) = 5.

Example B(N, k)
x1 x2 x3 x4
◦ ◦ ◦ ◦
◦ ◦ ◦ •
◦ ◦ • ◦
◦ • ◦ ◦
• ◦ ◦ ◦
• • ◦ ◦
B(4, 2) = 5.

Example B(N, k)
x1 x2 x3 x4
◦ ◦ ◦ ◦
◦ ◦ ◦ •
◦ ◦ • ◦
◦ • ◦ ◦
• ◦ ◦ ◦
• • ◦ ◦
B(4, 2) = 5.

Example B(N, k)
x1 x2 x3 x4
◦ ◦ ◦ ◦
◦ ◦ ◦ •
◦ ◦ • ◦
◦ • ◦ ◦
• ◦ ◦ ◦
• • ◦ ◦
B(4, 2) = 5.

Usefulness of B(N, k)
Theorem 7
Let k be a break point of mH (N). Then
mH (N) ≤ B(N, k)
Proof.
Consider any k points.

They cannot be shattered as otherwise k would not be a break point of
mH (N) by definition.
B(N, k) is the largest such list (as it is defined as the maximum of all of
the possible dichotomies – remember, it is defined as a purely
combinatorial number independent of the hypothesis set).
Pk−1
Strategy: mH (N) ≤ max{mH (N) given k} ≤ B(N, k) ≤ i=0 ai N i
Bound for B(N, k)
We try to bound B(N, k) (we will compute below an exact solution, too
but the bound is all we need).
Suppose, we know that
Lemma 8
The following recursion inequality holds
B(N, k) ≤ B(N − 1, k) + B(N − 1, k − 1)
Then it follows
Lemma 9
(Sauer’s Lemma)
k−1
!
X N
B(N, k) ≤
i=0
i

Recursion Inequality Illustrated
k→
1 2 3 4 5 6 ...
1 1 2 2 2 2 2 ...
2 1 3 4 4 4 4 ...
3 1 4 7 8 8 8 ...
N↓
& ↓ ...
4 1 5 11 ... ... ......
.. ..
5 1 6 . . ...
B(N, k) ≤ B(N − 1, k) + B(N − 1, k − 1)

Note,this should remind you of the dynamic programming algorithm!

Proof of Recursion Inequality
Proof.
We have
B(N, 1) = 1, ∀N
B(1, k) = 2, ∀k > 1.
We have B(N, 1) = 1 as only one dichotomy exists (if we had at least two,
then they had to differ in one place and thus a subset of size 1 would be
shattered). We have B(1, k) = 2 as the condition is void.
Consider the set of all dichotomies S for N ≥ 2 and k ≥ 2.
We group S into three sets S = S1 ∪ S2− ∪ S2+ , where S2± contains those
dichotomies that are identical on the first x1 , . . . , xn−1 coordinates and
differ only in the last coordinate by either −1 (the S2− ) or +1 (the set S2− ).
The set S1 contains the remaining dichotomies.
Define α := |S1 |, and β := |S2− | = |S2+ |.
Proof (cont.)
# of rows x1 x2 ... xN−1 xN

+1 +1 ... +1 +1
−1 +1 ... +1 +1
S1 α .. .. .. .. ..
. . . . .
+1 −1 ... −1 −1
−1 +1 ... −1 +1
+1 −1 ... +1 +1
−1 −1 ... +1 +1 By definition:
S2+ β .. .. .. .. .. B(N, k) = α + 2β
. . . . .
+1 −1 ... +1 +1
−1 −1 ... −1 +1
+1 −1 ... +1 −1
−1 −1 ... +1 −1
S2− β .. .. .. .. ..
. . . . .
+1 −1 ... +1 −1
−1 −1 ... −1 −1

Proof (cont.)
Now, consider the total number of dichotomies on the first N − 1 points.

This number is given by α + β as S2− is identical to S2+ on those points
(they are redundant). Thus by definition of B(N, k)
α + β ≤ B(N − 1, k).
We claim that
β ≤ B(N − 1, k − 1).
Indeed, if a subset of size k − 1 of N − 1 could be shattered by the
dichotomies of S2+ then we could take the corresponding subset in S2− and
add xN to the data points. This yields a subset of size k that can be
shattered contradicting the definition of B(N, k).
Putting everything together we get the desired inequality.

Proof of Sauer’s Lemma
Proof.
Proof by induction. Statement is true for k = 1 or N = 1 by inspection.
Assume, it is true for any N ≤ N0 and all k.
Recursion inequality implies B(N0 + 1, k) ≤ B(N0 , k) + B(N0 , k − 1), thus
k−1 k−2
X N0 X N0
B(N0 + 1, k) ≤ +
i i
i=0 i=0
k−1
X k−1
X N0
N0
=1+ +
i i −1
i=1 i=1
k−1
X N0 N0
=1+ +
i i −1
i=1
k−1 k−1
X N0 + 1 X N0 + 1
=1+ =
i i
i=1 i=0

Excursion: Exact formula for B(N, k)
We do not need it, but:
Theorem 10
Pk−1 N
B(N, k) = i=0 i
Proof.
Pk−1 N
We only have to show B(N, k) ≥ i=0 i .
Consider a dichotomy of having zero {−1}’s. Clearly all subsets of
size k cannot be shattered. (Why?)
Consider a set of dichotomies of having i {−1}’s with i < k. Clearly
all subsets of size k cannot be shattered. (Why?)
Consider the union of all of these. Clearly all subsets of size k of this
set cannot be shattered.
N
There are i subsets having i {−1}’s.

Summary
The growth function mH (N) has either a breakpoint k or not.
If it has no breakpoint k, we know mH (N) = 2N (bad case).
Pk−1 N

If it has a breakpoint k, we know mH (N) ≤ B(N, k) = i=0 i (good
case).
We know that having a breakpoint is the good case, because1
Lemma 11
k−1 k−1 k−1

X N X N eN
≤ N k−1 + 1 and ≤
i=0
i i=0
i k −1
As both alternative expressions are polynominal, we know that the growth

function mH (N) is polynomially bounded, henceforth “good case”.
1
Proof: Homework, exercises
Growth of the Growth Function
We either have no luck at all or we are in good shape – nothing in
between. If broken once (breakpoint!) it stays polynomial.
10
Bad
8 Good
Ugly Region is void

log mH
6
4
2
2 4 6 8 10

Link between Growth Function and VC Dimension
Note, that mH could have several break points. The smaller such number,
the better for the generalisation (see below). Thus, we define
Vapnik-Chervonenkis dimension as follows:
Definition 12
Let H be a set of hypothesis. The Vapnik-Chervonenkis dimension of
the set H, denoted by dVC (H) is the largest value of N such that
mH (N) = 2N . If no such number exists, then dVC (H) := ∞
The Vapnik-Chervonenkis dimension can be considered as a measure of

the real complexity of a set of hypothesis. It characterizes the learning
model’s complexity.
Corollary 13
mH (N) ≤ N dVC (H) + 1

Warning: Counter Example sin(ωx )
Intuitively we tend to believe that the Vapnik-Chervonenkis

dimension is related to the number of parameters.
This intuition is wrong.
Here is a counter example:
Let X = R and H = R. For a fixed number ω ∈ R a hypothesis is given
by
hω (x ) = sign(sin(ω x )).
Now we claim
For all N ∈ N the points {21 , 22 , . . . , 2N } are shattered by h.
Thus VC dimension is infinite, despite the fact that we only have a
single! parameter (see also [GH15]).


by
Now we claim


by
Now we claim


by
Now we claim


by
Now we claim


by
Now we claim


by
Now we claim

Warning: Counter Example sin(ωx ) (cont.)
Proof.
To prove that, ∀N ∈ N the points {21 , 22 , . . . , 2N } are shattered by
h let d = 0.y1 y2 . . . yN be a binary encoding of the desired labels yi
where we map −1 to 0 and 1 to 1.
Let d 0 be the corresponding decimal number and define ω = −π ∗ d 0 .
Essentially each xi bit shifts ω to produce the desired label as a
result of the fact that sign(sin(πz)) = (−1)bzc .
Intuitive reason: If mapped to standard interval, ω becomes huge, and

sin(ω x ) becomes too wiggly.

Composition of Growth Functions
Let H1 , . . . , HK be hypothesis sets with VC dimensions d1 , . . . , dK .
Let H̃ be hypothesis set of functions taking inputs in RK . Assume H̃ has
VC dimension d̃.
Then we can define the composition H as the hypothesis set
H := H̃ ◦ (H1 , . . . , HK ), Hi ∈ HK by
h(x) = h̃(h1 (x), . . . , hK (x))
Lemma 14
K
Y
mH (N) ≤ mH̃ (N) mHi (N)
i=1
This can be used to reason about composed machines, i.e. neural networks!
Vapnik-Chervonenkis Theorem
The following is the famous Vapnik-Chervonenkis Theorem also known as

Generalization Bound for ML.
Theorem 15
Let H be a hypothesis set with finite VC dimension dVC . Assuming we
draw sample data D of size N independently from the same unknown
distribution. Then for any > 0 we can bound the out-of-sample error
Eout by the in sample error Ein as follows:
2 N/8
P[|Ein (h) − Eout (h)| > ] ≤ 4mH (2N) e − ,
where mH (N) denotes the growth function and h ∈ H denotes the

learned hypothesis.

Sketch of proof
We intuitively want to replace M in the union bound by mH (N). Having finite
VC dimension dVC this makes sense.
However, we cannot directly apply Hoeffding as the hypothesis h is not fixed.
Thus estimating the probability of the event {|Ein (h) − Eout (h)| > } is difficult.
We introduce a “ghost” data set D0 of the same distribution than D and draw
2N samples (thus, we replace mH (N) by mH (2N), which explains the change of
some of the constants).
This provides another in sample error Ein0 (h).
Intuitively, if Ein (h) ≈ Eout (h) than also Ein0 (h) ≈ Eout (h) and Ein (h) ≈ Ein0 (h).
This can be made rigorously using conditional expectations.
Henceforth, we can approximate {|Ein (h) − Eout (h)| > } (difficult) with
{|Ein (h) − Ein0 (h)| > } (easy) as the latter involves only finitely many data (here
the growth function comes into play, too).
The full details is subject of a seminar talk, enjoy!

Model Complexity and Error
There is a trade off:
in−sample error
out−of−sample error
model complexity
Error
dVC
VC dimension dVC

Interpreting the Generalization Bound
We can rephrase the inequality of the Generalization Bound as follows:
For any fixed tolerance level δ we have with probability of at least 1 − δ:
s
8 4mH (2N)
Eout (h) ≤ Ein (h) + log (1)
N δ
s
8 4mH (2N)
Eout (h) ≥ Ein (h) − log . (2)
N δ
(The proof is left as an exercise to the reader!)
Note that,
1. equation 1 implies that we can bound the out-of-sample error by the
in-sample-error (independent of any knowledge of the distribution).
2. equation 2 implies that if we have learned a hypothesis well (with
small epsilon) than any other hypothesis will not do much better if
at all. Thus we have found a close to optimal solution.
Interpreting the Generalization Bound (cont.)
1. The Generalization bound is a universal result.

2. It is kind of worst case (as it is distribution independent).
3. Thus, the estimates are very loose.
Conclusion
Nonetheless, it is conceptually the most important mathematical theorem
for ML as it proves that learning is possible at all and provide guidance
for model (complexity) selection.
Remark: You find a very useful explanation of the generalisation bound

here: https://mostafa-samir.github.io/ml-theory-pt2/.

References I
[AMMIL12] Y. S. Abu-Mostafa, M. Magdon-Ismail, and H.-T. Lin, Learning

From Data. AMLBook, 2012.
[GH15] H. N. G. Harman, S.R. Kulkarni, “sin(ωx ) can approximate almost

every finite set of samples,” Constructive Approximation, vol. 42,
pp. 303–311, June 2015.
[SSBD14] C. S. Shalev-Shwartz and S. Ben-David, From Theory to

Algorithms, 1st ed. Cambridge University Press, May 2014.
[Online]. Available:
https://www.cs.huji.ac.il/~shais/UnderstandingMachineLearning/
understanding-machine-learning-theory-algorithms.pdf
[Vap95] V. N. Vapnik, The nature of statistical learning theory. New

York, NY, USA: Springer-Verlag New York, Inc., 1995.

Learning From Data

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Learning From Data

Uploaded by

Copyright:

Available Formats

Learning From Data

Wissen durch Praxis stärkt

Bounding the Growth Function

When is Learning Feasible?

2/45 Jörg Schäfer | Learning From Data | c b n a 3: Is Learning Feasible?

3/45 Jörg Schäfer | Learning From Data | c b n a 3: Is Learning Feasible?

We can calculate (the random variable) Ein from the data D.

6/45 Jörg Schäfer | Learning From Data | c b n a 3: Is Learning Feasible?

i.e. µ = ν is probably almost correct (P.A.C.)!

8/45 Jörg Schäfer | Learning From Data | c b n a 3: Is Learning Feasible?

9/45 Jörg Schäfer | Learning From Data | c b n a 3: Is Learning Feasible?

Nonetheless, learning with finitely many hypothesis is feasible as we have

We face, however a serious problem now:

10/45 Jörg Schäfer | Learning From Data | c b na 3: Is Learning Feasible?

Nonetheless, learning with finitely many hypothesis is feasible as we have

We face, however a serious problem now:

10/45 Jörg Schäfer | Learning From Data | c b na 3: Is Learning Feasible?

0.5 1.0 1.5 2.0 2.5

0.5 1.0 1.5 2.0 2.5

3. This will yield the famous Vapnik-Chervonenkis Bound:

12/45 Jörg Schäfer | Learning From Data | c b n a 3: Is Learning Feasible?

H(x1 , . . . , xn ) := {(h(x1 ), . . . , h(xn )) |h ∈ H}.

One can think of H(x1 , . . . , xn ) as a projection of H onto the data set

13/45 Jörg Schäfer | Learning From Data | c b na 3: Is Learning Feasible?

where | · | denotes the cardinality of the set.

14/45 Jörg Schäfer | Learning From Data | c b na 3: Is Learning Feasible?

Obviously, this dichotomy cannot be shattered (see definition von

Obviously, this dichotomy cannot be shattered (see definition von

We conclude, that we cannot shatter this and the corresponding symmetrical

We conclude, that we cannot shatter this and the corresponding symmetrical

The positive rays example consists of

18/45 Jörg Schäfer | Learning From Data | c b n a 3: Is Learning Feasible?

Thus, the growth function is how to select two points from N + 1,

19/45 Jörg Schäfer | Learning From Data | c b n a 3: Is Learning Feasible?

Table: Example Growth Functions

Calculating the growth function is difficult, in general.

21/45 Jörg Schäfer | Learning From Data | c b n a 3: Is Learning Feasible?

Thus, a break point k is any k for which mH (k) < 2k .

22/45 Jörg Schäfer | Learning From Data | c b na 3: Is Learning Feasible?

To this end, a pure combinatorial function can be helpful:

23/45 Jörg Schäfer | Learning From Data | c b na 3: Is Learning Feasible?

How many dichotomies can you list on 4 points so that no 2 are

24/45 Jörg Schäfer | Learning From Data | c b n a 3: Is Learning Feasible?

How many dichotomies can you list on 4 points so that no 2 are

24/45 Jörg Schäfer | Learning From Data | c b n a 3: Is Learning Feasible?

How many dichotomies can you list on 4 points so that no 2 are

24/45 Jörg Schäfer | Learning From Data | c b n a 3: Is Learning Feasible?

How many dichotomies can you list on 4 points so that no 2 are

24/45 Jörg Schäfer | Learning From Data | c b n a 3: Is Learning Feasible?

How many dichotomies can you list on 4 points so that no 2 are

24/45 Jörg Schäfer | Learning From Data | c b n a 3: Is Learning Feasible?

How many dichotomies can you list on 4 points so that no 2 are

24/45 Jörg Schäfer | Learning From Data | c b na 3: Is Learning Feasible?

How many dichotomies can you list on 4 points so that no 2 are

24/45 Jörg Schäfer | Learning From Data | c b na 3: Is Learning Feasible?

25/45 Jörg Schäfer | Learning From Data | c b n a 3: Is Learning Feasible?

25/45 Jörg Schäfer | Learning From Data | c b n a 3: Is Learning Feasible?

25/45 Jörg Schäfer | Learning From Data | c b n a 3: Is Learning Feasible?

25/45 Jörg Schäfer | Learning From Data | c b n a 3: Is Learning Feasible?

25/45 Jörg Schäfer | Learning From Data | c b n a 3: Is Learning Feasible?

25/45 Jörg Schäfer | Learning From Data | c b na 3: Is Learning Feasible?

Consider any k points.

B(N, k) ≤ B(N − 1, k) + B(N − 1, k − 1)

k−1 k−1 k−1