LfD02 2

Learning From Data
2: Learning Foundations
Jörg Schäfer
Frankfurt University of Applied Sciences
Department of Computer Sciences
Nibelungenplatz 1
D-60318 Frankfurt am Main
Wissen durch Praxis stärkt

1/31 Jörg Schäfer | Learning From Data | c b n a 2: Learning Foundations Summer Semester 2022
Content
Motivation
Deterministic Learning
Error Measures
Statistical Learning
Bibliography
2/31 Jörg Schäfer | Learning From Data | c b n a 2: Learning Foundations

The Future
I visualize a time when we will be to robots what dogs are to

humans, and I’m rooting for the machines. – Claude Shannon

Recap: The Learning Model
Unkown Target Function

f :X →Y

Training Examples / Learning Algorithm / Final Hypothesis
(x1 , y1 ), . . . , (xn , yn ) O h≈f
Hypothesis Set
H
Figure: Learning Model, source: [AMMIL12]

Is the Learning Setup Realistic?
Assume, we have the following setup:

Let X = {0, 1}3 . Let Y = {0, 1}.
Assume, we are given the data set D as follows
xn yn
0 0 0 0
0 0 1 1
0 1 0 1
0 1 1 0
1 0 0 1
Can we predict the remaining 3 values?

Is the Learning Setup Realistic? (cont.)
There are different 8 functions that could agree on the data set D (why?):
xn yn g f1 f2 f3 f4 f5 f6 f7 f8
0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 1 1 1 1 1 1 1 1 1 1
0 1 0 1 1 1 1 1 1 1 1 1 1
0 1 1 0 0 0 0 0 0 0 0 0 0
1 0 0 1 1 1 1 1 1 1 1 1 1
1 0 1 ? 0 1 0 0 1 1 0 1
1 1 0 ? 0 0 1 0 1 0 1 1
1 1 1 ? 0 0 0 1 0 1 1 1
So which one is the “right” g?
We do not know! Impossible to chose.
Conclusion
Even in this simple concrete bit guessing task learning is impossible.
Real Line Example – Two Points
How many straight lines do fit through two points?
●
●
Ok, but that’s only for two points, what about many?

Real Line Example – Two Points
How many straight lines do fit through two points?
●
●
Ok, but that’s only for two points, what about many?

Real Line Example – Many Points
How many straight lines do fit through many points?
● ● ●
●
● ●
Ok, but that’s only if we allow for curves with quirky kinks, what about
smooth curves?
Real Line Example – Many Points
How many straight lines do fit through many points?
● ● ●
●
● ●
Ok, but that’s only if we allow for curves with quirky kinks, what about
smooth curves?
Real Line Example
Can we fit a polynomial to any N points?
Lemma 1
Let N be any number of points (x1 , y1 ), . . . , (xN , yN ) ∈ R2 . Then there
exists a polynomial p(x ) = M i
P
i=0 βi x of degree M such that p(xi ) = yi
for all i ∈ 1, . . . , N if M ≥ N − 1.
Proof.
PM
We write the regression problem i=0 βi x i in matrix form as follows:
 
1 x1 x12 . . . x1M  
y1 β0
 

 ..  1 x2 x22 . . . x2M 
 . 
 . = .. .. .. .. .  .. 

. . . . .. 
yN βM

1 xN xN2 . . . xNM
y = Xβ

Real Line Example (cont.)
Proof.
Consider the function φ(x , β) := 12 ||y − Xβ||2 .
To get its minimum, we take derivatives
N M 2 !
∂φ(x , β) ∂ 1X X
= yi − Xik βk
∂βj ∂βj 2
i=1 k=0
N M N M
X X X X
=− yi − Xik βk Xij = − yi Xjit − Xjit Xik βk
i=1 k=0 i=1 k=0
N N M
X XX
=− yi Xjit + Xjit Xik βk .
i=1 i=1 k=0
Thus
∇φ(x) = −Xt y + (Xt X)β.
Henceforth ∇φ(x) = 0 is equivalent to Xt y = (Xt X)β.
10/31 Jörg Schäfer | Learning From Data | c b na 2: Learning Foundations

Proof.
Q X is a so-called Vandermonde matrix. Consider first N − 1 = M.
Note that the matrix
Then det(X) = 1≤i,j≤n (xi − xj ). Thus, the matrix X is clearly invertible, henceforth
Xt X, too. If we add columns to X (as M ≥ N − 1) the rank of X is still maximal and
henceforth, XXt has full rank. Therefore, we can define the matrix X † which is the
so-called Moore-Penrose inverse as follows:
X † := Xt (XXt )−1 y.
If we define
β̂ := X† y,
then we compute that:
Xβ̂ = XX† y = XXt (XXt )−1 y = y.

Thus, the minimum of φ is zero and we have found an exact solution.

Now, given N points, how many polynomials exist that fit to these points?
Infinitely many!
Proof.
Take any point (xN+1 , yN+1 ). Then the lemma above implies that we can
find a polynomial fitting all N + 1 points.

Now, given N points, how many polynomials exist that fit to these points?
Infinitely many!
Proof.
Take any point (xN+1 , yN+1 ). Then the lemma above implies that we can
find a polynomial fitting all N + 1 points.

How many polynomials do fit through N points?
Polynomial Degree 9
Polynomial Degree 3 ● ● ●
Linear
●
● ●
(We will look at the “wiggly” thing and its implications later.)

Conclusion
Deterministic Learning
It is impossible to learn the real function g from the hypothesis H and
finite data D alone.

Why Deterministic Learning is Impossible
As computer scientists we should not be surprised:
From an information point of view, consider d ∈ D as a bit sequence
of information.
The function f to be learned comprises a longer (even infinite)
sequence.
If (deterministic) learning was feasible, we could learn f from d.
Henceforth we could learn the longer (even infinite) sequence f
represents from a shorter sequence of size |d|.
Thus, we could compress a long (even infinite) sequence into a short
sequence.
Repeating the argument, we could compress everything to data of
minimum length of the shortest data set d ∈ D enabling learning.
This is clearly absurd.

of information.
sequence.
sequence.

of information.
sequence.
sequence.

of information.
sequence.
sequence.

of information.
sequence.
sequence.

of information.
sequence.
sequence.

of information.
sequence.
sequence.

What Can We Do?
Your assumptions are your windows on the world. Scrub them

off every once in a while, or the light won’t come in.
– Isaac Asimov
So we should rethink our assumptions!
Modesty is the color of virtue.

– Diogenes

What Can We Do?

– Isaac Asimov

– Diogenes

What Can We Do?

– Isaac Asimov

– Diogenes

Analogy: Opinion Polls
Source: https://commons.wikimedia.org/wiki/File:Dewey_Defeats_Truman.jpg
Suppose X is the set of all voters in a country.

Let f : X → P where P is the set of all admissible parties.
Do we really have to predict f ?

Analogy: Opinion Polls (cont.)
In reality, we just want to get the aggregated votes correct.

We are not interested in individual votes.
Thus, we do not need to estimate f .
Also, we are prepared to accept (small) errors.
But we want to be sure, that huge errors are unlikely.
If you possess a restricted amount of information for solving
some problem try to solve the problem directly and never solve
a more general problem as an intermediate step.
– Vladimir N. Vapnik
Thus, we should not try to estimate f but rather something about the
probability of errors.















Classifying the Errors
Thus, we need to describe our errors first (only for discrete functions, we will
have other error measures for real-valued data later):
Definition 2
Let y ∈ {−1, +1}. Let the target function be denoted as f : X 7→ Y. Let the
data be given as D := {(x1 , y1 ), . . . , (xn , yn )}, where yi = f (xi ). We define for
any hypothesis h the in-sample-error as:
1 X
Ein (h) := Ein := |f (xi ) 6= h(xi )|.
n
xi ∈D
For any data set D0 with D0 ∩ D = ∅ we define the out-of-sample-error as:

1 X
Eout (h) := Eout := |f (xi ) 6= h(xi )|.
n 0
xi ∈D

Classifying the Errors (cont.)
Can we estimate Eout from the data D?

More, precisely, can we estimate Eout from the Ein on the data D?

Sampling
Assume, we have the following classification problem:
● ●
h h
● ●
● ● ● ●
● ● ● ● ● ●
● ●
● ● ● ●
● ● ● ● ● ●
● ●
● ● ● ●
● ●
● ●● ● ●●
g g
● ●
Should not Eout be related to the ratio of the red and green area?
Problem: We do not know the distribution function (i.e. probability) of
the data.

Are we stuck again?
We do not know the distribution, but we know about rare events:
Assume, we get have Ein = 0, i.e. How likely is it, that we get the
the following distribution: following for out-of-sample:
● ●
h
● ●
● ● ● ●
● ● ● ● ● ●
● ●
● ● ● ●
● ● ● ●
●
● ● ●
●
● ● ●●
g
●
This is, in fact, the clue to understanding the learning problem!

Impossibility of Deterministic Learning
We cannot learn the true function, but we can (maybe) learn the
probablity that we are wrong (lets be more modest).
In fact, we have the following paradigm:
Non-Deterministic Learning
Given enough data, the probability of getting things not terribly wrong,
is reasonably high.
PAC
This principle is also called “probably approximately correct (PAC)
learning”.

is reasonably high.
PAC
learning”.

is reasonably high.
PAC
learning”.

is reasonably high.
PAC
learning”.

is reasonably high.
PAC
learning”.

Probability to the Rescue
Source: Ralf Roletschek

https://commons.wikimedia.org/wiki/File:13-02-27-spielbank-wiesbaden-by-RalfR-094.jpg

Vapnik-Chervonenkis Theorem – Generalization Bound
The following is the famous Vapnik-Chervonenkis Theorem also known as
Generalization Bound for ML.
Theorem 3
Let H be a hypothesis set with finite VC dimension dVC . Assuming we
draw sample data D of size N independently from the same unknown
distribution. Then for any > 0 we can bound the out-of-sample error
Eout by the in sample error Ein as follows:
2 N/8
P[|Ein (h) − Eout (h)| > ] ≤ 4mH (2N) e − ,
where mH (N) denotes the growth function and h ∈ H denotes the

learned hypothesis.
(We will explain these later, bare with me!)

Improved Learning Model
Unkown Target Function Error

f :X →Y Measure(s)
'
Training Examples
/ Learning Algorithm / Final Hypothesis
(x1 , y1 ), . . . , (xn , yn ) O h≈f
O
Unknown
Probability Hypothesis Set
Distribution P(x ) H
Figure: Learning Model, source: [AMMIL12]

Consequences
Can we prove, that ML will always work?
Source: National Transportation Safety Board
https://commons.wikimedia.org/wiki/File:Tesla_Model_S_(35366284636).jpg

Simple Probabilistic Example Learning
Suppose, we have two two-dimensional Gaussian distributions with µ1 = (2, 3),
µ2 = (−15, −15) and covariance matrices

100 36 64 −49
Σ1 = and Σ2 =
36 100 −49 100
as follows:
Two 2−dim Gaussian Distributions

30
20
10
x2
0
−10
−20
−30
−30 −20 −10 0 10 20 30

Simple Probabilistic Example Learning (cont.)
Then we could have the following two linear, and one curved hypothesis
for example:
Linear Separator 35 Misclassified Linear Separator 8 Misclassified
30
30
20
20
10
10
x2
x2
0
0
−10
−10
−20
−20
−30
−30
−30 −20 −10 0 10 20 30 −30 −20 −10 0 10 20 30
x1 x1
Non−Linear Separator 8 Misclassified

30
20
10
x2
0
−10

20
Challenges
Proof of Generalization Bound!

When is it valid – i.e. what are our assumptions (including implicit
ones)?
How to select “best” hypothesis – i.e. what are (working) learning
algorithms?
If we cannot prove that we are always “right” (in fact with some
probability bigger than zero we will be wrong), how can we validate
our models?

References I
[AMMIL12] Y. S. Abu-Mostafa, M. Magdon-Ismail, and H.-T. Lin, Learning

From Data. AMLBook, 2012.
[Vap95] V. N. Vapnik, The nature of statistical learning theory. New

York, NY, USA: Springer-Verlag New York, Inc., 1995.

LfD02 2

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

LfD02 2

Uploaded by

Copyright:

Available Formats

Learning From Data

Wissen durch Praxis stärkt

2/31 Jörg Schäfer | Learning From Data | c b n a 2: Learning Foundations

I visualize a time when we will be to robots what dogs are to

3/31 Jörg Schäfer | Learning From Data | c b n a 2: Learning Foundations

Unkown Target Function

Figure: Learning Model, source: [AMMIL12]

4/31 Jörg Schäfer | Learning From Data | c b n a 2: Learning Foundations

Assume, we have the following setup:

5/31 Jörg Schäfer | Learning From Data | c b n a 2: Learning Foundations

7/31 Jörg Schäfer | Learning From Data | c b n a 2: Learning Foundations

7/31 Jörg Schäfer | Learning From Data | c b n a 2: Learning Foundations

9/31 Jörg Schäfer | Learning From Data | c b n a 2: Learning Foundations

10/31 Jörg Schäfer | Learning From Data | c b na 2: Learning Foundations

Xβ̂ = XX† y = XXt (XXt )−1 y = y.

11/31 Jörg Schäfer | Learning From Data | c b na 2: Learning Foundations

12/31 Jörg Schäfer | Learning From Data | c b na 2: Learning Foundations

12/31 Jörg Schäfer | Learning From Data | c b na 2: Learning Foundations

13/31 Jörg Schäfer | Learning From Data | c b n a 2: Learning Foundations

14/31 Jörg Schäfer | Learning From Data | c b na 2: Learning Foundations

15/31 Jörg Schäfer | Learning From Data | c b n a 2: Learning Foundations

15/31 Jörg Schäfer | Learning From Data | c b na 2: Learning Foundations

15/31 Jörg Schäfer | Learning From Data | c b na 2: Learning Foundations

15/31 Jörg Schäfer | Learning From Data | c b na 2: Learning Foundations

15/31 Jörg Schäfer | Learning From Data | c b na 2: Learning Foundations

15/31 Jörg Schäfer | Learning From Data | c b na 2: Learning Foundations

15/31 Jörg Schäfer | Learning From Data | c b na 2: Learning Foundations

Your assumptions are your windows on the world. Scrub them

So we should rethink our assumptions!

Modesty is the color of virtue.

16/31 Jörg Schäfer | Learning From Data | c b na 2: Learning Foundations

Your assumptions are your windows on the world. Scrub them

So we should rethink our assumptions!

Modesty is the color of virtue.

16/31 Jörg Schäfer | Learning From Data | c b na 2: Learning Foundations

Your assumptions are your windows on the world. Scrub them

So we should rethink our assumptions!

Modesty is the color of virtue.

16/31 Jörg Schäfer | Learning From Data | c b na 2: Learning Foundations

Suppose X is the set of all voters in a country.

17/31 Jörg Schäfer | Learning From Data | c b n a 2: Learning Foundations

In reality, we just want to get the aggregated votes correct.

18/31 Jörg Schäfer | Learning From Data | c b na 2: Learning Foundations

In reality, we just want to get the aggregated votes correct.

18/31 Jörg Schäfer | Learning From Data | c b na 2: Learning Foundations

In reality, we just want to get the aggregated votes correct.

18/31 Jörg Schäfer | Learning From Data | c b na 2: Learning Foundations

In reality, we just want to get the aggregated votes correct.

18/31 Jörg Schäfer | Learning From Data | c b na 2: Learning Foundations

In reality, we just want to get the aggregated votes correct.

18/31 Jörg Schäfer | Learning From Data | c b na 2: Learning Foundations

In reality, we just want to get the aggregated votes correct.

18/31 Jörg Schäfer | Learning From Data | c b na 2: Learning Foundations

In reality, we just want to get the aggregated votes correct.

18/31 Jörg Schäfer | Learning From Data | c b na 2: Learning Foundations

In reality, we just want to get the aggregated votes correct.

18/31 Jörg Schäfer | Learning From Data | c b na 2: Learning Foundations

For any data set D0 with D0 ∩ D = ∅ we define the out-of-sample-error as:

19/31 Jörg Schäfer | Learning From Data | c b na 2: Learning Foundations

Can we estimate Eout from the data D?

20/31 Jörg Schäfer | Learning From Data | c b n a 2: Learning Foundations

Assume, we have the following classification problem: