You are on page 1of 53

Learning From Data

2: Learning Foundations
Jörg Schäfer
Frankfurt University of Applied Sciences
Department of Computer Sciences
Nibelungenplatz 1
D-60318 Frankfurt am Main

Wissen durch Praxis stärkt


1/31 Jörg Schäfer | Learning From Data | c b n a 2: Learning Foundations Summer Semester 2022
Content

Motivation

Deterministic Learning

Error Measures

Statistical Learning

Bibliography

2/31 Jörg Schäfer | Learning From Data | c b n a 2: Learning Foundations


The Future

I visualize a time when we will be to robots what dogs are to


humans, and I’m rooting for the machines. – Claude Shannon

3/31 Jörg Schäfer | Learning From Data | c b n a 2: Learning Foundations


Recap: The Learning Model

Unkown Target Function


f :X →Y


Training Examples / Learning Algorithm / Final Hypothesis
(x1 , y1 ), . . . , (xn , yn ) O h≈f

Hypothesis Set
H

Figure: Learning Model, source: [AMMIL12]

4/31 Jörg Schäfer | Learning From Data | c b n a 2: Learning Foundations


Is the Learning Setup Realistic?

Assume, we have the following setup:


Let X = {0, 1}3 . Let Y = {0, 1}.
Assume, we are given the data set D as follows
xn yn
0 0 0 0
0 0 1 1
0 1 0 1
0 1 1 0
1 0 0 1
Can we predict the remaining 3 values?

5/31 Jörg Schäfer | Learning From Data | c b n a 2: Learning Foundations


Is the Learning Setup Realistic? (cont.)
There are different 8 functions that could agree on the data set D (why?):
xn yn g f1 f2 f3 f4 f5 f6 f7 f8
0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 1 1 1 1 1 1 1 1 1 1
0 1 0 1 1 1 1 1 1 1 1 1 1
0 1 1 0 0 0 0 0 0 0 0 0 0
1 0 0 1 1 1 1 1 1 1 1 1 1
1 0 1 ? 0 1 0 0 1 1 0 1
1 1 0 ? 0 0 1 0 1 0 1 1
1 1 1 ? 0 0 0 1 0 1 1 1
So which one is the “right” g?
We do not know! Impossible to chose.
Conclusion
Even in this simple concrete bit guessing task learning is impossible.
6/31 Jörg Schäfer | Learning From Data | c b n a 2: Learning Foundations
Real Line Example – Two Points
How many straight lines do fit through two points?


Ok, but that’s only for two points, what about many?

7/31 Jörg Schäfer | Learning From Data | c b n a 2: Learning Foundations


Real Line Example – Two Points
How many straight lines do fit through two points?


Ok, but that’s only for two points, what about many?

7/31 Jörg Schäfer | Learning From Data | c b n a 2: Learning Foundations


Real Line Example – Many Points
How many straight lines do fit through many points?

● ● ●


● ●

Ok, but that’s only if we allow for curves with quirky kinks, what about
smooth curves?
8/31 Jörg Schäfer | Learning From Data | c b n a 2: Learning Foundations
Real Line Example – Many Points
How many straight lines do fit through many points?

● ● ●


● ●

Ok, but that’s only if we allow for curves with quirky kinks, what about
smooth curves?
8/31 Jörg Schäfer | Learning From Data | c b n a 2: Learning Foundations
Real Line Example
Can we fit a polynomial to any N points?
Lemma 1
Let N be any number of points (x1 , y1 ), . . . , (xN , yN ) ∈ R2 . Then there
exists a polynomial p(x ) = M i
P
i=0 βi x of degree M such that p(xi ) = yi
for all i ∈ 1, . . . , N if M ≥ N − 1.

Proof.
PM
We write the regression problem i=0 βi x i in matrix form as follows:
 
1 x1 x12 . . . x1M  
y1 β0
 

 ..  1 x2 x22 . . . x2M 
 . 
 . = .. .. .. .. .  .. 

. . . . .. 
yN βM

1 xN xN2 . . . xNM

y = Xβ

9/31 Jörg Schäfer | Learning From Data | c b n a 2: Learning Foundations


Real Line Example (cont.)
Proof.
Consider the function φ(x , β) := 12 ||y − Xβ||2 .
To get its minimum, we take derivatives
N M 2 !
∂φ(x , β) ∂ 1X X
= yi − Xik βk
∂βj ∂βj 2
i=1 k=0
N  M  N  M 
X X X X
=− yi − Xik βk Xij = − yi Xjit − Xjit Xik βk
i=1 k=0 i=1 k=0
N N M
X XX
=− yi Xjit + Xjit Xik βk .
i=1 i=1 k=0

Thus
∇φ(x) = −Xt y + (Xt X)β.
Henceforth ∇φ(x) = 0 is equivalent to Xt y = (Xt X)β.

10/31 Jörg Schäfer | Learning From Data | c b na 2: Learning Foundations


Real Line Example (cont.)
Proof.
Q X is a so-called Vandermonde matrix. Consider first N − 1 = M.
Note that the matrix
Then det(X) = 1≤i,j≤n (xi − xj ). Thus, the matrix X is clearly invertible, henceforth
Xt X, too. If we add columns to X (as M ≥ N − 1) the rank of X is still maximal and
henceforth, XXt has full rank. Therefore, we can define the matrix X † which is the
so-called Moore-Penrose inverse as follows:

X † := Xt (XXt )−1 y.
If we define

β̂ := X† y,
then we compute that:

Xβ̂ = XX† y = XXt (XXt )−1 y = y.


Thus, the minimum of φ is zero and we have found an exact solution.

11/31 Jörg Schäfer | Learning From Data | c b na 2: Learning Foundations


Real Line Example (cont.)

Now, given N points, how many polynomials exist that fit to these points?

Infinitely many!
Proof.
Take any point (xN+1 , yN+1 ). Then the lemma above implies that we can
find a polynomial fitting all N + 1 points.

12/31 Jörg Schäfer | Learning From Data | c b na 2: Learning Foundations


Real Line Example (cont.)

Now, given N points, how many polynomials exist that fit to these points?

Infinitely many!
Proof.
Take any point (xN+1 , yN+1 ). Then the lemma above implies that we can
find a polynomial fitting all N + 1 points.

12/31 Jörg Schäfer | Learning From Data | c b na 2: Learning Foundations


Real Line Example (cont.)
How many polynomials do fit through N points?

Polynomial Degree 9
Polynomial Degree 3 ● ● ●
Linear


● ●

(We will look at the “wiggly” thing and its implications later.)

13/31 Jörg Schäfer | Learning From Data | c b n a 2: Learning Foundations


Conclusion

Deterministic Learning
It is impossible to learn the real function g from the hypothesis H and
finite data D alone.

14/31 Jörg Schäfer | Learning From Data | c b na 2: Learning Foundations


Why Deterministic Learning is Impossible
As computer scientists we should not be surprised:
From an information point of view, consider d ∈ D as a bit sequence
of information.
The function f to be learned comprises a longer (even infinite)
sequence.
If (deterministic) learning was feasible, we could learn f from d.
Henceforth we could learn the longer (even infinite) sequence f
represents from a shorter sequence of size |d|.
Thus, we could compress a long (even infinite) sequence into a short
sequence.
Repeating the argument, we could compress everything to data of
minimum length of the shortest data set d ∈ D enabling learning.
This is clearly absurd.

15/31 Jörg Schäfer | Learning From Data | c b n a 2: Learning Foundations


Why Deterministic Learning is Impossible
As computer scientists we should not be surprised:
From an information point of view, consider d ∈ D as a bit sequence
of information.
The function f to be learned comprises a longer (even infinite)
sequence.
If (deterministic) learning was feasible, we could learn f from d.
Henceforth we could learn the longer (even infinite) sequence f
represents from a shorter sequence of size |d|.
Thus, we could compress a long (even infinite) sequence into a short
sequence.
Repeating the argument, we could compress everything to data of
minimum length of the shortest data set d ∈ D enabling learning.
This is clearly absurd.

15/31 Jörg Schäfer | Learning From Data | c b na 2: Learning Foundations


Why Deterministic Learning is Impossible
As computer scientists we should not be surprised:
From an information point of view, consider d ∈ D as a bit sequence
of information.
The function f to be learned comprises a longer (even infinite)
sequence.
If (deterministic) learning was feasible, we could learn f from d.
Henceforth we could learn the longer (even infinite) sequence f
represents from a shorter sequence of size |d|.
Thus, we could compress a long (even infinite) sequence into a short
sequence.
Repeating the argument, we could compress everything to data of
minimum length of the shortest data set d ∈ D enabling learning.
This is clearly absurd.

15/31 Jörg Schäfer | Learning From Data | c b na 2: Learning Foundations


Why Deterministic Learning is Impossible
As computer scientists we should not be surprised:
From an information point of view, consider d ∈ D as a bit sequence
of information.
The function f to be learned comprises a longer (even infinite)
sequence.
If (deterministic) learning was feasible, we could learn f from d.
Henceforth we could learn the longer (even infinite) sequence f
represents from a shorter sequence of size |d|.
Thus, we could compress a long (even infinite) sequence into a short
sequence.
Repeating the argument, we could compress everything to data of
minimum length of the shortest data set d ∈ D enabling learning.
This is clearly absurd.

15/31 Jörg Schäfer | Learning From Data | c b na 2: Learning Foundations


Why Deterministic Learning is Impossible
As computer scientists we should not be surprised:
From an information point of view, consider d ∈ D as a bit sequence
of information.
The function f to be learned comprises a longer (even infinite)
sequence.
If (deterministic) learning was feasible, we could learn f from d.
Henceforth we could learn the longer (even infinite) sequence f
represents from a shorter sequence of size |d|.
Thus, we could compress a long (even infinite) sequence into a short
sequence.
Repeating the argument, we could compress everything to data of
minimum length of the shortest data set d ∈ D enabling learning.
This is clearly absurd.

15/31 Jörg Schäfer | Learning From Data | c b na 2: Learning Foundations


Why Deterministic Learning is Impossible
As computer scientists we should not be surprised:
From an information point of view, consider d ∈ D as a bit sequence
of information.
The function f to be learned comprises a longer (even infinite)
sequence.
If (deterministic) learning was feasible, we could learn f from d.
Henceforth we could learn the longer (even infinite) sequence f
represents from a shorter sequence of size |d|.
Thus, we could compress a long (even infinite) sequence into a short
sequence.
Repeating the argument, we could compress everything to data of
minimum length of the shortest data set d ∈ D enabling learning.
This is clearly absurd.

15/31 Jörg Schäfer | Learning From Data | c b na 2: Learning Foundations


Why Deterministic Learning is Impossible
As computer scientists we should not be surprised:
From an information point of view, consider d ∈ D as a bit sequence
of information.
The function f to be learned comprises a longer (even infinite)
sequence.
If (deterministic) learning was feasible, we could learn f from d.
Henceforth we could learn the longer (even infinite) sequence f
represents from a shorter sequence of size |d|.
Thus, we could compress a long (even infinite) sequence into a short
sequence.
Repeating the argument, we could compress everything to data of
minimum length of the shortest data set d ∈ D enabling learning.
This is clearly absurd.

15/31 Jörg Schäfer | Learning From Data | c b na 2: Learning Foundations


What Can We Do?

Your assumptions are your windows on the world. Scrub them


off every once in a while, or the light won’t come in.
– Isaac Asimov

So we should rethink our assumptions!

Modesty is the color of virtue.


– Diogenes

16/31 Jörg Schäfer | Learning From Data | c b na 2: Learning Foundations


What Can We Do?

Your assumptions are your windows on the world. Scrub them


off every once in a while, or the light won’t come in.
– Isaac Asimov

So we should rethink our assumptions!

Modesty is the color of virtue.


– Diogenes

16/31 Jörg Schäfer | Learning From Data | c b na 2: Learning Foundations


What Can We Do?

Your assumptions are your windows on the world. Scrub them


off every once in a while, or the light won’t come in.
– Isaac Asimov

So we should rethink our assumptions!

Modesty is the color of virtue.


– Diogenes

16/31 Jörg Schäfer | Learning From Data | c b na 2: Learning Foundations


Analogy: Opinion Polls

Source: https://commons.wikimedia.org/wiki/File:Dewey_Defeats_Truman.jpg

Suppose X is the set of all voters in a country.


Let f : X → P where P is the set of all admissible parties.
Do we really have to predict f ?

17/31 Jörg Schäfer | Learning From Data | c b n a 2: Learning Foundations


Analogy: Opinion Polls (cont.)

In reality, we just want to get the aggregated votes correct.


We are not interested in individual votes.
Thus, we do not need to estimate f .
Also, we are prepared to accept (small) errors.
But we want to be sure, that huge errors are unlikely.
If you possess a restricted amount of information for solving
some problem try to solve the problem directly and never solve
a more general problem as an intermediate step.
– Vladimir N. Vapnik
Thus, we should not try to estimate f but rather something about the
probability of errors.

18/31 Jörg Schäfer | Learning From Data | c b na 2: Learning Foundations


Analogy: Opinion Polls (cont.)

In reality, we just want to get the aggregated votes correct.


We are not interested in individual votes.
Thus, we do not need to estimate f .
Also, we are prepared to accept (small) errors.
But we want to be sure, that huge errors are unlikely.
If you possess a restricted amount of information for solving
some problem try to solve the problem directly and never solve
a more general problem as an intermediate step.
– Vladimir N. Vapnik
Thus, we should not try to estimate f but rather something about the
probability of errors.

18/31 Jörg Schäfer | Learning From Data | c b na 2: Learning Foundations


Analogy: Opinion Polls (cont.)

In reality, we just want to get the aggregated votes correct.


We are not interested in individual votes.
Thus, we do not need to estimate f .
Also, we are prepared to accept (small) errors.
But we want to be sure, that huge errors are unlikely.
If you possess a restricted amount of information for solving
some problem try to solve the problem directly and never solve
a more general problem as an intermediate step.
– Vladimir N. Vapnik
Thus, we should not try to estimate f but rather something about the
probability of errors.

18/31 Jörg Schäfer | Learning From Data | c b na 2: Learning Foundations


Analogy: Opinion Polls (cont.)

In reality, we just want to get the aggregated votes correct.


We are not interested in individual votes.
Thus, we do not need to estimate f .
Also, we are prepared to accept (small) errors.
But we want to be sure, that huge errors are unlikely.
If you possess a restricted amount of information for solving
some problem try to solve the problem directly and never solve
a more general problem as an intermediate step.
– Vladimir N. Vapnik
Thus, we should not try to estimate f but rather something about the
probability of errors.

18/31 Jörg Schäfer | Learning From Data | c b na 2: Learning Foundations


Analogy: Opinion Polls (cont.)

In reality, we just want to get the aggregated votes correct.


We are not interested in individual votes.
Thus, we do not need to estimate f .
Also, we are prepared to accept (small) errors.
But we want to be sure, that huge errors are unlikely.
If you possess a restricted amount of information for solving
some problem try to solve the problem directly and never solve
a more general problem as an intermediate step.
– Vladimir N. Vapnik
Thus, we should not try to estimate f but rather something about the
probability of errors.

18/31 Jörg Schäfer | Learning From Data | c b na 2: Learning Foundations


Analogy: Opinion Polls (cont.)

In reality, we just want to get the aggregated votes correct.


We are not interested in individual votes.
Thus, we do not need to estimate f .
Also, we are prepared to accept (small) errors.
But we want to be sure, that huge errors are unlikely.
If you possess a restricted amount of information for solving
some problem try to solve the problem directly and never solve
a more general problem as an intermediate step.
– Vladimir N. Vapnik
Thus, we should not try to estimate f but rather something about the
probability of errors.

18/31 Jörg Schäfer | Learning From Data | c b na 2: Learning Foundations


Analogy: Opinion Polls (cont.)

In reality, we just want to get the aggregated votes correct.


We are not interested in individual votes.
Thus, we do not need to estimate f .
Also, we are prepared to accept (small) errors.
But we want to be sure, that huge errors are unlikely.
If you possess a restricted amount of information for solving
some problem try to solve the problem directly and never solve
a more general problem as an intermediate step.
– Vladimir N. Vapnik
Thus, we should not try to estimate f but rather something about the
probability of errors.

18/31 Jörg Schäfer | Learning From Data | c b na 2: Learning Foundations


Analogy: Opinion Polls (cont.)

In reality, we just want to get the aggregated votes correct.


We are not interested in individual votes.
Thus, we do not need to estimate f .
Also, we are prepared to accept (small) errors.
But we want to be sure, that huge errors are unlikely.
If you possess a restricted amount of information for solving
some problem try to solve the problem directly and never solve
a more general problem as an intermediate step.
– Vladimir N. Vapnik
Thus, we should not try to estimate f but rather something about the
probability of errors.

18/31 Jörg Schäfer | Learning From Data | c b na 2: Learning Foundations


Classifying the Errors
Thus, we need to describe our errors first (only for discrete functions, we will
have other error measures for real-valued data later):
Definition 2
Let y ∈ {−1, +1}. Let the target function be denoted as f : X 7→ Y. Let the
data be given as D := {(x1 , y1 ), . . . , (xn , yn )}, where yi = f (xi ). We define for
any hypothesis h the in-sample-error as:
1 X
Ein (h) := Ein := |f (xi ) 6= h(xi )|.
n
xi ∈D

For any data set D0 with D0 ∩ D = ∅ we define the out-of-sample-error as:


1 X
Eout (h) := Eout := |f (xi ) 6= h(xi )|.
n 0
xi ∈D

19/31 Jörg Schäfer | Learning From Data | c b na 2: Learning Foundations


Classifying the Errors (cont.)

Can we estimate Eout from the data D?


More, precisely, can we estimate Eout from the Ein on the data D?

20/31 Jörg Schäfer | Learning From Data | c b n a 2: Learning Foundations


Sampling

Assume, we have the following classification problem:

● ●
h h
● ●
● ● ● ●
● ● ● ● ● ●
● ●
● ● ● ●
● ● ● ● ● ●
● ●
● ● ● ●
● ●
● ●● ● ●●

g g
● ●

Should not Eout be related to the ratio of the red and green area?
Problem: We do not know the distribution function (i.e. probability) of
the data.

21/31 Jörg Schäfer | Learning From Data | c b na 2: Learning Foundations


Are we stuck again?
We do not know the distribution, but we know about rare events:

Assume, we get have Ein = 0, i.e. How likely is it, that we get the
the following distribution: following for out-of-sample:
● ●
h
● ●
● ● ● ●
● ● ● ● ● ●
● ●
● ● ● ●
● ● ● ●

● ● ●

● ● ●●

g

This is, in fact, the clue to understanding the learning problem!

22/31 Jörg Schäfer | Learning From Data | c b na 2: Learning Foundations


Impossibility of Deterministic Learning
We cannot learn the true function, but we can (maybe) learn the
probablity that we are wrong (lets be more modest).

In fact, we have the following paradigm:

Non-Deterministic Learning
Given enough data, the probability of getting things not terribly wrong,
is reasonably high.

PAC
This principle is also called “probably approximately correct (PAC)
learning”.

23/31 Jörg Schäfer | Learning From Data | c b na 2: Learning Foundations


Impossibility of Deterministic Learning
We cannot learn the true function, but we can (maybe) learn the
probablity that we are wrong (lets be more modest).

In fact, we have the following paradigm:

Non-Deterministic Learning
Given enough data, the probability of getting things not terribly wrong,
is reasonably high.

PAC
This principle is also called “probably approximately correct (PAC)
learning”.

23/31 Jörg Schäfer | Learning From Data | c b na 2: Learning Foundations


Impossibility of Deterministic Learning
We cannot learn the true function, but we can (maybe) learn the
probablity that we are wrong (lets be more modest).

In fact, we have the following paradigm:

Non-Deterministic Learning
Given enough data, the probability of getting things not terribly wrong,
is reasonably high.

PAC
This principle is also called “probably approximately correct (PAC)
learning”.

23/31 Jörg Schäfer | Learning From Data | c b na 2: Learning Foundations


Impossibility of Deterministic Learning
We cannot learn the true function, but we can (maybe) learn the
probablity that we are wrong (lets be more modest).

In fact, we have the following paradigm:

Non-Deterministic Learning
Given enough data, the probability of getting things not terribly wrong,
is reasonably high.

PAC
This principle is also called “probably approximately correct (PAC)
learning”.

23/31 Jörg Schäfer | Learning From Data | c b na 2: Learning Foundations


Impossibility of Deterministic Learning
We cannot learn the true function, but we can (maybe) learn the
probablity that we are wrong (lets be more modest).

In fact, we have the following paradigm:

Non-Deterministic Learning
Given enough data, the probability of getting things not terribly wrong,
is reasonably high.

PAC
This principle is also called “probably approximately correct (PAC)
learning”.

23/31 Jörg Schäfer | Learning From Data | c b na 2: Learning Foundations


Probability to the Rescue

Source: Ralf Roletschek


https://commons.wikimedia.org/wiki/File:13-02-27-spielbank-wiesbaden-by-RalfR-094.jpg

24/31 Jörg Schäfer | Learning From Data | c b n a 2: Learning Foundations


Vapnik-Chervonenkis Theorem – Generalization Bound
The following is the famous Vapnik-Chervonenkis Theorem also known as
Generalization Bound for ML.
Theorem 3
Let H be a hypothesis set with finite VC dimension dVC . Assuming we
draw sample data D of size N independently from the same unknown
distribution. Then for any  > 0 we can bound the out-of-sample error
Eout by the in sample error Ein as follows:
2 N/8
P[|Ein (h) − Eout (h)| > ] ≤ 4mH (2N) e − ,

where mH (N) denotes the growth function and h ∈ H denotes the


learned hypothesis.

(We will explain these later, bare with me!)

25/31 Jörg Schäfer | Learning From Data | c b na 2: Learning Foundations


Improved Learning Model

Unkown Target Function Error


f :X →Y Measure(s)

 '
Training Examples 
/ Learning Algorithm / Final Hypothesis
(x1 , y1 ), . . . , (xn , yn ) O h≈f
O

Unknown
Probability Hypothesis Set
Distribution P(x ) H

Figure: Learning Model, source: [AMMIL12]

26/31 Jörg Schäfer | Learning From Data | c b n a 2: Learning Foundations


Consequences
Can we prove, that ML will always work?

Source: National Transportation Safety Board

https://commons.wikimedia.org/wiki/File:Tesla_Model_S_(35366284636).jpg

27/31 Jörg Schäfer | Learning From Data | c b n a 2: Learning Foundations


Simple Probabilistic Example Learning
Suppose, we have two two-dimensional Gaussian distributions with µ1 = (2, 3),
µ2 = (−15, −15) and covariance matrices
   
100 36 64 −49
Σ1 = and Σ2 =
36 100 −49 100
as follows:

Two 2−dim Gaussian Distributions


30
20
10
x2

0
−10
−20
−30

−30 −20 −10 0 10 20 30


28/31 Jörg Schäfer | Learning From Data | c b n a 2: Learning Foundations
Simple Probabilistic Example Learning (cont.)
Then we could have the following two linear, and one curved hypothesis
for example:
Linear Separator 35 Misclassified Linear Separator 8 Misclassified
30

30
20

20
10

10
x2

x2
0

0
−10

−10
−20

−20
−30

−30
−30 −20 −10 0 10 20 30 −30 −20 −10 0 10 20 30

x1 x1

Non−Linear Separator 8 Misclassified


30
20
10
x2

0
−10

29/31 Jörg Schäfer | Learning From Data | c b n a 2: Learning Foundations


20
Challenges

Proof of Generalization Bound!


When is it valid – i.e. what are our assumptions (including implicit
ones)?
How to select “best” hypothesis – i.e. what are (working) learning
algorithms?
If we cannot prove that we are always “right” (in fact with some
probability bigger than zero we will be wrong), how can we validate
our models?

30/31 Jörg Schäfer | Learning From Data | c b n a 2: Learning Foundations


References I

[AMMIL12] Y. S. Abu-Mostafa, M. Magdon-Ismail, and H.-T. Lin, Learning


From Data. AMLBook, 2012.

[Vap95] V. N. Vapnik, The nature of statistical learning theory. New


York, NY, USA: Springer-Verlag New York, Inc., 1995.

31/31 Jörg Schäfer | Learning From Data | c b n a 2: Learning Foundations

You might also like