Professional Documents
Culture Documents
2: Learning Foundations
Jörg Schäfer
Frankfurt University of Applied Sciences
Department of Computer Sciences
Nibelungenplatz 1
D-60318 Frankfurt am Main
Motivation
Deterministic Learning
Error Measures
Statistical Learning
Bibliography
Training Examples / Learning Algorithm / Final Hypothesis
(x1 , y1 ), . . . , (xn , yn ) O h≈f
Hypothesis Set
H
●
●
Ok, but that’s only for two points, what about many?
●
●
Ok, but that’s only for two points, what about many?
● ● ●
●
● ●
Ok, but that’s only if we allow for curves with quirky kinks, what about
smooth curves?
8/31 Jörg Schäfer | Learning From Data | c b n a 2: Learning Foundations
Real Line Example – Many Points
How many straight lines do fit through many points?
● ● ●
●
● ●
Ok, but that’s only if we allow for curves with quirky kinks, what about
smooth curves?
8/31 Jörg Schäfer | Learning From Data | c b n a 2: Learning Foundations
Real Line Example
Can we fit a polynomial to any N points?
Lemma 1
Let N be any number of points (x1 , y1 ), . . . , (xN , yN ) ∈ R2 . Then there
exists a polynomial p(x ) = M i
P
i=0 βi x of degree M such that p(xi ) = yi
for all i ∈ 1, . . . , N if M ≥ N − 1.
Proof.
PM
We write the regression problem i=0 βi x i in matrix form as follows:
1 x1 x12 . . . x1M
y1 β0
.. 1 x2 x22 . . . x2M
.
. = .. .. .. .. . ..
. . . . ..
yN βM
1 xN xN2 . . . xNM
y = Xβ
Thus
∇φ(x) = −Xt y + (Xt X)β.
Henceforth ∇φ(x) = 0 is equivalent to Xt y = (Xt X)β.
X † := Xt (XXt )−1 y.
If we define
β̂ := X† y,
then we compute that:
Now, given N points, how many polynomials exist that fit to these points?
Infinitely many!
Proof.
Take any point (xN+1 , yN+1 ). Then the lemma above implies that we can
find a polynomial fitting all N + 1 points.
Now, given N points, how many polynomials exist that fit to these points?
Infinitely many!
Proof.
Take any point (xN+1 , yN+1 ). Then the lemma above implies that we can
find a polynomial fitting all N + 1 points.
Polynomial Degree 9
Polynomial Degree 3 ● ● ●
Linear
●
● ●
(We will look at the “wiggly” thing and its implications later.)
Deterministic Learning
It is impossible to learn the real function g from the hypothesis H and
finite data D alone.
Source: https://commons.wikimedia.org/wiki/File:Dewey_Defeats_Truman.jpg
● ●
h h
● ●
● ● ● ●
● ● ● ● ● ●
● ●
● ● ● ●
● ● ● ● ● ●
● ●
● ● ● ●
● ●
● ●● ● ●●
g g
● ●
Should not Eout be related to the ratio of the red and green area?
Problem: We do not know the distribution function (i.e. probability) of
the data.
Assume, we get have Ein = 0, i.e. How likely is it, that we get the
the following distribution: following for out-of-sample:
● ●
h
● ●
● ● ● ●
● ● ● ● ● ●
● ●
● ● ● ●
● ● ● ●
●
● ● ●
●
● ● ●●
g
●
Non-Deterministic Learning
Given enough data, the probability of getting things not terribly wrong,
is reasonably high.
PAC
This principle is also called “probably approximately correct (PAC)
learning”.
Non-Deterministic Learning
Given enough data, the probability of getting things not terribly wrong,
is reasonably high.
PAC
This principle is also called “probably approximately correct (PAC)
learning”.
Non-Deterministic Learning
Given enough data, the probability of getting things not terribly wrong,
is reasonably high.
PAC
This principle is also called “probably approximately correct (PAC)
learning”.
Non-Deterministic Learning
Given enough data, the probability of getting things not terribly wrong,
is reasonably high.
PAC
This principle is also called “probably approximately correct (PAC)
learning”.
Non-Deterministic Learning
Given enough data, the probability of getting things not terribly wrong,
is reasonably high.
PAC
This principle is also called “probably approximately correct (PAC)
learning”.
'
Training Examples
/ Learning Algorithm / Final Hypothesis
(x1 , y1 ), . . . , (xn , yn ) O h≈f
O
Unknown
Probability Hypothesis Set
Distribution P(x ) H
https://commons.wikimedia.org/wiki/File:Tesla_Model_S_(35366284636).jpg
0
−10
−20
−30
30
20
20
10
10
x2
x2
0
0
−10
−10
−20
−20
−30
−30
−30 −20 −10 0 10 20 30 −30 −20 −10 0 10 20 30
x1 x1
0
−10