ECS171: Machine Learning: Lecture 6: Training Versus Testing (LFD 2.1)

ECS171: Machine Learning
Lecture 6: Training versus Testing (LFD 2.1)
Cho-Jui Hsieh
UC Davis
Jan 29, 2018

Preamble to the theory
Training versus testing
Out-of-sample error (generalization error):
Eout = Ex [e(h(x), f (x))]
What we want: small Eout

Training versus testing
Out-of-sample error (generalization error):
Eout = Ex [e(h(x), f (x))]
What we want: small Eout

In-sample error (training error):
N
1 X
Ein = e(h(xn ), f (xn ))
N
n=1
This is what we can minimize

The 2 questions of learning
Eout (g ) ≈ 0 is achieved through:
Eout (g ) ≈ Ein (g ) and Ein (g ) ≈ 0

The 2 questions of learning
Eout (g ) ≈ 0 is achieved through:
Eout (g ) ≈ Ein (g ) and Ein (g ) ≈ 0
Learning is thus split into 2 questions:

Can we make sure that Eout (g ) ≈ Ein (g )?
Hoeffding’s inequality (?)
Can we make Ein (g ) small?
Optimization (done)
What the theory will achieve
Currently we only know

2N
P[|Ein (g ) − Eout (g )| > ] ≤ 2Me −2

2N
P[|Ein (g ) − Eout (g )| > ] ≤ 2Me −2
What if M = ∞?
(e.g., perceptron)

2N
P[|Ein (g ) − Eout (g )| > ] ≤ 2Me −2
What if M = ∞?
(e.g., perceptron)
Todo:
We will establish a finite quantity to replace M
? 2N
P[|Ein (g ) − Eout (g )| > ] ≤ 2mH (N)e −2
Study mH (N) to understand the trade-off for model complexity

Reducing M to finite number
Where did the M come from?
The Bad events Bm :

2N
“|Ein (hm ) − Eout (hm )| > ” with probability ≤ 2e −2
Where did the M come from?
The Bad events Bm :

2
“|Ein (hm ) − Eout (hm )| > ” with probability ≤ 2e −2 N
The union bound:
P[B1 or B2 or · · · or BM ]
2
≤ P[B1 ] + P[B2 ] + · · · + P[BM ] ≤ 2Me −2 N
| {z }
consider worst case: no overlaps
Can we improve on M?
∆Eout : change in +1 and −1 areas

∆Ein : change in labels of data points
∆Eout : change in +1 and −1 areas

∆Ein : change in labels of data points
|Ein (h1 ) − Eout (h1 )| ≈ |Ein (h2 ) − Eout (h2 )|
Overlapped events!
What can we replace M with?
Instead of the whole input space

Let’s consider a finite set of input points
How many patterns of red and blue can you get?


Dichotomies: mini-hypotheses
A hypothesis: h : X → {−1, +1}


A dichotomy: h : {x1 , x2 , · · · , xN } → {−1, +1}

Number of hypotheses |H| can be infinite

Number of dichotomies |H(x1 , x2 , · · · , xN )|:

at most 2N

at most 2N
Candidate for replacing M
The growth function
The growth function counts the most dichotomies on any N points:
mH (N) = max |H(x1 , · · · , xN )|

x1 ,··· ,xN ∈X
The growth function
The growth function counts the most dichotomies on any N points:
mH (N) = max |H(x1 , · · · , xN )|

x1 ,··· ,xN ∈X
The growth function satisfies:
mH (N) ≤ 2N
Growth function for perceptrons
Compute mH (3) in 2-D space
What’s |H(x1 , x2 , x3 )|?

Compute mH (3) in 2-D space when H is perceptron (linear hyperplanes)
mH (3) = 8

Doesn’t matter because we only counts the most dichotomies

What’s mH (4)?
What’s mH (4)?
(At least) missing two dichotomies:
What’s mH (4)?
(At least) missing two dichotomies:
mH (4) = 14 < 24
Example I: positive rays
Example II: positive intervals
Example III: convex sets
H is set of h : R2 → {−1, +1}

h(x) = +1 is convex
How many dichotomies can we generate?
H is set of h : R2 → {−1, +1}

h(x) = +1 is convex
H is set of h : R2 → {−1, +1}

h(x) = +1 is convex
H is set of h : R2 → {−1, +1}

h(x) = +1 is convex
mH (N) = 2N for any N
We say the N points are “shattered” by convex sets
The 3 growth functions
H is positive rays:
mH (N) = N + 1
H is positive intervals:
1 1
mH (N) = N 2 + N + 1
2 2
H is convex sets:
mH (N) = 2N
What’s next?
Remember the inequality

2N
P[|Ein − Eout | > ] ≤ 2Me −2
What’s next?

2N
P[|Ein − Eout | > ] ≤ 2Me −2
What happens if we replace M by mH (N)?

mH (N) polynomial ⇒ Good!
What’s next?

2N
P[|Ein − Eout | > ] ≤ 2Me −2
What happens if we replace M by mH (N)?

mH (N) polynomial ⇒ Good!
How to show mH (N) is polynomial?
Conclusions
Next class: LFD 2.1, 2.2
Questions?

ECS171: Machine Learning: Lecture 6: Training Versus Testing (LFD 2.1)

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ECS171: Machine Learning: Lecture 6: Training Versus Testing (LFD 2.1)

Uploaded by

Copyright:

Available Formats

ECS171: Machine Learning

Lecture 6: Training versus Testing (LFD 2.1)

Jan 29, 2018

Out-of-sample error (generalization error):

Eout = Ex [e(h(x), f (x))]

What we want: small Eout

Out-of-sample error (generalization error):

Eout = Ex [e(h(x), f (x))]

What we want: small Eout

This is what we can minimize

Eout (g ) ≈ 0 is achieved through:

Eout (g ) ≈ Ein (g ) and Ein (g ) ≈ 0

Eout (g ) ≈ 0 is achieved through:

Eout (g ) ≈ Ein (g ) and Ein (g ) ≈ 0

Learning is thus split into 2 questions:

Currently we only know

Currently we only know

Currently we only know

Study mH (N) to understand the trade-off for model complexity

The Bad events Bm :

The Bad events Bm :

∆Eout : change in +1 and −1 areas

∆Eout : change in +1 and −1 areas

Instead of the whole input space

Instead of the whole input space

Instead of the whole input space

A hypothesis: h : X → {−1, +1}

A hypothesis: h : X → {−1, +1}

A hypothesis: h : X → {−1, +1}

A hypothesis: h : X → {−1, +1}

A hypothesis: h : X → {−1, +1}

A hypothesis: h : X → {−1, +1}

The growth function counts the most dichotomies on any N points:

mH (N) = max |H(x1 , · · · , xN )|

The growth function counts the most dichotomies on any N points:

mH (N) = max |H(x1 , · · · , xN )|

The growth function satisfies:

Compute mH (3) in 2-D space

What’s |H(x1 , x2 , x3 )|?

Compute mH (3) in 2-D space when H is perceptron (linear hyperplanes)

Compute mH (3) in 2-D space when H is perceptron (linear hyperplanes)

Compute mH (3) in 2-D space when H is perceptron (linear hyperplanes)

Doesn’t matter because we only counts the most dichotomies

H is set of h : R2 → {−1, +1}

H is set of h : R2 → {−1, +1}

H is set of h : R2 → {−1, +1}

H is set of h : R2 → {−1, +1}

Remember the inequality

Remember the inequality

What happens if we replace M by mH (N)?

Remember the inequality

What happens if we replace M by mH (N)?

Next class: LFD 2.1, 2.2

You might also like