You are on page 1of 46

ECS171: Machine Learning

Lecture 6: Training versus Testing (LFD 2.1)

Cho-Jui Hsieh
UC Davis

Jan 29, 2018


Preamble to the theory
Training versus testing

Out-of-sample error (generalization error):

Eout = Ex [e(h(x), f (x))]

What we want: small Eout


Training versus testing

Out-of-sample error (generalization error):

Eout = Ex [e(h(x), f (x))]

What we want: small Eout


In-sample error (training error):
N
1 X
Ein = e(h(xn ), f (xn ))
N
n=1

This is what we can minimize


The 2 questions of learning

Eout (g ) ≈ 0 is achieved through:

Eout (g ) ≈ Ein (g ) and Ein (g ) ≈ 0


The 2 questions of learning

Eout (g ) ≈ 0 is achieved through:

Eout (g ) ≈ Ein (g ) and Ein (g ) ≈ 0

Learning is thus split into 2 questions:


Can we make sure that Eout (g ) ≈ Ein (g )?
Hoeffding’s inequality (?)
Can we make Ein (g ) small?
Optimization (done)
What the theory will achieve

Currently we only know


2N
P[|Ein (g ) − Eout (g )| > ] ≤ 2Me −2
What the theory will achieve

Currently we only know


2N
P[|Ein (g ) − Eout (g )| > ] ≤ 2Me −2

What if M = ∞?
(e.g., perceptron)
What the theory will achieve

Currently we only know


2N
P[|Ein (g ) − Eout (g )| > ] ≤ 2Me −2

What if M = ∞?
(e.g., perceptron)
Todo:
We will establish a finite quantity to replace M
? 2N
P[|Ein (g ) − Eout (g )| > ] ≤ 2mH (N)e −2

Study mH (N) to understand the trade-off for model complexity


Reducing M to finite number
Where did the M come from?

The Bad events Bm :


2N
“|Ein (hm ) − Eout (hm )| > ” with probability ≤ 2e −2
Where did the M come from?

The Bad events Bm :


2
“|Ein (hm ) − Eout (hm )| > ” with probability ≤ 2e −2 N
The union bound:
P[B1 or B2 or · · · or BM ]
2
≤ P[B1 ] + P[B2 ] + · · · + P[BM ] ≤ 2Me −2 N
| {z }
consider worst case: no overlaps
Can we improve on M?
Can we improve on M?
Can we improve on M?
Can we improve on M?

∆Eout : change in +1 and −1 areas


∆Ein : change in labels of data points
Can we improve on M?

∆Eout : change in +1 and −1 areas


∆Ein : change in labels of data points
|Ein (h1 ) − Eout (h1 )| ≈ |Ein (h2 ) − Eout (h2 )|
Overlapped events!
What can we replace M with?

Instead of the whole input space


Let’s consider a finite set of input points
How many patterns of red and blue can you get?
What can we replace M with?

Instead of the whole input space


Let’s consider a finite set of input points
How many patterns of red and blue can you get?
What can we replace M with?

Instead of the whole input space


Let’s consider a finite set of input points
How many patterns of red and blue can you get?
Dichotomies: mini-hypotheses

A hypothesis: h : X → {−1, +1}


Dichotomies: mini-hypotheses

A hypothesis: h : X → {−1, +1}


A dichotomy: h : {x1 , x2 , · · · , xN } → {−1, +1}
Dichotomies: mini-hypotheses

A hypothesis: h : X → {−1, +1}


A dichotomy: h : {x1 , x2 , · · · , xN } → {−1, +1}
Number of hypotheses |H| can be infinite
Dichotomies: mini-hypotheses

A hypothesis: h : X → {−1, +1}


A dichotomy: h : {x1 , x2 , · · · , xN } → {−1, +1}
Number of hypotheses |H| can be infinite
Number of dichotomies |H(x1 , x2 , · · · , xN )|:
Dichotomies: mini-hypotheses

A hypothesis: h : X → {−1, +1}


A dichotomy: h : {x1 , x2 , · · · , xN } → {−1, +1}
Number of hypotheses |H| can be infinite
Number of dichotomies |H(x1 , x2 , · · · , xN )|:
at most 2N
Dichotomies: mini-hypotheses

A hypothesis: h : X → {−1, +1}


A dichotomy: h : {x1 , x2 , · · · , xN } → {−1, +1}
Number of hypotheses |H| can be infinite
Number of dichotomies |H(x1 , x2 , · · · , xN )|:
at most 2N
Candidate for replacing M
The growth function

The growth function counts the most dichotomies on any N points:

mH (N) = max |H(x1 , · · · , xN )|


x1 ,··· ,xN ∈X
The growth function

The growth function counts the most dichotomies on any N points:

mH (N) = max |H(x1 , · · · , xN )|


x1 ,··· ,xN ∈X

The growth function satisfies:

mH (N) ≤ 2N
Growth function for perceptrons

Compute mH (3) in 2-D space

What’s |H(x1 , x2 , x3 )|?


Growth function for perceptrons

Compute mH (3) in 2-D space when H is perceptron (linear hyperplanes)

mH (3) = 8
Growth function for perceptrons

Compute mH (3) in 2-D space when H is perceptron (linear hyperplanes)


Growth function for perceptrons

Compute mH (3) in 2-D space when H is perceptron (linear hyperplanes)

Doesn’t matter because we only counts the most dichotomies


Growth function for perceptrons

What’s mH (4)?
Growth function for perceptrons

What’s mH (4)?
(At least) missing two dichotomies:
Growth function for perceptrons

What’s mH (4)?
(At least) missing two dichotomies:

mH (4) = 14 < 24
Example I: positive rays
Example II: positive intervals
Example III: convex sets

H is set of h : R2 → {−1, +1}


h(x) = +1 is convex
How many dichotomies can we generate?
Example III: convex sets

H is set of h : R2 → {−1, +1}


h(x) = +1 is convex
How many dichotomies can we generate?
Example III: convex sets

H is set of h : R2 → {−1, +1}


h(x) = +1 is convex
How many dichotomies can we generate?
Example III: convex sets

H is set of h : R2 → {−1, +1}


h(x) = +1 is convex
mH (N) = 2N for any N
We say the N points are “shattered” by convex sets
The 3 growth functions

H is positive rays:
mH (N) = N + 1
H is positive intervals:
1 1
mH (N) = N 2 + N + 1
2 2
H is convex sets:
mH (N) = 2N
What’s next?

Remember the inequality


2N
P[|Ein − Eout | > ] ≤ 2Me −2
What’s next?

Remember the inequality


2N
P[|Ein − Eout | > ] ≤ 2Me −2

What happens if we replace M by mH (N)?


mH (N) polynomial ⇒ Good!
What’s next?

Remember the inequality


2N
P[|Ein − Eout | > ] ≤ 2Me −2

What happens if we replace M by mH (N)?


mH (N) polynomial ⇒ Good!
How to show mH (N) is polynomial?
Conclusions

Next class: LFD 2.1, 2.2

Questions?

You might also like