04 Draft

Machine Learning
(_hx“)
Lecture 4: Theory of Generalization
Hsuan-Tien Lin (ó“0)
htlin@csie.ntu.edu.tw
Department of Computer Science

& Information Engineering
National Taiwan University
(↵Àc'x«⌦Â↵˚)
Hsuan-Tien Lin (NTU CSIE) Machine Learning 0/49

Theory of Generalization
Roadmap
1 When Can Machines Learn?
2 Why Can Machines Learn?

Effective Number of Lines
Effective Number of Hypotheses
Break Point
Definition of VC Dimension
VC Dimension of Perceptrons
Physical Intuition of VC Dimension
Interpreting VC Dimension

Theory of Generalization Effective Number of Lines
t -
有問題区
Is M = 1 Feasible? ✗ 0
i
• input x 2 [ 1, +1] ⇢ R1 , uniform iid

àlà >
-1
àttl
• target f (x) = sign(x), taking sign(0) = +1 g
• hypothesis set: ha (x) = sign(x a) for a 2 [ 1, 1]

infintely many a
xxjfoò
aǎ
>
• algorithm: g = ha⇤ with a⇤ = minyn =+1 xn , Einlg) ⼆ 0
assuming at least one yn = 1
• for ✏ < 0.5, Eout (g) > ✏ if every yn = +1 satisfies xn > 2✏ ( 0,2 幻
2 3 *
✓ ◆ Eoutlg ) ⽣
2 2✏ N
⼆ > E
4
P Ein (g) Eout (g) > ✏  5
| {z } 2
"
0
( ⼀些 )
1
BAD data can happen rarely

even for infinitely many hypotheses
Where Did M Come From?

⇥ ⇤ ⇣ ⌘
2
P Ein (g) Eout (g) > ✏  2 · M · exp 2✏ N
• BAD events Bm : |Ein (hm ) Eout (hm )| > ✏ ⽂揦 h 䛔不好的事情。
• to give A freedom of choice: bound P[B1 or B2 or . . . BM ]

• worst case: all Bm non-overlapping
P[B1 or B2 or . . . BM ]  P[B1 ] + P[B2 ] + . . . + P[BM ]

|{z}
union bound 我們有很多 Jhypothesis 演算法可以 ,
若阿 3⽇多項這个 bound
,
也知 , 就沒有意火了 ! 在之中⾃由選擇將⼤家制咲⽣
,
"
不好事情的机率 " ⽤联集拆開
" "
, ,
算每个不好事情發⽣的机率
where did uniform bound fail
to consider for M = 1?

Where Did Uniform Bound Fail?

union bound P[B1 ] + P[B2 ] + . . . + P[BM ]
• BAD events Bm : |Ein (hm ) Eout (hm )| > ✏ B1 B2

壞事不太会重疊 .tn 壞事的 datath 的但相似的 h 就会重疊 2 ,
,
overlapping for similar hypotheses h1 ⇡ h2

(e.g. if a1 ⇡ a2 in previous example)
• why? 1 Eout (h1 ) ⇡ Eout (h2 )
2 for most D, Ein (h1 ) = Ein (h2 ) B3
• union bound over-estimating
iiunionbound 是⽤联集 ,
⾯積分開加 ,
但真實有可能重疊 i. 上限 overestimate ,
無法處理 0 的狀況 i.
要找出壞事重疊的部分
to account for overlap,

can we group similar hypotheses by kind?
How ? 是否可以將 09 h , 分為有限多个類別、
, 長得差不多的⼈歸類

How Many Lines Are There? (1/2)

n o
2
H = all lines in R 平⾯上的線即為 peceptron
• how many lines? 1

• how many kinds of lines if viewed from one input vector x1 ?
假設只有筆資料從筆⼀⼀
T
,
X (
資料看出去則只有年中線 2
,
•x1
。
2 kinds: h1 -like(x1 ) = or h2 -like(x1 ) = ⇥

⼀
种是圈 ,
⼀
种是叉、


n o
2
H = all lines in R
• how many lines? 1

• how many kinds of lines if viewed from one input vector x1 ?
•x1
h2
h1
2 kinds: h1 -like(x1 ) = or h2 -like(x1 ) = ⇥


n o
2
H = all lines in R
• how many kinds of lines if viewed from two inputs x1 , x2 ?

若有 5点会有 4 种線
2 、
•x1
•x2
4:
⇥ ⇥
7
⇥ ⇥ M
L
⼈
one input: 2; two inputs: 4; three inputs?


n o
2
H = all lines in R
• how many kinds of lines if viewed from two inputs x1 , x2 ?
•x1
h3
h2
•x2 h4
h1
4:
⇥ ⇥
⇥ ⇥
one input: 2; two inputs: 4; three inputs?

How Many Kinds of Lines for Three Inputs? (1/2)

n o
2
H = all lines in R
若有 35 input 且排列呈現三⾓形 ,
for three inputs x1 , x2 , x3 ⇥

9
⇥
c ⇥
•x1
5
⇥ 2
•x2 ⇥
⇥
•x3 8:
⇥
⇥ →
←
⇥
⇥
always 8 for three inputs? ⇥ 0
t
⇥
How Many Kinds of Lines for Three Inputs? (2/2)

n o
2
H = all lines in R
若三族排列在同⼀直線上 ,
for another three inputs 5 ⇥

x1 , x2 , x3 ⇥
⇥
三桌共線 ⇥
•x1
←
⇥
⇥ -
•x2 6:
•x3 無法⽤ ⇥
⇥
直線切 ! ⇥
‘fewer than 8’ when degenerate
(e.g. collinear or same inputs) ⇥
⇥ 1
⇥ ←

How Many Kinds of Lines for Four Inputs?

n o
2
H = all lines in R
若有 4 5点 2
T
for four inputs x1 , x2 , x3 , x4
⇥ ⇥
⇥
•x1
•x4 ← ⇥ ⇥ n
•x2
⇥
•x3 14: 2⇥
T
※
" "
T
⇥ ⇥
for any four inputs ⇥
at most 14
⇥
⇥
t
⇥


maximum kinds of lines with respect to N inputs x1 , x2 , · · · , xN
() effective number of lines
• must be  2N (why?) 個叉只有种可能性 2 lines in 2D

• finite ‘grouping’ of infinitely-many lines 2 H N effective(N)
• wish: 1 2
2 4
⇥ ⇤
BAD Prob , →
P Ein (g) Eout (g) > ✏ 3 8 6
or
⇣ ⌘ 4 14 < 2N
希望⼩ →  2 · effective(N) · exp 2✏2 N
nottoobig small
( 取代 M )
if 1 effective(N) can replace M and

2 effective(N) ⌧ 2N 亂夠⼤時壞事發⽣的机率会接近 ,
0
learning possible with infinite lines :-) 代表⼤部分都好事 ,

Questions?

Theory of Generalization Effective Number of Hypotheses
除了線性以外的 H
Dichotomies: Mini-hypotheses
H = {hypothesis h : X ! {⇥, }}
• call
h(x1 , x2 , . . . , xN ) = (h(x1 ), h(x2 ), . . . , h(xN )) 2 {⇥, }N

A 分 ( 將⼿上的桌分成兩堆圈堆⽔堆个 Hset 可以做出多少种 dichotomy )
⼀
⼆⼀
,
,
, 、
a dichotomy: hypothesis ‘limited’ to the eyes of x1 , x2 , . . . , xN

• H(x1 , x2 , . . . , xN ): ⼤ H 代表考量的是 dichotomylunigue H ) ⽽非原本所有的
h.im
,
all dichotomies ‘implemented’ by H on x1 , x2 , . . . , xN

針対所有只対 NJ 特定的桌做取值
hypotheses H dichotomies H(x1 , x2 , . . . , xN )
e.g. all lines in R2 { , ⇥, ⇥⇥, . . .}
size possibly infinite upper bounded by 2N
|H(x1 , x2 , . . . , xN )|: candidate for replacing M

⽤ dichotomyset ⼤⼩衡量 M 前
Growth Function
dichotomy 是由先前選好的
⼼化決定可能的未來理論分析有⿇煩 ,
,
i.
想移除的 ✗ 的依賴選擇 ✗ 的各种可能性取 Max 的 dicnotomy ,
lines in 2D
,
• |H(x1 , x2 , . . . , xN )|: depend on inputs

(x1 , x2 , . . . , xN ) N mH (N)
• growth function: 1 2
remove dependence by taking max of all 2 4
possible (x1 , x2 , . . . , xN ) 3 max(. . . , 6, 8)
=8
4 14 < 2N
mH (N) = max |H(x1 , x2 , . . . , xN )|
x1 ,x2 ,...,xN 2X
line ,
• finite, upper-bounded by 2N
how to ‘calculate’ the growth function?

Perceptron 較難
-7
Growth Function for Positive Rays
先看較簡單的
h(x) = 1 h(x) = +1
a
維實取
...
-
x1 x2 x3 xN
bdi
00 , v
0 0 0 ) v
• X = R (one dimensional)
X X X ) X "
要有向 0
• H contains h, where each h(x) = sign(x a) for threshold a

• ‘positive half’ of 1D perceptrons 个⾨檻值比 h ⼩ -1 比 h ⼤
正取⼀
, ,
; ⼗ 1
此例可以切出多少 dichotomyi
one dichotomy for a 2 each spot (xn , xn+1 ): x1 x2 x3 x4
mH (N) = N + 1 最多 N_n 种 ,
⇥
⇥ ⇥
⇥ ⇥ ⇥
(N + 1)⌧ 2N when N large! ⇥ ⇥ ⇥ ⇥

Growth Function for Positive Intervals

h(x) = 1 h(x) = +1 h(x) = 1
x1 x2 x3 ... xN
• X = R (one dimensional)
• H contains h, where each h(x) = +1 iff x 2 [`, r ), 1 otherwise
one dichotomy for each ‘interval kind’ x1 x2 x3 x4

⇥ ⇥ ⇥
✓ ◆ ⇥ ⇥
N + 1 个区塊
mH (N) = + 1 ⇥
2
| {z } |{z} ⇥ ⇥ ⇥
interval ends in N + 1 spots all ⇥ ⇥ ⇥
⇥
1 2 1
= N + N +1 ⇥ ⇥ ⇥
2 2 ⇥ ⇥
⇥ ⇥ ⇥
1 2 1 ⇥ ⇥ ⇥ ⇥
2N + 2N +1 ⌧ 2N when N large!
Growth Function for Convex Sets (1/2)

定灯 hi 制h 衬应到⼀
个 convexset
convex region in blue non-convex region
• X = R2 (two dimensional)
• H contains h, wherebottom
h(x) = +1 iff x in a
bottom
convex region, 1 otherwise
what is mH (N)?
這樣的 h 的成長函邦長怎樣 ?

Growth Function for Convex Sets (2/2)

極端可能將所有的輸入排在圓圈上不論要做什麼 dichotomy ,
,
可以⽤个多边形將所有的吳速起向外擴都是正截+ 都是負
⼀
,
up
• one possible set of N inputs:

,
,
−
x
x1 , x2 , . . . , xN on a big circle +• ④
+
• every dichotomy can be implemented

x−
by H using a convex region slightly −x
extended from contour of positive inputs x−
要產⽣ Jdicnotomy 就畫个凸多边 N形
⼀
-
,
•
mH (N) = 2 +
✗
−
˙
+
不論是哪种 dichotomy 都可以⽤凸多边形凸集合 ) 做出⼼
bottom
• call those N inputs ‘shattered’ by H 1 HM Xzi XN ) 1

,
=
2
成長出取 , N 族共有味中 dichotomy
2
hnl 了 ) 8
⼆
mH (N) = 2N ()
exists N inputs that can be shattered linesonk

Theory of Generalization Break Point
The Four Growth Functions

• positive rays: mH (N) = N + 1
• positive intervals: mH (N) = 12 N 2 + 12 N + 1
• convex sets: mH (N) = 2N
• 2D perceptrons: mH (N) < 2N in some cases
what if mH (N) replaces M? mn 取代 MI
⇥ ⇤ ? ⇣ ⌘
P Ein (g) Eout (g) > ✏  2 · mH (N) · exp 2✏2 N
polynomial: good; exponential: bad

mnlm 若為多項式後⾯的抑減少的較快 upperbound 越⼩壞事的発⽣趕近 0
, ,
但若想 wnvexsetiexp ⼼前⾯指取成長後⾯指邦下降不⼀定能確保即使 N 夠⼤
,
, ,
,
Ein 和 Eout
for 2D or general perceptrons, 夠接近
mH (N) polynomial?

Break Point of H
what do we know about 2D perceptrons now?
three inputs: ‘exists’ shatter;
four inputs, ‘for all’ no shatter
if no k inputs can be shattered by H,

call k a break point for H
• mH (k ) < 2k ⇥ ⇥
• k + 1, k + 2, k + 3, . . . also break points!
• will study minimum break point k
2D perceptrons: minimum break point at 4

3 族可以做出所有 dichotomy 我們感興趣的是第⼀个做不出所有
但⽵桌無法做出所有 dichotomy dichotomy 的奌 ,
PLA 中是
"
4
"
"
4 即為 breakpointiiczi

The Four Minimum Break Points

• positive rays: mH (N) = N + 1 = O(N)
minimum break point at 2
• positive intervals: mH (N) = 12 N 2 + 21 N + 1 = O(N 2 )
• convex sets: mH (N) = 2N
no break point 永遠是 "
• 2D perceptrons: mH (N) < 2N in some cases
theorem from combinatorics

(not going to prove in class):
• no break point: mH (N) = 2N (sure!)
• minimum break point k : 若有 breakpointi 可能可以証明在
mH (N) = O(N k 1 ) k 發⽣時成長的速度為⼼
,

Questions?

Theory of Generalization Definition of VC Dimension
BAD Bound for General H

want:
h i ✓ ◆
2
P 9h 2 H s.t. Ein (h) Eout (h) > ✏  2 mH ( N) · exp 2 ✏ N
actually, when N large enough,

h i ✓ ◆
1
P 9h 2 H s.t. Ein (h) Eout (h) > ✏  2·2mH (2N) · exp 2· ✏2 N
16
called Vapnik-Chervonenkis (VC) Bound

Interpretation of Vapnik-Chervonenkis (VC) Bound
For any g = A(D) 2 H and ‘statistical’ large D, for N 2, k 3
演算法做的選擇会被以 Bound ⽀配 h i ( 已經假設 dataset.lv 夠⼤)
P Ein (g) Eout (g) > ✏
9 的 Ein 和 Eout 離很遠的机率 D 訓練
h 測試 i
会来得很⼩
 PD 9h 2 H s.t. Ein (h) Eout (h) > ✏ BAD case ,
⇣ ⌘
1 2
 4mH (2N) exp 8 ✏ N
if k exists ⇣ ⌘
 4(2N)k 1 exp 1 2
8 ✏ N
goodilargeenough ,
幾個條件使 learning 可以做到 ;
成長出取要在耵地⽅露出⼀線曙光
→
if 1 mH (N) breaks at k (good H)
夠⼤的 data
2 N large enough TIBAD 發⽣下降 (good D) ī large
=) probably generalized ‘Eout ⇡ Ein ’, and
if 3 A picks a g with small Ein
=) probably learned! 、
(good A)
(:-) good luck)
i
lucky
好的演算法
若露出曙光為桌 ki ⼼1 2
☆ VC Dimension
則 Maxnonbreakpoint 為 Kl ⼼ 2 4
即 vcdim
the formal name of maximum non-break point dVC ⼼了 8 VCD
= (minimum break point k - 1) ⼼ 414 BP
Definition -
gnypothesisset 的性質
VC dimension of H, denoted dVC (H) is

Ishatterivcp
largest N for which mH (N) = 2N
v-noshatteriBP.ch
(the most inputs H that can shatter)
=
"
Minimumk -1
"
N  dVC =) H can shatter some N inputs , 不是 BP
k > dVC =) k is a break point for H
if N 2, dVC 2, mH (N)  N dVC

成長出取

The Four VC Dimensions

• positive rays: mH (N) = N + 1
dVC = 1 • 給打桌時所有排列組合出來可以 shatter 此奌 ,
29 不⾏
• positive intervals: mH (N) = 12 N 2 + 12 N + 1
dVC = 2 • •
• convex sets:
不論給幾 9点恰好排在圓上時
up
mH (N) = 2N
dVC = 1
,
convexset 來 snatter
可以⽤
bottom
• 2D perceptrons: mH (N)  N 3 for N 2

•
dVC = 3
• •
good: finite dVC

VC Dimension and Learning

finite dVC =) g ‘will’ generalize (Eout (g) ⇡ Ein (g))
ǐ• regardless of learning algorithm A

② • regardless of input distribution P
⑦ • regardless of target function f
unknown target function unknown

f: X !Y X X
P on X
(ideal credit approval formula)
x1 , x2 , · · · , xN x
learning
training examples
D : (x1 , y1 ), · · · , (xN , yN ) X
algorithm
A
final hypothesis
g⇡f
(historical records in bank) (‘learned’ formula to be used)
hypothesis set
H ‘worst case’ guarantee
on generalization
(set of candidate formula)
From Noiseless VC to Noisy VC
real-world learning problems are often noisy
age 23 years but more!

gender female
• noise in x (covered by P(x)):
annual salary NTD 1,000,000
year in residence 1 year inaccurate customer
year in job 0.5 year information?
current debt 200,000
• noise in y (covered by
credit? {no( 1), yes(+1)}
P(y |x)): good customer,
‘mislabeled’ as bad?
does VC bound work under noise?

Probabilistic Marbles
top
top
sample
one key of VC bound: marbles!
bin
bottom
‘deterministic’ marbles ‘probabilistic’ (noisy) marbles

bottom
• marble x ⇠ P(x) • marble x ⇠ P(x)

• deterministic color • probabilistic color
Jf (x) 6= h(x)K Jy 6= h(x)K with y ⇠ P(y |x)
i.i.d.
same nature: can estimate P[orange] if ⇠
i.i.d. i.i.d.
VC holds for x ⇠ P(x), y ⇠ P(y |x)
| {z }
i.i.d.
(x,y ) ⇠ P(x,y )

The New Learning Flow

y
unknown target
distribution P(y |x) unknown
containing f (x) + noise P on X
(ideal credit approval formula)

x1 , x2 , · · · , xN x
y1 , y2 , · · · , yN
training examples learning final hypothesis

D : (x1 , y1 ), · · · , (xN , yN ) algorithm g⇡f
A
(historical records in bank) (‘learned’ formula to be used)
hypothesis set
H
(set of candidate formula)

VC still works under noise

Questions?

Theory of Generalization VC Dimension of Perceptrons
2D PLA Revisited
linearly separable D with xn ⇠ P and yn = f (xn )
PLA can converge P[|Ein (g) Eout (g)| > ✏]  ... by dVC = 3
PLA 得知 d
T large N large
N
夠⼤時 Eout Ein
某條線可以將 data 分得好
Ein (g) = 0 Eout (g) ⇡ Ein (g)
PLA 確保的
Eout (g) ⇡ 0 :-)
general PLA for x with more than 2 features?

多維度資料会發⽣什麼事 ?

• 1D perceptron (pos/neg rays): dVC = 2
• 2D perceptrons: dVC = 3
• dVC •
3: 3 㸻可以 shatter
• •
• dVC  3: ⇥ ⇥ 41 不⾏
?
• d-D perceptrons: dVC = d +1
two steps:
• dVC d + 1
• dVC  d + 1

Extra Fun Time

What statement below shows that dVC d + 1?
1 There are some d + 1 inputs we can shatter.
2 We can shatter any set of d + 1 inputs.
3 There are some d + 2 inputs we cannot shatter.
4 We cannot shatter any set of d + 2 inputs.
Reference Answer: 1
dVC is the maximum that mH (N) = 2N , and
mH (N) is the most number of dichotomies of N
inputs. So if we can find 2d+1 dichotomies on
some d + 1 inputs, mH (d + 1) = 2d+1 and
hence dVC d + 1.

dVC d +1
There are some d + 1 inputs we can shatter.
• some ‘trivial’ inputs:

2 3 2
⽣ 3
— xT1 — 1 0 0 ... 0
6
6 — xT2 — 7 6
7 6 1 1 0 ... 0 7
7
6 — xT3 — 7 6 1 0 1 0 7
X=6 7=6 7
6 .. 7 6 .. .. .. 7
4 . 5 4 . . . 0 5
—xTd+1 — 1 0 ... 0 1
•
• visually in 2D:
• •
note: X invertible! 反矩陣存在且唯⼀

Can We Shatter X?
2 3 2 3
— xT1 — 1 0 0 ... 0
6
6 — xT2 — 7 6
7 6 1 1 0 ... 0 7
7
X=6 .. 7=6 .. .. .. 7 invertible
4 . 5 4 . . . 0 5
—xTd+1 — 1 0 ... 0 1
to shatter . . .
3 2
y1
for any y = 4 ... 5, find w such that
6 7
yd+1
X invertible! 1
sign (Xw) = y (= (Xw) = y () w=X y
‘special’ X can be shattered =) dVC d +1

Extra Fun Time

What statement below shows that dVC  d + 1?
1 There are some d + 1 inputs we can shatter.
2 We can shatter any set of d + 1 inputs.
3 There are some d + 2 inputs we cannot shatter.
4 We cannot shatter any set of d + 2 inputs.
Reference Answer: 4
dVC is the maximum that mH (N) = 2N , and
mH (N) is the most number of dichotomies of N
inputs. So if we cannot find 2d+2 dichotomies
on any d + 2 inputs (i.e. break point),
mH (d + 2) < 2d+2 and hence dVC < d + 2.
That is, dVC  d + 1.

dVC  d + 1 (1/2)
A 2D Special Case
2 3 2 3
— xT1 — 1 0 0
• • 6 — xT2 — 7 6 1 1 0 7
X=6 7=6 7
• • 4 — xT3 — 5 4 1 0 1 5
—xT4 — 1 1 1
?
⇥
? cannot be ⇥ 這种 dichotomy 無法產⽣
wT x4 = wT x2 + wT x3 w T x1 > 0
| {z } | {z } | {z }
⇥
linear dependence restricts dichotomy

dVC  d + 1 (2/2)
d-D General Case
2 3 dt 2 dtl
— xT1 —
more rows than columns:
6
6 — xT2 — 7
7
6
X=6 .. 7 linear dependence (some ai non-zero)
. 7
6 7 xd+2 = a1 x1 + a2 x2 + . . . + ad+1 xd+1
4 — xTd+1 — 5
— xTd+2 —
• can you generate (sign(a1 ), sign(a2 ), . . . , sign(ad+1 ), ⇥)? if so,

what w?
wT xd+2 = a1 wT x1 +a2 wT x2 + . . . + ad+1 wT xd+1
| {z } | {z } | {z }
⇥ ⇥
> 0(contradition!)
只要有線性相依關係即無法產⽣ dichotomy ,
‘general’ X no-shatter =) dVC  d + 1

Questions?

Theory of Generalization Physical Intuition of VC Dimension
9
Degrees of Freedom
9 9 9 9
10 8 10 8 10 8 10 8 10 8
11 7 11 7 11 7 11 7 11 7
12 6 12 6 12 6 12 6 12 6
13 5 13 5 13 5 13 5 13 5
14 4 14 4 14 4 14 4 14 4
15 3 15 3 15 3 15 3 15 3
16 2 16 2 16 2 16 2 16 2
17 1 17 1 17 1 17 1 17 1
18 0 18 0 18 0 18 0 18 0
10 9 8 10 9 8 10 9 8 10 9 8 10 9 8
11 7 11 7 11 7 11 7 11 7
12 6 12 6 12 6 12 6 12 6
13 5 13 5 13 5 13 5 13 5
14 4 14 4 14 4 14 4 14 4
15 3 15 3 15 3 15 3 15 3
16 2 16 2 16 2 16 2 16 2
17 1 17 1 17 1 17 1 17 1
18 0 18 0 18 0 18 0 18 0
(modified from the work of Hugues Vermeiren on http://www.texample.net)
• hypothesis parameters w = (w0 , w1 , · · · , wd ):

creates degrees of freedom
• hypothesis quantity M = |H|:
‘analog’ degrees of freedom
• hypothesis ‘power’ dVC = d + 1:
effective ‘binary’ degrees of freedom 要做玩分類下有多少⾃由度 ,
代表能做多少 dicnotomy
dVC (H): powerfulness of H
Two Old Friends

Positive Rays (dVC = 1)
h(x) = 1 h(x) = +1
a
0.8 x1 x2 x3 ... xN
free parameters: a dvc ⼆
1
Positive Intervals (dVC = 2)

h(x) = 1 h(x) = +1 h(x) = 1
x1 x2 x3 ... xN
free parameters: `, r dvcz
practical rule of thumb:

有多少可以調整的旋鈕管版、
dVC ⇡ #free parameters (but not always, e.g.,

mystery about deep learning models)
M and dVC
copied from Lecture 3 :-)
1 can we make sure that Eout (g) is close enough to Ein (g)?
2 can we make Ein (g) small enough?
small M large M
1 Yes!, 壞事發⽣机会⼩ 1 No!, 壞事發⽣机会变⼤
P[BAD]  2 · M · exp(. . .) P[BAD]  2 · M · exp(. . .)
2 No!, too few choices 選擇少 2 Yes!, many choices
演算法可能無法送到好的 Ein ,
small dVC large dVC

1 Yes!, P[BAD]  1 No!, P[BAD] 
4 · (2N)dVC · exp(. . .) 4 · (2N)dVC · exp(. . .)
2 No!, too limited power 2 Yes!, lots of power
⾃由度受限
using the right dVC (or H) is important

Questions?

Theory of Generalization Interpreting VC Dimension
VC Bound Rephrase: Penalty for Model Complexity

For any g = A(D) 2 H and ‘statistical’ large D, for N 2, dVC 2
h i ⇣ ⌘
PD Ein (g) Eout (g) > ✏  4(2N)dVC exp 1 2
8✏ N
| {z } | {z }
BAD i
差多遠
Rephrase 壞事發⽣的机会⼩⼼好事發⽣的机会⼤
. . ., with probability 1 , GOOD: Ein (g) Eout (g)  ✏

⇣ ⌘
set = 4(2N)dVC exp 1 2
8 ✏ N
⇣ ⌘
1 2
4(2N)dVC
= exp 8✏ N
⇣ ⌘
4(2N)dVC 1 2
ln = 8✏ N
r ⇣ ⌘
8 4(2N)dVC
N ln = ✏

VC Bound Rephrase: Penalty for Model Complexity

h i ⇣ ⌘
8✏ N
| {z } | {z }
BAD
Rephrase
. . ., with probability 1 , GOOD! 差距会被限制在這之中
r ⇣ ⌘
8 4(2N)dVC
gen. error Ein (g) Eout (g)  N ln
r ⇣ ⌘ r ⇣ ⌘
8 4(2N)dVC 8 4(2N)dVC
Ein (g) N ln  Eout (g)  Ein (g) + N ln
p
. . . : penalty for model complexity
|{z}
⌦(N, H, )
THE VC Message
with a high probability,
r ⇣ ⌘
8 4(2N)dVC
Eout (g)  Ein (g) + N ln
| {z }
⌦(N,H, )
out-of-sample error Eout • dVC ": Ein # but ⌦ "

• dVC #: ⌦ # but Ein "
model complexity
• best dVC
⇤ in the middle
Error
随 dni ⽽不
in-sample error Ein :

dvci ,
可以 shatter 的桌变多 i. Eint
1 較多排列組合 )
d⇤vc VC dimension, dvc
powerful H not always good!

Theory of Generalization Interpreting VC Dimension 樣本複雜度 1 資料量複雜度 )
VC Bound Rephrase: Sample Complexity
h i ⇣ ⌘
8✏ N
| {z } | {z }
BAD
Ein 和 Eout 差最多 0.li 壞事發⽣的机会最多 10 % ; 給定⼀
个 learningmodel ,
dvc ⼆
3
dVC 1 2
given specs ✏ = 0.1, = 0.1, dVC = 3, want 4(2N) exp 8✏ N 
N bound
100 2.82 ⇥ 107 Data 不夠
1,000
10,000
9.17 ⇥ 109
1.19 ⇥ 108
1
sample complexity:
need N ⇡ 10, 000dVC in theory
100,000 1.65 ⇥ 10 38
29,300 9.99 ⇥ 10 2
practical rule of thumb:
N ⇡ 10dVC often enough!

Looseness of VC Bound
1 0 年 1 000 0 的差距 ivcbound 很寬鬆
h i ⇣ ⌘
8✏ N
theory: N ⇡ 10, 000dVC ; practice: N ⇡ 10dVC
Why? 寬鬆的來源
pt 未知但都可以⽤
, Hoettding 來確保
• Hoeffding for unknown Eout any distribution, any target
→
確保分析時可以⽤任何資料
• mH (N) instead of |H(x1 , . . . , xN )| ⽽抓在⼿中的
‘any’ data
,
非只有那筆
• N dVC instead of mH (N) ⽤多項式做上限的上限的上限 ‘any’ H of same dVC
並非⽤真正的成長函取確保看个 H 只要看其 du 就好不⽤其他系節
⼀
,
• union bound on worst cases any choice made by A

,
—but hardly better, and ‘similarly loose for all models’
philosophical message of VC bound

important for improving ML

Questions?

Summary
1 When Can Machines Learn?
Lecture 3: Feasibility of Learning

2 Why Can Machines Learn?

Effective Number of Hypotheses
Break Point
Definition of VC Dimension
Physical Intuition of VC Dimension
Interpreting VC Dimension
• next: beyond VC theory, please :-)

04 Draft

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

04 Draft

Uploaded by

Copyright:

Available Formats

Machine Learning

Department of Computer Science

Hsuan-Tien Lin (NTU CSIE) Machine Learning 0/49

Lecture 4: Theory of Generalization

Hsuan-Tien Lin (NTU CSIE) Machine Learning 1/49

• input x 2 [ 1, +1] ⇢ R1 , uniform iid

• hypothesis set: ha (x) = sign(x a) for a 2 [ 1, 1]

• algorithm: g = ha⇤ with a⇤ = minyn =+1 xn , Einlg) ⼆ 0

assuming at least one yn = 1

BAD data can happen rarely

Where Did M Come From?

• BAD events Bm : |Ein (hm ) Eout (hm )| > ✏ ⽂ 揦 h 䛔 不好 的 事情 。

• to give A freedom of choice: bound P[B1 or B2 or . . . BM ]

P[B1 or B2 or . . . BM ]  P[B1 ] + P[B2 ] + . . . + P[BM ]

Hsuan-Tien Lin (NTU CSIE) Machine Learning 3/49

Where Did Uniform Bound Fail?

• BAD events Bm : |Ein (hm ) Eout (hm )| > ✏ B1 B2

overlapping for similar hypotheses h1 ⇡ h2

to account for overlap,

Hsuan-Tien Lin (NTU CSIE) Machine Learning 4/49

How Many Lines Are There? (1/2)

• how many lines? 1

2 kinds: h1 -like(x1 ) = or h2 -like(x1 ) = ⇥

Hsuan-Tien Lin (NTU CSIE) Machine Learning 5/49

How Many Lines Are There? (1/2)

• how many lines? 1

2 kinds: h1 -like(x1 ) = or h2 -like(x1 ) = ⇥

Hsuan-Tien Lin (NTU CSIE) Machine Learning 5/49

How Many Lines Are There? (2/2)

• how many kinds of lines if viewed from two inputs x1 , x2 ?

one input: 2; two inputs: 4; three inputs?

Hsuan-Tien Lin (NTU CSIE) Machine Learning 6/49

How Many Lines Are There? (2/2)

• how many kinds of lines if viewed from two inputs x1 , x2 ?

one input: 2; two inputs: 4; three inputs?

Hsuan-Tien Lin (NTU CSIE) Machine Learning 6/49

How Many Kinds of Lines for Three Inputs? (1/2)

for three inputs x1 , x2 , x3 ⇥

How Many Kinds of Lines for Three Inputs? (2/2)

for another three inputs 5 ⇥

Hsuan-Tien Lin (NTU CSIE) Machine Learning 8/49

How Many Kinds of Lines for Four Inputs?

Hsuan-Tien Lin (NTU CSIE) Machine Learning 9/49

Effective Number of Lines

• must be  2N (why?) 個 叉 只有 种 可能性 2 lines in 2D

if 1 effective(N) can replace M and

learning possible with infinite lines :-) 代表 ⼤部分 都 好 事 ,

Hsuan-Tien Lin (NTU CSIE) Machine Learning 10/49

Hsuan-Tien Lin (NTU CSIE) Machine Learning 11/49

h(x1 , x2 , . . . , xN ) = (h(x1 ), h(x2 ), . . . , h(xN )) 2 {⇥, }N

a dichotomy: hypothesis ‘limited’ to the eyes of x1 , x2 , . . . , xN

all dichotomies ‘implemented’ by H on x1 , x2 , . . . , xN

|H(x1 , x2 , . . . , xN )|: candidate for replacing M

Hsuan-Tien Lin (NTU CSIE) Machine Learning 12/49

• |H(x1 , x2 , . . . , xN )|: depend on inputs

how to ‘calculate’ the growth function?

Hsuan-Tien Lin (NTU CSIE) Machine Learning 13/49

• H contains h, where each h(x) = sign(x a) for threshold a

Hsuan-Tien Lin (NTU CSIE) Machine Learning 14/49

Growth Function for Positive Intervals

one dichotomy for each ‘interval kind’ x1 x2 x3 x4

Growth Function for Convex Sets (1/2)

convex region in blue non-convex region

• BAD events Bm : |Ein (hm ) Eout (hm )| > ✏ ⽂揦 h 䛔不好的事情。

• must be  2N (why?) 個叉只有种可能性 2 lines in 2D

learning possible with infinite lines :-) 代表⼤部分都好事 ,