You are on page 1of 54

Machine Learning

(_hx“)
Lecture 4: Theory of Generalization
Hsuan-Tien Lin (ó“0)
htlin@csie.ntu.edu.tw

Department of Computer Science


& Information Engineering
National Taiwan University
(↵Àc'x«⌦Â↵˚)

Hsuan-Tien Lin (NTU CSIE) Machine Learning 0/49


Theory of Generalization

Roadmap
1 When Can Machines Learn?
2 Why Can Machines Learn?

Lecture 4: Theory of Generalization


Effective Number of Lines
Effective Number of Hypotheses
Break Point
Definition of VC Dimension
VC Dimension of Perceptrons
Physical Intuition of VC Dimension
Interpreting VC Dimension

Hsuan-Tien Lin (NTU CSIE) Machine Learning 1/49


Theory of Generalization Effective Number of Lines
t -
有 問題 区
Is M = 1 Feasible? ✗ 0
i

• input x 2 [ 1, +1] ⇢ R1 , uniform iid


àlà >
-1
àttl
• target f (x) = sign(x), taking sign(0) = +1 g

• hypothesis set: ha (x) = sign(x a) for a 2 [ 1, 1]


infintely many a
xxjfoò

>

• algorithm: g = ha⇤ with a⇤ = minyn =+1 xn , Einlg) ⼆ 0

assuming at least one yn = 1

• for ✏ < 0.5, Eout (g) > ✏ if every yn = +1 satisfies xn > 2✏ ( 0,2 幻
2 3 *

✓ ◆ Eoutlg ) ⽣
2 2✏ N
⼆ > E

4
P Ein (g) Eout (g) > ✏  5
| {z } 2
"
0
( ⼀些 )
1

BAD data can happen rarely


even for infinitely many hypotheses
Hsuan-Tien Lin (NTU CSIE) Machine Learning 2/49
Theory of Generalization Effective Number of Lines

Where Did M Come From?


⇥ ⇤ ⇣ ⌘
2
P Ein (g) Eout (g) > ✏  2 · M · exp 2✏ N

• BAD events Bm : |Ein (hm ) Eout (hm )| > ✏ ⽂ 揦 h 䛔 不好 的 事情 。

• to give A freedom of choice: bound P[B1 or B2 or . . . BM ]


• worst case: all Bm non-overlapping

P[B1 or B2 or . . . BM ]  P[B1 ] + P[B2 ] + . . . + P[BM ]


|{z}
union bound 我們 有 很多 Jhypothesis 演算法 可以 ,

若 阿 3⽇ 多 項 這 个 bound
,
也知 , 就 沒有 意 火 了 ! 在 之中 ⾃由 選擇 將 ⼤家 制 咲 ⽣
,
"

不好 事情 的 机 率 " ⽤ 联 集 拆開
" "

, ,

算 每 个 不好 事情 發⽣ 的 机 率
where did uniform bound fail
to consider for M = 1?

Hsuan-Tien Lin (NTU CSIE) Machine Learning 3/49


Theory of Generalization Effective Number of Lines

Where Did Uniform Bound Fail?


union bound P[B1 ] + P[B2 ] + . . . + P[BM ]

• BAD events Bm : |Ein (hm ) Eout (hm )| > ✏ B1 B2


壞事 不 太 会重疊 .tn 壞事 的 datath 的 但 相似 的 h 就 会重疊 2 ,
,

overlapping for similar hypotheses h1 ⇡ h2


(e.g. if a1 ⇡ a2 in previous example)
• why? 1 Eout (h1 ) ⇡ Eout (h2 )
2 for most D, Ein (h1 ) = Ein (h2 ) B3
• union bound over-estimating
iiunionbound 是⽤ 联 集 ,
⾯積 分開 加 ,
但 真實 有 可能 重疊 i. 上限 overestimate ,

無法 處理 0 的 狀況 i.
要 找 出 壞事 重疊 的 部分

to account for overlap,


can we group similar hypotheses by kind?
How ? 是否 可以 將 09 h , 分為 有限 多 个 類別 、

, 長得 差不多 的 ⼈ 歸類

Hsuan-Tien Lin (NTU CSIE) Machine Learning 4/49


Theory of Generalization Effective Number of Lines

How Many Lines Are There? (1/2)


n o
2
H = all lines in R 平⾯ 上 的 線 即 為 peceptron

• how many lines? 1


• how many kinds of lines if viewed from one input vector x1 ?
假設 只有 筆 資料 從 筆⼀ ⼀

T
,

X (
資料 看 出去 則 只有 年中線 2
,
•x1

2 kinds: h1 -like(x1 ) = or h2 -like(x1 ) = ⇥


种是圈 ,

种 是叉 、

Hsuan-Tien Lin (NTU CSIE) Machine Learning 5/49


Theory of Generalization Effective Number of Lines

How Many Lines Are There? (1/2)


n o
2
H = all lines in R

• how many lines? 1


• how many kinds of lines if viewed from one input vector x1 ?

•x1
h2
h1

2 kinds: h1 -like(x1 ) = or h2 -like(x1 ) = ⇥

Hsuan-Tien Lin (NTU CSIE) Machine Learning 5/49


Theory of Generalization Effective Number of Lines

How Many Lines Are There? (2/2)


n o
2
H = all lines in R

• how many kinds of lines if viewed from two inputs x1 , x2 ?


若 有 5点 会有 4 种 線
2 、

•x1

•x2

4:
⇥ ⇥
7
⇥ ⇥ M
L

one input: 2; two inputs: 4; three inputs?

Hsuan-Tien Lin (NTU CSIE) Machine Learning 6/49


Theory of Generalization Effective Number of Lines

How Many Lines Are There? (2/2)


n o
2
H = all lines in R

• how many kinds of lines if viewed from two inputs x1 , x2 ?

•x1
h3
h2
•x2 h4
h1

4:
⇥ ⇥
⇥ ⇥

one input: 2; two inputs: 4; three inputs?

Hsuan-Tien Lin (NTU CSIE) Machine Learning 6/49


Theory of Generalization Effective Number of Lines

How Many Kinds of Lines for Three Inputs? (1/2)


n o
2
H = all lines in R

若有 35 input 且 排列 呈現 三⾓形 ,

for three inputs x1 , x2 , x3 ⇥


9

c ⇥
•x1
5
⇥ 2

•x2 ⇥

•x3 8:

⇥ →



always 8 for three inputs? ⇥ 0
t

Hsuan-Tien Lin (NTU CSIE) Machine Learning 7/49
Theory of Generalization Effective Number of Lines

How Many Kinds of Lines for Three Inputs? (2/2)


n o
2
H = all lines in R

若 三 族 排列 在 同⼀ 直線 上 ,

for another three inputs 5 ⇥


x1 , x2 , x3 ⇥

三 桌共 線 ⇥
•x1


⇥ -
•x2 6:
•x3 無法 ⽤ ⇥

直線 切 ! ⇥
‘fewer than 8’ when degenerate
(e.g. collinear or same inputs) ⇥
⇥ 1

⇥ ←

Hsuan-Tien Lin (NTU CSIE) Machine Learning 8/49


Theory of Generalization Effective Number of Lines

How Many Kinds of Lines for Four Inputs?


n o
2
H = all lines in R

若有 4 5点 2
T
for four inputs x1 , x2 , x3 , x4
⇥ ⇥

•x1
•x4 ← ⇥ ⇥ n

•x2

•x3 14: 2⇥
T


" "
T
⇥ ⇥
for any four inputs ⇥
at most 14



t

Hsuan-Tien Lin (NTU CSIE) Machine Learning 9/49


Theory of Generalization Effective Number of Lines

Effective Number of Lines


maximum kinds of lines with respect to N inputs x1 , x2 , · · · , xN
() effective number of lines

• must be  2N (why?) 個 叉 只有 种 可能性 2 lines in 2D


• finite ‘grouping’ of infinitely-many lines 2 H N effective(N)
• wish: 1 2
2 4
⇥ ⇤
BAD Prob , →
P Ein (g) Eout (g) > ✏ 3 8 6
or

⇣ ⌘ 4 14 < 2N
希望 ⼩ →  2 · effective(N) · exp 2✏2 N
nottoobig small
( 取代 M )

if 1 effective(N) can replace M and


2 effective(N) ⌧ 2N 亂 夠 ⼤ 時 壞事 發⽣ 的 机 率 会 接近 ,
0

learning possible with infinite lines :-) 代表 ⼤部分 都 好 事 ,

Hsuan-Tien Lin (NTU CSIE) Machine Learning 10/49


Theory of Generalization Effective Number of Lines

Questions?

Hsuan-Tien Lin (NTU CSIE) Machine Learning 11/49


Theory of Generalization Effective Number of Hypotheses

除了 線性 以外 的 H
Dichotomies: Mini-hypotheses
H = {hypothesis h : X ! {⇥, }}

• call

h(x1 , x2 , . . . , xN ) = (h(x1 ), h(x2 ), . . . , h(xN )) 2 {⇥, }N


A 分 ( 將 ⼿上 的 桌 分成 兩 堆 圈 堆 ⽔ 堆 个 Hset 可以 做出 多少 种 dichotomy )

⼆ ⼀
,
,
, 、

a dichotomy: hypothesis ‘limited’ to the eyes of x1 , x2 , . . . , xN


• H(x1 , x2 , . . . , xN ): ⼤ H 代表考量 的 是 dichotomylunigue H ) ⽽ 非 原本 所有 的

h.im
,

all dichotomies ‘implemented’ by H on x1 , x2 , . . . , xN


針 対 所有 只 対 NJ 特定 的 桌 做 取值
hypotheses H dichotomies H(x1 , x2 , . . . , xN )
e.g. all lines in R2 { , ⇥, ⇥⇥, . . .}
size possibly infinite upper bounded by 2N

|H(x1 , x2 , . . . , xN )|: candidate for replacing M

Hsuan-Tien Lin (NTU CSIE) Machine Learning 12/49


Theory of Generalization Effective Number of Hypotheses

⽤ dichotomyset ⼤⼩ 衡量 M 前
Growth Function
dichotomy 是 由 先前 選 好 的
⼼ 化 決定 可能 的 未來 理論 分析 有 ⿇煩 ,
,

i.
想 移除 的 ✗ 的 依賴 選擇 ✗ 的 各 种 可能性 取 Max 的 dicnotomy ,

lines in 2D
,

• |H(x1 , x2 , . . . , xN )|: depend on inputs


(x1 , x2 , . . . , xN ) N mH (N)
• growth function: 1 2
remove dependence by taking max of all 2 4
possible (x1 , x2 , . . . , xN ) 3 max(. . . , 6, 8)
=8
4 14 < 2N
mH (N) = max |H(x1 , x2 , . . . , xN )|
x1 ,x2 ,...,xN 2X
line ,

• finite, upper-bounded by 2N

how to ‘calculate’ the growth function?

Hsuan-Tien Lin (NTU CSIE) Machine Learning 13/49


Theory of Generalization Effective Number of Hypotheses

Perceptron 較難
-7
Growth Function for Positive Rays
先 看 較 簡單 的

h(x) = 1 h(x) = +1
a
維實 取
...
-

x1 x2 x3 xN

bdi
00 , v

0 0 0 ) v

• X = R (one dimensional)
X X X ) X "
要有向 0

• H contains h, where each h(x) = sign(x a) for threshold a


• ‘positive half’ of 1D perceptrons 个 ⾨檻 值 比 h ⼩ -1 比 h ⼤
正取 ⼀

, ,
; ⼗ 1

此例 可以 切 出 多少 dichotomyi
one dichotomy for a 2 each spot (xn , xn+1 ): x1 x2 x3 x4

mH (N) = N + 1 最多 N_n 种 ,


⇥ ⇥
⇥ ⇥ ⇥
(N + 1)⌧ 2N when N large! ⇥ ⇥ ⇥ ⇥

Hsuan-Tien Lin (NTU CSIE) Machine Learning 14/49


Theory of Generalization Effective Number of Hypotheses

Growth Function for Positive Intervals


h(x) = 1 h(x) = +1 h(x) = 1

x1 x2 x3 ... xN

• X = R (one dimensional)
• H contains h, where each h(x) = +1 iff x 2 [`, r ), 1 otherwise

one dichotomy for each ‘interval kind’ x1 x2 x3 x4


⇥ ⇥ ⇥
✓ ◆ ⇥ ⇥
N + 1 个 区塊
mH (N) = + 1 ⇥
2
| {z } |{z} ⇥ ⇥ ⇥
interval ends in N + 1 spots all ⇥ ⇥ ⇥

1 2 1
= N + N +1 ⇥ ⇥ ⇥
2 2 ⇥ ⇥
⇥ ⇥ ⇥
1 2 1 ⇥ ⇥ ⇥ ⇥
2N + 2N +1 ⌧ 2N when N large!
Hsuan-Tien Lin (NTU CSIE) Machine Learning 15/49
Theory of Generalization Effective Number of Hypotheses

Growth Function for Convex Sets (1/2)


定灯 hi 制h 衬 应到 ⼀
个 convexset

convex region in blue non-convex region

• X = R2 (two dimensional)
• H contains h, wherebottom
h(x) = +1 iff x in a
bottom

convex region, 1 otherwise

what is mH (N)?
這樣 的 h 的 成長 函 邦 長 怎樣 ?

Hsuan-Tien Lin (NTU CSIE) Machine Learning 16/49


Theory of Generalization Effective Number of Hypotheses

Growth Function for Convex Sets (2/2)


極端 可能 將 所有 的 輸入 排 在 圓圈 上 不論 要 做 什麼 dichotomy ,
,

可以 ⽤ 个 多 边 形 將 所有 的 吳 速 起 向 外 擴 都是 正 截+ 都是 負

,
up

• one possible set of N inputs:


,
,

x
x1 , x2 , . . . , xN on a big circle +• ④
+

• every dichotomy can be implemented


x−
by H using a convex region slightly −x
extended from contour of positive inputs x−
要 產⽣ Jdicnotomy 就 畫 个 凸 多 边 N形

-
,


mH (N) = 2 +


˙
+
不論是哪 种 dichotomy 都 可以 ⽤ 凸 多 边 形 凸 集合 ) 做出 ⼼
bottom

• call those N inputs ‘shattered’ by H 1 HM Xzi XN ) 1


,
=
2

成長 出 取 , N 族 共有 味中 dichotomy
2

hnl 了 ) 8

mH (N) = 2N ()
exists N inputs that can be shattered linesonk

Hsuan-Tien Lin (NTU CSIE) Machine Learning 17/49


Theory of Generalization Break Point

The Four Growth Functions


• positive rays: mH (N) = N + 1
• positive intervals: mH (N) = 12 N 2 + 12 N + 1
• convex sets: mH (N) = 2N
• 2D perceptrons: mH (N) < 2N in some cases

what if mH (N) replaces M? mn 取代 MI

⇥ ⇤ ? ⇣ ⌘
P Ein (g) Eout (g) > ✏  2 · mH (N) · exp 2✏2 N

polynomial: good; exponential: bad


mnlm 若 為 多 項 式 後⾯ 的 抑 減少 的 較快 upperbound 越⼩ 壞事 的 発 ⽣ 趕 近 0
, ,

但 若 想 wnvexsetiexp ⼼ 前⾯ 指 取 成長 後⾯ 指 邦 下降 不 ⼀定 能 確保 即使 N 夠 ⼤
,
, ,
,
Ein 和 Eout
for 2D or general perceptrons, 夠 接近
mH (N) polynomial?

Hsuan-Tien Lin (NTU CSIE) Machine Learning 18/49


Theory of Generalization Break Point

Break Point of H
what do we know about 2D perceptrons now?
three inputs: ‘exists’ shatter;
four inputs, ‘for all’ no shatter

if no k inputs can be shattered by H,


call k a break point for H
• mH (k ) < 2k ⇥ ⇥
• k + 1, k + 2, k + 3, . . . also break points!
• will study minimum break point k

2D perceptrons: minimum break point at 4


3 族 可以 做出 所有 dichotomy 我們 感 興趣 的 是 第⼀个 做 不 出 所有
但 ⽵ 桌 無法 做出 所有 dichotomy dichotomy 的 奌 ,
PLA 中 是
"

4
"

"
4 即 為 breakpointiiczi

Hsuan-Tien Lin (NTU CSIE) Machine Learning 19/49


Theory of Generalization Break Point

The Four Minimum Break Points


• positive rays: mH (N) = N + 1 = O(N)
minimum break point at 2
• positive intervals: mH (N) = 12 N 2 + 21 N + 1 = O(N 2 )
minimum break point at 3
• convex sets: mH (N) = 2N
no break point 永遠 是 "
• 2D perceptrons: mH (N) < 2N in some cases
minimum break point at 4

theorem from combinatorics


(not going to prove in class):
• no break point: mH (N) = 2N (sure!)
• minimum break point k : 若 有 breakpointi 可能 可以 証明在
mH (N) = O(N k 1 ) k 發⽣ 時 成長 的 速度 為 ⼼
,

Hsuan-Tien Lin (NTU CSIE) Machine Learning 20/49


Theory of Generalization Break Point

Questions?

Hsuan-Tien Lin (NTU CSIE) Machine Learning 21/49


Theory of Generalization Definition of VC Dimension

BAD Bound for General H


want:
h i ✓ ◆
2
P 9h 2 H s.t. Ein (h) Eout (h) > ✏  2 mH ( N) · exp 2 ✏ N

actually, when N large enough,


h i ✓ ◆
1
P 9h 2 H s.t. Ein (h) Eout (h) > ✏  2·2mH (2N) · exp 2· ✏2 N
16

called Vapnik-Chervonenkis (VC) Bound

Hsuan-Tien Lin (NTU CSIE) Machine Learning 22/49


Theory of Generalization Definition of VC Dimension
Interpretation of Vapnik-Chervonenkis (VC) Bound
For any g = A(D) 2 H and ‘statistical’ large D, for N 2, k 3
演算法 做 的 選擇 会 被 以 Bound ⽀配 h i ( 已經假設 dataset.lv 夠 ⼤)
P Ein (g) Eout (g) > ✏
9 的 Ein 和 Eout 離 很 遠 的 机 率 D 訓練
h 測試 i
会来 得 很 ⼩
 PD 9h 2 H s.t. Ein (h) Eout (h) > ✏ BAD case ,

⇣ ⌘
1 2
 4mH (2N) exp 8 ✏ N
if k exists ⇣ ⌘
 4(2N)k 1 exp 1 2
8 ✏ N
goodilargeenough ,

幾個 條件 使 learning 可以 做到 ;

成長出 取 要 在耵 地⽅ 露出 ⼀線曙光

if 1 mH (N) breaks at k (good H)
夠 ⼤的 data
2 N large enough TIBAD 發⽣ 下降 (good D) ī large
=) probably generalized ‘Eout ⇡ Ein ’, and
if 3 A picks a g with small Ein
=) probably learned! 、
(good A)
(:-) good luck)
i
lucky
好 的 演算法
Hsuan-Tien Lin (NTU CSIE) Machine Learning 23/49
Theory of Generalization Definition of VC Dimension
若 露出 曙光 為 桌 ki ⼼1 2
☆ VC Dimension
則 Maxnonbreakpoint 為 Kl ⼼ 2 4

即 vcdim
the formal name of maximum non-break point dVC ⼼了 8 VCD

= (minimum break point k - 1) ⼼ 414 BP

Definition -

gnypothesisset 的 性質

VC dimension of H, denoted dVC (H) is


Ishatterivcp

largest N for which mH (N) = 2N

v-noshatteriBP.ch
(the most inputs H that can shatter)
=
"

Minimumk -1
"

N  dVC =) H can shatter some N inputs , 不是 BP

k > dVC =) k is a break point for H

if N 2, dVC 2, mH (N)  N dVC


成長 出 取

Hsuan-Tien Lin (NTU CSIE) Machine Learning 24/49


Theory of Generalization Definition of VC Dimension

The Four VC Dimensions


• positive rays: mH (N) = N + 1
dVC = 1 • 給 打 桌 時 所有 排列 組合 出來 可以 shatter 此 奌 ,

29 不⾏
• positive intervals: mH (N) = 12 N 2 + 12 N + 1
dVC = 2 • •
• convex sets:
不 論 給 幾 9点 恰好 排 在 圓 上 時
up
mH (N) = 2N
dVC = 1
,

convexset 來 snatter
可以 ⽤

bottom

• 2D perceptrons: mH (N)  N 3 for N 2



dVC = 3
• •

good: finite dVC

Hsuan-Tien Lin (NTU CSIE) Machine Learning 25/49


Theory of Generalization Definition of VC Dimension

VC Dimension and Learning


finite dVC =) g ‘will’ generalize (Eout (g) ⇡ Ein (g))

ǐ• regardless of learning algorithm A


② • regardless of input distribution P
⑦ • regardless of target function f

unknown target function unknown


f: X !Y X X
P on X
(ideal credit approval formula)
x1 , x2 , · · · , xN x

learning
training examples
D : (x1 , y1 ), · · · , (xN , yN ) X
algorithm
A
final hypothesis
g⇡f
(historical records in bank) (‘learned’ formula to be used)

hypothesis set
H ‘worst case’ guarantee
on generalization
(set of candidate formula)
Hsuan-Tien Lin (NTU CSIE) Machine Learning 26/49
Theory of Generalization Definition of VC Dimension

From Noiseless VC to Noisy VC

real-world learning problems are often noisy

age 23 years but more!


gender female
• noise in x (covered by P(x)):
annual salary NTD 1,000,000
year in residence 1 year inaccurate customer
year in job 0.5 year information?
current debt 200,000
• noise in y (covered by
credit? {no( 1), yes(+1)}
P(y |x)): good customer,
‘mislabeled’ as bad?

does VC bound work under noise?

Hsuan-Tien Lin (NTU CSIE) Machine Learning 27/49


Theory of Generalization Definition of VC Dimension

Probabilistic Marbles
top
top

sample

one key of VC bound: marbles!

bin
bottom

‘deterministic’ marbles ‘probabilistic’ (noisy) marbles


bottom

• marble x ⇠ P(x) • marble x ⇠ P(x)


• deterministic color • probabilistic color
Jf (x) 6= h(x)K Jy 6= h(x)K with y ⇠ P(y |x)

i.i.d.
same nature: can estimate P[orange] if ⇠

i.i.d. i.i.d.
VC holds for x ⇠ P(x), y ⇠ P(y |x)
| {z }
i.i.d.
(x,y ) ⇠ P(x,y )

Hsuan-Tien Lin (NTU CSIE) Machine Learning 28/49


Theory of Generalization Definition of VC Dimension

The New Learning Flow


y
unknown target
distribution P(y |x) unknown
containing f (x) + noise P on X

(ideal credit approval formula)


x1 , x2 , · · · , xN x
y1 , y2 , · · · , yN

training examples learning final hypothesis


D : (x1 , y1 ), · · · , (xN , yN ) algorithm g⇡f
A
(historical records in bank) (‘learned’ formula to be used)

hypothesis set
H

(set of candidate formula)


VC still works under noise

Hsuan-Tien Lin (NTU CSIE) Machine Learning 29/49


Theory of Generalization Definition of VC Dimension

Questions?

Hsuan-Tien Lin (NTU CSIE) Machine Learning 30/49


Theory of Generalization VC Dimension of Perceptrons

2D PLA Revisited
linearly separable D with xn ⇠ P and yn = f (xn )

PLA can converge P[|Ein (g) Eout (g)| > ✏]  ... by dVC = 3
PLA 得知 d

T large N large
N
夠 ⼤ 時 Eout Ein

某條 線 可以 將 data 分 得 好
Ein (g) = 0 Eout (g) ⇡ Ein (g)
PLA 確保 的
Eout (g) ⇡ 0 :-)

general PLA for x with more than 2 features?


多 維度 資料 会 發⽣ 什麼 事 ?

Hsuan-Tien Lin (NTU CSIE) Machine Learning 31/49


Theory of Generalization VC Dimension of Perceptrons

VC Dimension of Perceptrons
• 1D perceptron (pos/neg rays): dVC = 2
• 2D perceptrons: dVC = 3
• dVC •
3: 3 㸻 可以 shatter
• •
• dVC  3: ⇥ ⇥ 41 不⾏

?
• d-D perceptrons: dVC = d +1

two steps:
• dVC d + 1
• dVC  d + 1

Hsuan-Tien Lin (NTU CSIE) Machine Learning 32/49


Theory of Generalization VC Dimension of Perceptrons

Extra Fun Time


What statement below shows that dVC d + 1?
1 There are some d + 1 inputs we can shatter.
2 We can shatter any set of d + 1 inputs.
3 There are some d + 2 inputs we cannot shatter.
4 We cannot shatter any set of d + 2 inputs.

Reference Answer: 1
dVC is the maximum that mH (N) = 2N , and
mH (N) is the most number of dichotomies of N
inputs. So if we can find 2d+1 dichotomies on
some d + 1 inputs, mH (d + 1) = 2d+1 and
hence dVC d + 1.

Hsuan-Tien Lin (NTU CSIE) Machine Learning 33/49


Theory of Generalization VC Dimension of Perceptrons

dVC d +1
There are some d + 1 inputs we can shatter.

• some ‘trivial’ inputs:


2 3 2
⽣ 3
— xT1 — 1 0 0 ... 0
6
6 — xT2 — 7 6
7 6 1 1 0 ... 0 7
7
6 — xT3 — 7 6 1 0 1 0 7
X=6 7=6 7
6 .. 7 6 .. .. .. 7
4 . 5 4 . . . 0 5
—xTd+1 — 1 0 ... 0 1


• visually in 2D:
• •

note: X invertible! 反 矩陣 存在 且 唯⼀

Hsuan-Tien Lin (NTU CSIE) Machine Learning 34/49


Theory of Generalization VC Dimension of Perceptrons

Can We Shatter X?
2 3 2 3
— xT1 — 1 0 0 ... 0
6
6 — xT2 — 7 6
7 6 1 1 0 ... 0 7
7
X=6 .. 7=6 .. .. .. 7 invertible
4 . 5 4 . . . 0 5
—xTd+1 — 1 0 ... 0 1

to shatter . . .
3 2
y1
for any y = 4 ... 5, find w such that
6 7

yd+1

X invertible! 1
sign (Xw) = y (= (Xw) = y () w=X y

‘special’ X can be shattered =) dVC d +1

Hsuan-Tien Lin (NTU CSIE) Machine Learning 35/49


Theory of Generalization VC Dimension of Perceptrons

Extra Fun Time


What statement below shows that dVC  d + 1?
1 There are some d + 1 inputs we can shatter.
2 We can shatter any set of d + 1 inputs.
3 There are some d + 2 inputs we cannot shatter.
4 We cannot shatter any set of d + 2 inputs.

Reference Answer: 4
dVC is the maximum that mH (N) = 2N , and
mH (N) is the most number of dichotomies of N
inputs. So if we cannot find 2d+2 dichotomies
on any d + 2 inputs (i.e. break point),
mH (d + 2) < 2d+2 and hence dVC < d + 2.
That is, dVC  d + 1.

Hsuan-Tien Lin (NTU CSIE) Machine Learning 36/49


Theory of Generalization VC Dimension of Perceptrons

dVC  d + 1 (1/2)
A 2D Special Case
2 3 2 3
— xT1 — 1 0 0
• • 6 — xT2 — 7 6 1 1 0 7
X=6 7=6 7
• • 4 — xT3 — 5 4 1 0 1 5
—xT4 — 1 1 1

?

? cannot be ⇥ 這 种 dichotomy 無法 產⽣

wT x4 = wT x2 + wT x3 w T x1 > 0
| {z } | {z } | {z }

linear dependence restricts dichotomy


Hsuan-Tien Lin (NTU CSIE) Machine Learning 37/49
Theory of Generalization VC Dimension of Perceptrons
dVC  d + 1 (2/2)
d-D General Case
2 3 dt 2 dtl
— xT1 —
more rows than columns:
6
6 — xT2 — 7
7
6
X=6 .. 7 linear dependence (some ai non-zero)
. 7
6 7 xd+2 = a1 x1 + a2 x2 + . . . + ad+1 xd+1
4 — xTd+1 — 5
— xTd+2 —

• can you generate (sign(a1 ), sign(a2 ), . . . , sign(ad+1 ), ⇥)? if so,


what w?
wT xd+2 = a1 wT x1 +a2 wT x2 + . . . + ad+1 wT xd+1
| {z } | {z } | {z }
⇥ ⇥
> 0(contradition!)
只要 有 線性 相 依 關係 即 無法 產⽣ dichotomy ,

‘general’ X no-shatter =) dVC  d + 1


Hsuan-Tien Lin (NTU CSIE) Machine Learning 38/49
Theory of Generalization VC Dimension of Perceptrons

Questions?

Hsuan-Tien Lin (NTU CSIE) Machine Learning 39/49


Theory of Generalization Physical Intuition of VC Dimension

9
Degrees of Freedom
9 9 9 9
10 8 10 8 10 8 10 8 10 8
11 7 11 7 11 7 11 7 11 7
12 6 12 6 12 6 12 6 12 6
13 5 13 5 13 5 13 5 13 5

14 4 14 4 14 4 14 4 14 4
15 3 15 3 15 3 15 3 15 3
16 2 16 2 16 2 16 2 16 2
17 1 17 1 17 1 17 1 17 1
18 0 18 0 18 0 18 0 18 0

10 9 8 10 9 8 10 9 8 10 9 8 10 9 8
11 7 11 7 11 7 11 7 11 7
12 6 12 6 12 6 12 6 12 6
13 5 13 5 13 5 13 5 13 5

14 4 14 4 14 4 14 4 14 4
15 3 15 3 15 3 15 3 15 3
16 2 16 2 16 2 16 2 16 2
17 1 17 1 17 1 17 1 17 1
18 0 18 0 18 0 18 0 18 0

(modified from the work of Hugues Vermeiren on http://www.texample.net)

• hypothesis parameters w = (w0 , w1 , · · · , wd ):


creates degrees of freedom
• hypothesis quantity M = |H|:
‘analog’ degrees of freedom
• hypothesis ‘power’ dVC = d + 1:
effective ‘binary’ degrees of freedom 要做 玩 分類 下 有 多少 ⾃由度 ,

代表 能 做 多少 dicnotomy
dVC (H): powerfulness of H
Hsuan-Tien Lin (NTU CSIE) Machine Learning 40/49
Theory of Generalization Physical Intuition of VC Dimension

Two Old Friends


Positive Rays (dVC = 1)
h(x) = 1 h(x) = +1
a
0.8 x1 x2 x3 ... xN
free parameters: a dvc ⼆
1

Positive Intervals (dVC = 2)


h(x) = 1 h(x) = +1 h(x) = 1
x1 x2 x3 ... xN
free parameters: `, r dvcz

practical rule of thumb:


有 多少 可以 調整 的 旋鈕 管版 、

dVC ⇡ #free parameters (but not always, e.g.,


mystery about deep learning models)
Hsuan-Tien Lin (NTU CSIE) Machine Learning 41/49
Theory of Generalization Physical Intuition of VC Dimension

M and dVC
copied from Lecture 3 :-)
1 can we make sure that Eout (g) is close enough to Ein (g)?
2 can we make Ein (g) small enough?

small M large M
1 Yes!, 壞事 發⽣ 机 会 ⼩ 1 No!, 壞事 發⽣ 机 会 变 ⼤
P[BAD]  2 · M · exp(. . .) P[BAD]  2 · M · exp(. . .)
2 No!, too few choices 選擇 少 2 Yes!, many choices
演算法 可能 無法送 到 好 的 Ein ,

small dVC large dVC


1 Yes!, P[BAD]  1 No!, P[BAD] 
4 · (2N)dVC · exp(. . .) 4 · (2N)dVC · exp(. . .)
2 No!, too limited power 2 Yes!, lots of power
⾃由度 受限
using the right dVC (or H) is important

Hsuan-Tien Lin (NTU CSIE) Machine Learning 42/49


Theory of Generalization Physical Intuition of VC Dimension

Questions?

Hsuan-Tien Lin (NTU CSIE) Machine Learning 43/49


Theory of Generalization Interpreting VC Dimension

VC Bound Rephrase: Penalty for Model Complexity


For any g = A(D) 2 H and ‘statistical’ large D, for N 2, dVC 2
h i ⇣ ⌘
PD Ein (g) Eout (g) > ✏  4(2N)dVC exp 1 2
8✏ N
| {z } | {z }
BAD i
差多 遠

Rephrase 壞事 發⽣ 的 机 会 ⼩⼼ 好 事 發⽣ 的 机 会 ⼤

. . ., with probability 1 , GOOD: Ein (g) Eout (g)  ✏


⇣ ⌘
set = 4(2N)dVC exp 1 2
8 ✏ N
⇣ ⌘
1 2
4(2N)dVC
= exp 8✏ N
⇣ ⌘
4(2N)dVC 1 2
ln = 8✏ N
r ⇣ ⌘
8 4(2N)dVC
N ln = ✏

Hsuan-Tien Lin (NTU CSIE) Machine Learning 44/49


Theory of Generalization Interpreting VC Dimension

VC Bound Rephrase: Penalty for Model Complexity


For any g = A(D) 2 H and ‘statistical’ large D, for N 2, dVC 2
h i ⇣ ⌘
PD Ein (g) Eout (g) > ✏  4(2N)dVC exp 1 2
8✏ N
| {z } | {z }
BAD

Rephrase
. . ., with probability 1 , GOOD! 差距会 被 限制 在這 之中
r ⇣ ⌘
8 4(2N)dVC
gen. error Ein (g) Eout (g)  N ln
r ⇣ ⌘ r ⇣ ⌘
8 4(2N)dVC 8 4(2N)dVC
Ein (g) N ln  Eout (g)  Ein (g) + N ln

p
. . . : penalty for model complexity
|{z}
⌦(N, H, )
Hsuan-Tien Lin (NTU CSIE) Machine Learning 44/49
Theory of Generalization Interpreting VC Dimension

THE VC Message
with a high probability,
r ⇣ ⌘
8 4(2N)dVC
Eout (g)  Ein (g) + N ln
| {z }
⌦(N,H, )

out-of-sample error Eout • dVC ": Ein # but ⌦ "


• dVC #: ⌦ # but Ein "
model complexity
• best dVC
⇤ in the middle
Error

随 dni ⽽ 不

in-sample error Ein :


dvci ,
可以 shatter 的 桌 变 多 i. Eint

1 較 多 排列 組合 )
d⇤vc VC dimension, dvc

powerful H not always good!

Hsuan-Tien Lin (NTU CSIE) Machine Learning 45/49


Theory of Generalization Interpreting VC Dimension 樣本 複雜 度 1 資料 量複雜度 )
VC Bound Rephrase: Sample Complexity
For any g = A(D) 2 H and ‘statistical’ large D, for N 2, dVC 2
h i ⇣ ⌘
PD Ein (g) Eout (g) > ✏  4(2N)dVC exp 1 2
8✏ N
| {z } | {z }
BAD
Ein 和 Eout 差 最多 0.li 壞事 發⽣ 的 机 会 最多 10 % ; 給定 ⼀
个 learningmodel ,
dvc ⼆
3

dVC 1 2
given specs ✏ = 0.1, = 0.1, dVC = 3, want 4(2N) exp 8✏ N 
N bound
100 2.82 ⇥ 107 Data 不 夠
1,000
10,000
9.17 ⇥ 109
1.19 ⇥ 108
1
sample complexity:
need N ⇡ 10, 000dVC in theory
100,000 1.65 ⇥ 10 38
29,300 9.99 ⇥ 10 2

practical rule of thumb:

N ⇡ 10dVC often enough!


Hsuan-Tien Lin (NTU CSIE) Machine Learning 46/49
Theory of Generalization Interpreting VC Dimension

Looseness of VC Bound
1 0 年 1 000 0 的 差距 ivcbound 很寬鬆
h i ⇣ ⌘
PD Ein (g) Eout (g) > ✏  4(2N)dVC exp 1 2
8✏ N

theory: N ⇡ 10, 000dVC ; practice: N ⇡ 10dVC

Why? 寬鬆 的 來源
pt 未知 但 都 可以 ⽤
, Hoettding 來 確保
• Hoeffding for unknown Eout any distribution, any target

確保 分析 時 可以 ⽤ 任何 資料
• mH (N) instead of |H(x1 , . . . , xN )| ⽽ 抓 在 ⼿中 的
‘any’ data
,

非 只有 那筆
• N dVC instead of mH (N) ⽤ 多 項式 做 上限 的 上限 的 上限 ‘any’ H of same dVC
並非 ⽤ 真正 的 成長 函 取 確保看 个 H 只要看 其 du 就 好 不⽤ 其他 系 節

,

• union bound on worst cases any choice made by A


,

—but hardly better, and ‘similarly loose for all models’

philosophical message of VC bound


important for improving ML

Hsuan-Tien Lin (NTU CSIE) Machine Learning 47/49


Theory of Generalization Interpreting VC Dimension

Questions?

Hsuan-Tien Lin (NTU CSIE) Machine Learning 48/49


Theory of Generalization Interpreting VC Dimension

Summary
1 When Can Machines Learn?

Lecture 3: Feasibility of Learning


2 Why Can Machines Learn?

Lecture 4: Theory of Generalization


Effective Number of Lines
Effective Number of Hypotheses
Break Point
Definition of VC Dimension
VC Dimension of Perceptrons
Physical Intuition of VC Dimension
Interpreting VC Dimension
• next: beyond VC theory, please :-)

Hsuan-Tien Lin (NTU CSIE) Machine Learning 49/49

You might also like