You are on page 1of 36

Machine Learning

(_hx“)
Lecture 3: Feasibility of Learning
Hsuan-Tien Lin (ó“0)
htlin@csie.ntu.edu.tw

Department of Computer Science


& Information Engineering
National Taiwan University
(↵Àc'x«⌦Â↵˚)

Hsuan-Tien Lin (NTU CSIE) Machine Learning 0/35


Feasibility of Learning

Roadmap
1 When Can Machines Learn?

Lecture 3: Feasibility of Learning


Learning is Impossible?
Probability to the Rescue
Connection to Learning
Connection to Real Learning
Feasibility of Learning Decomposed

Hsuan-Tien Lin (NTU CSIE) Machine Learning 1/35


Feasibility of Learning Learning is Impossible?

A Learning Puzzle
batchedsupervisedbin.classification.XFLO.li
10,1 0,1 0,1 )
,
,

Xi Xz Xs yn = 1

✗4 X5 Xb
yn = +1

Maybenodefiniteans _

itdependsonwhathypotneses
wechooiei g(x) = ?

let’s test your ‘human learning’


with 6 examples :-)
Hsuan-Tien Lin (NTU CSIE) Machine Learning 2/35
Feasibility of Learning Learning is Impossible?

Two Controversial Answers


whatever you say about g(x),

yn = 1

g(x) = ?
yn = +1

truth f (x) = +1 because


g(x) = ? ... truth f (x) = 1 because . . .

which reason is correct?


Hsuan-Tien Lin (NTU CSIE) Machine Learning 3/35
Feasibility of Learning Learning is Impossible?

Two Controversial Answers


whatever you say about g(x),

yn = 1

g(x) = ?
yn = +1

truth f (x) = +1 because


g(x) = ? ... truth f (x) = 1 because . . .
• symmetry , +1 • left-top black , -1
• (black or white count = 3) or • middle column contains at
(black count = 4 and most 1 black and right-top
middle-top black) , +1 white , -1

all valid reasons, your adversarial teacher


can always call you ‘didn’t learn’. :-(

Hsuan-Tien Lin (NTU CSIE) Machine Learning 4/35


Feasibility of Learning Learning is Impossible?

A Brain-Storming Problem
10 14
(5, 3, 2) ! 151022, (7, 2, 5) ! ? 14 3549

It is like a ‘learning problem’ with N = 1, x1 = (5, 3, 2), y1 = 151022.


Learn a hypothesis from the one example to predict on x = (7, 2, 5).
What is your answer?

151026 143547
g(x) = 151012 + x1 + x2 + x3 g(x) = x1 · x2 · 10000
+ x1 · x3 · 100
+ (x1 · x2 + x1 · x3 x2 )

which one is the smarter answer that only top


2% people can think of?

Hsuan-Tien Lin (NTU CSIE) Machine Learning 5/35


Feasibility of Learning Learning is Impossible?

What is the Next Number?


1,4,1,5

Hsuan-Tien Lin (NTU CSIE) Machine Learning 6/35


Feasibility of Learning Learning is Impossible?

What is the Next Number?


1,4,1,5

1,4,1,5,0,-1,1,6
by yt = yt 4 yt 2

1,4,1,5,1,6,1,7
by yt = yt 2 + Jt is evenK

1,4,1,5,2,9,3,14
by yt = yt 4 + yt 2

any number can be the next!

Hsuan-Tien Lin (NTU CSIE) Machine Learning 7/35


Feasibility of Learning Learning is Impossible?

A ‘Simple’ Binary Classification Problem


xn yn = f (xn )
000
001 ⇥
010 ⇥
011
100 ⇥
28 ⼆
25 6
7
• X = {0, 1}3 , Y = { , ⇥}, can enumerate all candidate f as H
1、

pick g 2 H with all g(xn ) = yn (like PLA),


does g ⇡ f ?

Hsuan-Tien Lin (NTU CSIE) Machine Learning 8/35


Feasibility of Learning Learning is Impossible?

Infeasibility of Learning
x y g f1 f2 f3 f4 f5 f6 f7 f8
000
001 ⇥ ⇥ ⇥ ⇥ ⇥ ⇥ ⇥ ⇥ ⇥ ⇥
D 010
011
⇥ ⇥ ⇥ ⇥ ⇥ ⇥ ⇥ ⇥ ⇥ ⇥

100 ⇥ ⇥ ⇥ ⇥ ⇥ ⇥ ⇥ ⇥ ⇥ ⇥
101 ? ⇥ ⇥ ⇥ ⇥
110 ? ⇥ ⇥ ⇥ ⇥
111 ? ⇥ ⇥ ⇥ ⇥

• g ⇡ f inside D: sure!
• g ⇡ f outside D: No! (but that’s really what we want!)

learning from D (to infer something outside D)


is doomed if any ‘unknown’ f can happen. :-(

Hsuan-Tien Lin (NTU CSIE) Machine Learning 9/35


Feasibility of Learning Learning is Impossible?

No Free Lunch Theorem for Machine Learning


Without any assumptions on the learning problem on hand,
all learning algorithms perform the same.

(CC-BY-SA 2.0 by Gaspar Torriero on Flickr)

no algorithm is best
for all learning problems
Hsuan-Tien Lin (NTU CSIE) Machine Learning 10/35
Feasibility of Learning Learning is Impossible?

Questions?

Hsuan-Tien Lin (NTU CSIE) Machine Learning 11/35


Feasibility of Learning Probability to the Rescue

Inferring Something Unknown with Assumptions


difficult to infer unknown target f outside D in learning;
top can we infer something unknown in other scenarios?

• consider a bin of many many orange and


green marbles
• do we know the orange portion
(probability)? No!

can you infer the orange probability?


bottom

Hsuan-Tien Lin (NTU CSIE) Machine Learning 12/35


Feasibility of Learning Probability to the Rescue
Statistics 101: Inferring Orange Probability
top
top

sample

bin
bin sample bottom

assume assume N marbles sampled independently:


bottom

orange probability = µ, orange fraction = ⌫,


green probability = 1 µ, green fraction = 1 ⌫,
with µ unknown now ⌫ known

does in-sample ⌫ say anything about


out-of-sample µ?

Hsuan-Tien Lin (NTU CSIE) Machine Learning 13/35


Feasibility of Learning Probability to the Rescue

Possible versus Probable


does in-sample ⌫ say anything about out-of-sample µ?

No! top

sample
possibly not: sample can be mostly
top

green while bin is mostly orange

Yes!
probably yes: in-sample ⌫ likely close
to unknown µ
bin
bottom

formally, what does ⌫ say about µ?


bottom

Hsuan-Tien Lin (NTU CSIE) Machine Learning 14/35


Feasibility of Learning Probability to the Rescue

Hoeffding’s Inequality (1/2)


top
top

sample of size N

µ = orange ⌫ = orange
probability in bin fraction in sample

bin
• in big sample (N large), ⌫ is probably close to µ (within ✏) bottom

⇥ ⇤ ⇣ ⌘ bottom

P ⌫ µ > ✏  2 exp 2✏2 N

• called Hoeffding’s Inequality, for marbles, coin, polling, . . .

the statement ‘⌫ = µ’ is
probably approximately correct (PAC)
Hsuan-Tien Lin (NTU CSIE) Machine Learning 15/35
Feasibility of Learning Probability to the Rescue

Hoeffding’s Inequality (2/2) relatedtochebyshev


PCIVMKEJZ
⇥ 1 -2 explzcim
⇤ ⇣ ⌘
2✏ N 當 N
2 ⼤ e.tw)
µ > ✏  2 exp
P ⌫ O ,

i. 知 0 代表 bad ⼩
sth.ba dhappened
,
,

• valid for all N and ✏ 差很 下 ,


top
top

sample of size N
• does not depend on µ,
no need to ‘know’ µ
• larger sample size N or
looser gap ✏
=) higher probability for ‘⌫ ⇡ µ’
bin
bottom

if large N, can probably infer


unknown µ by known ⌫
bottom

(under iid sampling assumption)

Hsuan-Tien Lin (NTU CSIE) Machine Learning 16/35


Feasibility of Learning Probability to the Rescue

Questions?

Hsuan-Tien Lin (NTU CSIE) Machine Learning 17/35


Feasibility of Learning Connection to Learning
Connection to Learning
我們 想 知道 的 是 我們 的 hypotnesis
bin learning 和 targettunc 是否 相同 重 桌在 評估
,
,

prediction 和 idealvalue 的 差異 ,

• unknown orange prob. µ ?


• fixed hypothesis h(x) = target f (x)
t
( 未知 )
• marble • 2 bin 〈 • x2X correctorincorrect ?
,

'
• orange • PM • h is wrong , h(x) 6= f (x) ; 代表有 多少 ⼈ 是
nomunitorm
錯誤 的
• green • • h is right , h(x) = f (x)

• size-N sample from bin • check h on D = {(xn , yn )}


dataset
|{z}
,

f (xn )
of i.i.d. marbles with i.i.d. xn
若 假設 unifomisamplingacrossmarbles
" top
,

• h(x) 6= f (x)

if large N & i.i.d. xn , can probably infer • h(x) = f (x)

unknown Jh(x) 6= f (x)K probability


by known Jh(xn ) 6= yn K fraction
運⽤ ⼿上 有的 dataset ,
X
( Hoeffdingsinequalityi ) .

Hsuan-Tien Lin (NTU CSIE) Machine Learning 18/35


Feasibility of Learning Connection to Learning
Added Components 是否 相同 ,

unknown target function 假設 unknown x ?


>
f: X !Y P on X h⇡f
(ideal credit approval formula)
x1 , x2 , · · · , xN
evaluation.es
ampling

training examples learning final hypothesis


D : (x1 , y1 ), · · · , (xN , yN ) algorithm g⇡f
A
(historical records in bank) (‘learned’ formula to be used)

hypothesis set
fixed h
H

(set of candidate formula)

for any fixed h, can probably infer


eypectation Howoftenyourtnmakeserror
unknown Eout (h) = E Jh(x) 6= f (x)K ierrorrate
x⇠P
allpossibleexamplypothesis
ethatyoucanget fromtheprobabii ty tn
要估 的 ,

N
1 P
by known Ein (h) = N Jh(xn ) 6= yn K ⽤ ⼿ 上 data ,

已知 n=1 是否 相同 ,

(under iid sampling assumption)


Hsuan-Tien Lin (NTU CSIE) Machine Learning 19/35


Feasibility of Learning Connection to Learning

The Formal Guarantee


for any fixed h, in ‘big’ data (N large),
in-sample error Ein (h) is probably close to
out-of-sample error Eout (h) (within ✏)
⇥ ⇤ ⇣ ⌘
P Ein (h) Eout (h) > ✏  2 exp 2✏2 N
V M

same as the ‘bin’ analogy . . .


• valid for all N and ✏
• does not depend on Eout (h), no need to ‘know’ Eout (h)
—f and P can stay unknown
• ‘Ein (h) = Eout (h)’ is probably approximately correct (PAC)
exp ( 八)
2 -2 E

if ‘Ein (h) ⇡ Eout (h)’ and ‘Ein (h) small’


=) Eout (h) small =) h ⇡ f with respect to P !!! Pone ! !

Hsuan-Tien Lin (NTU CSIE) Machine Learning 20/35


Feasibility of Learning Connection to Learning

Verification of One h
for any fixed h, when data large enough,
Ein (h) ⇡ Eout (h) 若 只有 -5 h …

Can we claim ‘good learning’ (g ⇡ f )?

Yes! verificntionlīesting No! 不管怎樣 都 只能 選 h


if Ein (h) small for the fixed h if A forced to pick THE h as g
and A pick the h as g =) Ein (h) almost always not small
=) ‘g = f ’ PAC =) ‘g 6= f ’ PAC!

real learning:
A shall make choices 2 H (like PLA) 應該 要 可以選擇 .

rather than being forced to pick one h. :-(

Hsuan-Tien Lin (NTU CSIE) Machine Learning 21/35


Feasibility of Learning Connection to Learning

The ‘Verification’ Flow ( 只有 ⼀


⼩時 是在 做
verification 不 是 learning)
, ,

unknown target function unknown


f: X !Y P on X
(ideal credit approval formula)
x1 , x2 , · · · , xN x

verifying examples final hypothesis


D : (x1 , y1 ), · · · , (xN , yN ) g⇡f
(historical records in bank) (given formula to be verified)
g=h

one
hypothesis
h

(one candidate formula)

can now use ‘historical records’ (data) to


verify ‘one candidate formula’ h

Hsuan-Tien Lin (NTU CSIE) Machine Learning 22/35


Feasibility of Learning Connection to Learning

Questions?

Hsuan-Tien Lin (NTU CSIE) Machine Learning 23/35


Feasibility of Learning Connection to Real Learning

种 coloring ⽅法 衬 应 个 bin
Multiple h



-

, ,

top

h1 h2 hM

Eout (h1 ) Eout (h2 ) Eout (hM )

........

Ein (h1 ) Ein (h2 ) Ein (hM )


bottom
real learning (say like PLA):
BINGO when getting ••••••••••?

Hsuan-Tien Lin (NTU CSIE) Machine Learning 24/35


Feasibility of Learning Connection to Real Learning

Coin Game

........

Q: if everyone in size-400 NTU ML class flips a coin 5 times, and one


of the students gets 5 heads for her coin ‘g’. Is ‘g’ really magical?bottom

A: No. Even if all coins are fair, the probability that one of the coins
31 400
results in 5 heads is 1 32 > 99%.
badsampleprobabilityissmall.hutifmorebinsmaygetworse ,

BAD sample: Ein and Eout far away


不能 做 正確 的
—can get worse when involving ‘choice’ estimate
Hsuan-Tien Lin (NTU CSIE) Machine Learning 25/35
Feasibility of Learning Connection to Real Learning

BAD Sample and BAD Data


BAD Sample
e.g., Eout = 12 , but getting all heads (Ein = 0)!

BAD Data for One h


Eout (h) and Ein (h) far away:
e.g., Eout big (far from f ), but Ein small (correct on most examples)
set

h
D1
BAD cnet.si
D2 ... D1126 ... D5678
BAD
... Hoeffding
PD [BAD D for h]  . . .
想像你拿 到 baddataset 的
机会 是
⼩ 的 ,

Hoeffding: small
X
PD [BAD D] = P(D) · JBAD DK
all possibleD

Hsuan-Tien Lin (NTU CSIE) Machine Learning 26/35


Feasibility of Learning Connection to Real Learning

BAD Data for Many h


GOOD data for many h BADiwhateverweobservedonthedatasetdoesntte.nu
sthetruthabouttheunderlyingbin
() GOOD data for verifying any h JBAD 就是 badforh -
,

() there exists no BAD h such that Eout (h) and Ein (h) far away
there exists some h such that Eout (h) and Ein (h) far away
() BAD data for many h

D1 D2 ... D1126 ... D5678 Hoeffding


h1 BAD BAD PD [BAD D for h1 ]  . . .

1
Good
h2 BAD Good PD [BAD D for h2 ]  . . .
h3 BAD BAD powdrful ! BAD PD [BAD D for h3 ]  . . .
!
...
hM BAD Good , BAD PD [BAD D for hM ]  . . .
finite
all BAD BAD GOOD BAD ?
Bound BAD probability
, 確保 small
GOOD i
Canbeusedforanyhypothesis ,

do not know if D is BAD or not;


wish PD [BAD D] small & pray for “GOOD luck”
Hsuan-Tien Lin (NTU CSIE) Machine Learning 27/35
Feasibility of Learning Connection to Real Learning

Bound of BAD Data


PD [BAD D] lmanyhypothesis) ,

= PD [BAD D for h1 or BAD D for h2 or . . . or BAD D for hM ]


 PD [BAD D for h1 ] + PD [BAD D for h2 ] + . . . + PD [BAD D for hM ]

worstcase.Misfinite.no
(union bound)
⇣ ⌘ ⇣ ⌘ ⇣ ⌘
 2 exp 2✏2 N + 2 exp 2✏2 N + . . . + 2 exp 2✏2 N
⇣ ⌘
= 2M exp 2✏2 N ttunable ,

d
h 的 取量

• finite-bin version of Hoeffding, valid for all M, N and ✏


• does not depend on any Eout (hm ), no need to ‘know’ Eout (hm )
—f and P can stay unknown
• ‘Ein (g) = Eout (g)’ is PAC, regardless of A
ht H
Einlh ) ⼆ Eoutlhl forany
PAL
‘most reasonable’ A (like PLA):
pick the hm with lowest Ein (hm ) as g
Hsuan-Tien Lin (NTU CSIE) Machine Learning 28/35
Feasibility of Learning Connection to Real Learning

Questions?

Hsuan-Tien Lin (NTU CSIE) Machine Learning 29/35


Feasibility of Learning Feasibility of Learning Decomposed
The ‘Statistical’ Learning Flow
if |H| = M finite, N large enough,
for whatever g picked by A, Eout (g) ⇡ Ein (g) ( Hoeffding)
if A finds one g with Ein (g) ⇡ 0,
PAC guarantee for Eout (g) ⇡ 0 =) learning possible :-)

unknown target function unknown


f: X !Y P on X
(ideal credit approval formula)
x1 , x2 , · · · , xN x

training examples learning final hypothesis


D : (x1 , y1 ), · · · , (xN , yN ) algorithm g⇡f
A
(historical records in bank) (‘learned’ formula to be used)

hypothesis set
H
Eout (g) |{z}
⇡ Ein (g) |{z}
⇡ 0
(set of candidate formula)
test train

Hsuan-Tien Lin (NTU CSIE) Machine Learning 30/35


Feasibility of Learning Feasibility of Learning Decomposed

Two Central Questions


for batch & supervised binary classification, g ⇡ f () Eout (g) ⇡ 0
| {z } | {z }
lecture 2 lecture 1

achieved through Eout (g) ⇡ Ein (g) and Ein (g) ⇡ 0


| {z } | {z }
lecture 3 lecture 1

learning split to two central questions:


1 can we make sure that Eout (g) is close
enough to Ein (g)? (test/generalize)
2 can we make Ein (g) small enough?
(train/optimize)

what role does |{z}


M play for the two questions?
|H|

Hsuan-Tien Lin (NTU CSIE) Machine Learning 31/35


Feasibility of Learning Feasibility of Learning Decomposed

Trade-off on M
1 can we make sure that Eout (g) is close enough to Ein (g)?
2 can we make Ein (g) small enough?

small M large M
1 Yes!, 1 No!,
P[BAD]  2 · M · exp(. . .) P[BAD]  2 · M · exp(. . .)
2 No!, too few choices 2 Yes!, many choices

using the right M (or H) is important


M = 1 doomed?

Hsuan-Tien Lin (NTU CSIE) Machine Learning 32/35


Feasibility of Learning Feasibility of Learning Decomposed

Preview
Known
⇥ ⇤ ⇣ ⌘
2
P Ein (g) Eout (g) > ✏  2 · M · exp 2✏ N

Todo
• establish a finite quantity that replaces M

⇥ ⇤ ? ⇣ ⌘
P Ein (g) Eout (g) > ✏  2 · mH · exp 2✏2 N
• justify the feasibility of learning for infinite M
• study mH to understand its trade-off for ‘right’ H, just like M

mysterious PLA to be fully resolved


“soon” :-)

Hsuan-Tien Lin (NTU CSIE) Machine Learning 33/35


Feasibility of Learning Feasibility of Learning Decomposed

Questions?

Hsuan-Tien Lin (NTU CSIE) Machine Learning 34/35


Feasibility of Learning Feasibility of Learning Decomposed

Summary
1 When Can Machines Learn?

Lecture 2: The Learning Problems


Lecture 3: Feasibility of Learning
Learning is Impossible?
absolutely no free lunch outside D
Probability to the Rescue
probably approximately correct outside D
Connection to Learning
verification possible if Ein (h) small for fixed h
Connection to Real Learning
learning possible if |H| finite and Ein (g) small
Feasibility of Learning Decomposed
two questions: Eout (g) ⇡ Ein (g), and Ein (g) ⇡ 0

2 Why Can Machines Learn?


• next: what if |H| = 1?
Hsuan-Tien Lin (NTU CSIE) Machine Learning 35/35

You might also like