Machine Learning ( - HX") : Lecture 3: Feasibility of Learning

Machine Learning
(_hx“)
Lecture 3: Feasibility of Learning
Hsuan-Tien Lin (ó“0)
htlin@csie.ntu.edu.tw
Department of Computer Science

& Information Engineering
National Taiwan University
(↵Àc'x«⌦Â↵˚)
Hsuan-Tien Lin (NTU CSIE) Machine Learning 0/35

Feasibility of Learning
Roadmap
1 When Can Machines Learn?

Learning is Impossible?
Probability to the Rescue
Connection to Learning
Connection to Real Learning
Feasibility of Learning Decomposed

Feasibility of Learning Learning is Impossible?
A Learning Puzzle
batchedsupervisedbin.classification.XFLO.li
10,1 0,1 0,1 )
,
,
Xi Xz Xs yn = 1
✗4 X5 Xb
yn = +1
Maybenodefiniteans _
itdependsonwhathypotneses
wechooiei g(x) = ?
let’s test your ‘human learning’

with 6 examples :-)
Two Controversial Answers

whatever you say about g(x),
yn = 1
g(x) = ?
yn = +1
truth f (x) = +1 because

g(x) = ? ... truth f (x) = 1 because . . .
which reason is correct?

Two Controversial Answers

whatever you say about g(x),
yn = 1
g(x) = ?
yn = +1
truth f (x) = +1 because

g(x) = ? ... truth f (x) = 1 because . . .
• symmetry , +1 • left-top black , -1
• (black or white count = 3) or • middle column contains at
(black count = 4 and most 1 black and right-top
middle-top black) , +1 white , -1
all valid reasons, your adversarial teacher

can always call you ‘didn’t learn’. :-(

A Brain-Storming Problem
10 14
(5, 3, 2) ! 151022, (7, 2, 5) ! ? 14 3549
It is like a ‘learning problem’ with N = 1, x1 = (5, 3, 2), y1 = 151022.

Learn a hypothesis from the one example to predict on x = (7, 2, 5).
What is your answer?
151026 143547
g(x) = 151012 + x1 + x2 + x3 g(x) = x1 · x2 · 10000
+ x1 · x3 · 100
+ (x1 · x2 + x1 · x3 x2 )
which one is the smarter answer that only top

2% people can think of?

What is the Next Number?

1,4,1,5

What is the Next Number?

1,4,1,5
1,4,1,5,0,-1,1,6
by yt = yt 4 yt 2
1,4,1,5,1,6,1,7
by yt = yt 2 + Jt is evenK
1,4,1,5,2,9,3,14
by yt = yt 4 + yt 2
any number can be the next!

A ‘Simple’ Binary Classification Problem

xn yn = f (xn )
000
001 ⇥
010 ⇥
011
100 ⇥
28 ⼆
25 6
7
• X = {0, 1}3 , Y = { , ⇥}, can enumerate all candidate f as H
1、
pick g 2 H with all g(xn ) = yn (like PLA),

does g ⇡ f ?

Infeasibility of Learning
x y g f1 f2 f3 f4 f5 f6 f7 f8
000
001 ⇥ ⇥ ⇥ ⇥ ⇥ ⇥ ⇥ ⇥ ⇥ ⇥
D 010
011
⇥ ⇥ ⇥ ⇥ ⇥ ⇥ ⇥ ⇥ ⇥ ⇥
100 ⇥ ⇥ ⇥ ⇥ ⇥ ⇥ ⇥ ⇥ ⇥ ⇥
101 ? ⇥ ⇥ ⇥ ⇥
110 ? ⇥ ⇥ ⇥ ⇥
111 ? ⇥ ⇥ ⇥ ⇥
• g ⇡ f inside D: sure!
• g ⇡ f outside D: No! (but that’s really what we want!)
learning from D (to infer something outside D)

is doomed if any ‘unknown’ f can happen. :-(

No Free Lunch Theorem for Machine Learning

Without any assumptions on the learning problem on hand,
all learning algorithms perform the same.
(CC-BY-SA 2.0 by Gaspar Torriero on Flickr)
no algorithm is best
for all learning problems
Questions?

Feasibility of Learning Probability to the Rescue
Inferring Something Unknown with Assumptions

difficult to infer unknown target f outside D in learning;
top can we infer something unknown in other scenarios?
• consider a bin of many many orange and

green marbles
• do we know the orange portion
(probability)? No!
can you infer the orange probability?

bottom

Statistics 101: Inferring Orange Probability
top
top
sample
bin
bin sample bottom
assume assume N marbles sampled independently:

bottom
orange probability = µ, orange fraction = ⌫,

green probability = 1 µ, green fraction = 1 ⌫,
with µ unknown now ⌫ known
does in-sample ⌫ say anything about

out-of-sample µ?

Possible versus Probable

does in-sample ⌫ say anything about out-of-sample µ?
No! top
sample
possibly not: sample can be mostly
top
green while bin is mostly orange
Yes!
probably yes: in-sample ⌫ likely close
to unknown µ
bin
bottom
formally, what does ⌫ say about µ?

bottom

Hoeffding’s Inequality (1/2)

top
top
sample of size N
µ = orange ⌫ = orange
probability in bin fraction in sample
bin
• in big sample (N large), ⌫ is probably close to µ (within ✏) bottom
⇥ ⇤ ⇣ ⌘ bottom
P ⌫ µ > ✏  2 exp 2✏2 N
• called Hoeffding’s Inequality, for marbles, coin, polling, . . .
the statement ‘⌫ = µ’ is
probably approximately correct (PAC)
Hoeffding’s Inequality (2/2) relatedtochebyshev

PCIVMKEJZ
⇥ 1 -2 explzcim
⇤ ⇣ ⌘
2✏ N 當 N
2 ⼤ e.tw)
µ > ✏  2 exp
P ⌫ O ,
i. 知 0 代表 bad ⼩
sth.ba dhappened
,
,
• valid for all N and ✏ 差很下 ,

top
top
sample of size N
• does not depend on µ,
no need to ‘know’ µ
• larger sample size N or
looser gap ✏
=) higher probability for ‘⌫ ⇡ µ’
bin
bottom
if large N, can probably infer

unknown µ by known ⌫
bottom
(under iid sampling assumption)

Questions?

Feasibility of Learning Connection to Learning
我們想知道的是我們的 hypotnesis
bin learning 和 targettunc 是否相同重桌在評估
,
,
prediction 和 idealvalue 的差異 ,
• unknown orange prob. µ ?

• fixed hypothesis h(x) = target f (x)
t
( 未知 )
• marble • 2 bin 〈 • x2X correctorincorrect ?
,
'
• orange • PM • h is wrong , h(x) 6= f (x) ; 代表有多少⼈是
nomunitorm
錯誤的
• green • • h is right , h(x) = f (x)
。
• size-N sample from bin • check h on D = {(xn , yn )}

dataset
|{z}
,
f (xn )
of i.i.d. marbles with i.i.d. xn
若假設 unifomisamplingacrossmarbles
" top
,
• h(x) 6= f (x)
if large N & i.i.d. xn , can probably infer • h(x) = f (x)
unknown Jh(x) 6= f (x)K probability

by known Jh(xn ) 6= yn K fraction
運⽤⼿上有的 dataset ,
X
( Hoeffdingsinequalityi ) .

Added Components 是否相同 ,
unknown target function 假設 unknown x ?

>
f: X !Y P on X h⇡f
(ideal credit approval formula)
x1 , x2 , · · · , xN
evaluation.es
ampling
training examples learning final hypothesis

D : (x1 , y1 ), · · · , (xN , yN ) algorithm g⇡f
A
(historical records in bank) (‘learned’ formula to be used)
hypothesis set
fixed h
H
(set of candidate formula)
for any fixed h, can probably infer

eypectation Howoftenyourtnmakeserror
unknown Eout (h) = E Jh(x) 6= f (x)K ierrorrate
x⇠P
allpossibleexamplypothesis
ethatyoucanget fromtheprobabii ty tn
要估的 ,
N
1 P
by known Ein (h) = N Jh(xn ) 6= yn K ⽤⼿上 data ,
已知 n=1 是否相同 ,
(under iid sampling assumption)

.
.
The Formal Guarantee

for any fixed h, in ‘big’ data (N large),
in-sample error Ein (h) is probably close to
out-of-sample error Eout (h) (within ✏)
⇥ ⇤ ⇣ ⌘
P Ein (h) Eout (h) > ✏  2 exp 2✏2 N
V M
same as the ‘bin’ analogy . . .

• valid for all N and ✏
• does not depend on Eout (h), no need to ‘know’ Eout (h)
—f and P can stay unknown
• ‘Ein (h) = Eout (h)’ is probably approximately correct (PAC)
exp ( 八)
2 -2 E
if ‘Ein (h) ⇡ Eout (h)’ and ‘Ein (h) small’

=) Eout (h) small =) h ⇡ f with respect to P !!! Pone ! !

Verification of One h
for any fixed h, when data large enough,
Ein (h) ⇡ Eout (h) 若只有 -5 h …
Can we claim ‘good learning’ (g ⇡ f )?
Yes! verificntionlīesting No! 不管怎樣都只能選 h

if Ein (h) small for the fixed h if A forced to pick THE h as g
and A pick the h as g =) Ein (h) almost always not small
=) ‘g = f ’ PAC =) ‘g 6= f ’ PAC!
real learning:
A shall make choices 2 H (like PLA) 應該要可以選擇 .
rather than being forced to pick one h. :-(

The ‘Verification’ Flow ( 只有⼀

⼩時是在做
verification 不是 learning)
, ,
unknown target function unknown

f: X !Y P on X
x1 , x2 , · · · , xN x
verifying examples final hypothesis

D : (x1 , y1 ), · · · , (xN , yN ) g⇡f
(historical records in bank) (given formula to be verified)
g=h
one
hypothesis
h
(one candidate formula)
can now use ‘historical records’ (data) to

verify ‘one candidate formula’ h

Questions?

Feasibility of Learning Connection to Real Learning
种 coloring ⽅法衬应个 bin
Multiple h
⼀
枷
⼀
-
, ,
top
h1 h2 hM
Eout (h1 ) Eout (h2 ) Eout (hM )
........
Ein (h1 ) Ein (h2 ) Ein (hM )

bottom
real learning (say like PLA):
BINGO when getting ••••••••••?

Coin Game
........
Q: if everyone in size-400 NTU ML class flips a coin 5 times, and one

of the students gets 5 heads for her coin ‘g’. Is ‘g’ really magical?bottom
A: No. Even if all coins are fair, the probability that one of the coins
31 400
results in 5 heads is 1 32 > 99%.
badsampleprobabilityissmall.hutifmorebinsmaygetworse ,
BAD sample: Ein and Eout far away

不能做正確的
—can get worse when involving ‘choice’ estimate
BAD Sample and BAD Data

BAD Sample
e.g., Eout = 12 , but getting all heads (Ein = 0)!
BAD Data for One h

Eout (h) and Ein (h) far away:
e.g., Eout big (far from f ), but Ein small (correct on most examples)
set
h
D1
BAD cnet.si
D2 ... D1126 ... D5678
BAD
... Hoeffding
PD [BAD D for h]  . . .
想像你拿到 baddataset 的
机会是
⼩的 ,
Hoeffding: small
X
PD [BAD D] = P(D) · JBAD DK
all possibleD

BAD Data for Many h

GOOD data for many h BADiwhateverweobservedonthedatasetdoesntte.nu
sthetruthabouttheunderlyingbin
() GOOD data for verifying any h JBAD 就是 badforh -
,
() there exists no BAD h such that Eout (h) and Ein (h) far away
there exists some h such that Eout (h) and Ein (h) far away
() BAD data for many h
D1 D2 ... D1126 ... D5678 Hoeffding

h1 BAD BAD PD [BAD D for h1 ]  . . .
1
Good
h2 BAD Good PD [BAD D for h2 ]  . . .
h3 BAD BAD powdrful ! BAD PD [BAD D for h3 ]  . . .
!
...
hM BAD Good , BAD PD [BAD D for hM ]  . . .
finite
all BAD BAD GOOD BAD ?
Bound BAD probability
, 確保 small
GOOD i
Canbeusedforanyhypothesis ,
do not know if D is BAD or not;

wish PD [BAD D] small & pray for “GOOD luck”
Bound of BAD Data

PD [BAD D] lmanyhypothesis) ,
= PD [BAD D for h1 or BAD D for h2 or . . . or BAD D for hM ]

 PD [BAD D for h1 ] + PD [BAD D for h2 ] + . . . + PD [BAD D for hM ]
worstcase.Misfinite.no
(union bound)
⇣ ⌘ ⇣ ⌘ ⇣ ⌘
 2 exp 2✏2 N + 2 exp 2✏2 N + . . . + 2 exp 2✏2 N
⇣ ⌘
= 2M exp 2✏2 N ttunable ,
d
h 的取量
• finite-bin version of Hoeffding, valid for all M, N and ✏

• does not depend on any Eout (hm ), no need to ‘know’ Eout (hm )
—f and P can stay unknown
• ‘Ein (g) = Eout (g)’ is PAC, regardless of A
ht H
Einlh ) ⼆ Eoutlhl forany
PAL
‘most reasonable’ A (like PLA):
pick the hm with lowest Ein (hm ) as g
Questions?

Feasibility of Learning Feasibility of Learning Decomposed
The ‘Statistical’ Learning Flow
if |H| = M finite, N large enough,
for whatever g picked by A, Eout (g) ⇡ Ein (g) ( Hoeffding)
if A finds one g with Ein (g) ⇡ 0,
PAC guarantee for Eout (g) ⇡ 0 =) learning possible :-)
unknown target function unknown

f: X !Y P on X
x1 , x2 , · · · , xN x
training examples learning final hypothesis

D : (x1 , y1 ), · · · , (xN , yN ) algorithm g⇡f
A
(historical records in bank) (‘learned’ formula to be used)
hypothesis set
H
Eout (g) |{z}
⇡ Ein (g) |{z}
⇡ 0
(set of candidate formula)
test train

Two Central Questions

for batch & supervised binary classification, g ⇡ f () Eout (g) ⇡ 0
| {z } | {z }
lecture 2 lecture 1
achieved through Eout (g) ⇡ Ein (g) and Ein (g) ⇡ 0

| {z } | {z }
lecture 3 lecture 1
learning split to two central questions:

1 can we make sure that Eout (g) is close
enough to Ein (g)? (test/generalize)
2 can we make Ein (g) small enough?
(train/optimize)
what role does |{z}

M play for the two questions?
|H|

Trade-off on M
1 can we make sure that Eout (g) is close enough to Ein (g)?
2 can we make Ein (g) small enough?
small M large M
1 Yes!, 1 No!,
P[BAD]  2 · M · exp(. . .) P[BAD]  2 · M · exp(. . .)
2 No!, too few choices 2 Yes!, many choices
using the right M (or H) is important

M = 1 doomed?

Preview
Known
⇥ ⇤ ⇣ ⌘
2
P Ein (g) Eout (g) > ✏  2 · M · exp 2✏ N
Todo
• establish a finite quantity that replaces M
⇥ ⇤ ? ⇣ ⌘
P Ein (g) Eout (g) > ✏  2 · mH · exp 2✏2 N
• justify the feasibility of learning for infinite M
• study mH to understand its trade-off for ‘right’ H, just like M
mysterious PLA to be fully resolved

“soon” :-)

Questions?

Summary
1 When Can Machines Learn?
Lecture 2: The Learning Problems

Learning is Impossible?
absolutely no free lunch outside D
Probability to the Rescue
probably approximately correct outside D
verification possible if Ein (h) small for fixed h
Connection to Real Learning
learning possible if |H| finite and Ein (g) small
Feasibility of Learning Decomposed
two questions: Eout (g) ⇡ Ein (g), and Ein (g) ⇡ 0
2 Why Can Machines Learn?

• next: what if |H| = 1?

Machine Learning ( - HX") : Lecture 3: Feasibility of Learning

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning ( - HX") : Lecture 3: Feasibility of Learning

Uploaded by

Copyright:

Available Formats

Machine Learning

Department of Computer Science

Hsuan-Tien Lin (NTU CSIE) Machine Learning 0/35

Lecture 3: Feasibility of Learning

Hsuan-Tien Lin (NTU CSIE) Machine Learning 1/35

let’s test your ‘human learning’

Two Controversial Answers

truth f (x) = +1 because

which reason is correct?

Two Controversial Answers

truth f (x) = +1 because

all valid reasons, your adversarial teacher

Hsuan-Tien Lin (NTU CSIE) Machine Learning 4/35

It is like a ‘learning problem’ with N = 1, x1 = (5, 3, 2), y1 = 151022.

which one is the smarter answer that only top

Hsuan-Tien Lin (NTU CSIE) Machine Learning 5/35

What is the Next Number?

Hsuan-Tien Lin (NTU CSIE) Machine Learning 6/35

What is the Next Number?

any number can be the next!

Hsuan-Tien Lin (NTU CSIE) Machine Learning 7/35

A ‘Simple’ Binary Classification Problem

pick g 2 H with all g(xn ) = yn (like PLA),

Hsuan-Tien Lin (NTU CSIE) Machine Learning 8/35

learning from D (to infer something outside D)

Hsuan-Tien Lin (NTU CSIE) Machine Learning 9/35

No Free Lunch Theorem for Machine Learning

(CC-BY-SA 2.0 by Gaspar Torriero on Flickr)

Hsuan-Tien Lin (NTU CSIE) Machine Learning 11/35

Inferring Something Unknown with Assumptions

• consider a bin of many many orange and

can you infer the orange probability?

Hsuan-Tien Lin (NTU CSIE) Machine Learning 12/35

assume assume N marbles sampled independently:

orange probability = µ, orange fraction = ⌫,

does in-sample ⌫ say anything about

Hsuan-Tien Lin (NTU CSIE) Machine Learning 13/35

Possible versus Probable

green while bin is mostly orange

formally, what does ⌫ say about µ?

Hsuan-Tien Lin (NTU CSIE) Machine Learning 14/35

Hoeffding’s Inequality (1/2)

P ⌫ µ > ✏  2 exp 2✏2 N

• called Hoeffding’s Inequality, for marbles, coin, polling, . . .

Hoeffding’s Inequality (2/2) relatedtochebyshev

• valid for all N and ✏ 差很 下 ,

if large N, can probably infer

(under iid sampling assumption)

Hsuan-Tien Lin (NTU CSIE) Machine Learning 16/35

Hsuan-Tien Lin (NTU CSIE) Machine Learning 17/35

• unknown orange prob. µ ?

• size-N sample from bin • check h on D = {(xn , yn )}

if large N & i.i.d. xn , can probably infer • h(x) = f (x)

unknown Jh(x) 6= f (x)K probability

Hsuan-Tien Lin (NTU CSIE) Machine Learning 18/35

unknown target function 假設 unknown x ?

training examples learning final hypothesis

(set of candidate formula)

for any fixed h, can probably infer

(under iid sampling assumption)

The Formal Guarantee

same as the ‘bin’ analogy . . .

• valid for all N and ✏ 差很下 ,

Yes! verificntionlīesting No! 不管怎樣都只能選 h

The ‘Verification’ Flow ( 只有⼀