05 Draft

Machine Learning
(_hx“)
Lecture 5: Linear Models
Hsuan-Tien Lin (ó“0)
htlin@csie.ntu.edu.tw
Department of Computer Science

& Information Engineering
National Taiwan University
(↵Àc'x«⌦Â↵˚)
Hsuan-Tien Lin (NTU CSIE) Machine Learning 0/52

Linear Models
Roadmap
1 When Can Machines Learn?
2 Why Can Machines Learn?
3 How Can Machines Learn?

Linear Regression Problem
Linear Regression Algorithm
Logistic Regression Problem
Logistic Regression Error
Gradient of Logistic Regression Error
Gradient Descent
Stochastic Gradient Descent

Linear Models Linear Regression Problem
Credit Limit Problem

age 23 years
gender female
annual salary NTD 1,000,000
year in residence 1 year
unknown target function year in job 0.5 year
f: X !Y current debt 200,000
(ideal credit limit formula) credit limit? 100,000
要給某客⼾多少信⽤額度 ? 每个客⼾不同 ,
training examples learning final hypothesis

D : (x1 , y1 ), · · · , (xN , yN ) algorithm g⇡f
A
(historical records in bank) (‘learned’ formula to be used)
hypothesis set
H
(set of candidate formula)
regression 特点輸出空間為整个實丰⼜
Y = R: regression
Linear Regression Hypothesis

H 該長怎樣輸出空間会是整个實取 ?
,
age 23 years
annual salary NTD 1,000,000
year in job 0.5 year
current debt 200,000
• For x = (x0 , x1 , x2 , · · · , xd ) ‘features of customer’, ← 客⼾資料

approximate the desired credit limit with a weighted sum:
wii 加权分邦
d
X
y⇡ wi xi
i=0
• linear regression hypothesis: h(x) = wT x
h(x): like perceptron, but without the sign

PLA 有取正負号但這裡沒有 ,

Illustration of Linear Regression

Linearhypothesis ,
x = (x) 2 R x = (x1 , x2 ) 2 R2
三維空間中即形成平⾯⼀
衡量奌到 hypothesis 有多近
y
y
x1
x x2
linear regression:
find lines/hyperplanes with small residuals

Pointwise Error Measure for ‘Small Residuals’

final hypothesis
g⇡f
how well? often use averaged err(g(x), f (x)), like
Eout (g) = E Jg(x) 6= f (x)K

x⇠P | {z }
err(g(x),f (x))
—err: called pointwise error measure
in-sample out-of-sample
N
1X
Ein (g) = err(g(xn ), f (xn )) Eout (g) = E err(g(x), f (x))
N x⇠P
n=1
will mainly consider pointwise err for simplicity

Learning Flow with Pointwise Error Measure

y
unknown target
distribution P(y |x) unknown
containing f (x) + noise P on X
(ideal credit approval formula)

x1 , x2 , · · · , xN x
y1 , y2 , · · · , yN

A
hypothesis set error measure

H err
extended VC theory/‘philosophy’
works for most H and err

Two Important Pointwise Error Measures

0 1
B C
err @g(x), f (x)A
|{z} |{z}
ỹ y
0/1 error squared error

err(ỹ , y ) = Jỹ 6= y K err(ỹ , y ) = (ỹ y )2
• correct or incorrect? • how far is ỹ from y ?
• often for classification • often for regression
squared error: quantify ‘small residual’

Squared Error Measure for Regression

popular/historical error measure for linear regression:
squared error err(ŷ , y ) = (ŷ y )2
in-sample out-of-sample
N
1X
Ein (w) = (h(xn ) yn )2 Eout (w) = E (wT x y )2
N | {z } (x,y )⇠P
n=1
wT x n
next: how to minimize Ein (w)?

Questions?

Linear Models Linear Regression Algorithm
Matrix Form of Ein (w)

Ein 如何越⼩越好了
N N
1X T 2 1X T
Ein (w) = (w xn yn ) = (xn w yn ) 2
N nypothesis 希望的預測N
n=1 n=1
的預測
2
xT1 w y1
1 xT2 w y2
=
N ...
xTN w yN
2 3 2 3 2
xT1 y1
1 6
6 xT2 7
7w
6 y2 7
6 7
( 共同的 w 提出 ) = 4 5 4 ... 5
N ...
xTN yN
1
= k X w y k2
N |{z} |{z} |{z}
N⇥d+1 d+1⇥1 N⇥1

唯变叔為 w
,
已知刈
1
min Ein (w) = kXw yk2
w N
• Ein (w): continuous, differentiable, convex

• necessary condition of ‘best’ w
在制⽅向做偏微皆⼆ 0
2 3 2 3
Ein @Ein
@w 0 (w) 0
6 @Ein 7 6 7
6 @w 1 (w) 7 6 0 7
rEin (w) ⌘ 6 7=6 7
4 ... 5 4 ... 5
@Ein
w 最低桌 @w d (w) 0
不管往哪了⽅向都無法讓函邦值更
低
—not possible to ‘roll down’
,
梯度⼆ 0
task: find wLIN such that rEin (wLIN ) = 0

The Gradient rEin (w)

0 1
1 1@ T T
Ein (w) = kXw yk2 = w X Xw 2wT XT y + yT y A
N N | {z } |{z} |{z}
A b c
one w only 若 w 只有个維度
⼀
,
vector w 若㖷了向量
⇣ ⌘ ⇣ ⌘
Ein (w)= N1 aw 2 2bw + c Ein (w)= N1 wT Aw 2wT b + c
rEin (w)= N1 (2aw 2b) rEin (w)= N1 (2Aw 2b)
simple! :-) similar (derived by definition)

何時梯度⼆ 0 ?
2
rEin (w) = N XT Xw XT y

Optimal Linear Regression Weights

2
task: find wLIN such that N XT Xw XT y = rEin (w) = 0
之中只有 w 未知 ˋ
invertible XT X 可求得反矩陣 ,
singular XT X
• easy! unique solution • many optimal solutions
⇣ ⌘ 1
ldtllxldt 1) • one of the solutions
wLIN = XT X XT y
| {z } wLIN = X† y
pseudo-inverse X†
by defining X† in other ways
• often the case because
N d +1
practical suggestion:
use well-implemented † routine
1 T
instead of XT X X
for numerical stability when almost-singular

1 from D, construct input matrix X and output vector y by
製造矩陣 2 3 2 3
xT1 y1
6 xT2 7 6
7 y = 6 y2 7
7
X=6 4 5 4
··· ··· 5
xTN yN
| {z } | {z }
N⇥(d+1) N⇥1
2 calculate pseudo-inverse X†
|{z}
計算反矩陣 (d+1)⇥N
3 return wLIN = X† y
|{z}
(d+1)⇥1
simple and efficient

with good † routine
Is Linear Regression a ‘Learning Algorithm’?

wLIN = X† y
No! Yes! 從某些⾓度⽽⾔算是机器学習演算法 "

• analytic (closed-form) I • good Ein ?
solution, ‘instantaneous’ yes, optimal! 有好的 Ein
• not improving Ein nor 2
• good Eout ? 資料量夠多有好 Ein 有好 Eout
,
→ ,
Eout iteratively yes, finite dVC like perceptrons

⼀下⼦就算出結果 ? 沒有学習过程 ? 3 • improving iteratively?
次有隨著覌察的資料更新結果 ! somewhat, within an iterative
pseudo-inverse routine
if Eout (wLIN ) is good, learning ‘happened’!

Linear Regression Generalization Issue
如何保証 Ein 是好的 ! Benefit of Analytic Solution:

‘Simpler-than-VC’ Guarantee
n o
to be shown d+1
Ein = E Ein (wLIN w.r.t. D) = noise level · 1 N
D⇠P N
資料中的雜訊
衬所有抓把起來的資料做平均
Distrihution 中会做錯的部分
1 1
Ein (wLIN ) = ky ŷ k2 = ky X X † y k2
N |{z} N |{z}
predictions wLIN
1
= k( I XX† )yk2
N |{z}
identity
call XX† the hat matrix H

because it puts ^ on y
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 14/23

Geometric View of Hat Matrix
y
y ŷ
span of X
ŷ
in RN
• ŷ = XwLIN within the span of X columns
• y ŷ smallest: y ŷ ? span
• H: project y to ŷ 2 span 投影
• I H: transform y to y ŷ ? span
claim: trace(I H) = N (d + 1). Why? :-)

衬⾓線上所有值相加

An Illustrative ‘Proof’
y
y ŷ
noise
f (X) span of X
ŷ
• if y comes from some ideal f (X) 2 span plus noise

• noise with per-dimension ‘noise level’ 2 transformed by I H to
be y ŷ
1
Ein (wLIN ) = ky ŷk2 = 1
N k(I H)noisek2
N
1 2
= N N (d + 1)
2 d+1
Ein = · 1 N 看到的讓 Ein 好看笑
,
2 d+1
Eout = · 1+ N (complicated!) 要看新資料來往、
可能差 2 倍
The Learning Curves of Linear Regression

(proof skipped this year)
Expected Error
d+1 Eout
Eout = noise level · 1 + N 2
d+1
Ein = noise level · 1 N
Ein
d+1 Number of Data Points, N
• both converge to 2 (noise level) for N ! 1

• expected generalization error: 2(d+1)
N
—similar to worst-case guarantee from VC
linear regression (LinReg):

learning ‘happened’!
Questions?

Linear Models Logistic Regression Problem
Heart Attack Prediction Problem (1/2)

age 40 years
gender male
blood pressure 130/85
cholesterol level 240
unknown target weight 70
distribution P(y |x)
containing f (x) + noise heart disease? yes ⼆元分類問題

A

H err
binary classification:
ideal f (x) = sign P(+1|x) 12 2 { 1, +1}
because of classification err

Heart Attack Prediction Problem (2/2)

age 40 years
gender male
unknown target weight 70
distribution P(y |x)
containing f (x) + noise heart attack? 80% risk ⼼臟病発的風險

A

H err
‘soft’ binary classification:

dinputvector
f (x) = P(+1|x) 2 [0, 1]
想知道的是這个值 exi 風險值

Soft Binary Classification

target function f (x) = P(+1|x) 2 [0, 1]
ideal (noiseless) data actual (noisy) data

⇣ ⌘ ⇣ ⌘
x1 , y10 = 0.9 = P(+1|x1 ) x1 , y1 = ⇠ P(y |x1 )
⇣ ⌘ ⇣ ⌘
x2 , y20 = 0.2 = P(+1|x2 ) x2 , y2 = ⇥ ⇠ P(y |x2 )
.. ..
⇣ . ⌘ ⇣ . ⌘
xN , yN = 0.6 = P(+1|xN )
0 xN , yN = ⇥ ⇠ P(y |xN )
T T
但⼿上可能沒這个資料⼜跟原本⼀樣 ,
只有錯対
same data as hard binary classification,

different target function

Soft Binary Classification

target function f (x) = P(+1|x) 2 [0, 1]
改成左迅 data 有 noise 的版本 , 根據 09,02 …
ideal (noiseless) data actual (noisy) data 抽樣⼀下

⇣ ⌘ ⇣ r z⌘
?
x1 , y10 = 0.9 = P(+1|x1 ) x1 , y10 = 1 = ⇠ P(y |x1 )
⇣ ⌘ ⇣ r z⌘
?
x2 , y20 = 0.2 = P(+1|x2 ) x2 , y20 = 0 = ⇠ P(y |x2 )
.. ..
⇣ . ⌘ ⇣ . r z⌘
?
xN , yN = 0.6 = P(+1|xN )
0
xN , yN = 0 =
0 ⇠ P(y |xN )
same data as hard binary classification,

different target function
要如何做才能求得好的 nypothesis
7
當有興趣的⼗是 -1 的輸出時
,
0
,

Logistic Hypothesis 常⽤ ˋ
age 40 years
gender male
• For x = (x0 , x1 , x2 , · · · , xd ) ‘features of

S 型
patient’, calculate a weighted ‘risk score’:
將 teatnre 做加权 1
d
X ✓(s)
s= wi xi
i=0 ← 0 為常䫍
0 s
• convert the score to estimated probability 將分邦転化為个⼀
0 到⼯的值
by logistic function ✓(s)
logistic hypothesis: h(x) = ✓(wT x)

Logistic Function
1
✓(s)
0 s
✓( 1) = 0; ✓(0) = 12 ; ✓(1) = 1
es 1
分⼿⼼低時靠近 ✓(s) =
0
= 分形⾼時靠近 2
1 + es 1+e s
analytictorm
—smooth, monotonic, sigmoid function of s
logistic regression: use

1
h(x) =
1 + exp( wT x)
to approximate target function f (x) = P(+1|x)

Questions?

Linear Models Logistic Regression Error
Three Linear Models

linear scoring function: s = wT x
linear classification linear regression logistic regression

h(x) = sign(s) h(x) = s h(x) = ✓(s)
x0
杈 x0 x0
總分
x1 x1 x1
s s s
x2 h(x) x2 h(x) x2 h(x)
xd 対分錯 xd xd
透过 sigmoid 輸出
…
plausible err = 0/1 friendly err = squared

(small flipping noise) (easy to minimize) err = ?
PLA 在盡量不要有 error
how to define
Ein (w) for logistic regression?

𩶛
Likelihood ⇢
Y 有 2年中
target function f (x) for y = +1

f (x) = P(+1|x) , P(y |x) =
1 f (x) for y = 1
consider D = {(x1 , ), (x2 , ⇥), . . . , (xN , ⇥)}

產⽣ data 的机率是 …
'
probability that f generates D likelihood that h generates D

P(x1 )P( |x1 ) ⇥ P(x1 )h(x1 ) ⇥
P(x2 )P(⇥|x2 ) ⇥ P(x2 )(1 h(x2 )) ⇥
... 若假裝 h 就是 t 可以
,
. 察嚡樣資料的可能性
. .覌
但這不是真的不啐 liuinood
P(xN )P(⇥|xN )
,
P(xN )(1 h(xN ))

Maximumlikelihood ,
• if h ⇡ f , h 的可能性和 t 是類似的 ( 若 h 是好的 )
then likelihood(h) ⇡ probability using f

• probability using f usually large
Likelihood ⇢
target function f (x) for y = +1
f (x) = P(+1|x) , P(y |x) =
1 f (x) for y = 1
假設⼝是 iid ,
consider D = {(x1 , ), (x2 , ⇥), . . . , (xN , ⇥)}
probability that f generates D likelihood that h generates D

「
P(x1 )f (x1 ) ⇥ pretend P(x1 )h(x1 ) ⇥

P(x2 )(1 f (x2 )) ⇥ P(x2 )(1 h(x2 )) ⇥
... ...
P(xN )(1 f (xN )) P(xN )(1 h(xN ))
• if h ⇡ f ,
then likelihood(h) ⇡ probability using f
• probability using f usually large
Likelihood of Logistic Hypothesis

likelihood(h) ⇡ (probability using f ) ⇡ large
g = argmax likelihood(h)
h
旋転 1800 可以得到相同图形
1
0
when logistic: h(x) = ✓(wT x) 対稱性 :
✓(s)
1 h(x) = h( x) ( Iogisticfunc 特性 )
、
T 0 s
ltéwx
likelihood(h) = P(x1 )h(x1 ) ⇥ P(x2 )(1 h(x2 )) ⇥ . . . P(xN )(1 h(xN ))

wnr vvn vrrr
n n
対 h ⽽⾔是⼀樣的 ,
以刈 hlxnl
N
Y
likelihood(logistic h) / h(yn xn )
n=1

Likelihood of Logistic Hypothesis

likelihood(h) ⇡ (probability using f ) ⇡ large
g = argmax likelihood(h)
h
1
when logistic: h(x) = ✓(wT x) ✓(s)
1 h(x) = h( x)
0 s
O X X
likelihood(h) = P(x1 )h(+x1 ) ⇥ P(x2 )h( x2 ) ⇥ . . . P(xN )h( xN )

hlyxi ) h (火刈 MYNXN)
N
Y
likelihood(logistic h) / h(yn xn )
n=1

Cross-Entropy Error
N
Y
max likelihood(logistic h) / h(yn xn )
h
n=1
若是 0 就⽤
,
⼗⼼⼼
若是x 就⽤,
xrn

Cross-Entropy Error
N
Y ⇣ ⌘
max likelihood(w) / ✓ yn wT xn
w
n=1
⽤真正有興趣的槤 ,

Cross-Entropy Error
N
Y ⇣ ⌘
max ln ✓ yn wT xn ⼆
Iln Olynúxn )
w
n=1
連柴恥換成連加

Cross-Entropy Error
N ⇣ ⌘
1X
min ln ✓ yn wT xn
w N
n=1
errorfunc ,
Max 改成 Min
N
1 1X ⇣ ⌘
✓(s) = : min ln 1 + exp( yn wT xn )
1 + exp( s) w N
n=1
XN
1
=) min err(w, xn , yn ) 対這族的 error
w N
|n=1 {z }
Ein (w)
err(w, x, y ) = ln 1 + exp( y wT x) :
cross-entropy error 亂度 1 熵

Questions?

Linear Models Gradient of Logistic Regression Error
Minimizing Ein (w)

N
1X ⇣ ⌘
min Ein (w) = ln 1 + exp( yn wT xn )
w N
n=1
• Ein (w): continuous, differentiable,

twice-differentiable, convex
• how to minimize? locate valley
Ein 拋物線向上如⼭⾕
want rEin (w) = 0
first: derive rEin (w)

The Gradient
0
rEin (w) 1
N
X z
}| {
1 B C
Ein (w) = ln @1 + exp( yn wT xn )A
N | {z }
n=1
⇤
chaimrule
N ✓ ◆✓ ◆✓ ◆
@Ein (w) 1 X @ ln(⇤) @(1 + exp( )) @ yn wT xn
=
@wi N @⇤ @ @wi
n=1
N ✓ ◆✓ ◆✓ ◆
1X
=
N 藏 expo ) ynxni -
n=1
N ✓ ◆ ! N
1X exp( ) 1X
= yn xn,i = ✓( ) yn xn,i
N 1 + exp( ) N
n=1 n=1
N
P
1
rEin (w) = N ✓ yn wT xn yn xn
n=1

The Gradient
0
rEin (w) 1
N
X z
}| {
1 B C
Ein (w) = ln @1 + exp( yn wT xn )A
N | {z }
n=1
⇤
N ✓ ◆✓ ◆✓ ◆
@Ein (w) 1 X @ ln(⇤) @(1 + exp( )) @ yn wT xn
=
@wi N @⇤ @ @wi
n=1
N ✓ ◆
! !
1X 1
= exp( ) yn xn,i
N ⇤
n=1
N ✓ ◆ ! N
1X exp( ) 1X
= yn xn,i = ✓( ) yn xn,i
N 1 + exp( ) N
n=1 n=1
N
P
1
rEin (w) = N ✓ yn wT xn yn xn
n=1

Minimizing Ein (w)

N
1X ⇣ T
⌘
min Ein (w) = ln 1 + exp( yn w xn )
w N
n=1
N
1X ⇣ ⌘
want rEin (w) = ✓ yn wT xn yn xn =0
N
n=1
0 ⼼)
scaled ✓-weighted sum of yn xn

• all ✓(·) = 0: only if yn wT xn 0
—linear separable D
Ein • weighted sum = 0:
non-linear equation of w
w
closed-form solution? no :-(
PLA Revisited: Iterative Optimization

PLA: start from some w0 (say, 0), and ‘correct’ its mistakes on D
For t = 0, 1, . . .
1 find a mistake of wt called xn(t) , yn(t)
⇣ ⌘
sign wTt xn(t) 6= yn(t)
2 (try to) correct the mistake by
wt+1 wt + yn(t) xn(t)
when stop, return last w as g


For t = 0, 1, . . .
1 find a mistake of wt called xn(t) , yn(t)
⇣ ⌘
sign wTt xn(t) 6= yn(t)
2 (try to) correct the mistake by
wt+1 wt + yn(t) xn(t)
1 (equivalently) pick some n, and update wt by

r ⇣ ⌘ z
wt+1 wt + sign wTt xn 6= yn yn xn
direction
when stop, return last w as g
,


For t = 0, 1, . . .
(equivalently) pick some n, and update wt by
1
⇣r ⇣ ⌘ z ⌘
wt+1 wt + 1 · sign wTt xn 6= yn · yn xn
|{z} | {z }
⌘ v
⽅向
when stop, return last w as g 步長
① ② ③
choice of (⌘, v) and stopping condition defines
iterative optimization approach

Questions?

Linear Models Gradient Descent
Iterative Optimization
For t = 0, 1, . . . 似 leamingrate
wt+1 wt + ⌘v
when stop, return last w as g 要決定了件事 :D ⽅向 ① 步長
• PLA: v comes from mistake correction
In-sample Error, Ein

• smooth Ein (w) for logistic regression:
choose v to get the ball roll ‘downhill’?
• direction v:
(assumed) of unit length
• step size ⌘:
(assumed) positive Weights, w
a greedy approach for some given ⌘ > 0:
min Ein (wt + ⌘v) imeroptimiutionproblem

kvk=1 | {z }
wt+1

Linear Approximation
a greedy approach for some given ⌘ > 0:
min Ein (wt + ⌘v)

kvk=1
• still non-linear optimization, now with constraints

—not any easier than minw Ein (w)
• local approximation by linear formula makes problem easier
Ein (wt + ⌘v) ⇡ Ein (wt ) + ⌘vT rEin (wt )

if ⌘ really small (Taylor expansion) ⽤泰勒展開式在維上看是切線 ,
現在延伸到多維度 ,
an approximate greedy approach for some given small ⌘:
min
kvk=1 \
Ein (wt ) +
| {z }
⌘
|{z}
vT rEin (wt )
| {z }
1
丠 ?
known given positive known
常取要讓這項越⼩越好
Gradient Descent
an approximate greedy approach for some given small ⌘:
min Ein (wt ) + ⌘ vT rEin (wt )

kvk=1 | {z } |{z} | {z }
known given positive known
• optimal v: opposite direction of rEin (wt )
rEin (wt )
v=
krEin (wt )k
negativenormalizeddirect.int
• gradient descent: for small ⌘, wt+1 wt
lgradient.li
rEin (wt )
⌘ krEin (wt )k
update
gradient descent:
a simple & popular optimization tool

Choice of ⌘
too small too large jumpt 可能也会往上
泰勒展開式就

」
( 不会対
Weights, w Weights, w
too slow :-( 步伐⼩ too unstable :-( 步伐⼤
a naive yet effective heuristic

• if red ⌘ / krEin (wt )k by ratio purple ⌘ (the fixed learning rate)
rEin (wt )
wt+1 wt ⌘ = wt ⌘rEin (wt )
krEin (wt )k
fixed learning rate gradient descent:
wt+1 wt ⌘rEin (wt )

Putting Everything Together

Logistic Regression Algorithm
initialize w0
For t = 0, 1, · · ·
1 compute
N
1X ⇣ ⌘
rEin (wt ) = ✓ yn wTt xn yn xn
N
n=1
2 update by
wt+1 wt ⌘rEin (wt )
d
0.1126
...until rEin (wt+1 ) = 0 or enough iterations

return last wt+1 as g
O(N) time complexity in step 1 per iteration

Questions?

Linear Models Revisited


h(x) = sign(s) h(x) = s h(x) = ✓(s)
x0 x0 x0
x1 x1 x1
s s s
x2 h(x) x2 h(x) x2 h(x)
xd 対錯 xd xd ⼼之間的机率
plausible err = 0/1 friendly err = squared plausible err = cross-entropy
discrete Ein (w): 使 error 变⼩ quadratic convex Ein (w): smooth convex Ein (w):
NP-hard to solve in general closed-form solution gradient descent
hara easy easy
can linear regression or logistic regression
help linear classification?

Error Functions Revisited

for binary classification y 2 { 1, +1}

0 h(x) = sign(s) 算出 h值 h(x) = s h(x) = ✓(s)
2
② err(h, x, y ) = Jh(x) 6= y K err(h, x, y ) = (h(x) y) err(h, x, y ) = ln h(y x)
從 hypothesis 輸出看到是否相同
,
結合 : err0/1 (s, y ) errSQR (s, y ) errCE (s, y )

Jsign(s) 6= y K 2
= = (s y) = ln(1 + exp( y s))
Jsign(y s) 6= 1K 2
= = (y s 1)
(y s): classification correctness score

希望⼼。越來越好

Visualizing Error Functions 比較 errorfunc 的長相

0/1 err0/1 (s, y ) = Jsign(y s) 6= 1K 盡在同⼀平⾯上
sqr errSQR (s, y ) = (y s 1)2
ce errCE (s, y ) = ln(1 + exp( y s))
scaled ce errSCE (s, y ) = log2 (1 + exp( y s))
6 0/1 • 0/1: 1 iff ys  0

• sqr: large if ys ⌧ 1
4
err but over-charge ys 1
2 small errSQR ! small err0/1
1 • ce: monotonic of ys
0 small errCE $ small err0/1
−3 −2 −1 0 1 2 3
ys • scaled ce: a proper upper bound of 0/1
small errSCE $ small err0/1
upper bound:
useful for designing algorithm
Visualizing Error Functions

0/1 err0/1 (s, y ) = Jsign(y s) 6= 1K
6 • 0/1: 1 iff ys  0
0/1
sqr
4
err 扣分
区但⽤ but over-charge ys 1
2 linearreg small errSQR ! small err0/1
,
卻扣分
0
011 很好 small errCE $ small err0/1
ys t
−3 −2 −1 0 1 2 3
• scaled ce: a proper upper bound of 0/1
在 011 和 linenr small errSCE $ small err0/1
結果類似
,
upper bound:

0/1 err0/1 (s, y ) = Jsign(y s) 6= 1K
ce errCE (s, y ) = ln(1 + exp( y s)) 取 og.no
6 0/1 • 0/1: 1 iff ys  0

sqr
ce
4
2 small errSQR ! small err0/1
−3 −2 −1 0 1 2 3
upper bound:

0/1 err0/1 (s, y ) = Jsign(y s) 6= 1K
scaled ce errSCE (s, y ) = log2 (1 + exp( y s)) 取 lgz
6 0/1 • 0/1: 1 iff ys  0

sqr
scaled ce
4
在⼼切在⼀起 small errSQR ! small err0/1
2
−3 −2 −1 0 1 2 3
upper bound:
Learning Flow with Algorithmic Error Measure

y
unknown target
distribution P(y |x) unknown
containing f (x) + noise P on X
(ideal credit approval formula)

x1 , x2 , · · · , xN x
y1 , y2 , · · · , yN

A
ec
rr

H err
err: goal, not always easy to optimize;

rr: something ‘similar’ to facilitate A, e.g.
ec
upper bound
Theoretical Implication of Upper Bound

For any ys where s = wT x
若在意的是 011 的 error 上限為 tu ,
err0/1 (s, y )  errSCE (s, y ) = ln12 errCE (s, y ).

0/1 1
=) Ein (w)  EinSCE (w) = CE
ln 2 Ein (w) 平均
0/1 SCE 1 CE
Eout (w)  Eout (w) = ln 2 Eout (w)
VC on 0/1: VC-Reg on CE :
( 從 Ein 出発 1 1 從 Eout 出発) ,
0/1 0/1 0/1

Eout (w)  Ein (w) + ⌦0/1 Eout (w)  1 CE
ln 2 Eout (w)
1 CE 0/1 1 1
ln 2 Ein (w) + ⌦
CE CE
ln 2 Ein (w) + ln 2 ⌦
 
0/1
small EinCE (w) =) small Eout (w):
logistic/linear reg. for linear classification

使⽤ regression 得到
答案後將分我取 sign
Regression for Classification
1 run logistic/linear reg. on D with yn 2 { 1, +1} to get wREG
2 return g(x) = sign(wTREG x)
PLA linear regression logistic regression

• pros: efficient + • pros: 公式宑下就解決 • pros:
strong guarantee ‘easiest’ ‘easy’
if lin. separable optimization optimization
• cons: works only if • cons: loose • cons: loose
lin. separable bound of err0/1 bound of err0/1 for
for large |ys| very negative ys
若有 011 差較多州時
1 ⼤只是⼗上限
,
就会較寬鬆
非常的正或負時看起來就和 011 差較多
,
• linear regression sometimes used to set w0 for

PLA/logistic regression initialile ,
• logistic regression often preferred in practice

如果 lincarregression 看起來是不錯的⽅法就將 LR 的 w 當成 wo 再從 , ,
Hsuan-Tien Lin (NTU CSIE) 好的 w

繼續 optimize ,
Machine Learning 45/52

Questions?

Linear Models Stochastic Gradient Descent
Two Iterative Optimization Schemes

For t = 0, 1, . . .
⼀
步步讓 w 越來越好
wt+1 wt + ⌘v
when stop, return last w as g 每輪選⼀
个桌修正
PLA logistic regression

pick (xn , yn ) and decide wt+1 by check D and decide wt+1 (or
the one example new ŵ) by all examples
O(1) time per iteration :-) O(N) time per iteration :-(
每次都看其中⼀桌每輪要看完所有的桌將所有的桌的,
update: 2
gmdient 的貢獻相阿平均再的梯度⽅向 ,
logistic regression with 往下做

O(1) time per iteration?
w(t)
w(t+1)
x
9

Logistic Regression Revisited 演算法

N
1X ⇣ ⌘
wt+1 wt + ⌘ ✓ yn wTt xn yn xn
N
| n=1 {z }
rEin (wt )
• want: update direction v ⇡ rEin (wt ) 希望更新⽅向有 gradient 接近

while computing v by one single (xn , yn ) 㽆夠⼩時梯度是⼀ ,
PN i. 隨机抽族的⽅向但不希望花 N
• technique on removing N1 ⇐ 好
:
,
n=1 再做平均倍⼒氣算出梯度
view as expectation E over uniform choice of n!
stochastic gradient: 随机的梯度 ( 不是原本真正的⽽是在某族上的 1 ,
rw err(w, xn , yn ) with random n 將整体的梯度看成這个

true gradient: 隨机过程的期望值
rw Ein (w) = E rw err(w, xn , yn )
random n

Stochastic Gradient Descent (SGD)

stochastic gradient = true gradient + zero-mean ‘noise’ directions

• idea: replace true gradient by stochastic gradient ⽤随机的梯度做下降
• after enough steps, ⽽非真實的梯度 ,
average true gradient ⇡ average stochastic gradient

• pros: simple & cheaper computation :-) 若跑夠多步平均⽽⾔真實 ,
,
—useful for big data or online learning 的梯度会⼀随机梯度

• cons: less stable in nature
SGD logistic regression, looks familiar? :-):

⇣ ⌘ → 更新⽅向
wt+1 wt + ⌘ ✓ yn wTt xn yn xn
| {z }
rerr(wt ,xn ,yn )

PLA Revisited
SGD logistic regression:
⇣ ⌘
wt+1 wt + ⌘ · ✓ yn wTt xn yn xn
PLA: r z
wt+1 wt + 1 · yn 6= sign(wTt xn ) yn xn
• SGD logistic regression ⇡ ‘soft’ PLA

• PLA ⇡ SGD logistic regression with ⌘ = 1 when wTt xn large
two practical rule-of-thumb:

• stopping condition? t large enough 很難決定以相信跑
• ⌘? 0.1 when x in proper range 夠久就可以停

Questions?

Summary
1 Why Can Machines Learn?
Lecture 4: Theory of Generalization

2 How Can Machines Learn?

Linear Regression Problem
Logistic Regression Problem
Logistic Regression Error
Gradient of Logistic Regression Error
Gradient Descent
• next: beyond simple linear models

05 Draft

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

05 Draft

Uploaded by

Copyright:

Available Formats

Machine Learning

Department of Computer Science

Hsuan-Tien Lin (NTU CSIE) Machine Learning 0/52

Lecture 5: Linear Models

Hsuan-Tien Lin (NTU CSIE) Machine Learning 1/52

Credit Limit Problem

training examples learning final hypothesis

(set of candidate formula)

Linear Regression Hypothesis

• For x = (x0 , x1 , x2 , · · · , xd ) ‘features of customer’, ← 客⼾ 資料

• linear regression hypothesis: h(x) = wT x

h(x): like perceptron, but without the sign

Hsuan-Tien Lin (NTU CSIE) Machine Learning 3/52

Illustration of Linear Regression

Hsuan-Tien Lin (NTU CSIE) Machine Learning 4/52

Pointwise Error Measure for ‘Small Residuals’

how well? often use averaged err(g(x), f (x)), like

Eout (g) = E Jg(x) 6= f (x)K

—err: called pointwise error measure

will mainly consider pointwise err for simplicity

Learning Flow with Pointwise Error Measure

(ideal credit approval formula)

training examples learning final hypothesis

hypothesis set error measure

(set of candidate formula)

Hsuan-Tien Lin (NTU CSIE) Machine Learning 6/52

Two Important Pointwise Error Measures

0/1 error squared error

squared error: quantify ‘small residual’

Hsuan-Tien Lin (NTU CSIE) Machine Learning 7/52

Squared Error Measure for Regression

next: how to minimize Ein (w)?

Hsuan-Tien Lin (NTU CSIE) Machine Learning 8/52

Hsuan-Tien Lin (NTU CSIE) Machine Learning 9/52

Matrix Form of Ein (w)

Hsuan-Tien Lin (NTU CSIE) Machine Learning 10/52

• Ein (w): continuous, differentiable, convex

task: find wLIN such that rEin (wLIN ) = 0

Hsuan-Tien Lin (NTU CSIE) Machine Learning 11/52

The Gradient rEin (w)

rEin (w)= N1 (2aw 2b) rEin (w)= N1 (2Aw 2b)

simple! :-) similar (derived by definition)

Hsuan-Tien Lin (NTU CSIE) Machine Learning 12/52

Optimal Linear Regression Weights

Linear Regression Algorithm

simple and efficient

Is Linear Regression a ‘Learning Algorithm’?

No! Yes! 從某些 ⾓度 ⽽⾔ 算是 机 器 学 習 演算法 "

Eout iteratively yes, finite dVC like perceptrons

if Eout (wLIN ) is good, learning ‘happened’!

Hsuan-Tien Lin (NTU CSIE) Machine Learning 15/52

如何 保 証 Ein 是 好 的 ! Benefit of Analytic Solution:

call XX† the hat matrix H

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 14/23

Geometric View of Hat Matrix

claim: trace(I H) = N (d + 1). Why? :-)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 15/23

• if y comes from some ideal f (X) 2 span plus noise

The Learning Curves of Linear Regression

d+1 Number of Data Points, N

• both converge to 2 (noise level) for N ! 1

linear regression (LinReg):

Hsuan-Tien Lin (NTU CSIE) Machine Learning 17/52

• For x = (x0 , x1 , x2 , · · · , xd ) ‘features of customer’, ← 客⼾資料

No! Yes! 從某些⾓度⽽⾔算是机器学習演算法 "

如何保証 Ein 是好的 ! Benefit of Analytic Solution:

ideal (noiseless) data actual (noisy) data 抽樣⼀下