You are on page 1of 68

Machine Learning

(_hx“)
Lecture 5: Linear Models
Hsuan-Tien Lin (ó“0)
htlin@csie.ntu.edu.tw

Department of Computer Science


& Information Engineering
National Taiwan University
(↵Àc'x«⌦Â↵˚)

Hsuan-Tien Lin (NTU CSIE) Machine Learning 0/52


Linear Models

Roadmap
1 When Can Machines Learn?
2 Why Can Machines Learn?
3 How Can Machines Learn?

Lecture 5: Linear Models


Linear Regression Problem
Linear Regression Algorithm
Logistic Regression Problem
Logistic Regression Error
Gradient of Logistic Regression Error
Gradient Descent
Stochastic Gradient Descent

Hsuan-Tien Lin (NTU CSIE) Machine Learning 1/52


Linear Models Linear Regression Problem

Credit Limit Problem


age 23 years
gender female
annual salary NTD 1,000,000
year in residence 1 year
unknown target function year in job 0.5 year
f: X !Y current debt 200,000
(ideal credit limit formula) credit limit? 100,000
要 給 某 客⼾ 多少 信⽤ 額度 ? 每 个 客⼾ 不同 ,

training examples learning final hypothesis


D : (x1 , y1 ), · · · , (xN , yN ) algorithm g⇡f
A
(historical records in bank) (‘learned’ formula to be used)

hypothesis set
H

(set of candidate formula)

regression 特 点 輸出 空間 為 整 个 實 丰 ⼜
Y = R: regression
Hsuan-Tien Lin (NTU CSIE) Machine Learning 2/52
Linear Models Linear Regression Problem

Linear Regression Hypothesis


H 該長 怎樣 輸出 空間 会 是 整 个 實 取 ?
,

age 23 years
annual salary NTD 1,000,000
year in job 0.5 year
current debt 200,000

• For x = (x0 , x1 , x2 , · · · , xd ) ‘features of customer’, ← 客⼾ 資料


approximate the desired credit limit with a weighted sum:
wii 加 权分 邦
d
X
y⇡ wi xi
i=0

• linear regression hypothesis: h(x) = wT x

h(x): like perceptron, but without the sign


PLA 有 取 正負 号 但 這 裡 沒有 ,

Hsuan-Tien Lin (NTU CSIE) Machine Learning 3/52


Linear Models Linear Regression Problem

Illustration of Linear Regression


Linearhypothesis ,

x = (x) 2 R x = (x1 , x2 ) 2 R2
三維空間 中 即 形成 平⾯ ⼀

衡量 奌 到 hypothesis 有 多 近

y
y

x1
x x2

linear regression:
find lines/hyperplanes with small residuals

Hsuan-Tien Lin (NTU CSIE) Machine Learning 4/52


Linear Models Linear Regression Problem

Pointwise Error Measure for ‘Small Residuals’


final hypothesis
g⇡f

how well? often use averaged err(g(x), f (x)), like

Eout (g) = E Jg(x) 6= f (x)K


x⇠P | {z }
err(g(x),f (x))

—err: called pointwise error measure

in-sample out-of-sample
N
1X
Ein (g) = err(g(xn ), f (xn )) Eout (g) = E err(g(x), f (x))
N x⇠P
n=1

will mainly consider pointwise err for simplicity


Hsuan-Tien Lin (NTU CSIE) Machine Learning 5/52
Linear Models Linear Regression Problem

Learning Flow with Pointwise Error Measure


y
unknown target
distribution P(y |x) unknown
containing f (x) + noise P on X

(ideal credit approval formula)


x1 , x2 , · · · , xN x
y1 , y2 , · · · , yN

training examples learning final hypothesis


D : (x1 , y1 ), · · · , (xN , yN ) algorithm g⇡f
A
(historical records in bank) (‘learned’ formula to be used)

hypothesis set error measure


H err

(set of candidate formula)

extended VC theory/‘philosophy’
works for most H and err

Hsuan-Tien Lin (NTU CSIE) Machine Learning 6/52


Linear Models Linear Regression Problem

Two Important Pointwise Error Measures


0 1
B C
err @g(x), f (x)A
|{z} |{z}
ỹ y

0/1 error squared error


err(ỹ , y ) = Jỹ 6= y K err(ỹ , y ) = (ỹ y )2
• correct or incorrect? • how far is ỹ from y ?
• often for classification • often for regression

squared error: quantify ‘small residual’

Hsuan-Tien Lin (NTU CSIE) Machine Learning 7/52


Linear Models Linear Regression Problem

Squared Error Measure for Regression


popular/historical error measure for linear regression:
squared error err(ŷ , y ) = (ŷ y )2

in-sample out-of-sample

N
1X
Ein (w) = (h(xn ) yn )2 Eout (w) = E (wT x y )2
N | {z } (x,y )⇠P
n=1
wT x n

next: how to minimize Ein (w)?

Hsuan-Tien Lin (NTU CSIE) Machine Learning 8/52


Linear Models Linear Regression Problem

Questions?

Hsuan-Tien Lin (NTU CSIE) Machine Learning 9/52


Linear Models Linear Regression Algorithm

Matrix Form of Ein (w)


Ein 如何 越 ⼩ 越 好 了
N N
1X T 2 1X T
Ein (w) = (w xn yn ) = (xn w yn ) 2
N nypothesis 希望 的 預測N
n=1 n=1
的 預測
2
xT1 w y1
1 xT2 w y2
=
N ...
xTN w yN
2 3 2 3 2
xT1 y1
1 6
6 xT2 7
7w
6 y2 7
6 7
( 共同 的 w 提出 ) = 4 5 4 ... 5
N ...
xTN yN
1
= k X w y k2
N |{z} |{z} |{z}
N⇥d+1 d+1⇥1 N⇥1

Hsuan-Tien Lin (NTU CSIE) Machine Learning 10/52


Linear Models Linear Regression Algorithm
唯变 叔 為 w
,
已知 刈

1
min Ein (w) = kXw yk2
w N

• Ein (w): continuous, differentiable, convex


• necessary condition of ‘best’ w
在 制 ⽅向 做 偏 微 皆 ⼆ 0
2 3 2 3
Ein @Ein
@w 0 (w) 0
6 @Ein 7 6 7
6 @w 1 (w) 7 6 0 7
rEin (w) ⌘ 6 7=6 7
4 ... 5 4 ... 5
@Ein
w 最低 桌 @w d (w) 0
不管 往 哪 了 ⽅向 都 無法 讓函 邦值更

—not possible to ‘roll down’
,

梯度 ⼆ 0

task: find wLIN such that rEin (wLIN ) = 0

Hsuan-Tien Lin (NTU CSIE) Machine Learning 11/52


Linear Models Linear Regression Algorithm

The Gradient rEin (w)


0 1
1 1@ T T
Ein (w) = kXw yk2 = w X Xw 2wT XT y + yT y A
N N | {z } |{z} |{z}
A b c

one w only 若 w 只有 个 維度

,
vector w 若 㖷 了 向量

⇣ ⌘ ⇣ ⌘
Ein (w)= N1 aw 2 2bw + c Ein (w)= N1 wT Aw 2wT b + c

rEin (w)= N1 (2aw 2b) rEin (w)= N1 (2Aw 2b)

simple! :-) similar (derived by definition)


何時 梯度 ⼆ 0 ?
2
rEin (w) = N XT Xw XT y

Hsuan-Tien Lin (NTU CSIE) Machine Learning 12/52


Linear Models Linear Regression Algorithm

Optimal Linear Regression Weights


2
task: find wLIN such that N XT Xw XT y = rEin (w) = 0
之中 只有 w 未知 ˋ

invertible XT X 可 求得 反 矩陣 ,
singular XT X
• easy! unique solution • many optimal solutions
⇣ ⌘ 1
ldtllxldt 1) • one of the solutions
wLIN = XT X XT y
| {z } wLIN = X† y
pseudo-inverse X†
by defining X† in other ways
• often the case because
N d +1

practical suggestion:
use well-implemented † routine
1 T
instead of XT X X
for numerical stability when almost-singular
Hsuan-Tien Lin (NTU CSIE) Machine Learning 13/52
Linear Models Linear Regression Algorithm

Linear Regression Algorithm


1 from D, construct input matrix X and output vector y by
製造 矩陣 2 3 2 3
xT1 y1
6 xT2 7 6
7 y = 6 y2 7
7
X=6 4 5 4
··· ··· 5
xTN yN
| {z } | {z }
N⇥(d+1) N⇥1

2 calculate pseudo-inverse X†
|{z}
計算 反 矩陣 (d+1)⇥N

3 return wLIN = X† y
|{z}
(d+1)⇥1

simple and efficient


with good † routine
Hsuan-Tien Lin (NTU CSIE) Machine Learning 14/52
Linear Models Linear Regression Algorithm

Is Linear Regression a ‘Learning Algorithm’?


wLIN = X† y

No! Yes! 從某些 ⾓度 ⽽⾔ 算是 机 器 学 習 演算法 "


• analytic (closed-form) I • good Ein ?
solution, ‘instantaneous’ yes, optimal! 有 好 的 Ein
• not improving Ein nor 2
• good Eout ? 資料 量 夠 多 有 好 Ein 有 好 Eout
,
→ ,

Eout iteratively yes, finite dVC like perceptrons


⼀下⼦ 就 算 出 結果 ? 沒有 学 習 过 程 ? 3 • improving iteratively?
次有 隨著 覌 察的 資料 更新 結果 ! somewhat, within an iterative
pseudo-inverse routine

if Eout (wLIN ) is good, learning ‘happened’!

Hsuan-Tien Lin (NTU CSIE) Machine Learning 15/52


Linear Regression Generalization Issue

如何 保 証 Ein 是 好 的 ! Benefit of Analytic Solution:


‘Simpler-than-VC’ Guarantee
n o
to be shown d+1
Ein = E Ein (wLIN w.r.t. D) = noise level · 1 N
D⇠P N
資料 中 的 雜訊
衬 所有 抓 把 起來 的 資料 做 平均
Distrihution 中 会做錯 的 部分

1 1
Ein (wLIN ) = ky ŷ k2 = ky X X † y k2
N |{z} N |{z}
predictions wLIN
1
= k( I XX† )yk2
N |{z}
identity

call XX† the hat matrix H


because it puts ^ on y

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 14/23


Linear Regression Generalization Issue

Geometric View of Hat Matrix

y
y ŷ

span of X

in RN
• ŷ = XwLIN within the span of X columns
• y ŷ smallest: y ŷ ? span
• H: project y to ŷ 2 span 投影
• I H: transform y to y ŷ ? span

claim: trace(I H) = N (d + 1). Why? :-)


衬 ⾓ 線 上 所有 值 相加

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 15/23


Linear Regression Generalization Issue

An Illustrative ‘Proof’
y
y ŷ
noise

f (X) span of X

• if y comes from some ideal f (X) 2 span plus noise


• noise with per-dimension ‘noise level’ 2 transformed by I H to
be y ŷ
1
Ein (wLIN ) = ky ŷk2 = 1
N k(I H)noisek2
N
1 2
= N N (d + 1)

2 d+1
Ein = · 1 N 看到 的 讓 Ein 好看 笑
,

2 d+1
Eout = · 1+ N (complicated!) 要看 新 資料 來往 、

可能 差 2 倍
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 16/23
Linear Models Linear Regression Algorithm

The Learning Curves of Linear Regression


(proof skipped this year)

Expected Error
d+1 Eout
Eout = noise level · 1 + N 2

d+1
Ein = noise level · 1 N
Ein

d+1 Number of Data Points, N

• both converge to 2 (noise level) for N ! 1


• expected generalization error: 2(d+1)
N
—similar to worst-case guarantee from VC

linear regression (LinReg):


learning ‘happened’!
Hsuan-Tien Lin (NTU CSIE) Machine Learning 16/52
Linear Models Linear Regression Algorithm

Questions?

Hsuan-Tien Lin (NTU CSIE) Machine Learning 17/52


Linear Models Logistic Regression Problem

Heart Attack Prediction Problem (1/2)


age 40 years
gender male
blood pressure 130/85
cholesterol level 240
unknown target weight 70
distribution P(y |x)
containing f (x) + noise heart disease? yes ⼆ 元分類 問題

training examples learning final hypothesis


D : (x1 , y1 ), · · · , (xN , yN ) algorithm g⇡f
A

hypothesis set error measure


H err

binary classification:
ideal f (x) = sign P(+1|x) 12 2 { 1, +1}
because of classification err

Hsuan-Tien Lin (NTU CSIE) Machine Learning 18/52


Linear Models Logistic Regression Problem

Heart Attack Prediction Problem (2/2)


age 40 years
gender male
blood pressure 130/85
cholesterol level 240
unknown target weight 70
distribution P(y |x)
containing f (x) + noise heart attack? 80% risk ⼼臟病 発 的 風險

training examples learning final hypothesis


D : (x1 , y1 ), · · · , (xN , yN ) algorithm g⇡f
A

hypothesis set error measure


H err

‘soft’ binary classification:


dinputvector
f (x) = P(+1|x) 2 [0, 1]
想 知道 的 是這 个 值 exi 風險值

Hsuan-Tien Lin (NTU CSIE) Machine Learning 19/52


Linear Models Logistic Regression Problem

Soft Binary Classification


target function f (x) = P(+1|x) 2 [0, 1]

ideal (noiseless) data actual (noisy) data


⇣ ⌘ ⇣ ⌘
x1 , y10 = 0.9 = P(+1|x1 ) x1 , y1 = ⇠ P(y |x1 )
⇣ ⌘ ⇣ ⌘
x2 , y20 = 0.2 = P(+1|x2 ) x2 , y2 = ⇥ ⇠ P(y |x2 )
.. ..
⇣ . ⌘ ⇣ . ⌘
xN , yN = 0.6 = P(+1|xN )
0 xN , yN = ⇥ ⇠ P(y |xN )
T T
但 ⼿上 可能 沒 這 个 資料 ⼜ 跟 原本 ⼀樣 ,
只有 錯 対

same data as hard binary classification,


different target function

Hsuan-Tien Lin (NTU CSIE) Machine Learning 20/52


Linear Models Logistic Regression Problem

Soft Binary Classification


target function f (x) = P(+1|x) 2 [0, 1]
改 成 左 迅 data 有 noise 的 版本 , 根據 09,02 …

ideal (noiseless) data actual (noisy) data 抽樣 ⼀下


⇣ ⌘ ⇣ r z⌘
?
x1 , y10 = 0.9 = P(+1|x1 ) x1 , y10 = 1 = ⇠ P(y |x1 )
⇣ ⌘ ⇣ r z⌘
?
x2 , y20 = 0.2 = P(+1|x2 ) x2 , y20 = 0 = ⇠ P(y |x2 )
.. ..
⇣ . ⌘ ⇣ . r z⌘
?
xN , yN = 0.6 = P(+1|xN )
0
xN , yN = 0 =
0 ⇠ P(y |xN )

same data as hard binary classification,


different target function
要 如何 做 才能 求得 好 的 nypothesis
7
當 有 興趣 的 ⼗ 是 -1 的 輸出 時
,

0
,

Hsuan-Tien Lin (NTU CSIE) Machine Learning 20/52


Linear Models Logistic Regression Problem

Logistic Hypothesis 常 ⽤ ˋ
age 40 years
gender male
blood pressure 130/85
cholesterol level 240

• For x = (x0 , x1 , x2 , · · · , xd ) ‘features of


S 型
patient’, calculate a weighted ‘risk score’:
將 teatnre 做 加 权 1
d
X ✓(s)
s= wi xi
i=0 ← 0 為常 䫍
0 s
• convert the score to estimated probability 將 分 邦 転化 為 个 ⼀
0 到 ⼯ 的值
by logistic function ✓(s)

logistic hypothesis: h(x) = ✓(wT x)

Hsuan-Tien Lin (NTU CSIE) Machine Learning 21/52


Linear Models Logistic Regression Problem

Logistic Function
1
✓(s)

0 s

✓( 1) = 0; ✓(0) = 12 ; ✓(1) = 1

es 1
分⼿ ⼼ 低 時 靠近 ✓(s) =
0
= 分形 ⾼ 時 靠近 2
1 + es 1+e s
analytictorm
—smooth, monotonic, sigmoid function of s

logistic regression: use


1
h(x) =
1 + exp( wT x)
to approximate target function f (x) = P(+1|x)

Hsuan-Tien Lin (NTU CSIE) Machine Learning 22/52


Linear Models Logistic Regression Problem

Questions?

Hsuan-Tien Lin (NTU CSIE) Machine Learning 23/52


Linear Models Logistic Regression Error

Three Linear Models


linear scoring function: s = wT x

linear classification linear regression logistic regression


h(x) = sign(s) h(x) = s h(x) = ✓(s)
x0
杈 x0 x0

總分
x1 x1 x1
s s s
x2 h(x) x2 h(x) x2 h(x)

xd 対 分錯 xd xd
透 过 sigmoid 輸出

plausible err = 0/1 friendly err = squared


(small flipping noise) (easy to minimize) err = ?
PLA 在 盡量 不要 有 error

how to define
Ein (w) for logistic regression?

Hsuan-Tien Lin (NTU CSIE) Machine Learning 24/52


𩶛
Linear Models Logistic Regression Error

Likelihood ⇢
Y 有 2年中

target function f (x) for y = +1


f (x) = P(+1|x) , P(y |x) =
1 f (x) for y = 1

consider D = {(x1 , ), (x2 , ⇥), . . . , (xN , ⇥)}


產⽣ data 的 机 率 是 …
'

probability that f generates D likelihood that h generates D


P(x1 )P( |x1 ) ⇥ P(x1 )h(x1 ) ⇥
P(x2 )P(⇥|x2 ) ⇥ P(x2 )(1 h(x2 )) ⇥
... 若 假裝 h 就是 t 可以
,
. 察 嚡 樣資料 的 可能性
. .覌
但 這 不 是 真的 不 啐 liuinood
P(xN )P(⇥|xN )
,

P(xN )(1 h(xN ))


Maximumlikelihood ,

• if h ⇡ f , h 的 可能性 和 t 是 類似 的 ( 若 h 是好 的 )

then likelihood(h) ⇡ probability using f


• probability using f usually large
Hsuan-Tien Lin (NTU CSIE) Machine Learning 25/52
Linear Models Logistic Regression Error

Likelihood ⇢
target function f (x) for y = +1
f (x) = P(+1|x) , P(y |x) =
1 f (x) for y = 1

假設 ⼝ 是 iid ,

consider D = {(x1 , ), (x2 , ⇥), . . . , (xN , ⇥)}

probability that f generates D likelihood that h generates D


P(x1 )f (x1 ) ⇥ pretend P(x1 )h(x1 ) ⇥


P(x2 )(1 f (x2 )) ⇥ P(x2 )(1 h(x2 )) ⇥
... ...
P(xN )(1 f (xN )) P(xN )(1 h(xN ))

• if h ⇡ f ,
then likelihood(h) ⇡ probability using f
• probability using f usually large
Hsuan-Tien Lin (NTU CSIE) Machine Learning 25/52
Linear Models Logistic Regression Error

Likelihood of Logistic Hypothesis


likelihood(h) ⇡ (probability using f ) ⇡ large

g = argmax likelihood(h)
h

旋転 1800 可以 得到 相同 图 形
1
0
when logistic: h(x) = ✓(wT x) 対稱性 :
✓(s)

1 h(x) = h( x) ( Iogisticfunc 特性 )

T 0 s
ltéwx

likelihood(h) = P(x1 )h(x1 ) ⇥ P(x2 )(1 h(x2 )) ⇥ . . . P(xN )(1 h(xN ))


wnr vvn vrrr
n n

対 h ⽽⾔ 是 ⼀樣 的 ,

以刈 hlxnl
N
Y
likelihood(logistic h) / h(yn xn )
n=1

Hsuan-Tien Lin (NTU CSIE) Machine Learning 26/52


Linear Models Logistic Regression Error

Likelihood of Logistic Hypothesis


likelihood(h) ⇡ (probability using f ) ⇡ large

g = argmax likelihood(h)
h

1
when logistic: h(x) = ✓(wT x) ✓(s)

1 h(x) = h( x)
0 s

O X X

likelihood(h) = P(x1 )h(+x1 ) ⇥ P(x2 )h( x2 ) ⇥ . . . P(xN )h( xN )


hlyxi ) h (火 刈 MYNXN)

N
Y
likelihood(logistic h) / h(yn xn )
n=1

Hsuan-Tien Lin (NTU CSIE) Machine Learning 26/52


Linear Models Logistic Regression Error

Cross-Entropy Error
N
Y
max likelihood(logistic h) / h(yn xn )
h
n=1
若是 0 就 ⽤
,
⼗ ⼼⼼

若是x 就 ⽤,
xrn

Hsuan-Tien Lin (NTU CSIE) Machine Learning 27/52


Linear Models Logistic Regression Error

Cross-Entropy Error
N
Y ⇣ ⌘
max likelihood(w) / ✓ yn wT xn
w
n=1
⽤ 真正 有 興趣 的 槤 ,

Hsuan-Tien Lin (NTU CSIE) Machine Learning 27/52


Linear Models Logistic Regression Error

Cross-Entropy Error
N
Y ⇣ ⌘
max ln ✓ yn wT xn ⼆
Iln Olynúxn )
w
n=1
連柴 恥 換 成 連 加

Hsuan-Tien Lin (NTU CSIE) Machine Learning 27/52


Linear Models Logistic Regression Error

Cross-Entropy Error
N ⇣ ⌘
1X
min ln ✓ yn wT xn
w N
n=1
errorfunc ,

Max 改成 Min

N
1 1X ⇣ ⌘
✓(s) = : min ln 1 + exp( yn wT xn )
1 + exp( s) w N
n=1
XN
1
=) min err(w, xn , yn ) 対 這 族的 error
w N
|n=1 {z }
Ein (w)

err(w, x, y ) = ln 1 + exp( y wT x) :
cross-entropy error 亂 度 1 熵

Hsuan-Tien Lin (NTU CSIE) Machine Learning 27/52


Linear Models Logistic Regression Error

Questions?

Hsuan-Tien Lin (NTU CSIE) Machine Learning 28/52


Linear Models Gradient of Logistic Regression Error

Minimizing Ein (w)


N
1X ⇣ ⌘
min Ein (w) = ln 1 + exp( yn wT xn )
w N
n=1

• Ein (w): continuous, differentiable,


twice-differentiable, convex
• how to minimize? locate valley
Ein 拋物線 向上 如 ⼭ ⾕
want rEin (w) = 0

first: derive rEin (w)

Hsuan-Tien Lin (NTU CSIE) Machine Learning 29/52


Linear Models Gradient of Logistic Regression Error

The Gradient
0
rEin (w) 1
N
X z
}| {
1 B C
Ein (w) = ln @1 + exp( yn wT xn )A
N | {z }
n=1

chaimrule
N ✓ ◆✓ ◆✓ ◆
@Ein (w) 1 X @ ln(⇤) @(1 + exp( )) @ yn wT xn
=
@wi N @⇤ @ @wi
n=1
N ✓ ◆✓ ◆✓ ◆
1X
=
N 藏 expo ) ynxni -

n=1
N ✓ ◆ ! N
1X exp( ) 1X
= yn xn,i = ✓( ) yn xn,i
N 1 + exp( ) N
n=1 n=1

N
P
1
rEin (w) = N ✓ yn wT xn yn xn
n=1

Hsuan-Tien Lin (NTU CSIE) Machine Learning 30/52


Linear Models Gradient of Logistic Regression Error

The Gradient
0
rEin (w) 1
N
X z
}| {
1 B C
Ein (w) = ln @1 + exp( yn wT xn )A
N | {z }
n=1

N ✓ ◆✓ ◆✓ ◆
@Ein (w) 1 X @ ln(⇤) @(1 + exp( )) @ yn wT xn
=
@wi N @⇤ @ @wi
n=1
N ✓ ◆
! !
1X 1
= exp( ) yn xn,i
N ⇤
n=1
N ✓ ◆ ! N
1X exp( ) 1X
= yn xn,i = ✓( ) yn xn,i
N 1 + exp( ) N
n=1 n=1

N
P
1
rEin (w) = N ✓ yn wT xn yn xn
n=1

Hsuan-Tien Lin (NTU CSIE) Machine Learning 30/52


Linear Models Gradient of Logistic Regression Error

Minimizing Ein (w)


N
1X ⇣ T

min Ein (w) = ln 1 + exp( yn w xn )
w N
n=1
N
1X ⇣ ⌘
want rEin (w) = ✓ yn wT xn yn xn =0
N
n=1
0 ⼼)

scaled ✓-weighted sum of yn xn


• all ✓(·) = 0: only if yn wT xn 0
—linear separable D
Ein • weighted sum = 0:
non-linear equation of w

w
closed-form solution? no :-(
Hsuan-Tien Lin (NTU CSIE) Machine Learning 31/52
Linear Models Gradient of Logistic Regression Error

PLA Revisited: Iterative Optimization


PLA: start from some w0 (say, 0), and ‘correct’ its mistakes on D

For t = 0, 1, . . .
1 find a mistake of wt called xn(t) , yn(t)
⇣ ⌘
sign wTt xn(t) 6= yn(t)

2 (try to) correct the mistake by

wt+1 wt + yn(t) xn(t)

when stop, return last w as g

Hsuan-Tien Lin (NTU CSIE) Machine Learning 32/52


Linear Models Gradient of Logistic Regression Error

PLA Revisited: Iterative Optimization


PLA: start from some w0 (say, 0), and ‘correct’ its mistakes on D

For t = 0, 1, . . .
1 find a mistake of wt called xn(t) , yn(t)
⇣ ⌘
sign wTt xn(t) 6= yn(t)

2 (try to) correct the mistake by

wt+1 wt + yn(t) xn(t)

1 (equivalently) pick some n, and update wt by


r ⇣ ⌘ z
wt+1 wt + sign wTt xn 6= yn yn xn
direction
when stop, return last w as g
,

Hsuan-Tien Lin (NTU CSIE) Machine Learning 32/52


Linear Models Gradient of Logistic Regression Error

PLA Revisited: Iterative Optimization


PLA: start from some w0 (say, 0), and ‘correct’ its mistakes on D

For t = 0, 1, . . .
(equivalently) pick some n, and update wt by
1
⇣r ⇣ ⌘ z ⌘
wt+1 wt + 1 · sign wTt xn 6= yn · yn xn
|{z} | {z }
⌘ v
⽅向
when stop, return last w as g 步 長

① ② ③
choice of (⌘, v) and stopping condition defines
iterative optimization approach

Hsuan-Tien Lin (NTU CSIE) Machine Learning 32/52


Linear Models Gradient of Logistic Regression Error

Questions?

Hsuan-Tien Lin (NTU CSIE) Machine Learning 33/52


Linear Models Gradient Descent

Iterative Optimization
For t = 0, 1, . . . 似 leamingrate
wt+1 wt + ⌘v
when stop, return last w as g 要 決定 了 件 事 :D ⽅向 ① 步 長
• PLA: v comes from mistake correction

In-sample Error, Ein


• smooth Ein (w) for logistic regression:
choose v to get the ball roll ‘downhill’?
• direction v:
(assumed) of unit length
• step size ⌘:
(assumed) positive Weights, w

a greedy approach for some given ⌘ > 0:

min Ein (wt + ⌘v) imeroptimiutionproblem


kvk=1 | {z }
wt+1

Hsuan-Tien Lin (NTU CSIE) Machine Learning 34/52


Linear Models Gradient Descent

Linear Approximation
a greedy approach for some given ⌘ > 0:

min Ein (wt + ⌘v)


kvk=1

• still non-linear optimization, now with constraints


—not any easier than minw Ein (w)
• local approximation by linear formula makes problem easier

Ein (wt + ⌘v) ⇡ Ein (wt ) + ⌘vT rEin (wt )


if ⌘ really small (Taylor expansion) ⽤ 泰 勒 展開式 在 維 上 看 是 切 線 ,

現在 延伸 到 多 維度 ,

an approximate greedy approach for some given small ⌘:

min
kvk=1 \
Ein (wt ) +
| {z }

|{z}
vT rEin (wt )
| {z }
1
丠 ?
known given positive known

常取 要 讓這項越 ⼩ 越 好
Hsuan-Tien Lin (NTU CSIE) Machine Learning 35/52
Linear Models Gradient Descent

Gradient Descent
an approximate greedy approach for some given small ⌘:

min Ein (wt ) + ⌘ vT rEin (wt )


kvk=1 | {z } |{z} | {z }
known given positive known

• optimal v: opposite direction of rEin (wt )

rEin (wt )
v=
krEin (wt )k
negativenormalizeddirect.int
• gradient descent: for small ⌘, wt+1 wt
lgradient.li
rEin (wt )
⌘ krEin (wt )k

update
gradient descent:
a simple & popular optimization tool

Hsuan-Tien Lin (NTU CSIE) Machine Learning 36/52


Linear Models Gradient Descent
Choice of ⌘
too small too large jumpt 可能 也会 往 上
泰勒 展開式 就
In-sample Error, Ein

In-sample Error, Ein


( 不会 対

Weights, w Weights, w

too slow :-( 步伐 ⼩ too unstable :-( 步伐 ⼤

a naive yet effective heuristic


• if red ⌘ / krEin (wt )k by ratio purple ⌘ (the fixed learning rate)

rEin (wt )
wt+1 wt ⌘ = wt ⌘rEin (wt )
krEin (wt )k

fixed learning rate gradient descent:

wt+1 wt ⌘rEin (wt )


Hsuan-Tien Lin (NTU CSIE) Machine Learning 37/52
Linear Models Gradient Descent

Putting Everything Together


Logistic Regression Algorithm
initialize w0
For t = 0, 1, · · ·
1 compute
N
1X ⇣ ⌘
rEin (wt ) = ✓ yn wTt xn yn xn
N
n=1

2 update by
wt+1 wt ⌘rEin (wt )
d
0.1126

...until rEin (wt+1 ) = 0 or enough iterations


return last wt+1 as g

O(N) time complexity in step 1 per iteration


Hsuan-Tien Lin (NTU CSIE) Machine Learning 38/52
Linear Models Gradient Descent

Questions?

Hsuan-Tien Lin (NTU CSIE) Machine Learning 39/52


Linear Models Gradient Descent

Linear Models Revisited


linear scoring function: s = wT x

linear classification linear regression logistic regression


h(x) = sign(s) h(x) = s h(x) = ✓(s)
x0 x0 x0
x1 x1 x1
s s s
x2 h(x) x2 h(x) x2 h(x)

xd 対錯 xd xd ⼼ 之間 的 机 率
plausible err = 0/1 friendly err = squared plausible err = cross-entropy
discrete Ein (w): 使 error 变 ⼩ quadratic convex Ein (w): smooth convex Ein (w):
NP-hard to solve in general closed-form solution gradient descent
hara easy easy
can linear regression or logistic regression
help linear classification?

Hsuan-Tien Lin (NTU CSIE) Machine Learning 40/52


Linear Models Gradient Descent

Error Functions Revisited


linear scoring function: s = wT x

for binary classification y 2 { 1, +1}

linear classification linear regression logistic regression


0 h(x) = sign(s) 算 出 h值 h(x) = s h(x) = ✓(s)
2
② err(h, x, y ) = Jh(x) 6= y K err(h, x, y ) = (h(x) y) err(h, x, y ) = ln h(y x)
從 hypothesis 輸出 看到 是否 相同
,

結合 : err0/1 (s, y ) errSQR (s, y ) errCE (s, y )


Jsign(s) 6= y K 2
= = (s y) = ln(1 + exp( y s))
Jsign(y s) 6= 1K 2
= = (y s 1)

(y s): classification correctness score


希望 ⼼。 越來越 好

Hsuan-Tien Lin (NTU CSIE) Machine Learning 41/52


Linear Models Gradient Descent

Visualizing Error Functions 比較 errorfunc 的 長相


0/1 err0/1 (s, y ) = Jsign(y s) 6= 1K 盡 在 同⼀ 平⾯ 上
sqr errSQR (s, y ) = (y s 1)2
ce errCE (s, y ) = ln(1 + exp( y s))
scaled ce errSCE (s, y ) = log2 (1 + exp( y s))

6 0/1 • 0/1: 1 iff ys  0


• sqr: large if ys ⌧ 1
4
err but over-charge ys 1
2 small errSQR ! small err0/1
1 • ce: monotonic of ys
0 small errCE $ small err0/1
−3 −2 −1 0 1 2 3
ys • scaled ce: a proper upper bound of 0/1
small errSCE $ small err0/1

upper bound:
useful for designing algorithm
Hsuan-Tien Lin (NTU CSIE) Machine Learning 42/52
Linear Models Gradient Descent

Visualizing Error Functions


0/1 err0/1 (s, y ) = Jsign(y s) 6= 1K
sqr errSQR (s, y ) = (y s 1)2
ce errCE (s, y ) = ln(1 + exp( y s))
scaled ce errSCE (s, y ) = log2 (1 + exp( y s))

6 • 0/1: 1 iff ys  0
0/1
sqr
• sqr: large if ys ⌧ 1
4
err 扣 分
区 但⽤ but over-charge ys 1
2 linearreg small errSQR ! small err0/1
,

卻 扣分
1 • ce: monotonic of ys
0
011 很 好 small errCE $ small err0/1
ys t
−3 −2 −1 0 1 2 3
• scaled ce: a proper upper bound of 0/1
在 011 和 linenr small errSCE $ small err0/1
結果類似
,

upper bound:
useful for designing algorithm
Hsuan-Tien Lin (NTU CSIE) Machine Learning 42/52
Linear Models Gradient Descent

Visualizing Error Functions


0/1 err0/1 (s, y ) = Jsign(y s) 6= 1K
sqr errSQR (s, y ) = (y s 1)2
ce errCE (s, y ) = ln(1 + exp( y s)) 取 og.no
scaled ce errSCE (s, y ) = log2 (1 + exp( y s))

6 0/1 • 0/1: 1 iff ys  0


sqr
ce
• sqr: large if ys ⌧ 1
4
err but over-charge ys 1
2 small errSQR ! small err0/1
1 • ce: monotonic of ys
0 small errCE $ small err0/1
−3 −2 −1 0 1 2 3
ys • scaled ce: a proper upper bound of 0/1
small errSCE $ small err0/1

upper bound:
useful for designing algorithm
Hsuan-Tien Lin (NTU CSIE) Machine Learning 42/52
Linear Models Gradient Descent

Visualizing Error Functions


0/1 err0/1 (s, y ) = Jsign(y s) 6= 1K
sqr errSQR (s, y ) = (y s 1)2
ce errCE (s, y ) = ln(1 + exp( y s))
scaled ce errSCE (s, y ) = log2 (1 + exp( y s)) 取 lgz

6 0/1 • 0/1: 1 iff ys  0


sqr
scaled ce
• sqr: large if ys ⌧ 1
4
err but over-charge ys 1
在 ⼼ 切 在 ⼀起 small errSQR ! small err0/1
2
1 • ce: monotonic of ys
0 small errCE $ small err0/1
−3 −2 −1 0 1 2 3
ys • scaled ce: a proper upper bound of 0/1
small errSCE $ small err0/1

upper bound:
useful for designing algorithm
Hsuan-Tien Lin (NTU CSIE) Machine Learning 42/52
Linear Models Gradient Descent

Learning Flow with Algorithmic Error Measure


y
unknown target
distribution P(y |x) unknown
containing f (x) + noise P on X

(ideal credit approval formula)


x1 , x2 , · · · , xN x
y1 , y2 , · · · , yN

training examples learning final hypothesis


D : (x1 , y1 ), · · · , (xN , yN ) algorithm g⇡f
A
ec
rr
(historical records in bank) (‘learned’ formula to be used)

hypothesis set error measure


H err

(set of candidate formula)

err: goal, not always easy to optimize;


rr: something ‘similar’ to facilitate A, e.g.
ec
upper bound
Hsuan-Tien Lin (NTU CSIE) Machine Learning 43/52
Linear Models Gradient Descent

Theoretical Implication of Upper Bound


For any ys where s = wT x
若 在意 的 是 011 的 error 上限 為 tu ,

err0/1 (s, y )  errSCE (s, y ) = ln12 errCE (s, y ).


0/1 1
=) Ein (w)  EinSCE (w) = CE
ln 2 Ein (w) 平均
0/1 SCE 1 CE
Eout (w)  Eout (w) = ln 2 Eout (w)

VC on 0/1: VC-Reg on CE :
( 從 Ein 出 発 1 1 從 Eout 出 発) ,

0/1 0/1 0/1


Eout (w)  Ein (w) + ⌦0/1 Eout (w)  1 CE
ln 2 Eout (w)
1 CE 0/1 1 1
ln 2 Ein (w) + ⌦
CE CE
ln 2 Ein (w) + ln 2 ⌦
 

0/1
small EinCE (w) =) small Eout (w):
logistic/linear reg. for linear classification

Hsuan-Tien Lin (NTU CSIE) Machine Learning 44/52


Linear Models Gradient Descent
使⽤ regression 得到
答案後將 分 我 取 sign
Regression for Classification
1 run logistic/linear reg. on D with yn 2 { 1, +1} to get wREG
2 return g(x) = sign(wTREG x)

PLA linear regression logistic regression


• pros: efficient + • pros: 公式 宑 下 就 解決 • pros:
strong guarantee ‘easiest’ ‘easy’
if lin. separable optimization optimization
• cons: works only if • cons: loose • cons: loose
lin. separable bound of err0/1 bound of err0/1 for
for large |ys| very negative ys
若 有 011 差 較多 州 時
1 ⼤ 只 是 ⼗ 上限
,

就 会 較寬鬆
非常 的 正 或 負時 看起來 就 和 011 差較 多
,

• linear regression sometimes used to set w0 for


PLA/logistic regression initialile ,

• logistic regression often preferred in practice


如果 lincarregression 看起來 是 不錯 的 ⽅法 就 將 LR 的 w 當成 wo 再 從 , ,

Hsuan-Tien Lin (NTU CSIE) 好的 w


繼續 optimize ,

Machine Learning 45/52


Linear Models Gradient Descent

Questions?

Hsuan-Tien Lin (NTU CSIE) Machine Learning 46/52


Linear Models Stochastic Gradient Descent

Two Iterative Optimization Schemes


For t = 0, 1, . . .

步步 讓 w 越來越 好
wt+1 wt + ⌘v
when stop, return last w as g 每輪選 ⼀
个 桌修正

PLA logistic regression


pick (xn , yn ) and decide wt+1 by check D and decide wt+1 (or
the one example new ŵ) by all examples
O(1) time per iteration :-) O(N) time per iteration :-(
每 次都 看 其中 ⼀ 桌 每 輪 要 看 完 所有 的 桌 將 所有 的 桌 的,

update: 2
gmdient 的 貢獻 相 阿 平均 再 的 梯度 ⽅向 ,

logistic regression with 往 下 做


O(1) time per iteration?
w(t)

w(t+1)

x
9

Hsuan-Tien Lin (NTU CSIE) Machine Learning 47/52


Linear Models Stochastic Gradient Descent

Logistic Regression Revisited 演算法


N
1X ⇣ ⌘
wt+1 wt + ⌘ ✓ yn wTt xn yn xn
N
| n=1 {z }
rEin (wt )

• want: update direction v ⇡ rEin (wt ) 希望 更新 ⽅向 有 gradient 接近


while computing v by one single (xn , yn ) 㽆 夠 ⼩時 梯度是 ⼀ ,

PN i. 隨机 抽 族 的 ⽅向 但 不 希望 花 N
• technique on removing N1 ⇐ 好
:
,

n=1 再 做 平均 倍 ⼒氣 算 出 梯度
view as expectation E over uniform choice of n!

stochastic gradient: 随机 的 梯度 ( 不 是 原本 真正 的 ⽽是 在 某 族 上 的 1 ,

rw err(w, xn , yn ) with random n 將 整 体 的 梯度看成 這 个


true gradient: 隨机 过程 的 期望值
rw Ein (w) = E rw err(w, xn , yn )
random n

Hsuan-Tien Lin (NTU CSIE) Machine Learning 48/52


Linear Models Stochastic Gradient Descent

Stochastic Gradient Descent (SGD)


stochastic gradient = true gradient + zero-mean ‘noise’ directions

Stochastic Gradient Descent


• idea: replace true gradient by stochastic gradient ⽤ 随机 的 梯度 做 下降
• after enough steps, ⽽ 非 真實 的 梯度 ,

average true gradient ⇡ average stochastic gradient


• pros: simple & cheaper computation :-) 若 跑 夠 多 步 平均 ⽽⾔ 真實 ,
,

—useful for big data or online learning 的 梯度 会 ⼀ 随机 梯度


• cons: less stable in nature

SGD logistic regression, looks familiar? :-):


⇣ ⌘ → 更新 ⽅向
wt+1 wt + ⌘ ✓ yn wTt xn yn xn
| {z }
rerr(wt ,xn ,yn )

Hsuan-Tien Lin (NTU CSIE) Machine Learning 49/52


Linear Models Stochastic Gradient Descent

PLA Revisited
SGD logistic regression:
⇣ ⌘
wt+1 wt + ⌘ · ✓ yn wTt xn yn xn

PLA: r z
wt+1 wt + 1 · yn 6= sign(wTt xn ) yn xn

• SGD logistic regression ⇡ ‘soft’ PLA


• PLA ⇡ SGD logistic regression with ⌘ = 1 when wTt xn large

two practical rule-of-thumb:


• stopping condition? t large enough 很難決定 以 相信 跑
• ⌘? 0.1 when x in proper range 夠 久 就 可以 停

Hsuan-Tien Lin (NTU CSIE) Machine Learning 50/52


Linear Models Stochastic Gradient Descent

Questions?

Hsuan-Tien Lin (NTU CSIE) Machine Learning 51/52


Linear Models Stochastic Gradient Descent

Summary
1 Why Can Machines Learn?

Lecture 4: Theory of Generalization


2 How Can Machines Learn?

Lecture 5: Linear Models


Linear Regression Problem
Linear Regression Algorithm
Logistic Regression Problem
Logistic Regression Error
Gradient of Logistic Regression Error
Gradient Descent
Stochastic Gradient Descent
• next: beyond simple linear models

Hsuan-Tien Lin (NTU CSIE) Machine Learning 52/52

You might also like