Machine Learning Basics Lecture 5 SVM II

Machine Learning Basics
Lecture 5: SVM II
Princeton University COS 495
Instructor: Yingyu Liang
Review: SVM objective
SVM: objective
• Let 𝑦𝑖 ∈ +1, −1 , 𝑓𝑤,𝑏 𝑥 = 𝑤 𝑇 𝑥 + 𝑏. Margin:
𝑦𝑖 𝑓𝑤,𝑏 𝑥𝑖
𝛾 = min
𝑖 |𝑤|
• Support Vector Machine:
𝑦𝑖 𝑓𝑤,𝑏 𝑥𝑖
max 𝛾 = max min
𝑤,𝑏 𝑤,𝑏 𝑖 |𝑤|
SVM: optimization
• Optimization (Quadratic Programming):
2
1
min 𝑤
𝑤,𝑏 2
𝑦𝑖 𝑤 𝑇 𝑥𝑖 + 𝑏 ≥ 1, ∀𝑖
• Solved by Lagrange multiplier method:

2
1
ℒ 𝑤, 𝑏, 𝜶 = 𝑤 − ෍ 𝛼𝑖 [𝑦𝑖 𝑤 𝑇 𝑥𝑖 + 𝑏 − 1]
2
𝑖
where 𝜶 is the Lagrange multiplier
Lagrange multiplier
Lagrangian
• Consider optimization problem:
min 𝑓(𝑤)
𝑤
ℎ𝑖 𝑤 = 0, ∀1 ≤ 𝑖 ≤ 𝑙
• Lagrangian:
ℒ 𝑤, 𝜷 = 𝑓 𝑤 + ෍ 𝛽𝑖 ℎ𝑖 (𝑤)
𝑖
where 𝛽𝑖 ’s are called Lagrange multipliers
Lagrangian
min 𝑓(𝑤)
𝑤
ℎ𝑖 𝑤 = 0, ∀1 ≤ 𝑖 ≤ 𝑙
• Solved by setting derivatives of Lagrangian to 0

𝜕ℒ 𝜕ℒ
= 0; =0
𝜕𝑤𝑖 𝜕𝛽𝑖
Generalized Lagrangian
min 𝑓(𝑤)
𝑤
𝑔𝑖 𝑤 ≤ 0, ∀1 ≤ 𝑖 ≤ 𝑘
ℎ𝑗 𝑤 = 0, ∀1 ≤ 𝑗 ≤ 𝑙
• Generalized Lagrangian:
ℒ 𝑤, 𝜶, 𝜷 = 𝑓 𝑤 + ෍ 𝛼𝑖 𝑔𝑖 (𝑤) + ෍ 𝛽𝑗 ℎ𝑗 (𝑤)
𝑖 𝑗
where 𝛼𝑖 , 𝛽𝑗 ’s are called Lagrange multipliers
Generalized Lagrangian
• Consider the quantity:
𝜃𝑃 𝑤 ≔ max ℒ 𝑤, 𝜶, 𝜷
𝜶,𝜷:𝛼𝑖 ≥0
• Why?
𝑓 𝑤 , if 𝑤 satisfies all the constraints
𝜃𝑃 𝑤 = ቊ
+∞, if 𝑤 does not satisfy the constraints
• So minimizing 𝑓 𝑤 is the same as minimizing 𝜃𝑃 𝑤

min 𝑓 𝑤 = min 𝜃𝑃 𝑤 = min max ℒ 𝑤, 𝜶, 𝜷
𝑤 𝑤 𝑤 𝜶,𝜷:𝛼𝑖 ≥0
Lagrange duality
• The primal problem
𝑝∗ ≔ min 𝑓 𝑤 = min max ℒ 𝑤, 𝜶, 𝜷
𝑤 𝑤 𝜶,𝜷:𝛼𝑖 ≥0
• The dual problem
𝑑 ∗ ≔ max min ℒ 𝑤, 𝜶, 𝜷
𝜶,𝜷:𝛼𝑖 ≥0 𝑤
• Always true:
𝑑 ∗ ≤ 𝑝∗
Lagrange duality
• The primal problem
𝑝∗ ≔ min 𝑓 𝑤 = min max ℒ 𝑤, 𝜶, 𝜷
𝑤 𝑤 𝜶,𝜷:𝛼𝑖 ≥0
• The dual problem
𝑑 ∗ ≔ max min ℒ 𝑤, 𝜶, 𝜷
𝜶,𝜷:𝛼𝑖 ≥0 𝑤
• Interesting case: when do we have

𝑑 ∗ = 𝑝∗ ?
Lagrange duality
• Theorem: under proper conditions, there exists 𝑤 ∗ , 𝜶∗ , 𝜷∗ such that
𝑑∗ = ℒ 𝑤 ∗ , 𝜶∗ , 𝜷∗ = 𝑝∗
Moreover, 𝑤 ∗ , 𝜶∗ , 𝜷∗ satisfy Karush-Kuhn-Tucker (KKT) conditions:

𝜕ℒ
= 0, 𝛼𝑖 𝑔𝑖 𝑤 = 0
𝜕𝑤𝑖
𝑔𝑖 𝑤 ≤ 0, ℎ𝑗 𝑤 = 0, 𝛼𝑖 ≥ 0
Lagrange duality
dual complementarity
𝑑∗ = ℒ 𝑤 ∗ , 𝜶∗ , 𝜷∗ = 𝑝∗
Moreover, 𝑤 ∗ , 𝜶∗ , 𝜷∗ satisfy Karush-Kuhn-Tucker (KKT) conditions:

𝜕ℒ
= 0, 𝛼𝑖 𝑔𝑖 𝑤 = 0
𝜕𝑤𝑖
𝑔𝑖 𝑤 ≤ 0, ℎ𝑗 𝑤 = 0, 𝛼𝑖 ≥ 0
Lagrange duality
primal constraints dual constraints
𝑑∗ = ℒ 𝑤 ∗ , 𝜶∗ , 𝜷∗ = 𝑝∗
• Moreover, 𝑤 ∗ , 𝜶∗ , 𝜷∗ satisfy Karush-Kuhn-Tucker (KKT) conditions:

𝜕ℒ
= 0, 𝛼𝑖 𝑔𝑖 𝑤 = 0
𝜕𝑤𝑖
𝑔𝑖 𝑤 ≤ 0, ℎ𝑗 𝑤 = 0, 𝛼𝑖 ≥ 0
Lagrange duality
• What are the proper conditions?
• A set of conditions (Slater conditions):
• 𝑓, 𝑔𝑖 convex, ℎ𝑗 affine
• Exists 𝑤 satisfying all 𝑔𝑖 𝑤 < 0
• There exist other sets of conditions

• Search Karush–Kuhn–Tucker conditions on Wikipedia
SVM: optimization
SVM: optimization
• Optimization (Quadratic Programming):
2
1
min 𝑤
𝑤,𝑏 2
𝑦𝑖 𝑤 𝑇 𝑥𝑖 + 𝑏 ≥ 1, ∀𝑖
• Generalized Lagrangian:
2
1
ℒ 𝑤, 𝑏, 𝜶 = 𝑤 − ෍ 𝛼𝑖 [𝑦𝑖 𝑤 𝑇 𝑥𝑖 + 𝑏 − 1]
2
𝑖
where 𝜶 is the Lagrange multiplier
SVM: optimization
• KKT conditions:
𝜕ℒ
= 0,  𝑤 = σ𝑖 𝛼𝑖 𝑦𝑖 𝑥𝑖 (1)
𝜕𝑤
𝜕ℒ
= 0,  0 = σ𝑖 𝛼𝑖 𝑦𝑖 (2)
𝜕𝑏
• Plug into ℒ:
1
ℒ 𝑤, 𝑏, 𝜶 = σ𝑖 𝛼𝑖 − σ𝑖𝑗 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝑥𝑖𝑇 𝑥𝑗 (3)
2
combined with 0 = σ𝑖 𝛼𝑖 𝑦𝑖 , 𝛼𝑖 ≥ 0
Only depend on inner products
SVM: optimization
• Reduces to dual problem:
1
ℒ 𝑤, 𝑏, 𝜶 = ෍ 𝛼𝑖 − ෍ 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝑥𝑖𝑇 𝑥𝑗
2
𝑖 𝑖𝑗
෍ 𝛼𝑖 𝑦𝑖 = 0, 𝛼𝑖 ≥ 0
𝑖
• Since 𝑤 = σ𝑖 𝛼𝑖 𝑦𝑖 𝑥𝑖 , we have 𝑤 𝑇 𝑥 + 𝑏 = σ𝑖 𝛼𝑖 𝑦𝑖 𝑥𝑖𝑇 𝑥 + 𝑏

Kernel methods
Features
𝑥 𝜙 𝑥
Color Histogram
Extract
features
Red Green Blue

Features
Features
• Proper feature mapping can make non-linear to linear
• Using SVM on the feature space {𝜙 𝑥𝑖 }: only need 𝜙 𝑥𝑖 𝑇 𝜙(𝑥𝑗 )
• Conclusion: no need to design 𝜙 ⋅ , only need to design

𝑘 𝑥𝑖 , 𝑥𝑗 = 𝜙 𝑥𝑖 𝑇 𝜙(𝑥𝑗 )
Polynomial kernels
• Fix degree 𝑑 and constant 𝑐:
𝑘 𝑥, 𝑥′ = 𝑥 𝑇 𝑥′ + 𝑐 𝑑
• What are 𝜙(𝑥)?

• Expand the expression to get 𝜙(𝑥)
Polynomial kernels
Figure from Foundations of Machine Learning, by M. Mohri, A. Rostamizadeh, and A. Talwalkar

Gaussian kernels
• Fix bandwidth 𝜎:
2
𝑘 𝑥, 𝑥′ = exp(− 𝑥 − 𝑥 /2𝜎 2 )
′
• Also called radial basis function (RBF) kernels
• What are 𝜙(𝑥)? Consider the un-normalized version

𝑘′ 𝑥, 𝑥′ = exp(𝑥 𝑇 𝑥′/𝜎 2 )
• Power series expansion:
+∞
𝑥 𝑇𝑥′ 𝑖
𝑘′ 𝑥, 𝑥 ′ = ෍
𝜎 𝑖 𝑖!
𝑖
Mercer’s condition for kenerls
• Theorem: 𝑘 𝑥, 𝑥′ has expansion
+∞
𝑘 𝑥, 𝑥′ = ෍ 𝑎𝑖 𝜙𝑖 𝑥 𝜙𝑖 (𝑥 ′ )
𝑖
if and only if for any function 𝑐(𝑥),
∫ ∫ 𝑐 𝑥 𝑐 𝑥 ′ 𝑘 𝑥, 𝑥 ′ 𝑑𝑥𝑑𝑥 ′ ≥ 0
(Omit some math conditions for 𝑘 and 𝑐)

Constructing new kernels
• Kernels are closed under positive scaling, sum, product, pointwise
limit, and composition with a power series σ𝑖+∞ 𝑎𝑖 𝑘 𝑖 (𝑥, 𝑥 ′ )
• Example: 𝑘1 𝑥, 𝑥′ , 𝑘2 𝑥, 𝑥′ are kernels, then also is

𝑘 𝑥, 𝑥 ′ = 2𝑘1 𝑥, 𝑥′ + 3𝑘2 𝑥, 𝑥′
• Example: 𝑘1 𝑥, 𝑥′ is kernel, then also is

𝑘 𝑥, 𝑥 ′ = exp(𝑘1 𝑥, 𝑥 ′ )
Kernels v.s. Neural networks
Features
𝑥
Color Histogram
Extract build
features hypothesis 𝑦 = 𝑤𝑇𝜙 𝑥
Red Green Blue

Features: part of the model
Nonlinear model
build
hypothesis 𝑦 = 𝑤𝑇𝜙 𝑥
Linear model
Polynomial kernels

Polynomial kernel SVM as two layer neural network
𝑥12
𝑥22
𝑥1
2𝑥1 𝑥2
𝑦 = sign(𝑤 𝑇 𝜙(𝑥) + 𝑏)
2𝑐𝑥1
𝑥2
2𝑐𝑥2
𝑐
First layer is fixed. If also learn first layer, it becomes two layer neural network

Machine Learning Basics Lecture 5 SVM II

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning Basics Lecture 5 SVM II

Uploaded by

Copyright:

Available Formats

Machine Learning Basics

• Solved by Lagrange multiplier method:

• Solved by setting derivatives of Lagrangian to 0

• So minimizing 𝑓 𝑤 is the same as minimizing 𝜃𝑃 𝑤

• Interesting case: when do we have

Moreover, 𝑤 ∗ , 𝜶∗ , 𝜷∗ satisfy Karush-Kuhn-Tucker (KKT) conditions:

Moreover, 𝑤 ∗ , 𝜶∗ , 𝜷∗ satisfy Karush-Kuhn-Tucker (KKT) conditions:

• Moreover, 𝑤 ∗ , 𝜶∗ , 𝜷∗ satisfy Karush-Kuhn-Tucker (KKT) conditions:

• There exist other sets of conditions

• Since 𝑤 = σ𝑖 𝛼𝑖 𝑦𝑖 𝑥𝑖 , we have 𝑤 𝑇 𝑥 + 𝑏 = σ𝑖 𝛼𝑖 𝑦𝑖 𝑥𝑖𝑇 𝑥 + 𝑏

Red Green Blue

• Conclusion: no need to design 𝜙 ⋅ , only need to design

• What are 𝜙(𝑥)?

Figure from Foundations of Machine Learning, by M. Mohri, A. Rostamizadeh, and A. Talwalkar

• Also called radial basis function (RBF) kernels

• What are 𝜙(𝑥)? Consider the un-normalized version

(Omit some math conditions for 𝑘 and 𝑐)

• Example: 𝑘1 𝑥, 𝑥′ , 𝑘2 𝑥, 𝑥′ are kernels, then also is

• Example: 𝑘1 𝑥, 𝑥′ is kernel, then also is

Red Green Blue

Figure from Foundations of Machine Learning, by M. Mohri, A. Rostamizadeh, and A. Talwalkar

You might also like