Professional Documents
Culture Documents
Machine Learning Basics Lecture 4 SVM I
Machine Learning Basics Lecture 4 SVM I
Lecture 4: SVM I
Princeton University COS 495
Instructor: Yingyu Liang
Review: machine learning basics
Math formulation
• Given training data𝑥𝑖 , 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷
1 𝑛
• Find 𝑦 = 𝑓(𝑥) ∈ 𝓗 that minimizes 𝐿 𝑓 = σ𝑖=1 𝑙(𝑓, 𝑥𝑖 , 𝑦𝑖 )
𝑛
• s.t. the expected loss is small
𝐿 𝑓 = 𝔼 𝑥,𝑦 ~𝐷 [𝑙(𝑓, 𝑥, 𝑦)]
Machine learning 1-2-3
• Different names/faces/connections
• Occam’s razor
• VC dimension theory
• Minimum description length
• Tradeoff between Bias and variance; uniform convergence
• The curse of dimensionality
(𝑤 ∗ )𝑇 𝑥 < 0
Class +1 𝑤∗
Class -1
Class +1
𝑤1 𝑤2 𝑤3
Class -1
Class +1
𝑤1
Class -1
What about 𝑤3 ? New test data
Class +1
𝑤3
Class -1
Most confident: 𝑤2 New test data
Class +1
𝑤2
Class -1
Intuition: margin large margin
Class +1
𝑤2
Class -1
Margin
Margin
|𝑓𝑤 𝑥 |
• Lemma 1: 𝑥 has distance to the hyperplane 𝑓𝑤 𝑥 = 𝑤 𝑇 𝑥 = 0
|𝑤|
Proof:
• 𝑤 is orthogonal to the hyperplane
𝑤
• The unit direction is
|𝑤|
0 𝑥
𝑇 𝑤
𝑤 𝑓𝑤 (𝑥)
• The projection of 𝑥 is 𝑥= |𝑤|
𝑤 |𝑤|
𝑇
𝑤
𝑥
𝑤
Margin: with bias
• Claim 1: 𝑤 is orthogonal to the hyperplane 𝑓𝑤,𝑏 𝑥 = 𝑤 𝑇 𝑥 + 𝑏 = 0
Proof:
• pick any 𝑥1 and 𝑥2 on the hyperplane
• 𝑤 𝑇 𝑥1 + 𝑏 = 0
• 𝑤 𝑇 𝑥2 + 𝑏 = 0
• So 𝑤 𝑇 (𝑥1 − 𝑥2 ) = 0
Margin: with bias
−𝑏
• Claim 2: 0 has distance to the hyperplane 𝑤 𝑇 𝑥 + 𝑏 = 0
|𝑤|
Proof:
• pick any 𝑥1 the hyperplane
𝑤
• Project 𝑥1 to the unit direction to get the distance
|𝑤|
𝑇
𝑤 −𝑏
• 𝑥1 = since 𝑤 𝑇 𝑥1 + 𝑏 = 0
𝑤 |𝑤|
Margin: with bias
|𝑓𝑤,𝑏 𝑥 |
• Lemma 2: 𝑥 has distance to the hyperplane 𝑓𝑤,𝑏 𝑥 = 𝑤 𝑇 𝑥 +
|𝑤|
𝑏=0
Proof:
𝑤
• Let 𝑥 = 𝑥⊥ + 𝑟 , then |𝑟| is the distance
|𝑤|
• Multiply both sides by 𝑤 𝑇 and add 𝑏
• Left hand side: 𝑤 𝑇 𝑥 + 𝑏 = 𝑓𝑤,𝑏 𝑥
𝑤𝑇𝑤
• Right hand side: 𝑤 𝑇 𝑥⊥ + 𝑟 + 𝑏 = 0 + 𝑟| 𝑤 |
|𝑤|
The notation here is:
𝑦 𝑥 = 𝑤 𝑇 𝑥 + 𝑤0
• A bit complicated …
SVM: simplified objective
• Observation: when (𝑤, 𝑏) scaled by a factor 𝑐, the margin unchanged
𝑦𝑖 (𝑐𝑤 𝑇 𝑥𝑖 + 𝑐𝑏) 𝑦𝑖 (𝑤 𝑇 𝑥𝑖 + 𝑏)
=
| 𝑐𝑤 | |𝑤|
𝑦𝑖 𝑤 𝑇 𝑥𝑖 + 𝑏 ≥ 1, ∀𝑖
ෝ ∗?
• How to find the optimum 𝑤
SVM: principle for hypothesis class
Thought experiment
• Suppose pick an 𝑅, and suppose can decide if exists 𝑤 satisfying
1 2
𝑤 ≤𝑅
2
𝑦𝑖 𝑤 𝑇 𝑥𝑖 + 𝑏 ≥ 1, ∀𝑖
ෝ∗
𝑤
Thought experiment
ෝ ∗ is the best weight (i.e., satisfying the smallest 𝑅)
•𝑤
ෝ∗
𝑤
Thought experiment
ෝ ∗ is the best weight (i.e., satisfying the smallest 𝑅)
•𝑤
ෝ∗
𝑤
Thought experiment
ෝ ∗ is the best weight (i.e., satisfying the smallest 𝑅)
•𝑤
ෝ∗
𝑤
Thought experiment
ෝ ∗ is the best weight (i.e., satisfying the smallest 𝑅)
•𝑤
ෝ∗
𝑤
Thought experiment
• To handle the difference between empirical and expected losses
• Choose large margin hypothesis (high confidence)
• Choose a small hypothesis class
ෝ∗
𝑤
Corresponds to the
hypothesis class
Thought experiment
• Principle: use smallest hypothesis class still with a correct/good one
• Also true beyond SVM
• Also true for the case without perfect separation between the two classes
• Math formulation: VC-dim theory, etc.
ෝ∗
𝑤
Corresponds to the
hypothesis class
Thought experiment
• Principle: use smallest hypothesis class still with a correct/good one
• Whatever you know about the ground truth, add it as constraint/regularizer
ෝ∗
𝑤
Corresponds to the
hypothesis class
SVM: optimization
• Optimization (Quadratic Programming):
2
1
min 𝑤
𝑤,𝑏 2
𝑦𝑖 𝑤 𝑇 𝑥𝑖 + 𝑏 ≥ 1, ∀𝑖