Professional Documents
Culture Documents
8 SVMs
8 SVMs
Analytic Computing
25.06.22 2
1 Perceptron Algorithm
Binary classification
$
• Given training data 𝑥! , 𝑦! !"# with 𝑥! ∈ ℝ% and 𝑦! ∈ −1,1 ,
• Learn a classifier
( > 0, if 𝑦! = +1
𝑓 𝑥! = +
< 0, if 𝑦! = −1
• Correct classification:
𝑓( 𝑥! 𝑦! > 0
25.06.22 4
Linear separability
(Zisserman 2015)
25.06.22 5
Linear separability
(Zisserman 2015)
25.06.22 6
Linear classifiers Related to, but different
from chapter „From linear
regression to classification“
• A linear classifier has the form in 4-LogisticRegression
𝑓( 𝑥 = 𝑤 & 𝑥 + 𝑏
!
25.06.22 7
Linear classifiers
• The we write
!
𝑓( 𝑥 = 𝑤 & 𝑥
! !
(Zisserman 2015)
x0=1
25.06.22 8
Linear classifiers
𝑓( 𝑥 = 𝑤 & 𝑥
!
• In 3D the discriminant is a plane
• In n-dim the discriminant is a
hyperplane
• Only 𝑤 (including 𝑏) are needed to
classify new data
(Zisserman 2015)
25.06.22 9
The perceptron classifier
$
• Given training data 𝑥! , 𝑦! !"# with 𝑥! ∈ ℝ% and 𝑦! ∈ −1,1 ,
• how to find a weight vector w, the separating hyperplane,
such that the two categories are separated for the dataset?
• Perceptron algorithm
1. Initialize 𝑤 = 0#
2. While there is 𝑖 such that 𝑓& 𝑥! 𝑦! < 0 do
25.06.22 10
Example for perceptron algorithm in 2D
1. Initialize 𝑤 = 0#
2. While there is 𝑖 such that 𝑓& 𝑥! 𝑦! < 0 do
• 𝑤 ≔ 𝑤 − 𝛼𝑥! sign(𝑓& 𝑥! )
At convergence: 𝑤 = ∑$
!"# 𝛼! 𝑥! (Zisserman 2015)
Example
(Zisserman 2015)
25.06.22 12
2 Support Vectors
What is the best 𝒘?
(Zisserman 2015)
25.06.22 16
Support Vector Machine
!
𝑓" 𝑥 = & 𝛼! 𝑦! x"# 𝑥 + 𝑏
! (Zisserman 2015)
25.06.22 17
SVM Optimization Problem (1)
𝛿 𝑥! , ℎ 25.06.22 18
SVM Optimization Problem (1)
(Zisserman 2015)
25.06.22 21
SVM Optimization Problem (2)
2
max
$ ||𝑤||
subject to
𝑦! 𝑤 ) 𝑥! + 𝑏 ≥ 1, for 𝑖 = 1 … 𝑁
Or equivalently
min 𝑤 ,
$
subject to the same constraints
This is a quadratic optimization problem
subject to linear constraints and there is a unique minimum
25.06.22 22
Compare the two optimization criteria for classfication of linearly separable data
by linear regression with classification by SVM
Linear classification
Linear classification
using SVM:
using regression:
decision line maximizes
decision line is average
margins between support
between regression lines;
vectors; far away data
all data points are
points are irrelevant
considered
25.06.22 24
3 Soft Margin and
Hinge Loss
Re-visiting linear separability
(Zisserman 2015)
25.06.22 26
Re-visiting linear separability
25.06.22 27
Introduce „slack“ variables
(Zisserman 2015)
Soft margin solution
Revised optimization problem
O
K
min 𝑤 + 𝐶 ' 𝜉L
G∈ℝ ,J" ∈ℝ#
!
LMN
subject to
𝑦! 𝑤 " 𝑥! + 𝑏 ≥ 1 − 𝜉! , for 𝑖 = 1 … 𝑁
• Every constraint can be satisfied if 𝜉! is sufficiently large
• 𝐶 is a regularization parameter:
- small 𝐶 allows constraints to be easily ignored ⟹ large margin
- large 𝐶 makes constraints hard to ignore ⟹ narrow margin
- 𝐶 = ∞ enforces all constraints ⟹ hard margin
• Still a quadratic optimization problem with unique minimum
• One hyperparameter 𝐶 25.06.22 29
Loss function
• Given constraints:
𝑦! 𝑤 ) 𝑥! + 𝑏 ≥ 1 − 𝜉!
𝜉! ≥ 0
• We can rewrite 𝜉! as:
𝜉! = max(0,1 − 𝑦! 𝑓& 𝑥! )
(Zisserman 2015)
25.06.22 32
4 Gradient descent over
convex function
Gradient descent/ascent
https://en.wikipedia.org/wiki/Gradient_descent#/media/File:Gradient_descent.svg
Climb down a hill
Climb up a hill
Go in direction where
𝑑𝑓(𝒙)
= 𝛻𝒙 𝑓(𝒙)
𝑑𝒙
is maximal/minimal
In general: challenge can be difficult
𝑥! 𝑥"
Gradient Descent (- but without posts)
https://goo.gl/images/JKN6zm
Optimization continued
Questions
• Does this cost function have a unique solution?
25.06.22 38
Optimization continued
(Zisserman 2015)
Questions
• Does this cost function have a unique solution?
• Do we find it using gradient descent?
Does the solution we find using gradient descent depend on the starting point?
To the rescue:
• If the cost function is convex, then a locally optimal point is globally optimal (provided
the optimization is over constraints that form a convex set – given in our case)
25.06.22 39
Convex functions
(Zisserman 2015)
25.06.22 40
Convex function examples
25.06.22 41
Applied to hinge loss and regularization
25.06.22 42
Gradient descent algorithm for SVM
To minimize a cost function 𝒞(𝑤) use the iterative update
𝑤#$% ≔ 𝑤# − 𝜂# ∇& 𝒞(w' )
where 𝜂 is the learning rate.
(
Let‘s rewrite the minimization problem as an average with 𝜆 = ):
+
1 1
𝒞 𝑤 = 𝑤 ( + > max(0,1 − 𝑦! 𝑓C 𝑥! ) =
𝑁𝐶 𝑁
!*%
+
1 𝜆
= > 𝑤 ( + max(0,1 − 𝑦! 𝑓C 𝑥! )
𝑁 2
!*%
and 𝑓C 𝑥! = 𝑤 " 𝑥 + 𝑏
25.06.22 43
Sub-gradient for hinge loss
(Zisserman 2015)
!
25.06.22 44
Sub-gradient descent algorithm for SVM
/
1 𝜆 0
𝒞 𝑤 = ) 𝑤 + ℒ(𝑥- , 𝑦- ; 𝑤)
𝑁 2
-.(
Then each iteration t involves cycling through the training data with the updates:
25.06.22 45
?
Questions 4:
Gradient descent
, 0
𝑤
ℒ(𝑤, 𝑏, 𝛼) = − R 𝛼! 𝑦! 8 𝑤 ) 𝑥! + 𝑏 − 1
2
!/*
• failed constraints “punish“ the objective function 25.06.22 50
Excursion: Lagrange Multiplier
25.06.22 52
Revisit Optimization Problem for Hard Margin Case
• Minimize the quadratic form
𝑤 , 𝑤)𝑤
=
2 2
• With constraints
𝑦! 8 𝑤 ) 𝑥! + 𝑏 ≥ 1 ∀𝑖
, 0 𝛼- are the
𝑤
ℒ(𝑤, 𝑏, 𝛼) = − R 𝛼! 𝑦! 8 𝑤 ) 𝑥! + 𝑏 − 1 Lagrange
2 multipliers
!/*
• failed constraints “punish“ the objective function 25.06.22 53
Lagrangian primal problem
• Lagrangian primal problem is:
min max ℒ(𝑤, 𝑏, 𝛼)
$,' 3
subject to ∀𝑖: 𝛼! ≥ 0
25.06.22 54
Finding the optimum
𝑤 is a linear
• Find optimum using derivatives: combination of the
data instances!
0
𝜕
ℒ 𝑤, 𝑏, 𝛼 = 0 ⟹ 0 = R 𝛼! 𝑦!
𝜕𝑏
!/*
0 0
𝜕
ℒ 𝑤, 𝑏, 𝛼 = 0 ⟹ 𝑤( = R 𝛼! 𝑦! 𝑥!,( ⟹ 𝑤 = R 𝛼! 𝑦! 𝑥!
𝜕𝑤(
!/* !/* 25.06.22 55
Substitution into ℒ(𝑤, 𝑏, 𝛼)
, 0
𝑤
ℒ 𝑤, 𝑏, 𝛼 |$/∑8 = − R 𝛼! 𝑦! 8 𝑤 ) 𝑥! + 𝑏 − 1 =
$67 3$ "$ %$ 2
!/*
) 6
0 0 0 0
1
= R 𝛼( 𝑦( 𝑥( R 𝛼5 𝑦5 𝑥5 − R 𝛼! 𝑦! 8 R 𝛼( 𝑦( 𝑥( 𝑥! + 𝑏 − 1 =
2
(/* 5/* !/* (/*
0 0
1
= R 𝛼( 𝛼5 𝑦( 𝑦5 𝑥() 𝑥5 − R 𝛼! 𝛼( 𝑦! 𝑦( 𝑥!) 𝑥( − 𝑏 R 𝛼! 𝑦! + R 𝛼! =
2
(,5 !,( !/* !/*
= =0
0
1
= ℒ(𝛼) = R 𝛼! − R 𝛼( 𝛼5 𝑦( 𝑦5 𝑥() 𝑥5
2 25.06.22 56
!/* (,5
Wolfe dual problem
0
1
max R 𝛼! − R 𝛼( 𝛼5 𝑦( 𝑦5 𝑥() 𝑥5
3 2
!/* (,5
𝑦! 𝑤 " 𝑥! + 𝑏 ≥ 1 − 𝜉!
𝐶 > 𝜉!
!#$
• Transform to Lagrangian
• with additional Lagrange multipliers for the slack variables being
constrained to positive values ...
25.06.22 58
Summary: Primal and dual formulations
The dual form classifier seems to work like a kNN classifier, it requires
the training data points 𝑥! . However, many of the 𝛼! are zero.
The ones that are non-zero define the support vectors 𝑥! .
25.06.22 59
Summary: Primal and dual formulations
• Lagrangian primal problem is:
min max ℒ(𝑤, 𝑏, 𝛼)
$,' 3
subject to ∀𝑖: 𝛼! ≥ 0
25.06.22 60
6 Kernelization Tricks
in SVMs
Non-linear Case
𝑓C 𝑥 = > 𝛼! G 𝑦! G 𝑥!" 𝑥 + 𝑏
!*%
• 𝑥- : i-th training instance
• 𝛼- : weight for i-th training instance
25.06.22 63
Example 1: From 1-dim to 2-dim
25.06.22 64
Example 2: From 2-dim to 3-dim
(Zisserman 2015)
25.06.22 65
Feature engineering using 𝝓(𝒙)
• Classifier: Given 𝑥! ∈ ℝ7 , 𝜙: ℝ7 → ℝ8 , 𝑤 ∈ ℝ8
𝑓& 𝑥 = 𝑤 ) 𝜙 𝑥 + 𝑏
• Learning:
0
1
min9 𝑤 , + R max(0,1 − 𝑦! 𝑓& 𝑥! )
$∈ℝ 𝐶
!/*
𝑓< 𝑥 = ) 𝛼- F 𝑦- F 𝑥-# 𝑥 + 𝑏
-.(
/ 𝜙(x)
⟹ 𝑓< 𝑥 = ) 𝛼- F 𝑦- F 𝜙 𝑥- #
𝜙(𝑥) + 𝑏 only occurs in pairs
"
-.( 𝜙 𝑥' 𝜙(𝑥! )
Learning:
/ Kernels
1 "
max ) 𝛼- − ) 𝛼! 𝛼) 𝑦! 𝑦) 𝑥!# 𝑥) 𝑘 𝑥' , 𝑥! = 𝜙 𝑥' 𝜙(𝑥! )
: 2
-.( !,)
/
1 #
⟹ max ) 𝛼- − ) 𝛼! 𝛼) 𝑦! 𝑦) 𝜙 𝑥! 𝜙(𝑥) )
: 2
-.( !,)
𝑓< 𝑥 = ) 𝛼- F 𝑦- F 𝜙 𝑥- #
𝜙(𝑥) + 𝑏
-.(
/
⟹ 𝑓< 𝑥 = ) 𝛼- F 𝑦- F 𝑘(𝑥- , 𝑥) + 𝑏
-.(
Learning:
/
1 #
max ) 𝛼- − ) 𝛼! 𝛼) 𝑦! 𝑦) 𝜙 𝑥! 𝜙(𝑥) )
: 2
-.( !,)
/
1
⟹ max ) 𝛼- − ) 𝛼! 𝛼) 𝑦! 𝑦) 𝑘(𝑥! , 𝑥) )
: 2
-.( !,)
$
!"!#
>
• Gaussian kernels: 𝑘 𝑥, 𝑥 < = 𝑒 $%$ , for 𝜎 > 0
• Infinite dimensional feature space
• Also called Radial basis function kernel (RBF)
• often works quite well!
25.06.22 70
Background reading and more
Thorsten Joachims:
Transductive Inference for Text Classification using Support Vector
Machines. ICML 1999: 200-209
Maximum margin hyperplane
25.06.22 77
Reuters data set experiments (17 training documents)
25.06.22 78
IPVS
Thank you!
Steffen Staab
E-Mail Steffen.staab@ipvs.uni-stuttgart.de
Telefon +49 (0) 711 685-To be defined
www. ipvs.uni-stuttgart.de/departments/ac/
Universität Stuttgart
Analytic Computing, IPVS
Universitätsstraße 32, 50569 Stuttgart