You are on page 1of 72

IPVS – Institute for Parallel and Distributed Systems

Analytic Computing

8 Support Vector Machines

Prof. Dr. Steffen Staab


https://www.ipvs.uni-stuttgart.de/departments/ac/
• based on slides by
• Thomas Gottron, U. Koblenz-Landau,
https://west.uni-koblenz.de/de/studying/courses/ws1718/machine-learning-and-data-mining-1

• Andrew Zisserman, http://www.robots.ox.ac.uk/~az/lectures/ml/lect2.pdf

25.06.22 2
1 Perceptron Algorithm
Binary classification

$
• Given training data 𝑥! , 𝑦! !"# with 𝑥! ∈ ℝ% and 𝑦! ∈ −1,1 ,

• Learn a classifier

( > 0, if 𝑦! = +1
𝑓 𝑥! = +
< 0, if 𝑦! = −1

• Correct classification:
𝑓( 𝑥! 𝑦! > 0
25.06.22 4
Linear separability

(Zisserman 2015)

25.06.22 5
Linear separability

(Zisserman 2015)

25.06.22 6
Linear classifiers Related to, but different
from chapter „From linear
regression to classification“
• A linear classifier has the form in 4-LogisticRegression

𝑓( 𝑥 = 𝑤 & 𝑥 + 𝑏
!

• In 2D the discriminant is a line


• 𝑤 is the normal to the line,
! !
and b the bias
• 𝑤 is known as the weight vector
(Zisserman 2015)

25.06.22 7
Linear classifiers

• Let‘s assume 𝒙𝒊 = 𝟏, 𝒙𝒊,𝟏 , … , 𝒙𝒊,𝒅


(as in linear regression)

• The we write
!

𝑓( 𝑥 = 𝑤 & 𝑥

! !

(Zisserman 2015)
x0=1
25.06.22 8
Linear classifiers

• A linear classifier has the form

𝑓( 𝑥 = 𝑤 & 𝑥
!
• In 3D the discriminant is a plane
• In n-dim the discriminant is a
hyperplane
• Only 𝑤 (including 𝑏) are needed to
classify new data
(Zisserman 2015)

25.06.22 9
The perceptron classifier

$
• Given training data 𝑥! , 𝑦! !"# with 𝑥! ∈ ℝ% and 𝑦! ∈ −1,1 ,
• how to find a weight vector w, the separating hyperplane,
such that the two categories are separated for the dataset?

• Perceptron algorithm
1. Initialize 𝑤 = 0#
2. While there is 𝑖 such that 𝑓& 𝑥! 𝑦! < 0 do

• 𝑤 ≔ 𝑤 − 𝛼𝑥! sign 𝑓4 𝑥! = 𝑤 + 𝛼𝑥! 𝑦! (not for sign(...)=0)

25.06.22 10
Example for perceptron algorithm in 2D

1. Initialize 𝑤 = 0#
2. While there is 𝑖 such that 𝑓& 𝑥! 𝑦! < 0 do
• 𝑤 ≔ 𝑤 − 𝛼𝑥! sign(𝑓& 𝑥! )

At convergence: 𝑤 = ∑$
!"# 𝛼! 𝑥! (Zisserman 2015)
Example

• if the data is linearly separable,


then the algorithm will converge
• convergence can be slow …
• separating line close to training data

(Zisserman 2015)

25.06.22 12
2 Support Vectors
What is the best 𝒘?

• Idea: maximum margin solution is most stable under


perturbations of the inputs

(Zisserman 2015)

25.06.22 16
Support Vector Machine

!
𝑓" 𝑥 = & 𝛼! 𝑦! x"# 𝑥 + 𝑏
! (Zisserman 2015)
25.06.22 17
SVM Optimization Problem (1)

• Distance 𝛿 𝑥! , ℎ of data point 𝑥! from hyperplane ℎ


"$ # $% %$ &'
𝛿 𝑥! , ℎ = $

𝛿 𝑥! , ℎ 25.06.22 18
SVM Optimization Problem (1)

• In general, length of vector w does not matter


• fix w such that support vectors 𝑥( make 𝑦( 8 𝑤 ) 𝑥( + 𝑏 = 1
• Then positive and negative
support vectors have 1
distance 𝛿 𝑥( , ℎ =
* 𝑤
|$|
from hyperplane –
which we want
to maximize
• Standard formulation:
$
Minimize
,
(inverse margin)
𝛿 𝑥! , ℎ 25.06.22 19
Support Vector Machine

(Zisserman 2015)

25.06.22 21
SVM Optimization Problem (2)

2
max
$ ||𝑤||

subject to
𝑦! 𝑤 ) 𝑥! + 𝑏 ≥ 1, for 𝑖 = 1 … 𝑁
Or equivalently
min 𝑤 ,
$
subject to the same constraints
This is a quadratic optimization problem
subject to linear constraints and there is a unique minimum

25.06.22 22
Compare the two optimization criteria for classfication of linearly separable data
by linear regression with classification by SVM
Linear classification
Linear classification
using SVM:
using regression:
decision line maximizes
decision line is average
margins between support
between regression lines;
vectors; far away data
all data points are
points are irrelevant
considered

25.06.22 24
3 Soft Margin and
Hinge Loss
Re-visiting linear separability

• Points can be linearly separated,


but with very narrow margin

(Zisserman 2015)

25.06.22 26
Re-visiting linear separability

• Points can be linearly separated,


but with very narrow margin

• Possibly the large margin solution is better,


even though one constraint is violated

Trade-off between the margin and the


number of mistakes on training data (Zisserman 2015)

25.06.22 27
Introduce „slack“ variables

(Zisserman 2015)
Soft margin solution
Revised optimization problem
O
K
min 𝑤 + 𝐶 ' 𝜉L
G∈ℝ ,J" ∈ℝ#
!
LMN
subject to
𝑦! 𝑤 " 𝑥! + 𝑏 ≥ 1 − 𝜉! , for 𝑖 = 1 … 𝑁
• Every constraint can be satisfied if 𝜉! is sufficiently large
• 𝐶 is a regularization parameter:
- small 𝐶 allows constraints to be easily ignored ⟹ large margin
- large 𝐶 makes constraints hard to ignore ⟹ narrow margin
- 𝐶 = ∞ enforces all constraints ⟹ hard margin
• Still a quadratic optimization problem with unique minimum
• One hyperparameter 𝐶 25.06.22 29
Loss function

• Given constraints:
𝑦! 𝑤 ) 𝑥! + 𝑏 ≥ 1 − 𝜉!
𝜉! ≥ 0
• We can rewrite 𝜉! as:
𝜉! = max(0,1 − 𝑦! 𝑓& 𝑥! )

• Hence, we can optimize the unconstrained optimization problem over 𝑤:


0
1
min& 𝑤 , + R max(0,1 − 𝑦! 𝑓& 𝑥! )
$∈ℝ 𝐶
!/*

regularization loss function


25.06.22 30
Loss function
, 0
1
min 𝑤 + R max(0,1 − 𝑦! 𝑓& 𝑥! )
$∈ℝ& ,2$ ∈ℝ' 𝐶
!/*

(Zisserman 2015) 25.06.22 31


Hinge loss

(Zisserman 2015)

25.06.22 32
4 Gradient descent over
convex function
Gradient descent/ascent

https://en.wikipedia.org/wiki/Gradient_descent#/media/File:Gradient_descent.svg
Climb down a hill
Climb up a hill

Given differentiable function


describing height of hill at position
𝒙 = (𝑥(, … , 𝑥) ) height of hill 𝑓 𝒙 .

How to climb up/down fastest?

Go in direction where
𝑑𝑓(𝒙)
= 𝛻𝒙 𝑓(𝒙)
𝑑𝒙
is maximal/minimal
In general: challenge can be difficult

𝑥! 𝑥"
Gradient Descent (- but without posts)

https://goo.gl/images/JKN6zm
Optimization continued

Questions
• Does this cost function have a unique solution?

25.06.22 38
Optimization continued

(Zisserman 2015)
Questions
• Does this cost function have a unique solution?
• Do we find it using gradient descent?
Does the solution we find using gradient descent depend on the starting point?
To the rescue:
• If the cost function is convex, then a locally optimal point is globally optimal (provided
the optimization is over constraints that form a convex set – given in our case)
25.06.22 39
Convex functions

(Zisserman 2015)
25.06.22 40
Convex function examples

• A non-negative sum of convex functions is convex (Zisserman 2015)

25.06.22 41
Applied to hinge loss and regularization

25.06.22 42
Gradient descent algorithm for SVM
To minimize a cost function 𝒞(𝑤) use the iterative update
𝑤#$% ≔ 𝑤# − 𝜂# ∇& 𝒞(w' )
where 𝜂 is the learning rate.

(
Let‘s rewrite the minimization problem as an average with 𝜆 = ):
+
1 1
𝒞 𝑤 = 𝑤 ( + > max(0,1 − 𝑦! 𝑓C 𝑥! ) =
𝑁𝐶 𝑁
!*%
+
1 𝜆
= > 𝑤 ( + max(0,1 − 𝑦! 𝑓C 𝑥! )
𝑁 2
!*%

and 𝑓C 𝑥! = 𝑤 " 𝑥 + 𝑏
25.06.22 43
Sub-gradient for hinge loss

ℒ 𝑥! , 𝑦! ; 𝑤 = max(0,1 − 𝑦! 𝑓& 𝑥! ), 𝑓& 𝑥! = 𝑤 ) 𝑥! + 𝑏

(Zisserman 2015)
!
25.06.22 44
Sub-gradient descent algorithm for SVM
/
1 𝜆 0
𝒞 𝑤 = ) 𝑤 + ℒ(𝑥- , 𝑦- ; 𝑤)
𝑁 2
-.(

The iterative update is


𝑤12( ≔ 𝑤1 − 𝜂∇3 𝒞 w4 ≔
/
1
≔ 𝑤1 − 𝜂 ) 𝜆𝑤1 + ∇3 ℒ(𝑥- , 𝑦- ; 𝑤)
𝑁
-.(

Then each iteration t involves cycling through the training data with the updates:

𝑤 − 𝜂 𝜆𝑤1 − 𝑦- 𝑥- , if 𝑦- 𝑓< 𝑥- < 1


𝑤12( ≔9 1
𝑤1 − 𝜂𝜆𝑤1 , otherwise
(
Typical learning rate in Pegasos: 𝜂1 =
51

25.06.22 45
?
Questions 4:
Gradient descent

Steffen Staab, Universität Stuttgart, @ststaab, https://www.ipvs.uni-stuttgart.de/departments/ac/ 25.06.22 46


5 The dual problem
Primal vs dual problem
• SVM is a linear classifier: 𝑓& 𝑥 = 𝑤 ) 𝑥 + 𝑏
• The primal problem: an optimization problem over w:
0
1
min& 𝑤 , + R max(0,1 − 𝑦! 𝑓& 𝑥! )
$∈ℝ 𝐶
!/*

• The dual problem: Getting rid of the 𝑤 for a slightly different


representation of 𝑓& 𝑥 leads to the following representation
0
𝑓& 𝑥 = R 𝛼! 𝑦! 𝑥!) 𝑥 + 𝑏
!/*
and a new optimization problem with the same solution, but several
advantages. Let us show this on following slides... 25.06.22 49
Revisit Optimization Problem for Hard Margin Case
• Minimize the quadratic form
𝑤 , 𝑤)𝑤
=
2 2
• With constraints
𝑦! 8 𝑤 ) 𝑥! + 𝑏 ≥ 1 ∀𝑖

• The constraints will reach a value of 1 for at least one instance.

• Include hard constraints into the loss function:

, 0
𝑤
ℒ(𝑤, 𝑏, 𝛼) = − R 𝛼! 𝑦! 8 𝑤 ) 𝑥! + 𝑏 − 1
2
!/*
• failed constraints “punish“ the objective function 25.06.22 50
Excursion: Lagrange Multiplier

• We want to maximize a function 𝑓 𝑥


under the constraints 𝑔 𝑥 = 𝑎

Solution with Lagrange Multiplier


• Optimize the Lagrangian
𝑓 𝑥 −𝜆 𝑔 𝑥 −𝑎
instead!
Nicely visual explanation of Lagrange optimization at
https://www.svm-tutorial.com/2016/09/duality-lagrange-multipliers/
25.06.22 51
Algorithm for optimization with a Lagrange multiplier

1. Write down the Lagrangian 𝑓 𝑥 − 𝜆 ⋅ 𝑔 𝑥 − 𝑎


2. Take derivative of Lagrangian wrt x,
set it to 0
to find estimate of x that depends on 𝜆
3. Plug your estimate of x in the Lagrangian,
take the derivative wrt 𝜆,
and set it to 0,
to find the optimal value for the lagrange multiplier 𝜆
4. Plug in the Lagrange multiplier in your estimate for x

25.06.22 52
Revisit Optimization Problem for Hard Margin Case
• Minimize the quadratic form
𝑤 , 𝑤)𝑤
=
2 2
• With constraints
𝑦! 8 𝑤 ) 𝑥! + 𝑏 ≥ 1 ∀𝑖

• The constraints will reach a value of 1 for at least one instance.

• Include hard constraints into the loss function:

, 0 𝛼- are the
𝑤
ℒ(𝑤, 𝑏, 𝛼) = − R 𝛼! 𝑦! 8 𝑤 ) 𝑥! + 𝑏 − 1 Lagrange
2 multipliers
!/*
• failed constraints “punish“ the objective function 25.06.22 53
Lagrangian primal problem
• Lagrangian primal problem is:
min max ℒ(𝑤, 𝑏, 𝛼)
$,' 3

subject to ∀𝑖: 𝛼! ≥ 0

25.06.22 54
Finding the optimum

• Loss is a function of 𝑤, 𝑏, and 𝛼


, 0
𝑤
ℒ 𝑤, 𝑏, 𝛼 = − R 𝛼! 𝑦! 8 𝑤 ) 𝑥! + 𝑏 − 1
2
!/*

𝑤 is a linear
• Find optimum using derivatives: combination of the
data instances!
0
𝜕
ℒ 𝑤, 𝑏, 𝛼 = 0 ⟹ 0 = R 𝛼! 𝑦!
𝜕𝑏
!/*
0 0
𝜕
ℒ 𝑤, 𝑏, 𝛼 = 0 ⟹ 𝑤( = R 𝛼! 𝑦! 𝑥!,( ⟹ 𝑤 = R 𝛼! 𝑦! 𝑥!
𝜕𝑤(
!/* !/* 25.06.22 55
Substitution into ℒ(𝑤, 𝑏, 𝛼)
, 0
𝑤
ℒ 𝑤, 𝑏, 𝛼 |$/∑8 = − R 𝛼! 𝑦! 8 𝑤 ) 𝑥! + 𝑏 − 1 =
$67 3$ "$ %$ 2
!/*

) 6
0 0 0 0
1
= R 𝛼( 𝑦( 𝑥( R 𝛼5 𝑦5 𝑥5 − R 𝛼! 𝑦! 8 R 𝛼( 𝑦( 𝑥( 𝑥! + 𝑏 − 1 =
2
(/* 5/* !/* (/*

0 0
1
= R 𝛼( 𝛼5 𝑦( 𝑦5 𝑥() 𝑥5 − R 𝛼! 𝛼( 𝑦! 𝑦( 𝑥!) 𝑥( − 𝑏 R 𝛼! 𝑦! + R 𝛼! =
2
(,5 !,( !/* !/*

= =0
0
1
= ℒ(𝛼) = R 𝛼! − R 𝛼( 𝛼5 𝑦( 𝑦5 𝑥() 𝑥5
2 25.06.22 56
!/* (,5
Wolfe dual problem

0
1
max R 𝛼! − R 𝛼( 𝛼5 𝑦( 𝑦5 𝑥() 𝑥5
3 2
!/* (,5

subject to ∀𝑖: 𝛼! ≥ 0, and 0 = ∑0


!/* 𝛼! 𝑦!

• This problem is solvable with quadratic programming, because it fulfills


the Karush-Kuhn-Tucker conditions on 𝛼! that handle inequality
constraints (≥1) in the Lagrange optimization (not given here!).
• It gives us the classification function:
𝛼! is positive if
8 𝑥! is a support
vector
𝑓! 𝑥 = % 𝛼5 ' 𝑦5 ' 𝑥59 𝑥 + 𝑏
567
25.06.22 57
Non-separable Case (similar as before)
• Introduce (positive) „slack variables“ 𝜉! to allow deviations from the minimum
distance:

𝑦! 𝑤 " 𝑥! + 𝑏 ≥ 1 − 𝜉!

• Include a penalizing term in the optimization function:


% &

𝐶 > 𝜉!
!#$

• Transform to Lagrangian
• with additional Lagrange multipliers for the slack variables being
constrained to positive values ...

25.06.22 58
Summary: Primal and dual formulations

• Primal version of classifier


𝑓& 𝑥 = 𝑤 ) 𝑥 + 𝑏
• Dual version of classifier
0
𝑓& 𝑥 = R 𝛼! 8 𝑦! 8 𝑥!) 𝑥 + 𝑏
!/*

The dual form classifier seems to work like a kNN classifier, it requires
the training data points 𝑥! . However, many of the 𝛼! are zero.
The ones that are non-zero define the support vectors 𝑥! .

25.06.22 59
Summary: Primal and dual formulations
• Lagrangian primal problem is:
min max ℒ(𝑤, 𝑏, 𝛼)
$,' 3

subject to ∀𝑖: 𝛼! ≥ 0

• Lagrangian dual problem is:


0
1
max R 𝛼! − R 𝛼( 𝛼5 𝑦( 𝑦5 𝑥() 𝑥5
3 2
!/* (,5

subject to ∀𝑖: 𝛼! ≥ 0, and 0 = ∑0


!/* 𝛼! 𝑦!

25.06.22 60
6 Kernelization Tricks
in SVMs
Non-linear Case

• Not all classes can


be separated via a
hyperplane
• Essential:
• Dual representation uses
only the product of data
instances:
+

𝑓C 𝑥 = > 𝛼! G 𝑦! G 𝑥!" 𝑥 + 𝑏
!*%
• 𝑥- : i-th training instance
• 𝛼- : weight for i-th training instance

• Same for the Lagrangian... 25.06.22 62


Feature engineering using 𝝓(𝒙) Cf. lecture on
regression.
Chapter “beyond
• Classifier: Given 𝑥! ∈ ℝ7 , 𝜙: ℝ7 → ℝ8 , 𝑤 ∈ ℝ8 linear input”
𝑓& 𝑥 = 𝑤 ) 𝜙 𝑥 + 𝑏
• Learning:
0
1
min9 𝑤 , + R max(0,1 − 𝑦! 𝑓& 𝑥! )
$∈ℝ 𝐶
!/*

25.06.22 63
Example 1: From 1-dim to 2-dim

25.06.22 64
Example 2: From 2-dim to 3-dim

(Zisserman 2015)

25.06.22 65
Feature engineering using 𝝓(𝒙)

• Classifier: Given 𝑥! ∈ ℝ7 , 𝜙: ℝ7 → ℝ8 , 𝑤 ∈ ℝ8
𝑓& 𝑥 = 𝑤 ) 𝜙 𝑥 + 𝑏
• Learning:
0
1
min9 𝑤 , + R max(0,1 − 𝑦! 𝑓& 𝑥! )
$∈ℝ 𝐶
!/*

• 𝜙 𝑥 maps to high dimensional space ℝ8 where data is separable


• If 𝐷 ≫ 𝑑 then there are many more parameters to learn for w
25.06.22 66
Dual classifier in transformed feature space
Classifier:
/

𝑓< 𝑥 = ) 𝛼- F 𝑦- F 𝑥-# 𝑥 + 𝑏
-.(
/ 𝜙(x)
⟹ 𝑓< 𝑥 = ) 𝛼- F 𝑦- F 𝜙 𝑥- #
𝜙(𝑥) + 𝑏 only occurs in pairs
"
-.( 𝜙 𝑥' 𝜙(𝑥! )
Learning:
/ Kernels
1 "
max ) 𝛼- − ) 𝛼! 𝛼) 𝑦! 𝑦) 𝑥!# 𝑥) 𝑘 𝑥' , 𝑥! = 𝜙 𝑥' 𝜙(𝑥! )
: 2
-.( !,)

/
1 #
⟹ max ) 𝛼- − ) 𝛼! 𝛼) 𝑦! 𝑦) 𝜙 𝑥! 𝜙(𝑥) )
: 2
-.( !,)

subject to ∀𝑖: 𝛼- ≥ 0, and 0 = ∑/


-.( 𝛼- 𝑦-
25.06.22 67
Dual classifier using kernels
Classifier:
/

𝑓< 𝑥 = ) 𝛼- F 𝑦- F 𝜙 𝑥- #
𝜙(𝑥) + 𝑏
-.(
/

⟹ 𝑓< 𝑥 = ) 𝛼- F 𝑦- F 𝑘(𝑥- , 𝑥) + 𝑏
-.(

Learning:

/
1 #
max ) 𝛼- − ) 𝛼! 𝛼) 𝑦! 𝑦) 𝜙 𝑥! 𝜙(𝑥) )
: 2
-.( !,)

/
1
⟹ max ) 𝛼- − ) 𝛼! 𝛼) 𝑦! 𝑦) 𝑘(𝑥! , 𝑥) )
: 2
-.( !,)

subject to ∀𝑖: 𝛼- ≥ 0, and 0 = ∑/


-.( 𝛼- 𝑦-
25.06.22 68
Example kernels
• Linear kernels: 𝑘 𝑥, 𝑥 < = 𝑥 # 𝑥 <

• Polynomial kernels: 𝑘 𝑥, 𝑥 < = 1 + 𝑥 # 𝑥 < = , for any 𝑑 > 0


• Contains all polynomial terms up to degree

$
!"!#
>
• Gaussian kernels: 𝑘 𝑥, 𝑥 < = 𝑒 $%$ , for 𝜎 > 0
• Infinite dimensional feature space
• Also called Radial basis function kernel (RBF)
• often works quite well!

• Graph kernels: random walk


• String kernels: ...
• build your own kernel for your own problem! 25.06.22 69
Summary on kernels

• „Instead of inventing funny non-linear features, we may directly invent


funny kernels“ (Toussaint 2019)
• Inventing a kernel is intuitive:
• 𝑘(𝑥, 𝑥′) expresses how correlated 𝑦 and 𝑦′ should be
• it is a meassure of similarity, it compares 𝑥 and 𝑥′.
• Specifying how ’comparable’ 𝑥 and 𝑥′ are is often more intuitive than
defining “features that might work”.

25.06.22 70
Background reading and more

• Smooth readong about SVMs: Alexandre Kowalczyk,


Support vector machines succinctly. Syncfusion. Free
ebook:
https://www.syncfusion.com/ebooks/support_vector_machines_succinctly
• Also talks about most efficient algorithms to be used for finding
support vectors (it is neither of the two presented here!)
?
Questions 6:
Kernel

Steffen Staab, Universität Stuttgart, @ststaab, https://www.ipvs.uni-stuttgart.de/departments/ac/ 25.06.22 72


7 Transductive
Classification
Transductive learning characteristics

Characteristics Use cases


• Training data AND • news recommender
test data known • spam classifier
at learning time
• document reorganization
• Learning happens
specifically for the given
test cases

Thorsten Joachims:
Transductive Inference for Text Classification using Support Vector
Machines. ICML 1999: 200-209
Maximum margin hyperplane

Training data … , 𝑥⃗! , 𝑦! , …


Test data … 𝑥⃗(∗ …
Loss function
; 5
1
∥ 𝑤 ∥, +𝐶 R 𝜉! + 𝐶 ∗ R 𝜉(∗
2
!/: (/:
Naive, intractable approach:
subject to:
• for every hyperplane:
∀;!/* : 𝑦! 𝑤 ⋅ 𝑥⃗! + 𝑏 ≥ 1 − 𝜉!
5
∀(/* : 𝑦(∗ 𝑤 ⋅ 𝑥⃗(∗ + 𝑏 ≥ 1 − 𝜉(∗ • classify 𝑥⃗(∗
∀;!/* : 𝜉! > 0 • compute loss
5
∀(/* : 𝜉(∗ > 0 76
Reuters data set experiments (3299 test documents)

25.06.22 77
Reuters data set experiments (17 training documents)

25.06.22 78
IPVS

Thank you!

Steffen Staab

E-Mail Steffen.staab@ipvs.uni-stuttgart.de
Telefon +49 (0) 711 685-To be defined
www. ipvs.uni-stuttgart.de/departments/ac/

Universität Stuttgart
Analytic Computing, IPVS
Universitätsstraße 32, 50569 Stuttgart

You might also like