8 SVMs

IPVS – Institute for Parallel and Distributed Systems
Analytic Computing
8 Support Vector Machines
Prof. Dr. Steffen Staab

https://www.ipvs.uni-stuttgart.de/departments/ac/
• based on slides by
• Thomas Gottron, U. Koblenz-Landau,
https://west.uni-koblenz.de/de/studying/courses/ws1718/machine-learning-and-data-mining-1
• Andrew Zisserman, http://www.robots.ox.ac.uk/~az/lectures/ml/lect2.pdf
25.06.22 2
1 Perceptron Algorithm
Binary classification
$
• Given training data 𝑥! , 𝑦! !"# with 𝑥! ∈ ℝ% and 𝑦! ∈ −1,1 ,
• Learn a classifier
( > 0, if 𝑦! = +1
𝑓 𝑥! = +
< 0, if 𝑦! = −1
• Correct classification:
𝑓( 𝑥! 𝑦! > 0
25.06.22 4
Linear separability
(Zisserman 2015)
25.06.22 5
Linear separability
(Zisserman 2015)
25.06.22 6
Linear classifiers Related to, but different
from chapter „From linear
regression to classification“
• A linear classifier has the form in 4-LogisticRegression
𝑓( 𝑥 = 𝑤 & 𝑥 + 𝑏
!
• In 2D the discriminant is a line

• 𝑤 is the normal to the line,
! !
and b the bias
• 𝑤 is known as the weight vector
(Zisserman 2015)
25.06.22 7
Linear classifiers
• Let‘s assume 𝒙𝒊 = 𝟏, 𝒙𝒊,𝟏 , … , 𝒙𝒊,𝒅

(as in linear regression)
• The we write
!
𝑓( 𝑥 = 𝑤 & 𝑥
! !
(Zisserman 2015)
x0=1
25.06.22 8
Linear classifiers
• A linear classifier has the form
𝑓( 𝑥 = 𝑤 & 𝑥
!
• In 3D the discriminant is a plane
• In n-dim the discriminant is a
hyperplane
• Only 𝑤 (including 𝑏) are needed to
classify new data
(Zisserman 2015)
25.06.22 9
The perceptron classifier
$
• Given training data 𝑥! , 𝑦! !"# with 𝑥! ∈ ℝ% and 𝑦! ∈ −1,1 ,
• how to find a weight vector w, the separating hyperplane,
such that the two categories are separated for the dataset?
• Perceptron algorithm
1. Initialize 𝑤 = 0#
2. While there is 𝑖 such that 𝑓& 𝑥! 𝑦! < 0 do
• 𝑤 ≔ 𝑤 − 𝛼𝑥! sign 𝑓4 𝑥! = 𝑤 + 𝛼𝑥! 𝑦! (not for sign(...)=0)
25.06.22 10
Example for perceptron algorithm in 2D
1. Initialize 𝑤 = 0#
2. While there is 𝑖 such that 𝑓& 𝑥! 𝑦! < 0 do
• 𝑤 ≔ 𝑤 − 𝛼𝑥! sign(𝑓& 𝑥! )
At convergence: 𝑤 = ∑$
!"# 𝛼! 𝑥! (Zisserman 2015)
Example
• if the data is linearly separable,

then the algorithm will converge
• convergence can be slow …
• separating line close to training data
(Zisserman 2015)
25.06.22 12
2 Support Vectors
What is the best 𝒘?
• Idea: maximum margin solution is most stable under

perturbations of the inputs
(Zisserman 2015)
25.06.22 16
Support Vector Machine
!
𝑓" 𝑥 = & 𝛼! 𝑦! x"# 𝑥 + 𝑏
! (Zisserman 2015)
25.06.22 17
SVM Optimization Problem (1)
• Distance 𝛿 𝑥! , ℎ of data point 𝑥! from hyperplane ℎ

"$ # $% %$ &'
𝛿 𝑥! , ℎ = $
𝛿 𝑥! , ℎ 25.06.22 18
• In general, length of vector w does not matter

• fix w such that support vectors 𝑥( make 𝑦( 8 𝑤 ) 𝑥( + 𝑏 = 1
• Then positive and negative
support vectors have 1
distance 𝛿 𝑥( , ℎ =
* 𝑤
|$|
from hyperplane –
which we want
to maximize
• Standard formulation:
$
Minimize
,
(inverse margin)
𝛿 𝑥! , ℎ 25.06.22 19
Support Vector Machine
(Zisserman 2015)
25.06.22 21
2
max
$ ||𝑤||
subject to
𝑦! 𝑤 ) 𝑥! + 𝑏 ≥ 1, for 𝑖 = 1 … 𝑁
Or equivalently
min 𝑤 ,
$
subject to the same constraints
This is a quadratic optimization problem
subject to linear constraints and there is a unique minimum
25.06.22 22
Compare the two optimization criteria for classfication of linearly separable data
by linear regression with classification by SVM
Linear classification
Linear classification
using SVM:
using regression:
decision line maximizes
decision line is average
margins between support
between regression lines;
vectors; far away data
all data points are
points are irrelevant
considered
25.06.22 24
3 Soft Margin and
Hinge Loss
Re-visiting linear separability
• Points can be linearly separated,

but with very narrow margin
(Zisserman 2015)
25.06.22 26
Re-visiting linear separability
• Points can be linearly separated,

but with very narrow margin
• Possibly the large margin solution is better,

even though one constraint is violated
Trade-off between the margin and the

number of mistakes on training data (Zisserman 2015)
25.06.22 27
Introduce „slack“ variables
(Zisserman 2015)
Soft margin solution
Revised optimization problem
O
K
min 𝑤 + 𝐶 ' 𝜉L
G∈ℝ ,J" ∈ℝ#
!
LMN
subject to
𝑦! 𝑤 " 𝑥! + 𝑏 ≥ 1 − 𝜉! , for 𝑖 = 1 … 𝑁
• Every constraint can be satisfied if 𝜉! is sufficiently large
• 𝐶 is a regularization parameter:
- small 𝐶 allows constraints to be easily ignored ⟹ large margin
- large 𝐶 makes constraints hard to ignore ⟹ narrow margin
- 𝐶 = ∞ enforces all constraints ⟹ hard margin
• Still a quadratic optimization problem with unique minimum
• One hyperparameter 𝐶 25.06.22 29
Loss function
• Given constraints:
𝑦! 𝑤 ) 𝑥! + 𝑏 ≥ 1 − 𝜉!
𝜉! ≥ 0
• We can rewrite 𝜉! as:
𝜉! = max(0,1 − 𝑦! 𝑓& 𝑥! )
• Hence, we can optimize the unconstrained optimization problem over 𝑤:

0
1
min& 𝑤 , + R max(0,1 − 𝑦! 𝑓& 𝑥! )
$∈ℝ 𝐶
!/*
regularization loss function

25.06.22 30
Loss function
, 0
1
min 𝑤 + R max(0,1 − 𝑦! 𝑓& 𝑥! )
$∈ℝ& ,2$ ∈ℝ' 𝐶
!/*
(Zisserman 2015) 25.06.22 31

Hinge loss
(Zisserman 2015)
25.06.22 32
4 Gradient descent over
convex function
Gradient descent/ascent
https://en.wikipedia.org/wiki/Gradient_descent#/media/File:Gradient_descent.svg
Climb down a hill
Climb up a hill
Given differentiable function

describing height of hill at position
𝒙 = (𝑥(, … , 𝑥) ) height of hill 𝑓 𝒙 .
How to climb up/down fastest?
Go in direction where
𝑑𝑓(𝒙)
= 𝛻𝒙 𝑓(𝒙)
𝑑𝒙
is maximal/minimal
In general: challenge can be difficult
𝑥! 𝑥"
Gradient Descent (- but without posts)
https://goo.gl/images/JKN6zm
Optimization continued
Questions
• Does this cost function have a unique solution?
25.06.22 38
Optimization continued
(Zisserman 2015)
Questions
• Does this cost function have a unique solution?
• Do we find it using gradient descent?
Does the solution we find using gradient descent depend on the starting point?
To the rescue:
• If the cost function is convex, then a locally optimal point is globally optimal (provided
the optimization is over constraints that form a convex set – given in our case)
25.06.22 39
Convex functions
(Zisserman 2015)
25.06.22 40
Convex function examples
• A non-negative sum of convex functions is convex (Zisserman 2015)
25.06.22 41
Applied to hinge loss and regularization
25.06.22 42
Gradient descent algorithm for SVM
To minimize a cost function 𝒞(𝑤) use the iterative update
𝑤#$% ≔ 𝑤# − 𝜂# ∇& 𝒞(w' )
where 𝜂 is the learning rate.
(
Let‘s rewrite the minimization problem as an average with 𝜆 = ):
+
1 1
𝒞 𝑤 = 𝑤 ( + > max(0,1 − 𝑦! 𝑓C 𝑥! ) =
𝑁𝐶 𝑁
!*%
+
1 𝜆
= > 𝑤 ( + max(0,1 − 𝑦! 𝑓C 𝑥! )
𝑁 2
!*%
and 𝑓C 𝑥! = 𝑤 " 𝑥 + 𝑏
25.06.22 43
Sub-gradient for hinge loss
ℒ 𝑥! , 𝑦! ; 𝑤 = max(0,1 − 𝑦! 𝑓& 𝑥! ), 𝑓& 𝑥! = 𝑤 ) 𝑥! + 𝑏
(Zisserman 2015)
!
25.06.22 44
Sub-gradient descent algorithm for SVM
/
1 𝜆 0
𝒞 𝑤 = ) 𝑤 + ℒ(𝑥- , 𝑦- ; 𝑤)
𝑁 2
-.(
The iterative update is

𝑤12( ≔ 𝑤1 − 𝜂∇3 𝒞 w4 ≔
/
1
≔ 𝑤1 − 𝜂 ) 𝜆𝑤1 + ∇3 ℒ(𝑥- , 𝑦- ; 𝑤)
𝑁
-.(
Then each iteration t involves cycling through the training data with the updates:
𝑤 − 𝜂 𝜆𝑤1 − 𝑦- 𝑥- , if 𝑦- 𝑓< 𝑥- < 1

𝑤12( ≔9 1
𝑤1 − 𝜂𝜆𝑤1 , otherwise
(
Typical learning rate in Pegasos: 𝜂1 =
51
25.06.22 45
?
Questions 4:
Gradient descent
Steffen Staab, Universität Stuttgart, @ststaab, https://www.ipvs.uni-stuttgart.de/departments/ac/ 25.06.22 46

5 The dual problem
Primal vs dual problem
• SVM is a linear classifier: 𝑓& 𝑥 = 𝑤 ) 𝑥 + 𝑏
• The primal problem: an optimization problem over w:
0
1
min& 𝑤 , + R max(0,1 − 𝑦! 𝑓& 𝑥! )
$∈ℝ 𝐶
!/*
• The dual problem: Getting rid of the 𝑤 for a slightly different

representation of 𝑓& 𝑥 leads to the following representation
0
𝑓& 𝑥 = R 𝛼! 𝑦! 𝑥!) 𝑥 + 𝑏
!/*
and a new optimization problem with the same solution, but several
advantages. Let us show this on following slides... 25.06.22 49
Revisit Optimization Problem for Hard Margin Case
• Minimize the quadratic form
𝑤 , 𝑤)𝑤
=
2 2
• With constraints
𝑦! 8 𝑤 ) 𝑥! + 𝑏 ≥ 1 ∀𝑖
• The constraints will reach a value of 1 for at least one instance.
• Include hard constraints into the loss function:
, 0
𝑤
ℒ(𝑤, 𝑏, 𝛼) = − R 𝛼! 𝑦! 8 𝑤 ) 𝑥! + 𝑏 − 1
2
!/*
• failed constraints “punish“ the objective function 25.06.22 50
Excursion: Lagrange Multiplier
• We want to maximize a function 𝑓 𝑥

under the constraints 𝑔 𝑥 = 𝑎
Solution with Lagrange Multiplier

• Optimize the Lagrangian
𝑓 𝑥 −𝜆 𝑔 𝑥 −𝑎
instead!
Nicely visual explanation of Lagrange optimization at
https://www.svm-tutorial.com/2016/09/duality-lagrange-multipliers/
25.06.22 51
Algorithm for optimization with a Lagrange multiplier
1. Write down the Lagrangian 𝑓 𝑥 − 𝜆 ⋅ 𝑔 𝑥 − 𝑎

2. Take derivative of Lagrangian wrt x,
set it to 0
to find estimate of x that depends on 𝜆
3. Plug your estimate of x in the Lagrangian,
take the derivative wrt 𝜆,
and set it to 0,
to find the optimal value for the lagrange multiplier 𝜆
4. Plug in the Lagrange multiplier in your estimate for x
25.06.22 52
Revisit Optimization Problem for Hard Margin Case
• Minimize the quadratic form
𝑤 , 𝑤)𝑤
=
2 2
• With constraints
𝑦! 8 𝑤 ) 𝑥! + 𝑏 ≥ 1 ∀𝑖
• The constraints will reach a value of 1 for at least one instance.
• Include hard constraints into the loss function:
, 0 𝛼- are the
𝑤
ℒ(𝑤, 𝑏, 𝛼) = − R 𝛼! 𝑦! 8 𝑤 ) 𝑥! + 𝑏 − 1 Lagrange
2 multipliers
!/*
• failed constraints “punish“ the objective function 25.06.22 53
Lagrangian primal problem
• Lagrangian primal problem is:
min max ℒ(𝑤, 𝑏, 𝛼)
$,' 3
subject to ∀𝑖: 𝛼! ≥ 0
25.06.22 54
Finding the optimum
• Loss is a function of 𝑤, 𝑏, and 𝛼

, 0
𝑤
ℒ 𝑤, 𝑏, 𝛼 = − R 𝛼! 𝑦! 8 𝑤 ) 𝑥! + 𝑏 − 1
2
!/*
𝑤 is a linear
• Find optimum using derivatives: combination of the
data instances!
0
𝜕
ℒ 𝑤, 𝑏, 𝛼 = 0 ⟹ 0 = R 𝛼! 𝑦!
𝜕𝑏
!/*
0 0
𝜕
ℒ 𝑤, 𝑏, 𝛼 = 0 ⟹ 𝑤( = R 𝛼! 𝑦! 𝑥!,( ⟹ 𝑤 = R 𝛼! 𝑦! 𝑥!
𝜕𝑤(
!/* !/* 25.06.22 55
Substitution into ℒ(𝑤, 𝑏, 𝛼)
, 0
𝑤
ℒ 𝑤, 𝑏, 𝛼 |$/∑8 = − R 𝛼! 𝑦! 8 𝑤 ) 𝑥! + 𝑏 − 1 =
$67 3$ "$ %$ 2
!/*
) 6
0 0 0 0
1
= R 𝛼( 𝑦( 𝑥( R 𝛼5 𝑦5 𝑥5 − R 𝛼! 𝑦! 8 R 𝛼( 𝑦( 𝑥( 𝑥! + 𝑏 − 1 =
2
(/* 5/* !/* (/*
0 0
1
= R 𝛼( 𝛼5 𝑦( 𝑦5 𝑥() 𝑥5 − R 𝛼! 𝛼( 𝑦! 𝑦( 𝑥!) 𝑥( − 𝑏 R 𝛼! 𝑦! + R 𝛼! =
2
(,5 !,( !/* !/*
= =0
0
1
= ℒ(𝛼) = R 𝛼! − R 𝛼( 𝛼5 𝑦( 𝑦5 𝑥() 𝑥5
2 25.06.22 56
!/* (,5
Wolfe dual problem
0
1
max R 𝛼! − R 𝛼( 𝛼5 𝑦( 𝑦5 𝑥() 𝑥5
3 2
!/* (,5
subject to ∀𝑖: 𝛼! ≥ 0, and 0 = ∑0

!/* 𝛼! 𝑦!
• This problem is solvable with quadratic programming, because it fulfills

the Karush-Kuhn-Tucker conditions on 𝛼! that handle inequality
constraints (≥1) in the Lagrange optimization (not given here!).
• It gives us the classification function:
𝛼! is positive if
8 𝑥! is a support
vector
𝑓! 𝑥 = % 𝛼5 ' 𝑦5 ' 𝑥59 𝑥 + 𝑏
567
25.06.22 57
Non-separable Case (similar as before)
• Introduce (positive) „slack variables“ 𝜉! to allow deviations from the minimum
distance:
𝑦! 𝑤 " 𝑥! + 𝑏 ≥ 1 − 𝜉!
• Include a penalizing term in the optimization function:

% &
𝐶 > 𝜉!
!#$
• Transform to Lagrangian
• with additional Lagrange multipliers for the slack variables being
constrained to positive values ...
25.06.22 58
Summary: Primal and dual formulations
• Primal version of classifier

𝑓& 𝑥 = 𝑤 ) 𝑥 + 𝑏
• Dual version of classifier
0
𝑓& 𝑥 = R 𝛼! 8 𝑦! 8 𝑥!) 𝑥 + 𝑏
!/*
The dual form classifier seems to work like a kNN classifier, it requires
the training data points 𝑥! . However, many of the 𝛼! are zero.
The ones that are non-zero define the support vectors 𝑥! .
25.06.22 59
Summary: Primal and dual formulations
• Lagrangian primal problem is:
min max ℒ(𝑤, 𝑏, 𝛼)
$,' 3
subject to ∀𝑖: 𝛼! ≥ 0
• Lagrangian dual problem is:

0
1
max R 𝛼! − R 𝛼( 𝛼5 𝑦( 𝑦5 𝑥() 𝑥5
3 2
!/* (,5
subject to ∀𝑖: 𝛼! ≥ 0, and 0 = ∑0

!/* 𝛼! 𝑦!
25.06.22 60
6 Kernelization Tricks
in SVMs
Non-linear Case
• Not all classes can

be separated via a
hyperplane
• Essential:
• Dual representation uses
only the product of data
instances:
+
𝑓C 𝑥 = > 𝛼! G 𝑦! G 𝑥!" 𝑥 + 𝑏
!*%
• 𝑥- : i-th training instance
• 𝛼- : weight for i-th training instance
• Same for the Lagrangian... 25.06.22 62

Feature engineering using 𝝓(𝒙) Cf. lecture on
regression.
Chapter “beyond
• Classifier: Given 𝑥! ∈ ℝ7 , 𝜙: ℝ7 → ℝ8 , 𝑤 ∈ ℝ8 linear input”
𝑓& 𝑥 = 𝑤 ) 𝜙 𝑥 + 𝑏
• Learning:
0
1
min9 𝑤 , + R max(0,1 − 𝑦! 𝑓& 𝑥! )
$∈ℝ 𝐶
!/*
25.06.22 63
Example 1: From 1-dim to 2-dim
25.06.22 64
Example 2: From 2-dim to 3-dim
(Zisserman 2015)
25.06.22 65
Feature engineering using 𝝓(𝒙)
• Classifier: Given 𝑥! ∈ ℝ7 , 𝜙: ℝ7 → ℝ8 , 𝑤 ∈ ℝ8
𝑓& 𝑥 = 𝑤 ) 𝜙 𝑥 + 𝑏
• Learning:
0
1
min9 𝑤 , + R max(0,1 − 𝑦! 𝑓& 𝑥! )
$∈ℝ 𝐶
!/*
• 𝜙 𝑥 maps to high dimensional space ℝ8 where data is separable

• If 𝐷 ≫ 𝑑 then there are many more parameters to learn for w
25.06.22 66
Dual classifier in transformed feature space
Classifier:
/
𝑓< 𝑥 = ) 𝛼- F 𝑦- F 𝑥-# 𝑥 + 𝑏
-.(
/ 𝜙(x)
⟹ 𝑓< 𝑥 = ) 𝛼- F 𝑦- F 𝜙 𝑥- #
𝜙(𝑥) + 𝑏 only occurs in pairs
"
-.( 𝜙 𝑥' 𝜙(𝑥! )
Learning:
/ Kernels
1 "
max ) 𝛼- − ) 𝛼! 𝛼) 𝑦! 𝑦) 𝑥!# 𝑥) 𝑘 𝑥' , 𝑥! = 𝜙 𝑥' 𝜙(𝑥! )
: 2
-.( !,)
/
1 #
⟹ max ) 𝛼- − ) 𝛼! 𝛼) 𝑦! 𝑦) 𝜙 𝑥! 𝜙(𝑥) )
: 2
-.( !,)
subject to ∀𝑖: 𝛼- ≥ 0, and 0 = ∑/

-.( 𝛼- 𝑦-
25.06.22 67
Dual classifier using kernels
Classifier:
/
𝑓< 𝑥 = ) 𝛼- F 𝑦- F 𝜙 𝑥- #
𝜙(𝑥) + 𝑏
-.(
/
⟹ 𝑓< 𝑥 = ) 𝛼- F 𝑦- F 𝑘(𝑥- , 𝑥) + 𝑏
-.(
Learning:
/
1 #
max ) 𝛼- − ) 𝛼! 𝛼) 𝑦! 𝑦) 𝜙 𝑥! 𝜙(𝑥) )
: 2
-.( !,)
/
1
⟹ max ) 𝛼- − ) 𝛼! 𝛼) 𝑦! 𝑦) 𝑘(𝑥! , 𝑥) )
: 2
-.( !,)
subject to ∀𝑖: 𝛼- ≥ 0, and 0 = ∑/

-.( 𝛼- 𝑦-
25.06.22 68
Example kernels
• Linear kernels: 𝑘 𝑥, 𝑥 < = 𝑥 # 𝑥 <
• Polynomial kernels: 𝑘 𝑥, 𝑥 < = 1 + 𝑥 # 𝑥 < = , for any 𝑑 > 0

• Contains all polynomial terms up to degree
$
!"!#
>
• Gaussian kernels: 𝑘 𝑥, 𝑥 < = 𝑒 $%$ , for 𝜎 > 0
• Infinite dimensional feature space
• Also called Radial basis function kernel (RBF)
• often works quite well!
• Graph kernels: random walk

• String kernels: ...
• build your own kernel for your own problem! 25.06.22 69
Summary on kernels
• „Instead of inventing funny non-linear features, we may directly invent

funny kernels“ (Toussaint 2019)
• Inventing a kernel is intuitive:
• 𝑘(𝑥, 𝑥′) expresses how correlated 𝑦 and 𝑦′ should be
• it is a meassure of similarity, it compares 𝑥 and 𝑥′.
• Specifying how ’comparable’ 𝑥 and 𝑥′ are is often more intuitive than
defining “features that might work”.
25.06.22 70
Background reading and more
• Smooth readong about SVMs: Alexandre Kowalczyk,

Support vector machines succinctly. Syncfusion. Free
ebook:
https://www.syncfusion.com/ebooks/support_vector_machines_succinctly
• Also talks about most efficient algorithms to be used for finding
support vectors (it is neither of the two presented here!)
?
Questions 6:
Kernel
Steffen Staab, Universität Stuttgart, @ststaab, https://www.ipvs.uni-stuttgart.de/departments/ac/ 25.06.22 72

7 Transductive
Classification
Transductive learning characteristics
Characteristics Use cases

• Training data AND • news recommender
test data known • spam classifier
at learning time
• document reorganization
• Learning happens
specifically for the given
test cases
Thorsten Joachims:
Transductive Inference for Text Classification using Support Vector
Machines. ICML 1999: 200-209
Maximum margin hyperplane
Training data … , 𝑥⃗! , 𝑦! , …

Test data … 𝑥⃗(∗ …
Loss function
; 5
1
∥ 𝑤 ∥, +𝐶 R 𝜉! + 𝐶 ∗ R 𝜉(∗
2
!/: (/:
Naive, intractable approach:
subject to:
• for every hyperplane:
∀;!/* : 𝑦! 𝑤 ⋅ 𝑥⃗! + 𝑏 ≥ 1 − 𝜉!
5
∀(/* : 𝑦(∗ 𝑤 ⋅ 𝑥⃗(∗ + 𝑏 ≥ 1 − 𝜉(∗ • classify 𝑥⃗(∗
∀;!/* : 𝜉! > 0 • compute loss
5
∀(/* : 𝜉(∗ > 0 76
Reuters data set experiments (3299 test documents)
25.06.22 77
Reuters data set experiments (17 training documents)
25.06.22 78
IPVS
Thank you!
Steffen Staab
E-Mail Steffen.staab@ipvs.uni-stuttgart.de
Telefon +49 (0) 711 685-To be defined
www. ipvs.uni-stuttgart.de/departments/ac/
Universität Stuttgart
Analytic Computing, IPVS
Universitätsstraße 32, 50569 Stuttgart

8 SVMs

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

8 SVMs

Uploaded by

Copyright:

Available Formats

IPVS – Institute for Parallel and Distributed Systems

8 Support Vector Machines

Prof. Dr. Steffen Staab

• Andrew Zisserman, http://www.robots.ox.ac.uk/~az/lectures/ml/lect2.pdf

• In 2D the discriminant is a line

• Let‘s assume 𝒙𝒊 = 𝟏, 𝒙𝒊,𝟏 , … , 𝒙𝒊,𝒅

• A linear classifier has the form

• 𝑤 ≔ 𝑤 − 𝛼𝑥! sign 𝑓4 𝑥! = 𝑤 + 𝛼𝑥! 𝑦! (not for sign(...)=0)

• if the data is linearly separable,

• Idea: maximum margin solution is most stable under

• Distance 𝛿 𝑥! , ℎ of data point 𝑥! from hyperplane ℎ

• In general, length of vector w does not matter

• Points can be linearly separated,

• Points can be linearly separated,

• Possibly the large margin solution is better,

Trade-off between the margin and the

• Hence, we can optimize the unconstrained optimization problem over 𝑤:

regularization loss function

(Zisserman 2015) 25.06.22 31

Given differentiable function

How to climb up/down fastest?

• A non-negative sum of convex functions is convex (Zisserman 2015)

ℒ 𝑥! , 𝑦! ; 𝑤 = max(0,1 − 𝑦! 𝑓& 𝑥! ), 𝑓& 𝑥! = 𝑤 ) 𝑥! + 𝑏

The iterative update is

𝑤 − 𝜂 𝜆𝑤1 − 𝑦- 𝑥- , if 𝑦- 𝑓< 𝑥- < 1

Steffen Staab, Universität Stuttgart, @ststaab, https://www.ipvs.uni-stuttgart.de/departments/ac/ 25.06.22 46

• The dual problem: Getting rid of the 𝑤 for a slightly different

• The constraints will reach a value of 1 for at least one instance.

• Include hard constraints into the loss function:

• We want to maximize a function 𝑓 𝑥

Solution with Lagrange Multiplier

1. Write down the Lagrangian 𝑓 𝑥 − 𝜆 ⋅ 𝑔 𝑥 − 𝑎

• The constraints will reach a value of 1 for at least one instance.

• Include hard constraints into the loss function:

• Loss is a function of 𝑤, 𝑏, and 𝛼

subject to ∀𝑖: 𝛼! ≥ 0, and 0 = ∑0

• This problem is solvable with quadratic programming, because it fulfills

• Include a penalizing term in the optimization function:

• Primal version of classifier

• Lagrangian dual problem is:

subject to ∀𝑖: 𝛼! ≥ 0, and 0 = ∑0

• Not all classes can

• Same for the Lagrangian... 25.06.22 62

• 𝜙 𝑥 maps to high dimensional space ℝ8 where data is separable

subject to ∀𝑖: 𝛼- ≥ 0, and 0 = ∑/

subject to ∀𝑖: 𝛼- ≥ 0, and 0 = ∑/

• Polynomial kernels: 𝑘 𝑥, 𝑥 < = 1 + 𝑥 # 𝑥 < = , for any 𝑑 > 0

• Graph kernels: random walk

• „Instead of inventing funny non-linear features, we may directly invent

• Smooth readong about SVMs: Alexandre Kowalczyk,

Steffen Staab, Universität Stuttgart, @ststaab, https://www.ipvs.uni-stuttgart.de/departments/ac/ 25.06.22 72

Characteristics Use cases

Training data … , 𝑥⃗! , 𝑦! , …

You might also like