You are on page 1of 122

Pattern Recognition

(Pattern Classification)
Support Vector Machin (SVM)
Hypothesis set and Algorithm

Second Edition
Recall from Chapter 1
True Error Bound and linear
• Linear in input space:

• Linear in feature space:

Λ ≥‖h‖ℍ= ℛ ( 𝑊 ) =‖𝑊 ‖2=√ 𝑤 +𝑤 +𝑤 +… 2


1
2
2
2
3

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 3


T, Morteza Analoui
Learning algorithm tries to minimizing Upper
Bound of True Error by finding for a given
= upper bound of true risk (loss)

𝑎𝑟𝑔𝑚𝑖𝑛 ( ^
𝑅𝑆 ( h ) + 𝜆 ℛ ( h) ) =𝑎𝑟𝑔𝑚𝑖𝑛 𝐿(𝑊 )
h∈H h∈H

^
𝑅𝑆 ( h ) 𝐶𝑜𝑚𝑝𝑙𝑒𝑥𝑖𝑡𝑦 ( 𝑅𝑒𝑔𝑢𝑙𝑎𝑧𝑖𝑛𝑔 ) 𝑇𝑒𝑟𝑚
𝑚
1
𝐿 ( 𝑊 )= ∑ [ 𝐿¿ ¿𝑖 (h ( 𝑥 𝑖 ,𝑊 ) , 𝑦 𝑖 )]+ 𝜆 ℛ (𝑊 )¿
𝑚 𝑖=1 2
ℛ ( 𝑊 )=‖𝑊 ‖2=𝑤 21 +𝑤22 +𝑤 23+ …+𝑤 2𝑁

03/18/2024 Machine Learning Theory for Pattern Recognition - School of C 4


omputer Engineering, IUST - Morteza Analoui
Contents
1. Binary support machine
2. Binary SVM: Separable case (Consistent case)
3. A Geometric Representation of SVM
4. Binary SVM: Non-separable case (Inconsistent case)
5. Kernel Methods
6. Multiclass SVM

This chapter is mostly based on:


Foundations of Machine Learning, 2nd Ed., By Mehryar Mohri, , Afshin Rostamizadeh, Ameet Talwalkar,
Publisher: MIT Press, 2018

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 5


T, Morteza Analoui
1- Binary support machine
Binary Support Vector Machines
• SVM is one of most theoretically well motivated and practically most
effective classification algorithms in modern machine learning.
• We first introduce algorithm for consistent case )H contains concept
to learn ), then present its general version designed for inconsistent
case, and finally provide a theoretical foundation for SVMs based on
notion of margin

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 7


T, Morteza Analoui
SVM: a small generalization error learning
machine
• Consider an input space that is a subset of with , and output or target space ,
and let be target function (concept)
• Given a hypothesis set H of functions mapping to , binary classification task is
formulated as follows
• Learner receives a training sample of size drawn i.i.d. from according to some
unknown distribution ,
, with for all

• Problem consists of determining a hypothesis H, a binary classer, with small


generalization error:

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 8


T, Morteza Analoui
H : - margin linear hyperplane set,
• Different hypothesis sets H can be selected for this task. In view of
Occam's razor principle, hypothesis sets with smaller complexity and
smaller VC-dimension provide better learning guarantees, when
everything else is equal
• A natural hypothesis set with relatively small complexity is that of
linear classifiers, or hyper planes, which can be defined as follows:

(5.2)

• Learning problem is then referred to as a linear classification problem

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 9


T, Morteza Analoui
H : - margin linear hyperplane set
• General equation of a hyperplane in is ·, where is a non-zero vector
normal to hyperplane and is a scalar.

• A hypothesis H of form ·thus labels positively all points falling on one


side of hyperplane ·and negatively all others

• Definition of SVM solution is based on notion of margin

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 10


T, Morteza Analoui
Contents
1. Binary support machine
2. Binary SVM: Separable case (Consistent case)
3. A Geometric Representation of SVM
4. Binary SVM: Non-separable case (Inconsistent case)
5. Kernel Methods
6. Multiclass SVM

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 11


T, Morteza Analoui
2- Binary SVM: consistent case
)H contains concept to learn )
(Separable case)
Binary SVM - Consistent Case
• Concept class (positive class) :
• Negative class :

• Margin loss variable:

• Consistent case: can be learned such that for all training examples

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 13


T, Morteza Analoui
Binary SVM Margin Loss functions:
• Score:

Φ 𝜌 ( 𝜌 h (𝑥𝑖 , 𝑦 𝑖 ) )=𝑚𝑎𝑥 (0 ,1 − 𝑦 𝑖 h ( 𝑥𝑖 ) )

1−
3 𝑦
𝑖 h(
𝑥
Hinge loss function: 𝑖 )
2

1 𝛼 ¿
𝑖
𝑦 𝑖 h ( 𝑥𝑖 )=𝑠 𝑖
-3 -2 -1 0 1

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 14


T, Morteza Analoui
Regularization-based algorithm1
• Upper bound of true risk
^
𝑅 𝑆 , 𝜌=1 ( h )
2
𝑚
ℛ ( 𝑊 )=‖𝒘‖2=𝑤 21+ 𝑤22 +𝑤23 + …+𝑤 2𝑁
1
𝑅 (h ) ≤ ℒ ( 𝒘 , 𝑏 ) = ∑
𝑚 𝑖 =1
𝑚𝑎𝑥 ¿ ¿ (5.48.1)

Weighted regularizer

is upper bound of true risk regularization parameter,

• The solution and for the optimization problem min ℒ ( 𝒘 ,𝑏 )


𝑤,𝑏
gives

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 15


T, Morteza Analoui
Regularization-based algorithm1.
• Regularization-based algorithms recall from Chapter1
,
h∈H
𝑚
1
𝑎𝑟𝑔𝑚𝑖𝑛 𝐿 ( 𝑊 )= ∑ [ 𝐿¿ ¿𝑖 (h ( 𝑥 𝑖 ,𝑊 ) , 𝑦 𝑖 )]+ 𝜆 ℛ (𝑊 )¿
h∈H 𝑚 𝑖=1
𝑚
1
min ∑
𝑤 , 𝑏 𝑚 𝑖=1
max ¿ ¿ ¿ ¿
hinge loss:

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 16


T, Morteza Analoui
Regularization-based algorithm1..
• Finding that has minimum regularizer, and no empirical loss
1 2
Minimizing regularizer and no empirical loss: 𝑚𝑖𝑛 ‖𝒘 ‖2
𝒘 ,𝑏 2 (5.7)
Keeping scores 1 or more: Subject to:

• (5.7) and (5.48.1) are convex optimization problem and a specific


instance of quadratic programming (QP)

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 17


T, Morteza Analoui
Regularization-based algorithm2
• Regularization-based algorithms
𝐶 =1/ λ ≥ 0
h∈H
𝑚
𝑎𝑟𝑔𝑚𝑖𝑛 𝐿 ( 𝑊 )= ℛ ( 𝑊 ) + ∑ 𝛼𝑖 [ 𝐿 ¿ ¿𝑖 ( h ( 𝑥 𝑖 ,𝑊 ) , 𝑦 𝑖 ) ]¿ 𝛼 𝑖 ≡ 1/ 𝑚 λ ≥ 0
h∈H 𝑖=1

𝑚
min 𝐿 ( 𝑊 )=‖𝒘‖ + ∑ 𝛼 𝑖 ¿ ¿
2
𝜶=[ 𝛼 1 … 𝛼𝑖 … 𝛼𝑚 ]
2
𝑤 , 𝑏,𝜶 𝑖=1
hinge loss: ,

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 18


T, Morteza Analoui
Dual Problem for Algorithm2: Lagrangian
function
• Lagrangian function associated to problem (5.7)
Weighted

𝑚
1
ℒ ( 𝒘 ,𝑏 , 𝜶 )= ‖𝒘‖ + ∑ 𝛼 𝑖 ¿ ¿
2

2 𝑖=1
upper bound of true risk
(Lagrangian function) Lagrange variables

• The solution and for the dual problem min ℒ ( 𝒘 ,𝑏is, 𝜶the
) solution
for the primal (5.7) 𝑤 , 𝑏,𝜶

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 19


T, Morteza Analoui
Algorithm- QP solvers
• A variety of commercial and open-source solvers are available for
solving convex QP problems.
• Specialized algorithms have been developed to more efficiently solve
this particular convex QP problem, see appendix
QP solver:
• Setting gradient of Lagrangian with respect to primal variables and to
zero ,
• Setting weighted :
(H) (complementarity slackness conditions)

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 20


T, Morteza Analoui
Derivatives of Lagrangian function
, h (𝑥¿¿𝑖)=𝒘 ∙ 𝒙 𝑖 +b ¿
𝑚 𝑚
∇ 𝑤 ℒ =𝒘 − ∑ 𝛼 𝑖 𝑦 𝑖 𝑥𝑖 =0 → 𝒘 =∑ 𝛼 𝑖 𝑦 𝑖 𝑥𝑖 (5.9)
𝑖=1 𝑖=1

𝑚 𝑚
∇ 𝑏 ℒ =− ∑ 𝛼𝑖 𝑦 𝑖=0 → ∑ 𝛼𝑖 𝑦 𝑖=0 (5.10)
𝑖=1 𝑖=1

no empirical loss: ∀ 𝑖 , 𝛼𝑖 ¿ (5.11)

is called a support vector (support example) when

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 21


T, Morteza Analoui
is unique
• Solution of SVM problem is unique, but support vectors are not

• In dimension , points are sufficient to define a hyperplane When more


than points lie on a marginal hyperplane, different choices are
possible for support vectors

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 22


T, Morteza Analoui
Dual optimization problem
• Plug (5.9) and (5.10) into Lagrangian function (5.8) yields minimum
loss:

‖∑ ‖
𝑚 2 𝑚 𝑚 𝑚 𝑚
1
ℒ (𝑤 ,𝑏, 𝛼 )= ℒ 𝑑𝑢𝑎𝑙 (𝜶)= 𝛼 𝑖 𝑦 𝑖 𝑥𝑖 − ∑ ∑ 𝛼 𝑖 𝛼 𝑗 𝑦 𝑖 𝑦 𝑗 ( 𝒙𝑖 ∙ 𝒙 𝑗 ) − ∑ 𝛼𝑖 𝑦 𝑖 𝑏+ ∑ 𝛼 𝑖 (5.12)
2 𝑖=1 𝑖=1 𝑗=1 𝑖=1 𝑖=1

𝑚 𝑚
1
− ∑ ∑ 𝛼𝑖 𝛼 𝑗 𝑦 𝑖 𝑦 𝑗 ( 𝒙 𝑖 ∙ 𝒙 𝑗 )
0
2 𝑖 =1 𝑗=1

• which simplifies to
𝑚 𝑚 𝑚
1
ℒ 𝑑𝑢𝑎𝑙 (𝜶)=− ∑ ∑ 𝛼𝑖 𝛼 𝑗 𝑦 𝑖 𝑦 𝑗 ( 𝒙 𝑖 ∙𝒙 𝑗 ) + ∑ 𝛼 𝑖 (5.13)
2 𝑖=1 𝑗=1 𝑖=1

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 23


T, Morteza Analoui
Dual optimization solution
• This leads to following dual optimization problem for SVMs in separable case:
𝑚 𝑚 𝑚
1
max ℒ 𝑑𝑢𝑎𝑙 (𝜶)=− ∑ ∑ 𝛼 𝑖 𝛼 𝑗 𝑦 𝑖 𝑦 𝑗 ( 𝑥 𝑖 ∙ 𝑥 𝑗 ) + ∑ 𝛼 𝑖 (5.14)
𝜶 2 𝑖=1 𝑗=1 𝑖=1

subject to:

• Dual objective function is concave and differentiable. Dual optimization problem


is a QP problem, general-purpose and specialized QP solvers can be used
• SMO (Sequential Minimal Optimization) algorithm is used to solve dual form of
SVM problem in more general non-separable setting

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 24


T, Morteza Analoui
Primal and dual problems are equivalent
• Solution of dual problem (5.14) can be used directly to determine
hypothesis returned by SVMs, using equation (5.9):

(∑ )
𝑚
h ( 𝑥 )= 𝑠𝑔𝑛 ( 𝒘 ∙ 𝒙 + 𝑏 )=𝑠𝑔𝑛 𝛼 𝑖 𝑦𝑖 ( 𝒙 𝑖 ∙ 𝒙 )+ 𝑏 (5.15)
𝑖=1
𝑚
𝒘 =∑ 𝛼 𝑖 𝑦 𝑖 𝒙 𝑖(5.9)
𝑖=1
• Since support vectors lie on marginal hyperplanes, for any support
vector , ·, and thus can be obtained via
𝑚
𝑏= 𝑦 𝑗 − ∑ 𝛼 𝑖 𝑦 𝑖 ( 𝒙 𝑖 ∙ 𝒙 𝑗 ) (5.16)
𝑖=1

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 25


T, Morteza Analoui
Inner products between vectors
• Dual optimization problem (5.14) and expressions (5.15) and (5.16) reveal an
important property of SVMs:
• hypothesis solution depends only on inner products between vectors and not
directly on vectors themselves
• This observation is key and its importance will become clear when kernel
methods are introduced
• Now we can derive the following expression (see page 85 of the text for details)

𝑚
‖𝒘 ‖ = ∑ 𝛼 𝑖=‖𝜶‖1(5.19)
2
2
𝑖= 1

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 26


T, Morteza Analoui
Theorem 5.4
• Let be a linearly separable sample of .
• Let be hypothesis returned by SVMs for a sample , and let be
number of support vectors that define . Then,
average fraction of support vectors

average generalization error 𝑆 𝒟


𝑁 𝑆𝑉 (𝑆)
𝔼 [ 𝑚𝑅 (h 𝑆 ) ] ≤ 𝔼 𝑚+ 1
𝑆 𝒟 𝑚+1
[ ] (5.4)

^ 𝑁 𝑆𝑉 (𝑆)
Leave-One-Out error 𝑅 𝐿𝑂𝑂 (𝑆𝑉𝑀 ) ≤
𝑚+1

• where denotes distribution according to which points are drawn


03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 27
T, Morteza Analoui
Theorem 5.4
Theorem 5.4 gives a sparsity argument in favor of SVMs:
• Average error of algorithm is upper bounded by average fraction of support
vectors
• One may hope that for many distributions seen in practice, a relatively small
number of training points be the support vectors
• Solution will then be sparse in sense that a small fraction of dual variables will be
non-zero
• 5.4 is relatively weak bound since it applies only to average generalization error
of algorithm over all samples of size . It provides no information about variance of
generalization error
• We present stronger high-probability bounds based on notion of margin
(Theorem 5.10).
03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 28
T, Morteza Analoui
Contents
1. Binary support machine
2. Binary SVM: Separable case (Consistent case)
3. A Geometric Representation of SVM
4. Binary SVM: Non-separable case (Inconsistent case)
5. Kernel Methods
6. Multiclass SVM

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 29


T, Morteza Analoui
A Geometric Representation of SVM
• There are infinitely many such separating hyperplanes .
• Linear hypothesis H of form ·
• Consistent :
• There infinit number that support

Training set:

·
03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 30
T, Morteza Analoui
Which one is the best ?
• Answer: To keep upper bound of true risk as low as possible, for a
given and , we are looking for maximum “-margin in loss function
while empirical error is zero”
0

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 31


T, Morteza Analoui
A Geometric Representation of SVM
• This is equivalent to existence of such that:

h ( 𝒙 )=𝒘 ∙ 𝒙 +𝑏=0
𝑦 =+1

𝑦 =−1 h ( 𝑥 𝑖 ) ≥ 0 , 𝑦 𝑖 h( 𝑥 𝑖) ≥ 0
h ( 𝑥 𝑖 ) ≤ 0 , 𝑦 𝑖 h( 𝑥 𝑖) ≥ 0

Score of 𝑥𝑖 =𝑠𝑖 =𝑦 𝑖 h ( 𝑥𝑖 )

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 32


T, Morteza Analoui
Definition 5.1 – margin of examples
• Geometric margin at a point = distance from to hyperplane =0

h ( 𝒙 )=𝒘 ∙ 𝒙 +𝑏=0 𝑥𝑖 , 𝑦 𝑖 𝑦 𝑖 h( 𝑥 𝑖) ≥ 0

‖𝒘‖2 =√ 𝑤 +𝑤 +…+𝑤
2
1
2
2
2
𝑁
𝑦 𝑖 h( 𝑥 𝑖) ≥ 0

() = Score of 𝑥𝑖 =𝑠𝑖 =𝑦 𝑖 h ( 𝑥𝑖 )

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 33


T, Morteza Analoui
Geometry margin
• Geometric margin of a linear classifier for a sample is minimum geometric
margin over points in SAMPLE, that is distance of hyperplane defining to closest
sample points.

h ( 𝑥 )=0

geometric margin of : =

2 𝜌h

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 34


T, Morteza Analoui
Geometry margin
h ( 𝑥)− 1=0
h ( 𝑥)+1=0 𝑤

h ( 𝑥 𝑖 ) ≥ 0 , 𝑦 𝑖=+ 1
𝑥𝑖 𝑥𝑖 𝑦 𝑖 ( h ( 𝑥𝑖 ) −1) 1
𝜌 𝑖 ,+ 1= =𝜌 𝑖 −
‖𝑤‖2 ‖𝑤‖2

2
2 𝜌h = h ( 𝑥 )=0
‖𝑤‖2
𝑦 𝑖 (h ( 𝑥 𝑖 ) +1) 1
𝜌 𝑖 ,−1= =𝜌 𝑖 +
‖𝑤‖2 ‖𝑤‖2
h ( 𝑥 𝑖 ) ≥ 0 , 𝑦 𝑖=+ 1 2
𝜌 𝑖 ,−1 − 𝜌 𝑖 ,+ 1= =2 𝜌 h
‖𝑤‖2
03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 35
T, Morteza Analoui
Margin of based on
𝜌𝑖
() = → 𝜌h
=𝑦 𝑖 h ( 𝑥𝑖 )=𝑠 𝑖

h ( 𝑥 )=0 𝜌𝑖

𝑥𝑖

2
2 𝜌h =
‖𝑤‖2

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 36


T, Morteza Analoui
SVM: maximum -margin & no empirical
error H

• To keep upper bound of true risk as low as possible, we are looking


for maximum “-margin in loss function while empirical error is zero”

• It means: max =

• What is maximum possible?

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 37


T, Morteza Analoui
SVM: maximum -margin & no empirical
error H

𝜌𝑖
zero-one loss
𝑥𝑖 hinge loss: )
𝜌𝑗 𝑥 quadratic hinge )2
𝑗

2
2 𝜌h = h ( 𝑥 )=0
‖𝑤‖2
𝜌𝑖
−2 𝜌 h − 𝜌 h 𝜌 =𝜌 h

𝜌 𝑗 𝜌𝑖
𝜌 𝑗≥ 𝜌 h→ ≥1 𝜌𝑖 ≥ 𝜌h → ≥1
𝜌h 𝜌h
No training margin loss in -1 examples No training margin loss in +1 examples

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 38


T, Morteza Analoui
SVM Margin Loss functions 𝐿𝑖=max ⁡( 0 , 1− 𝑦 𝑖 h ( 𝑥 𝑖 ) )

zero-one loss zero-one loss:


hinge loss: ) hinge loss:
quadratic hinge quadratic hinge:

𝜌𝑖
𝜌h 𝜌𝑖 = 𝑦 h ( 𝑥𝑖 )= 𝑠 𝑖
−2 𝜌 h − 𝜌 h 1 𝜌h 𝑖

𝜌𝑖
= 𝑦 h ( 𝑥𝑖 )= 𝑠 𝑖
𝜌h 𝑖

Figure 5.5 Both hinge loss and quadratic hinge loss provide convex upper bounds on binary zero-one loss.
03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 39
T, Morteza Analoui
Dual of Algorithm1: Lagrangian function
• Lagrangian function associated to problem (5.48) is upper bound of
true risk ^
𝑅 (h) 𝑆 2 2 2 2
ℛ ( 𝑊 )=‖𝑤‖2=𝑤1 + 𝑤2 +𝑤3 + …+𝑤 2𝑁
𝑚
1
𝑅 (h) ≤ ℒ ( 𝒘 , 𝑏 ) = ∑
𝑚 𝑖 =1
𝑚𝑎𝑥 ¿ ¿ (5.48.1)
Lagrangian function is
upper bound of true risk regularization parameter, Lagrange variable

• The solution and for the dual problem is the solution for the primal
min ℒ ( 𝒘 ,𝑏 )
𝑤,𝑏
03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 41
T, Morteza Analoui
SVM Primal Algorithm2
• Finding that has maximum geometric margin and no empirical loss
1
m 𝑎𝑥 𝜌 h = (5.7.1)
𝒘 ,𝑏 ‖𝒘‖2
Subject to no empirical loss: Note that:
• Or minimalizing (that is a convex optimization problem and a specific
instance of quadratic programming (QP))

1 2
Minimizing regularizer and no empirical loss:𝑚𝑖𝑛 ‖ ‖
𝒘 2 (5.7)
𝒘 ,𝑏 2
Keeping scores 1 or more: Subject to:

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 42


T, Morteza Analoui
SVM Primal Algorithm2.
• Given , find and to maximize geometric margin while there is no
training error

1 2
min ‖𝒘‖2 (5.7)
𝒘 ,𝑏2
Subject to:

• The resulting algorithm precisely coincides with (5.48.1)

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 43


T, Morteza Analoui
Dual of Algorithm2
• Lagrangian function associated to problem (5.7)
𝑚
1
ℒ (𝑤 ,𝑏 , 𝜶 )= ‖𝒘‖ + ∑ 𝛼 𝑖 ¿ ¿
2

2 𝑖=1
Lagrangian function

Lagrange variables

• The solution and for the dual problem is the solution for the primal
min ℒ ( 𝒘 ,𝑏 , 𝜶 )
𝑤 , 𝑏,𝜶
2 1 1 1
𝜌h = = =
• Note that: ‖𝒘 ‖2
2 𝑚

∑ 𝛼𝑖
‖𝜶‖1 (5.19)
𝑖=1

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 44


T, Morteza Analoui
VC-dim of -margin hyperplane (linear) set
H
• VC dimension of -margin loss function and linear set H :
• is the dimension of the space, that is
• Let vectors in belong to a sphere of radius 𝑅 X

𝑑 ≤ 𝑚𝑖𝑛
([ ] )
𝑅2
𝜌
2
, 𝑁 +1

• Using large , generalization ability of the constructed hyperplane is


high.
Maximizing minimizes upper bound of

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 45


T, Morteza Analoui
Contents
1. Binary support machine
2. Binary SVM: Separable case (Consistent case)
3. A Geometric Representation of SVM
4. Binary SVM: Non-separable case (Inconsistent case)
5. Kernel Methods
6. Multiclass SVM

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 46


T, Morteza Analoui
3- Binary SVM: Non-separable
case
Inconsistent case (non-separable case), H
• In most practical settings, training data is not linearly separable: for
any hyperplane ·, there exists such that

(5.22)

• Constraints imposed in linearly separable case cannot all hold


simultaneously
1 2
min ‖𝒘‖2 (5.7)
𝒘 ,𝑏2
Subject to:
03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 48
T, Morteza Analoui
Inconsistent case (non-separable case)
• We introduce a new variable in consistent SVM algorithm to measure
empirical loss. ,=1
h ( 𝑥 )=0 𝜉 ′
𝑖
𝑥𝑖

𝑥𝑗
𝜉′𝑗
h ( 𝑥 ) −1=0
h ( 𝑥 )+ 1=0
Figure 5.4
A separating hyperplane with point classified incorrectly and point correctly classified, but with margin less than 1.
and are outliers ( > 0 and > 0 ).
03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 49
T, Morteza Analoui
Loss of Inconsistent case
• , =1
• represents loss for based on

𝐿( 𝑦 𝑖 h ( 𝑥 𝑖 ) )=1− 𝑦 𝑖 h ( 𝑥 𝑖 ) ¿ 𝜉 𝑖
h ( 𝑥 )=0 𝜉 ′ 3
𝑖
𝑚
𝑥𝑖
2 ∑ 𝜉𝑖
Total empirical loss=
𝑖 =1
𝑥𝑗 1
𝜉′𝑗 𝜉𝑖
h ( 𝑥 ) −1=0 -2 -1 +1 𝑦 𝑖 h ( 𝑥𝑖 )=1 − 𝜉 𝑖
0
h ( 𝑥 )+ 1=0 1−𝜉 𝑗 1-
=0

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 50


T, Morteza Analoui
Error of outliers:
0 <𝜉 <1: on correct side of separating hyperplane

1< 𝜉: on incorrect side of separating hyperplane

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 51


T, Morteza Analoui
Relaxed constrains
• A relaxed version of these constrains can indeed hold, that is, for
each , there exist such that

Subject to: relaxed to Subject to:

• And therefore the loss function becomes:


𝐿 𝑖 ( 𝑦 𝑖 h ( 𝑥𝑖 ) ) =max ¿ ¿

• (slack variable ) measures quantity by which vector violates the


desired inequality,
03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 52
T, Morteza Analoui
Soft margin – Hard margin
• For ·, vector with can be viewed as an miss classified example
• with is correctly classified by hyperplane but is considered to be an
outlier, that is > 0

• If we omit miss classified examples and outliers, training data is


correctly separated by with a margin that we refer to as soft margin,
as opposed to hard margin in separable case

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 53


T, Morteza Analoui
Empirical loss, large-margin, loss function
• One idea consists of selecting hyperplane that minimizes empirical
loss (that is ERM)
• But, that solution will not benefit from large-margin guarantees

• Problem of determining a hyperplane with smallest zero-one loss,


that is smallest number of misclassifications, is NP-hard as a function
of dimension of space. Using hinge or quadratic hinge loss functions
are computationally feasible

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 54


T, Morteza Analoui
𝑚
1
Loss functions ℒ (𝑤 ,𝑏 , 𝜶 )= ‖𝒘‖ + ∑ 𝛼 𝑖 ¿ ¿ ¿
2
2

𝑖=1
Slack (error) terms
• There are many possible choices for leading to more or less aggressive
penalizations of slack terms:
• Choices and lead to most straightforward solutions. Loss functions associated
with and are called hinge loss and quadratic hinge loss, respectively.

zero-one loss:

hinge loss:

quadratic hinge:

Figure 5.5
Both hinge loss and quadratic hinge loss provide
convex upper bounds on binary zero-one loss.
𝑦h ( 𝑥)
03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 55
T, Morteza Analoui
Two conflicting objectives: loss and margin
• On one hand, we wish to limit the total amount of empirical loss
(slack penalty) due to misclassified examples and outliers, which can
be measured by , or, more generally by for some .

• On other hand, we seek a hyperplane with a large margin, though a


larger margin can lead to more misclassified and outliers and thus
larger amounts of loss

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 56


T, Morteza Analoui
Primal optimization problem
• This leads to following general optimization problem defining SVMs in non-
separable case where the parameter C determines trade-off between margin-
maximization (or minimization of ) and minimization of slack penalty . Small C
means large empirical loss. Regularization using C
𝑚
1
min , ‖𝒘‖2+ 𝐶 ∑ 𝜉 𝑖
2 𝑝
2 𝑖 =1
(5.24)

Subject to:

relaxed score constrains Non-negativity constrain of slack (error) variable.


Not all examples need to satisfy the score constraint.
• (5.24) is a convex optimization problem

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 57


T, Morteza Analoui
A regularization view
• Optimization problem in 5.24 presents a regularization based solution
• Higher C means lower training error and smaller
• Lower C means higher training error and larger
𝑚
1
∑ 𝑝 2
𝑎𝑟𝑔𝑚𝑖𝑛 ( ^ )
𝑅𝑆 ( h ) + 𝜆 ℛ ( h) min 𝜉 𝑖 + 𝜆‖𝒘‖2
h∈H 𝑚, 𝑖=1
𝑚
1
min , ‖𝒘‖2+ 𝐶 ∑ 𝜉 𝐶
2 𝑝
𝑎𝑟𝑔𝑚𝑖𝑛 ( 1/ 𝜌 h +𝐶 ^
𝑅( h) ) 𝑖 1/ 𝜆
h∈H 2 𝑖 =1

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 58


T, Morteza Analoui
A regularization view
• Back to results in chapter1 for Regularization-based algorithms
𝑎𝑟𝑔𝑚𝑖𝑛 ( ^ 𝑅 ( h ) + 𝜆 ℛ ( h) )
𝑆
h∈H
𝑚
1
𝑎𝑟𝑔𝑚𝑖𝑛 𝐿 ( 𝑤 )= ∑ [ 𝐿¿ ¿𝑖 ( h ( 𝑥 𝑖 ,𝑊 ) , 𝑦 𝑖 )]+ 𝜆 ℛ (𝑤)¿
h∈H 𝑚 𝑖=1
𝑚
1
min ∑
𝑤 , 𝑏 𝑚 𝑖=1
max ¿ ¿ ¿ ¿
𝐿( 𝑦 𝑖 h ( 𝑥 𝑖 ) ) ¿ 𝜉 𝑖

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 59


T, Morteza Analoui
primal
Lagrangian function 1
min ‖𝒘‖ + 𝐶 ∑ 𝜉
2 𝑝
𝑚

2 𝑖
2 ,
𝑖 =1
Subject to:

• Analysis is presented in case of hinge loss which is most widely used loss function
for SVMs.
• We introduce Lagrange variables , associated to constraints and , associated to
non-negativity constraints of slack variables
• We denote by the vector and by vector
• Lagrangian can then be defined for all and , by
dual
(5.25)

Where

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 60


T, Morteza Analoui
Derivatives of Lagrangian function
(5.25)

• A vector appears in solution iff . Such vectors are called support


vectors
(5.26)

(5,27)

(5.28)

(5.29)

(5.30)

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 61


T, Morteza Analoui
Two types of support vectors
• By the complementarity condition (5.29), if , then

• If = 0, then ·and lies on a marginal hyperplane, as in the separable case and


requires

• Otherwise, 0 and is an outlier. In this case, (5.30) implies = 0 and (5.28)


requires = C

• As in separable case, weight vector solution is unique, support vectors are not.

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 62


T, Morteza Analoui
Two types of support vectors
• Support vectors are either outliers, in which case = C, or vectors lying
on marginal hyperplanes, in which

Support vectors:

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 63


T, Morteza Analoui
Dual optimization problem (5.24)
• Plug into Lagrangian definition of in terms of dual variables (5.26) and
apply constraint (5.27). This yields

‖ ‖
𝑚 2 𝑚 𝑚 𝑚 𝑚
1
ℒ (𝑤 ,𝑏, 𝛼 )= ℒ 𝑑𝑢𝑎𝑙 (𝜶)=
2
∑ 𝛼 𝑖 𝑦 𝑖 𝑥𝑖 − ∑ ∑ 𝛼 𝑖 𝛼 𝑗 𝑦 𝑖 𝑦 𝑗 ( 𝒙𝑖 ∙ 𝒙 𝑗 ) − ∑ 𝛼𝑖 𝑦 𝑖 𝑏+∑ 𝛼 𝑖 (5.31)
𝑖=1 𝑖=1 𝑗=1 𝑖=1 𝑖=1

𝑚 𝑚
1
− ∑ ∑ 𝛼𝑖 𝛼 𝑗 𝑦 𝑖 𝑦 𝑗 ( 𝒙 𝑖 ∙ 𝒙 𝑗 )
0
2 𝑖 =1 𝑗=1
• Remarkably, we find that objective function is no different than in
separable case:
𝑚 𝑚 𝑚
1
ℒ 𝑑𝑢𝑎𝑙 (𝜶)=− ∑ ∑ 𝛼𝑖 𝛼 𝑗 𝑦 𝑖 𝑦 𝑗 ( 𝒙 𝑖 ∙𝒙 𝑗 ) + ∑ 𝛼 𝑖 (5.32)
2 𝑖=1 𝑗=1 𝑖=1

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 64


T, Morteza Analoui
Dual optimization problem: non-separable
• Dual problem only differs from that of separable case (5.14) by
constraints
𝑚 𝑚 𝑚
1
max ℒ 𝑑𝑢𝑎𝑙 (𝜶)=−
𝜶
∑ ∑
2 𝑖=1 𝑗=1
𝛼 𝑖 𝛼 𝑗 𝑦 𝑖 𝑦 𝑗 ( 𝑥 𝑖 ∙ 𝑥 𝑗 ) +∑ 𝛼 𝑖 (5.33)
𝑖=1

subject to:

• Objective function is concave and differentiable and (5.33) is


equivalent to a convex QP. The problem is equivalent to primal
problem (5.24).

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 65


T, Morteza Analoui
Hypothesis
• Solution of dual problem (5.33) can be used directly to determine hypothesis
returned by SVMs, using equation (5.26):

(∑ )
𝑚
h ( 𝑥 )=𝑠𝑔𝑛 ( 𝒘 ∙ 𝒙 + 𝑏 )=𝑠𝑔𝑛 𝛼𝑖 𝑦𝑖 ( 𝒙 𝑖 ∙ 𝒙 )+ 𝑏 (5.34)
𝑚 𝑖=1

𝒘 =∑ 𝛼 𝑖 𝑦 𝑖 𝒙 𝑖
𝑖=1
• can be obtained from any support vector lying on a marginal hyperplane,
𝑚
𝑏= 𝑦 𝑗 − ∑ 𝛼 𝑖 𝑦 𝑖 ( 𝒙 𝑖 ∙ 𝒙 𝑗 ) (5.35)
• important property of SVM:𝑖=1
hypothesis solution depends only on inner products
between vectors and not directly on vectors themselves

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 66


T, Morteza Analoui
Generalization bounds using margin
theory
• Generalization bounds provide a strong theoretical justification for
the SVM algorithm.
• Confidence margin: Confidence margin of a real-valued function at a
point labeled with is quantity

• when , classifies correctly

• is confidence of prediction made by

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 67


T, Morteza Analoui
Margin loss function
• For any parameter , we have a -margin loss function that, penalizes with cost of
1 when it misclassifies point (), and penalizes (linearly) when it correctly
classifies with confidence less than or equal to ρ ().

• The parameter ρ can be interpreted


as the confidence margin
demanded from a hypothesis

𝑦 𝑖 h ( 𝑥𝑖 )=𝜌 𝑖 / 𝜌 h

Figure 5.6
ρ margin loss function: illustrated in red

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 68


T, Morteza Analoui
Empirical margin loss
• Definition 5.6 Given a sample and a hypothesis , the empirical margin
loss is defined by

𝑚
^ 1
𝑅 𝑆 , 𝜌 ( h )= ∑ Φ 𝜌 [ 𝑦 𝑖 h(𝑥𝑖 ) ] (5.37)
𝑚 𝑖=1

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 69


T, Morteza Analoui
Generalization bound for linear
hypotheses
• Corollary 5.11 Let H and assume that . Fix , then, for any , with probability at least
over the choice of a sample of size , the following hold for any :

^
𝑅 ( h ) ≤ 𝑅 𝑆 , 𝜌 ( h ) +2
𝑟Λ
𝜌 √𝑚
+
𝑙𝑜𝑔1 /𝛿
2𝑚 √
• In the separable case, for a linear with geometric margin and choice of
(5.44)

confidence margin parameter empirical margin loss =0

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 70


T, Morteza Analoui
Generalization bound for linear
hypotheses
• (5.44)  small generalization error can be achieved when:
• is small and
• empirical margin loss is relatively small

• For a given problem larger means smaller generalized error upper


bound
• It is a strong justification for margin-maximization algorithms such as
SVMs

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 71


T, Morteza Analoui
From Bound to Optimization problem
• An algorithm based on this theoretical guarantee consists of
minimizing right hand side of (5.44), that is, minimizing an objective
function with a term corresponding to sum of slack variables , and
another one minimizing
2
‖𝑤 ‖2

∑ 𝜉𝑖
^
𝑅 ( h ) ≤ 𝑅 𝑆 , 𝜌 ( h ) +2
𝜌
𝑟Λ
√ 𝑚
+
𝑙𝑜𝑔1 /𝛿
2𝑚 √ (5.44)

𝑖 =1
03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 72
T, Morteza Analoui
Contents
1. Binary support machine
2. Binary SVM: Separable case (Consistent case)
3. Binary SVM: Non-separable case (Inconsistent case)
4. Kernel Methods
5. Multiclass SVM

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 73


T, Morteza Analoui
4- Kernel Methods
Kernel Methods and non-separable case, H
• When is not linearly separable means target function is nonlinear
• Q: How we can use a linear hypotheses set to learn a non-linear ?
• A: Kernel method:
• use a nonlinear mapping from the input space to a higher-dimensional space H (feature
space),
• H is linear in feature space
• H is linear in feature space
• H is nonlinear in input space
• Now we train a nonlinear hypothesis. In some application, it is possible to find for some . So,
problem becomes consistent in feature space, no training error (consistent case)

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 75


T, Morteza Analoui
Example: degree-2 Polynomial Kernel
• Suppose input space is in and
• Nonlinear mapping using
• Then, : nonlinear in space
• And, : linear in space

• is nonlinear in input space and is linear in feature space

• Complexity of linear in feature space is twice of linear in input space

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 76


T, Morteza Analoui
Figure 6.1
^
𝑅> 0 ^
𝑅 =0

linear in input space linear in feature space and


nonlinear in input space
Figure 6.1
Non-linearly separable case. The classification task consists of discriminating between blue and red points.
(a) No hyperplane can separate the two populations. (b) A non-linear mapping can be used instead.

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 77


T, Morteza Analoui
Polynomial Kernel and complexity of
linear H
• Complexity of linear H in feature space:

• Foe: Input space in , and degree of


• when and
• VC-dimension is huge and can easily overfit training data ()

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 78


T, Morteza Analoui
SVMs with kernels
• Replacing each instance of an inner product in 5.33

𝑚 𝑚 𝑚
1
max ℒ 𝑑𝑢𝑎𝑙 (𝜶)=−
𝜶
∑ ∑
2 𝑖=1 𝑗=1
𝛼 𝑖 𝛼 𝑗 𝑦 𝑖 𝑦 𝑗 𝐾 ( 𝑥𝑖 ∙ 𝑥 𝑗 ) + ∑ 𝛼𝑖 (6.13)
𝑖=1

subject to:

• solution can be written as:


(∑ )
𝑚
h ( 𝑥 )=𝑠𝑔𝑛 ( 𝒘 ∙ 𝒙 + 𝑏 )=𝑠𝑔𝑛 𝛼 𝑖 𝑦 𝑖 𝐾 ( 𝒙 𝑖 ∙ 𝒙 ) +𝑏 (6.14)
𝑖=1

for any with

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 79


T, Morteza Analoui
Definition 6.1 (Kernels)
• A function is called a kernel over
• Idea is to define a kernel such that for any two examples be equal to an inner
product of vectors

∀ 𝑥 , 𝑥 ∈ 𝑋 , 𝐾 ( 𝑥 , 𝑥 ) =⟨ Φ ( 𝑥 ) , Φ ( 𝑥 ′ ) ⟩
′ ′ (6.1)

inner product of vectors and

• Since an inner product is a measure of similarity of two vectors, is often


interpreted as a similarity measure between elements of input space

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 80


T, Morteza Analoui
Polynomial kernels
• For any constant , a polynomial kernel of degree is kernel defined
over by:
𝑑
∀ 𝑥 , 𝑥 ∈ ℝ , 𝐾 ( 𝑥 , 𝑥 )= ( 𝑥 ∙ 𝑥 +𝑐 ) (6.3)
′ 𝑁 ′ ′

• Example: ∀ 𝑥 , 𝑥′ ∈ ℝ 𝑁 , 𝐾 ( 𝑥 , 𝑥 ′ )= ( 𝑥1 𝑥 ′ 1 + 𝑥 2 𝑥 ′ 2 +𝑐 )
2

[ ]
2
𝑥 ′1
2
𝑥 ′2
𝐾 ( 𝑥 , 𝑥 ′ ) =⟨ Φ ( 𝑥 ) , Φ ( 𝑥 ′ ) ⟩ = [ 𝑥21 𝑥22 √ 2 𝑥 1 𝑥 2 √ 2 𝑐 𝑥 1 √ 2 𝑐 𝑥 2 𝑐 ] ∙ √
2 𝑥 ′1 𝑥 ′2 2
= ( 𝑥 1 𝑥 ′ 1+ 𝑥 2 𝑥 ′ 2 + 𝑐 )
√2 𝑐 𝑥 ′1
√2 𝑐 𝑥 ′2
𝑐
03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 81
T, Morteza Analoui
Example: XOR and 2nd degree polynomial
h ( 𝑥)
√ 2 𝑥1 𝑥 2
(1,1,+ √ 2,− √ 2,− √ 2,1)(1,1,+ √ 2,+ √ 2,+ √ 2,1)
SVM solution:

h (Φ ( 𝑥 )) √ 2 𝑥1

(1,1,− √ 2,− √2,+ √ 2,1) (1,1,− √ 2,+ √ 2,− √ 2,1)

[ 𝑥 21 𝑥 22 √2 𝑥1 𝑥2 √ 2 𝑥 1 √2 𝑥 2 1 ]
Figure 6.3
Illustration of XOR classification problem and use of polynomial kernels. (a) XOR problem linearly
non-separable in input space. (b) Linearly separable using second-degree polynomial kernel.

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 82


T, Morteza Analoui
Gaussian kernels or radial basis function
(RBF)
• Gaussian kernels are among most frequently used kernels in applications

( )
2
− ‖𝑥 −𝑥 ′‖2

, 𝐾 ( 𝑥 , 𝑥 )=𝑒
2
′ 𝑁 ′ 2𝜎
∀ 𝑥, 𝑥 ∈ℝ (6.5)

• For , (maximum similarity)


• For , when (maximum dissimilarity)

• What is non linear ?


• What is complexity of linear H in feature space?
03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 83
T, Morteza Analoui
Sigmoid kernels
• For any real constants a, , a sigmoid kernel is defined over by:

∀ 𝑥 , 𝑥′ ∈ ℝ 𝑁 , 𝐾 ( 𝑥 , 𝑥 ′ )=tanh ( 𝑎 ( 𝑥 ∙ 𝑥 ′ ) + 𝑏 ) (6.6)

• Using sigmoid kernels with SVMs leads to an algorithm that is closely


related to learning algorithms based on simple neural networks,
which have sigmoid activation function.

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 84


T, Morteza Analoui
Example-1D
• Suppose we have 5 1D data points as the training set:
• (1=1, 1=+1), (2=2, 2=+1), (3=4, 3=1), (4=5, 4=1), (5=6, 5=+1)

Class label +1 +1 -1 -1 +1
data point 1 2 3 4 5

input space 1 2 4 5 6

[ ]
𝑥 2𝑖
2
Φ ( 𝑥𝑖 ) = √ 2 𝑥 𝑖 𝐾 ( 𝑥 , 𝑥 𝑖 ) =( 𝑥𝑥𝑖 +1)
1

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 85


T, Morteza Analoui
[ ]
2
𝑥𝑖
Example-1D, Feature Space Φ ( 𝑥𝑖 ) = √ 2 𝑥 𝑖
1

𝑥 2=2 , 𝜑 ( 𝑥2 ) =[4 ,2 √ 2 , 1] 𝑥5 =6 ,𝜑 ( 𝑥 5 )=[36 , 6 √2 , 1]

𝜑 ( 𝑥 5)
sv

𝜑 ( 𝑥4)
sv
sv 𝜑 ( 𝑥3)
𝜑 ( 𝑥 1) 𝜑 ( 𝑥 2)

Find the widest strip just by looking at the data!

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 86


T, Morteza Analoui
Example-1D, Applying SVM algorithm
• Polynomial kernel of degree 2

• is set to 100 (using large C means we like to emphasis on minimizing training


error
• We first find a ) by 2
𝐾 ( 𝑥 , 𝑥 𝑖 ) =( 𝑥 ∙ 𝑥 𝑖 +1)
5 5 5
1
max ℒ 𝑑𝑢𝑎𝑙 (𝜶)=∑ 𝛼𝑖 − ∑ ∑ 𝛼𝑖 𝛼 𝑗 𝑦 𝑖 𝑦 𝑗 ( 𝑥 𝑖 ∙ 𝑥 𝑗 +1 ) 2

𝜶 𝑖=1 2 𝑖=1 𝑗=1


subject to:

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 87


T, Morteza Analoui
Example-1D 𝐾 ( 𝑥 , 𝑥 𝑖 ) =( 𝑥 ∙ 𝑥 𝑖 +1)
2

1 2 3 4 5

1 2 4 5 6

= (6x6+1)2

(6x1+1)2

5 5 5
1
ℒ 𝑑𝑢𝑎𝑙 (𝜶)=∑ 𝛼𝑖 − ∑ ∑ 𝛼 𝑖 𝑗 𝑖 𝑗( 𝑖
𝛼 𝑦 𝑦 𝑥 ∙ 𝑥 𝑗 +1 )
2

…………………………………………………………………+
𝑖=1 2 𝑖=1 𝑗=1 ………………………………………………………………...+
…………………………………………

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 88


T, Morteza Analoui
Support vectors: )

Example-1D
• Finding α to maximize

• Solving the following 5 linear equations: • Or using a QP solver

∑ 𝑦𝑖 𝛼𝑖=0
𝜕ℒ
=1 −0 . 5 ( 2 × 4 𝛼 1+ 2× 9 𝛼 2 −2 ×25 𝛼3 −2 ×36 𝛼4 + 2× 49 𝛼 5 )=0
𝜕 𝛼1

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 89


T, Morteza Analoui
Support vectors: )

Example-1D
• We get
• a1=0, a2=2.5, a3=0, a4=7.333, a5=4.833, 2.5 + 4.833 - 7.333 = 0
• Note that a< , so there is no training error
• The support vectors are {2=2, 4=5, 5=6}
• using:

a2=2.5, a4=7.333, a5=4.833

1 2 3 4 5

1 2 6
𝑥
4 5

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 90


T, Morteza Analoui
Example-1D,
Support vectors: )

𝑤 =𝛼 2 𝑦 2 𝜑 ( 𝑥 2 ) +𝛼 4 𝑦 4 𝜑 ( 𝑥 4 ) +𝛼 5 𝑦 5 𝜑 ( 𝑥 5 ) =[ +0 .663 ,− 3 .77 , 0] h ( Φ ( 𝑥 ) ) =0 .663 𝑧 1 − 3 .77 𝑧 2 +9


1 1
𝑧1 𝜌h = =0 .261=
‖𝑤‖ √ 𝛼2 +𝛼 4 +𝛼 5
𝜑 ( 𝑥 5)

𝑤
𝜑 ( 𝑥4)

𝜑 ( 𝑥3)
𝜑 ( 𝑥 1) 𝜑 ( 𝑥 2)

𝑧2
03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 91
T, Morteza Analoui
Example-1D 𝐾 ( 𝑥 , 𝑥 𝑖 ) =( 𝑥 ∙ 𝑥 𝑖 +1)
2

Support vectors: )

• Testing for example : h ( 𝑥 )= ∑ 𝛼𝑖 𝑦𝑖 𝐾 ( 𝑥, 𝑥𝑖 )=𝑏 𝐾 (𝑥 , 𝑥5 )


𝑥𝑖 ∈ 𝑆
2 2 2
h ( 𝑥 )=2 . 5 ( 1 )( 2 𝑥+ 1 ) + 7 . 333 ( −1 ) (5 𝑥 +1 ) + 4 . 833 ( 1 ) ( 6 𝑥+1 ) + 𝑏
2
h ( 𝑥 )=0 . 6667 𝑥 − 5 .333 𝑥 +𝑏
• is also can be recovered by solving or by or by , as x2 and x5 lie on and x4 lies on

• All three give

2
h ( 𝑥 )=0 . 6667 𝑥 − 5 .333 𝑥 +9 h ( Φ ( 𝑥 ) ) =0 .663 𝑧 1 − 3 .77 𝑧 2 +9

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 92


T, Morteza Analoui
Example-1D
𝑇 2
h ( 𝑥 )=𝑤 Φ (𝑥 )+𝑏=0 , 0 . 6667 𝑥 − 5 .333 𝑥 +9=0
𝑇 2
h ( 𝑥 )=𝑤 Φ (𝑥 )+𝑏=+1 , 0 . 6667 𝑥 − 5 . 333 𝑥+ 9=+ 1
𝑇 2
h ( 𝑥 )=𝑤 Φ (𝑥 )+𝑏=−1 , 0 . 6667 𝑥 − 5 . 333 𝑥+ 9= − 1

h ( 𝑥 )=−1 h ( 𝑥 )=+1

𝑥
1 2 3 4 5
1 2 3 4 5 6 7

𝐿𝑎𝑏𝑒𝑙 :+1 𝐿𝑎𝑏𝑒𝑙 :+1


𝐿𝑎𝑏𝑒𝑙 :−1
=2.42 =5.58

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 93


T, Morteza Analoui
Example-XOR
Input vector y
+1 -1
x1: [-1,-1] -1
x2: [-1,+1] +1
x3: [+1,-1] +1
-1 +1
x4: [+1,+1] -1

[ ]
9 1 1 1
𝜑 ( 𝑧 ) =¿ 1 9 1 1
K= 1 1 9 1
𝑘 ( 𝑥1 , 𝑥 2 ) =¿ 1 1 1 9

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 94


T, Morteza Analoui
Example

Note that H is a linear hyperplane set.


Non-linearity is due to kernel function.

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 95


T, Morteza Analoui
RBF kernel
C=0.01: Lower C
higher training error and larger

C=100: Higher C
lower training error and smaller

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 96


T, Morteza Analoui
RBF kernel

=10≫1: decreasing influence


of support vectors (no amount
of regularization with C will be
able to prevent overfitting)
Too Complex model

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 97


T, Morteza Analoui
RBF-Kernel
• ≫1  radius of area of influence of support vectors only includes
support vector itself and no amount of regularization with will be
able to prevent overfitting.
• 1  model is too simple and cannot capture complexity of target
function c. The region of influence of any selected support vector
would include whole training set.
• =intermediate values  good models can be found on a diagonal of
and .

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 98


T, Morteza Analoui
RBF kernel
1: Too simple model

increasing margin

𝐶
≫1: Too Complex model

1: region of influence of any support vector


would include the whole training set.
Too simple model ≫1: decreasing influence of support vectors

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 99


T, Morteza Analoui
Contents
1. Binary support machine
2. Binary SVM: Separable case (Consistent case)
3. Binary SVM: Non-separable case (Inconsistent case)
4. Kernel Methods
5. Multiclass SVM

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 100


T, Morteza Analoui
5- Multiclass SVM

Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass kernel-based
vector machines. Journal of Machine Learning Research, 2, 2002.
Multiclass SVM
• Let denote input space and denote output space, and let be an unknown
distribution over according to which input points are drawn. We will distinguish
between two cases:
• mono-label case, where is a finite set of classes that we mark with numbers for convenience,
Learning: Given a dataset
• the multi-label case where
• In mono-label case, each example is labeled with a single class, while in multi-
label case it can be labeled with several. Text documents can be labeled with
several different relevant topics, e.g., sports, business, and society. The positive
components of a vector in indicate classes associated with an example.

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 102


T, Morteza Analoui
Multi-class SVM, Mono-label case
• In a risk minimization framework
• Each label has a different weight vector

• Leaning (Training): Maximizing multiclass margin


• Equivalently, Minimize total norm of the weight vectors such that the true
label is scored at least 1 more than the second best one
• Training results in ,
• Testing (Inference): Select the label based on the highest score for all

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 103


T, Morteza Analoui
Multiclass Margin loss
• Suppose a 5-class tasks.
• For pattern the scores are:
• The margin loss is given in 3 different possibilities:
𝑇 𝑇 𝑇
𝑠𝑙 =𝑤 𝑙 𝑥 +𝑏 𝑙 𝑠𝑙 =𝑤 𝑙 𝑥 +𝑏 𝑙 𝑠𝑙 =𝑤 𝑙 𝑥 +𝑏 𝑙

3.1
2.8 2.8 2.8

2.2

Labels: Labels: Labels:


1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
Margin loss= = 0 Margin loss= = 0.7 Margin loss= = 1.6

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 104


T, Morteza Analoui
Linear Hard SVM
no empirical error

• Recall hard binary linear SVM


1 2 regularizer
min ‖𝒘‖2 (5.7)
𝒘 ,𝑏2
Subject to:
Score constraint
• Single task hard multiclass linear SVM
min
1
𝑘 regularizer
(𝑤 ¿ ¿1 ; 𝑏1), …,(𝑤 𝑘 ; 𝑏𝑘 ) ∑‖ 𝒘‖2 ¿
2

2 𝑙=1
Score constraint: Score for true
𝑠𝑖 − 𝑠 𝑙 ≥1 ≡(1− ( 𝑠𝑖 − 𝑠 𝑙 ) ) ≤ 0 Subject to: label is higher than score for any
other label by 1

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 105


T, Morteza Analoui
Linear Soft SVM
𝑚
• Recall soft binary linear SVM 1
‖𝒘‖ + 𝐶 ∑ 𝜉
2 𝑝
min 2 𝑖
2
,
𝑖 =1
Subject to:

relaxed score constrains Non-negativity constrain of slack variable

• Single task soft multiclass linear SVM


min
𝑘 𝑚
1
(𝑤 ¿ ¿1 ; 𝑏1), …,(𝑤 𝑘 ; 𝑏𝑘 ) ∑‖𝑤 𝑙‖ +𝐶 ∑ 𝜉 𝑖 ¿
2 𝑝
2 𝑙=1 𝑖=1
Subject to:

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 106


T, Morteza Analoui
Lagrangian of the optimization problem
• To solve the optimization problem we use the Karush-Kuhn-Tucker theorem. We
add a dual set of variables, one for each constraint and get the Lagrangian of the
optimization problem:
• Recall single task soft binary linear SVM

(5.25)

• Single task soft multiclass linear SVM


𝑘 𝑚 𝑚 𝑘 𝑚
1
ℒ ( {𝒘 𝑙 , 𝑏𝑙 }𝑙=1 , 𝝃 , 𝜶 , 𝜷 )= ∑‖𝑤𝑙‖ +𝐶 ∑ 𝜉 𝑝𝑖 − ∑ ∑ 𝛼𝑖, 𝑙 [ 𝑠 𝑖 − 𝑠 𝑙 −1+𝜉 𝑖 ] − ∑ 𝛽 𝑖 𝜉 𝑖
𝑘 2

2 𝑙=1 𝑖=1 𝑖=1 𝑙=1 𝑖=1


Subject to :

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 107


T, Morteza Analoui
𝑚 𝑚 𝑚
1
ℒ 𝑑𝑢𝑎𝑙 (𝜶)=∑ 𝛼𝑖 − ∑ ∑ 𝛼 𝑖 𝛼 𝑗 𝑦 𝑖 𝑦 𝑗 ( 𝒙 𝑖 ∙ 𝒙 𝑗 )
Dual Problem 𝑖=1 2 𝑖=1 𝑗=1
(5.32)

subject to:

• We can rewrite the dual program in the following vector form


𝑚 𝑚
𝐶
max ℒ 𝑑𝑢𝑎𝑙 =∑ 𝐴𝑖 ∙1 𝑦 − ∑ ( 𝐴𝑖 ∙ 𝐴 𝑗 ) (𝑥 ¿ ¿𝑖, 𝑥 𝑗 )¿
Α 𝑖=1 2 𝑖, 𝑗=1 𝑖

Subject to: and

• Where and ,
• Let be a vector whose components are all zero except for the component which
is equal to 1,
• Let be the vector whose components are all 1.

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 108


T, Morteza Analoui
Dual Problem.

𝑚 𝑚
𝐶
max ℒ 𝑑𝑢𝑎𝑙 =∑ 𝐴𝑖 ∙1 𝑦 − ∑ ( 𝐴𝑖 ∙ 𝐴 𝑗 ) (𝑥 ¿ ¿𝑖, 𝑥 𝑗 )¿
Α 𝑖=1 2 𝑖, 𝑗=1 𝑖

Subject to: and

𝛼 𝑖= { 𝛼𝑖 , 1 𝛼 𝑖 ,2 … 𝛼𝑖 , 𝑘 }
→ 𝐴 =1 −𝛼 𝑖= [ −𝛼 𝑖,1 − 𝛼𝑖 ,2 … 1 −𝛼 𝑖, 𝑦 … −𝛼 𝑖,𝑘 ]
1 𝑦 =[ 0 … 0 1 0 … 0 ] 𝑖 𝑦
𝑖
𝑖 𝑖

𝑚
𝐴 𝑖 ∙1 𝑦 =1 −𝛼𝑖 , 𝑦
𝑖 𝑖
𝐴 𝑖 ∙ 1 =1 − ∑ 𝛼 𝑖= 0
𝑖=1

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 109


T, Morteza Analoui
Applying Kernel function ( )
𝑚
h ( 𝑥 )= 𝑠𝑔𝑛 ∑ 𝛼 𝑦 𝐾 ( 𝒙 ∙ 𝒙 ) + 𝑏 𝑖 𝑖 𝑖
𝑖=1

• Replacing the inner-products with a kernel function (·,·) that satisfies


Mercer’s conditions. The general dual program using kernel functions
is therefore,
𝑚 𝑚
𝐶
max ℒ =∑ 𝐴𝑖 ∙1 𝑦 − ∑ ( 𝐴𝑖 ∙ 𝐴 𝑗 ) 𝐾(𝑥 ¿ ¿𝑖∙𝑥 𝑗 )¿
Α 𝑖=1 2 𝑖, 𝑗=1
𝑖

Subject to: and

• Classification function becomes:

{ }
𝑚
h ( 𝑥 )=arg𝑚𝑎𝑥 𝑘𝑙=1 ∑ 𝐴𝑖,𝑙 𝐾 ( 𝑥 , 𝑥 𝑗 ) +𝑏 𝑙
𝑖=1

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 110


T, Morteza Analoui
Support Vectors
• The first sum is over all patterns that belong to the class. Hence, an example
labeled is a support pattern only if
• The second sum is over the rest of the patterns whose labels are different from .
In this case, an example is a support pattern only if

[ ]
𝑚 𝑚
𝑤𝑙 = 𝛽 ∑ (1− 𝛼𝑖, 𝑙 )Φ(𝑥 𝑖 )+ ∑ ( − 𝛼𝑖, 𝑙 ¿ Φ (𝑥 𝑖 ))
𝑖=1 𝑖=1
𝑦 𝑖 =𝑙 𝑦 𝑖 ≠𝑙

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 111


T, Morteza Analoui
Probabilistic interpretation for vector
• For each pattern (example) the vector satisfies the constraints
𝑘
𝛼 𝑖 , 𝑙 ≥ 0 ∧ ∑ 𝛼𝑖 , 𝑙 =1
𝑙=1
• Each set can be viewed as a probability distribution over the labels
• is a support pattern if and only if its corresponding distribution is not
concentrated on the correct label . That is: for and for
• Therefore, the classifier is constructed using patterns whose labels are uncertain;
the rest of the input patterns are ignored.

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 112


T, Morteza Analoui
Example
• Suppose , , and then

• example does not support Example scaled by


• example supports Example scaled by
• example supports Example scaled by
• example supports Example scaled by
• example supports Example scaled by

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 113


T, Morteza Analoui
Quadratic Programing
• Both the primal and dual problems are simple QPs generalizing those
of standard SVM algorithm.
• However, size of solution and number of constraints for both
problems is in , which, for a large number of classes , can make it
difficult to solve.
• However, there exist specific optimization solutions designed for this
problem based on a decomposition of the problem into disjoint sets
of constraints.

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 114


T, Morteza Analoui
Concluding Remarks
• Generalizes binary SVM algorithm
• If we have only two classes, this reduces to the binary (up to scale)

• Comes with similar generalization guarantees as the binary SVM

• Can be trained using different optimization methods


• Stochastic sub-gradient descent can be generalized

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 115


T, Morteza Analoui
Generalization bound
• In multi-class classification, a kernel-based hypothesis is based on a matrix of
maintain prototypes Vector is the row of .
• Each weight vector , defines a scoring function
• A family of kernel-based hypotheses we will consider is
H 𝐾 ={ ( 𝑥 , 𝑦 ) ∈ 𝑋 × { 1 , … , 𝑘 } ⟼ 𝑤 𝑦 ∙ Φ ( 𝑥 ) : ,‖𝑊 ‖ ≤ Λ }
2
2

• In which

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 116


T, Morteza Analoui
Generalization bound
• Assume that there exists , such that for all
• For any with probability at list for all


𝑚
1 𝑟 Λ 𝑙𝑜𝑔 1/ 𝛿
𝑅 ( h ) ≤ ∑ 𝜉 𝑖 +4 𝑘 + (9.12)
𝑚 𝑖=1 √𝑚 2𝑚
{ }
𝑘
h ∈ H 𝐾 = ( 𝑥 , 𝑦 ) ⟶ 𝑊 𝑦 ∙ Φ ( 𝑥 ) : ∑ ‖𝑤 𝑙‖ ≤ Λ 2
2
2
𝑙 =1

• Where for all

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 117


T, Morteza Analoui
Appendix
SVM solvers
SVM solvers, Exact SVM solvers
• LIBSVM
• LIBLINEAR
• liquidSVM
• Pegasos
• LASVM
• SVMLight

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 119


T, Morteza Analoui
SVM solvers, Hierarchical solvers
• ThunderSVM
• cuML SVM
• LPSVM

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 120


T, Morteza Analoui
SVM solvers, Approximate SVM solvers
• DC-SVM
• EnsembleSVM
• BudgetedSVM

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 121


T, Morteza Analoui
SVM solvers run on GPU
• GTSVM
• OHD-SVM

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 122


T, Morteza Analoui
SVM solvers, Multiclass
• Crammer-Singer SVM
• MSVMpack
• BSVM
• LaRank
• GaLa

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 123


T, Morteza Analoui

You might also like