2.pattern Recognition (Pattern Classification) - Support Vector Machin

Pattern Recognition
(Pattern Classification)
Support Vector Machin (SVM)
Hypothesis set and Algorithm
Second Edition
Recall from Chapter 1
True Error Bound and linear
• Linear in input space:
• Linear in feature space:
Λ ≥‖h‖ℍ= ℛ ( 𝑊 ) =‖𝑊 ‖2=√ 𝑤 +𝑤 +𝑤 +… 2

1
2
2
2
3
03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 3

T, Morteza Analoui
Learning algorithm tries to minimizing Upper
Bound of True Error by finding for a given
= upper bound of true risk (loss)
𝑎𝑟𝑔𝑚𝑖𝑛 ( ^
𝑅𝑆 ( h ) + 𝜆 ℛ ( h) ) =𝑎𝑟𝑔𝑚𝑖𝑛 𝐿(𝑊 )
h∈H h∈H
^
𝑅𝑆 ( h ) 𝐶𝑜𝑚𝑝𝑙𝑒𝑥𝑖𝑡𝑦 ( 𝑅𝑒𝑔𝑢𝑙𝑎𝑧𝑖𝑛𝑔 ) 𝑇𝑒𝑟𝑚
𝑚
1
𝐿 ( 𝑊 )= ∑ [ 𝐿¿ ¿𝑖 (h ( 𝑥 𝑖 ,𝑊 ) , 𝑦 𝑖 )]+ 𝜆 ℛ (𝑊 )¿
𝑚 𝑖=1 2
ℛ ( 𝑊 )=‖𝑊 ‖2=𝑤 21 +𝑤22 +𝑤 23+ …+𝑤 2𝑁
03/18/2024 Machine Learning Theory for Pattern Recognition - School of C 4

omputer Engineering, IUST - Morteza Analoui
Contents
1. Binary support machine
2. Binary SVM: Separable case (Consistent case)
3. A Geometric Representation of SVM
4. Binary SVM: Non-separable case (Inconsistent case)
5. Kernel Methods
6. Multiclass SVM
This chapter is mostly based on:

Foundations of Machine Learning, 2nd Ed., By Mehryar Mohri, , Afshin Rostamizadeh, Ameet Talwalkar,
Publisher: MIT Press, 2018

T, Morteza Analoui
1- Binary support machine
Binary Support Vector Machines
• SVM is one of most theoretically well motivated and practically most
effective classification algorithms in modern machine learning.
• We first introduce algorithm for consistent case )H contains concept
to learn ), then present its general version designed for inconsistent
case, and finally provide a theoretical foundation for SVMs based on
notion of margin

T, Morteza Analoui
SVM: a small generalization error learning
machine
• Consider an input space that is a subset of with , and output or target space ,
and let be target function (concept)
• Given a hypothesis set H of functions mapping to , binary classification task is
formulated as follows
• Learner receives a training sample of size drawn i.i.d. from according to some
unknown distribution ,
, with for all
• Problem consists of determining a hypothesis H, a binary classer, with small

generalization error:

T, Morteza Analoui
H : - margin linear hyperplane set,
• Different hypothesis sets H can be selected for this task. In view of
Occam's razor principle, hypothesis sets with smaller complexity and
smaller VC-dimension provide better learning guarantees, when
everything else is equal
• A natural hypothesis set with relatively small complexity is that of
linear classifiers, or hyper planes, which can be defined as follows:
(5.2)
• Learning problem is then referred to as a linear classification problem

T, Morteza Analoui
H : - margin linear hyperplane set
• General equation of a hyperplane in is ·, where is a non-zero vector
normal to hyperplane and is a scalar.
• A hypothesis H of form ·thus labels positively all points falling on one

side of hyperplane ·and negatively all others
• Definition of SVM solution is based on notion of margin

T, Morteza Analoui
Contents
5. Kernel Methods
6. Multiclass SVM

T, Morteza Analoui
2- Binary SVM: consistent case
)H contains concept to learn )
(Separable case)
Binary SVM - Consistent Case
• Concept class (positive class) :
• Negative class :
• Margin loss variable:
• Consistent case: can be learned such that for all training examples

T, Morteza Analoui
Binary SVM Margin Loss functions:
• Score:
Φ 𝜌 ( 𝜌 h (𝑥𝑖 , 𝑦 𝑖 ) )=𝑚𝑎𝑥 (0 ,1 − 𝑦 𝑖 h ( 𝑥𝑖 ) )
1−
3 𝑦
𝑖 h(
𝑥
Hinge loss function: 𝑖 )
2
1 𝛼 ¿
𝑖
𝑦 𝑖 h ( 𝑥𝑖 )=𝑠 𝑖
-3 -2 -1 0 1

T, Morteza Analoui
Regularization-based algorithm1
• Upper bound of true risk
^
𝑅 𝑆 , 𝜌=1 ( h )
2
𝑚
ℛ ( 𝑊 )=‖𝒘‖2=𝑤 21+ 𝑤22 +𝑤23 + …+𝑤 2𝑁
1
𝑅 (h ) ≤ ℒ ( 𝒘 , 𝑏 ) = ∑
𝑚 𝑖 =1
𝑚𝑎𝑥 ¿ ¿ (5.48.1)
Weighted regularizer
is upper bound of true risk regularization parameter,
• The solution and for the optimization problem min ℒ ( 𝒘 ,𝑏 )

𝑤,𝑏
gives

T, Morteza Analoui
Regularization-based algorithm1.
• Regularization-based algorithms recall from Chapter1
,
h∈H
𝑚
1
𝑎𝑟𝑔𝑚𝑖𝑛 𝐿 ( 𝑊 )= ∑ [ 𝐿¿ ¿𝑖 (h ( 𝑥 𝑖 ,𝑊 ) , 𝑦 𝑖 )]+ 𝜆 ℛ (𝑊 )¿
h∈H 𝑚 𝑖=1
𝑚
1
min ∑
𝑤 , 𝑏 𝑚 𝑖=1
max ¿ ¿ ¿ ¿
hinge loss:

T, Morteza Analoui
Regularization-based algorithm1..
• Finding that has minimum regularizer, and no empirical loss
1 2
Minimizing regularizer and no empirical loss: 𝑚𝑖𝑛 ‖𝒘 ‖2
𝒘 ,𝑏 2 (5.7)
Keeping scores 1 or more: Subject to:
• (5.7) and (5.48.1) are convex optimization problem and a specific

instance of quadratic programming (QP)

T, Morteza Analoui
Regularization-based algorithm2
• Regularization-based algorithms
𝐶 =1/ λ ≥ 0
h∈H
𝑚
𝑎𝑟𝑔𝑚𝑖𝑛 𝐿 ( 𝑊 )= ℛ ( 𝑊 ) + ∑ 𝛼𝑖 [ 𝐿 ¿ ¿𝑖 ( h ( 𝑥 𝑖 ,𝑊 ) , 𝑦 𝑖 ) ]¿ 𝛼 𝑖 ≡ 1/ 𝑚 λ ≥ 0
h∈H 𝑖=1
𝑚
min 𝐿 ( 𝑊 )=‖𝒘‖ + ∑ 𝛼 𝑖 ¿ ¿
2
𝜶=[ 𝛼 1 … 𝛼𝑖 … 𝛼𝑚 ]
2
𝑤 , 𝑏,𝜶 𝑖=1
hinge loss: ,

T, Morteza Analoui
Dual Problem for Algorithm2: Lagrangian
function
• Lagrangian function associated to problem (5.7)
Weighted
𝑚
1
ℒ ( 𝒘 ,𝑏 , 𝜶 )= ‖𝒘‖ + ∑ 𝛼 𝑖 ¿ ¿
2
2 𝑖=1
upper bound of true risk
(Lagrangian function) Lagrange variables
• The solution and for the dual problem min ℒ ( 𝒘 ,𝑏is, 𝜶the
) solution
for the primal (5.7) 𝑤 , 𝑏,𝜶

T, Morteza Analoui
Algorithm- QP solvers
• A variety of commercial and open-source solvers are available for
solving convex QP problems.
• Specialized algorithms have been developed to more efficiently solve
this particular convex QP problem, see appendix
QP solver:
• Setting gradient of Lagrangian with respect to primal variables and to
zero ,
• Setting weighted :
(H) (complementarity slackness conditions)

T, Morteza Analoui
Derivatives of Lagrangian function
, h (𝑥¿¿𝑖)=𝒘 ∙ 𝒙 𝑖 +b ¿
𝑚 𝑚
∇ 𝑤 ℒ =𝒘 − ∑ 𝛼 𝑖 𝑦 𝑖 𝑥𝑖 =0 → 𝒘 =∑ 𝛼 𝑖 𝑦 𝑖 𝑥𝑖 (5.9)
𝑖=1 𝑖=1
𝑚 𝑚
∇ 𝑏 ℒ =− ∑ 𝛼𝑖 𝑦 𝑖=0 → ∑ 𝛼𝑖 𝑦 𝑖=0 (5.10)
𝑖=1 𝑖=1
no empirical loss: ∀ 𝑖 , 𝛼𝑖 ¿ (5.11)
is called a support vector (support example) when

T, Morteza Analoui
is unique
• Solution of SVM problem is unique, but support vectors are not
• In dimension , points are sufficient to define a hyperplane When more

than points lie on a marginal hyperplane, different choices are
possible for support vectors

T, Morteza Analoui
Dual optimization problem
• Plug (5.9) and (5.10) into Lagrangian function (5.8) yields minimum
loss:
‖∑ ‖
𝑚 2 𝑚 𝑚 𝑚 𝑚
1
ℒ (𝑤 ,𝑏, 𝛼 )= ℒ 𝑑𝑢𝑎𝑙 (𝜶)= 𝛼 𝑖 𝑦 𝑖 𝑥𝑖 − ∑ ∑ 𝛼 𝑖 𝛼 𝑗 𝑦 𝑖 𝑦 𝑗 ( 𝒙𝑖 ∙ 𝒙 𝑗 ) − ∑ 𝛼𝑖 𝑦 𝑖 𝑏+ ∑ 𝛼 𝑖 (5.12)
2 𝑖=1 𝑖=1 𝑗=1 𝑖=1 𝑖=1
𝑚 𝑚
1
− ∑ ∑ 𝛼𝑖 𝛼 𝑗 𝑦 𝑖 𝑦 𝑗 ( 𝒙 𝑖 ∙ 𝒙 𝑗 )
0
2 𝑖 =1 𝑗=1
• which simplifies to
𝑚 𝑚 𝑚
1
ℒ 𝑑𝑢𝑎𝑙 (𝜶)=− ∑ ∑ 𝛼𝑖 𝛼 𝑗 𝑦 𝑖 𝑦 𝑗 ( 𝒙 𝑖 ∙𝒙 𝑗 ) + ∑ 𝛼 𝑖 (5.13)
2 𝑖=1 𝑗=1 𝑖=1

T, Morteza Analoui
Dual optimization solution
• This leads to following dual optimization problem for SVMs in separable case:
𝑚 𝑚 𝑚
1
max ℒ 𝑑𝑢𝑎𝑙 (𝜶)=− ∑ ∑ 𝛼 𝑖 𝛼 𝑗 𝑦 𝑖 𝑦 𝑗 ( 𝑥 𝑖 ∙ 𝑥 𝑗 ) + ∑ 𝛼 𝑖 (5.14)
𝜶 2 𝑖=1 𝑗=1 𝑖=1
subject to:
• Dual objective function is concave and differentiable. Dual optimization problem

is a QP problem, general-purpose and specialized QP solvers can be used
• SMO (Sequential Minimal Optimization) algorithm is used to solve dual form of
SVM problem in more general non-separable setting

T, Morteza Analoui
Primal and dual problems are equivalent
• Solution of dual problem (5.14) can be used directly to determine
hypothesis returned by SVMs, using equation (5.9):
(∑ )
𝑚
h ( 𝑥 )= 𝑠𝑔𝑛 ( 𝒘 ∙ 𝒙 + 𝑏 )=𝑠𝑔𝑛 𝛼 𝑖 𝑦𝑖 ( 𝒙 𝑖 ∙ 𝒙 )+ 𝑏 (5.15)
𝑖=1
𝑚
𝒘 =∑ 𝛼 𝑖 𝑦 𝑖 𝒙 𝑖(5.9)
𝑖=1
• Since support vectors lie on marginal hyperplanes, for any support
vector , ·, and thus can be obtained via
𝑚
𝑏= 𝑦 𝑗 − ∑ 𝛼 𝑖 𝑦 𝑖 ( 𝒙 𝑖 ∙ 𝒙 𝑗 ) (5.16)
𝑖=1

T, Morteza Analoui
Inner products between vectors
• Dual optimization problem (5.14) and expressions (5.15) and (5.16) reveal an
important property of SVMs:
• hypothesis solution depends only on inner products between vectors and not
directly on vectors themselves
• This observation is key and its importance will become clear when kernel
methods are introduced
• Now we can derive the following expression (see page 85 of the text for details)
𝑚
‖𝒘 ‖ = ∑ 𝛼 𝑖=‖𝜶‖1(5.19)
2
2
𝑖= 1

T, Morteza Analoui
Theorem 5.4
• Let be a linearly separable sample of .
• Let be hypothesis returned by SVMs for a sample , and let be
number of support vectors that define . Then,
average fraction of support vectors
average generalization error 𝑆 𝒟

𝑁 𝑆𝑉 (𝑆)
𝔼 [ 𝑚𝑅 (h 𝑆 ) ] ≤ 𝔼 𝑚+ 1
𝑆 𝒟 𝑚+1
[ ] (5.4)
^ 𝑁 𝑆𝑉 (𝑆)
Leave-One-Out error 𝑅 𝐿𝑂𝑂 (𝑆𝑉𝑀 ) ≤
𝑚+1
• where denotes distribution according to which points are drawn

T, Morteza Analoui
Theorem 5.4
Theorem 5.4 gives a sparsity argument in favor of SVMs:
• Average error of algorithm is upper bounded by average fraction of support
vectors
• One may hope that for many distributions seen in practice, a relatively small
number of training points be the support vectors
• Solution will then be sparse in sense that a small fraction of dual variables will be
non-zero
• 5.4 is relatively weak bound since it applies only to average generalization error
of algorithm over all samples of size . It provides no information about variance of
generalization error
• We present stronger high-probability bounds based on notion of margin
(Theorem 5.10).
T, Morteza Analoui
Contents
5. Kernel Methods
6. Multiclass SVM

T, Morteza Analoui
A Geometric Representation of SVM
• There are infinitely many such separating hyperplanes .
• Linear hypothesis H of form ·
• Consistent :
• There infinit number that support
Training set:
·
T, Morteza Analoui
Which one is the best ?
• Answer: To keep upper bound of true risk as low as possible, for a
given and , we are looking for maximum “-margin in loss function
while empirical error is zero”
0

T, Morteza Analoui
A Geometric Representation of SVM
• This is equivalent to existence of such that:
h ( 𝒙 )=𝒘 ∙ 𝒙 +𝑏=0
𝑦 =+1
𝑦 =−1 h ( 𝑥 𝑖 ) ≥ 0 , 𝑦 𝑖 h( 𝑥 𝑖) ≥ 0
h ( 𝑥 𝑖 ) ≤ 0 , 𝑦 𝑖 h( 𝑥 𝑖) ≥ 0
Score of 𝑥𝑖 =𝑠𝑖 =𝑦 𝑖 h ( 𝑥𝑖 )

T, Morteza Analoui
Definition 5.1 – margin of examples
• Geometric margin at a point = distance from to hyperplane =0
h ( 𝒙 )=𝒘 ∙ 𝒙 +𝑏=0 𝑥𝑖 , 𝑦 𝑖 𝑦 𝑖 h( 𝑥 𝑖) ≥ 0
‖𝒘‖2 =√ 𝑤 +𝑤 +…+𝑤
2
1
2
2
2
𝑁
𝑦 𝑖 h( 𝑥 𝑖) ≥ 0
() = Score of 𝑥𝑖 =𝑠𝑖 =𝑦 𝑖 h ( 𝑥𝑖 )

T, Morteza Analoui
Geometry margin
• Geometric margin of a linear classifier for a sample is minimum geometric
margin over points in SAMPLE, that is distance of hyperplane defining to closest
sample points.
h ( 𝑥 )=0
geometric margin of : =
2 𝜌h

T, Morteza Analoui
Geometry margin
h ( 𝑥)− 1=0
h ( 𝑥)+1=0 𝑤
h ( 𝑥 𝑖 ) ≥ 0 , 𝑦 𝑖=+ 1
𝑥𝑖 𝑥𝑖 𝑦 𝑖 ( h ( 𝑥𝑖 ) −1) 1
𝜌 𝑖 ,+ 1= =𝜌 𝑖 −
‖𝑤‖2 ‖𝑤‖2
2
2 𝜌h = h ( 𝑥 )=0
‖𝑤‖2
𝑦 𝑖 (h ( 𝑥 𝑖 ) +1) 1
𝜌 𝑖 ,−1= =𝜌 𝑖 +
‖𝑤‖2 ‖𝑤‖2
h ( 𝑥 𝑖 ) ≥ 0 , 𝑦 𝑖=+ 1 2
𝜌 𝑖 ,−1 − 𝜌 𝑖 ,+ 1= =2 𝜌 h
‖𝑤‖2
T, Morteza Analoui
Margin of based on
𝜌𝑖
() = → 𝜌h
=𝑦 𝑖 h ( 𝑥𝑖 )=𝑠 𝑖
h ( 𝑥 )=0 𝜌𝑖
𝑥𝑖
2
2 𝜌h =
‖𝑤‖2

T, Morteza Analoui
SVM: maximum -margin & no empirical
error H
• To keep upper bound of true risk as low as possible, we are looking

for maximum “-margin in loss function while empirical error is zero”
• It means: max =
• What is maximum possible?

T, Morteza Analoui
SVM: maximum -margin & no empirical
error H
𝜌𝑖
zero-one loss
𝑥𝑖 hinge loss: )
𝜌𝑗 𝑥 quadratic hinge )2
𝑗
2
2 𝜌h = h ( 𝑥 )=0
‖𝑤‖2
𝜌𝑖
−2 𝜌 h − 𝜌 h 𝜌 =𝜌 h
𝜌 𝑗 𝜌𝑖
𝜌 𝑗≥ 𝜌 h→ ≥1 𝜌𝑖 ≥ 𝜌h → ≥1
𝜌h 𝜌h
No training margin loss in -1 examples No training margin loss in +1 examples

T, Morteza Analoui
SVM Margin Loss functions 𝐿𝑖=max ⁡( 0 , 1− 𝑦 𝑖 h ( 𝑥 𝑖 ) )
zero-one loss zero-one loss:

hinge loss: ) hinge loss:
quadratic hinge quadratic hinge:
𝜌𝑖
𝜌h 𝜌𝑖 = 𝑦 h ( 𝑥𝑖 )= 𝑠 𝑖
−2 𝜌 h − 𝜌 h 1 𝜌h 𝑖
𝜌𝑖
= 𝑦 h ( 𝑥𝑖 )= 𝑠 𝑖
𝜌h 𝑖
Figure 5.5 Both hinge loss and quadratic hinge loss provide convex upper bounds on binary zero-one loss.
T, Morteza Analoui
Dual of Algorithm1: Lagrangian function
• Lagrangian function associated to problem (5.48) is upper bound of
true risk ^
𝑅 (h) 𝑆 2 2 2 2
ℛ ( 𝑊 )=‖𝑤‖2=𝑤1 + 𝑤2 +𝑤3 + …+𝑤 2𝑁
𝑚
1
𝑅 (h) ≤ ℒ ( 𝒘 , 𝑏 ) = ∑
𝑚 𝑖 =1
𝑚𝑎𝑥 ¿ ¿ (5.48.1)
Lagrangian function is
upper bound of true risk regularization parameter, Lagrange variable
• The solution and for the dual problem is the solution for the primal
min ℒ ( 𝒘 ,𝑏 )
𝑤,𝑏
T, Morteza Analoui
SVM Primal Algorithm2
• Finding that has maximum geometric margin and no empirical loss
1
m 𝑎𝑥 𝜌 h = (5.7.1)
𝒘 ,𝑏 ‖𝒘‖2
Subject to no empirical loss: Note that:
• Or minimalizing (that is a convex optimization problem and a specific
instance of quadratic programming (QP))
1 2
Minimizing regularizer and no empirical loss:𝑚𝑖𝑛 ‖ ‖
𝒘 2 (5.7)
𝒘 ,𝑏 2
Keeping scores 1 or more: Subject to:

T, Morteza Analoui
SVM Primal Algorithm2.
• Given , find and to maximize geometric margin while there is no
training error
1 2
min ‖𝒘‖2 (5.7)
𝒘 ,𝑏2
Subject to:
• The resulting algorithm precisely coincides with (5.48.1)

T, Morteza Analoui
Dual of Algorithm2
• Lagrangian function associated to problem (5.7)
𝑚
1
ℒ (𝑤 ,𝑏 , 𝜶 )= ‖𝒘‖ + ∑ 𝛼 𝑖 ¿ ¿
2
2 𝑖=1
Lagrangian function
Lagrange variables
• The solution and for the dual problem is the solution for the primal
min ℒ ( 𝒘 ,𝑏 , 𝜶 )
𝑤 , 𝑏,𝜶
2 1 1 1
𝜌h = = =
• Note that: ‖𝒘 ‖2
2 𝑚
∑ 𝛼𝑖
‖𝜶‖1 (5.19)
𝑖=1

T, Morteza Analoui
VC-dim of -margin hyperplane (linear) set
H
• VC dimension of -margin loss function and linear set H :
• is the dimension of the space, that is
• Let vectors in belong to a sphere of radius 𝑅 X
𝑑 ≤ 𝑚𝑖𝑛
([ ] )
𝑅2
𝜌
2
, 𝑁 +1
• Using large , generalization ability of the constructed hyperplane is

high.
Maximizing minimizes upper bound of

T, Morteza Analoui
Contents
5. Kernel Methods
6. Multiclass SVM

T, Morteza Analoui
3- Binary SVM: Non-separable
case
Inconsistent case (non-separable case), H
• In most practical settings, training data is not linearly separable: for
any hyperplane ·, there exists such that
(5.22)
• Constraints imposed in linearly separable case cannot all hold

simultaneously
1 2
min ‖𝒘‖2 (5.7)
𝒘 ,𝑏2
Subject to:
T, Morteza Analoui
Inconsistent case (non-separable case)
• We introduce a new variable in consistent SVM algorithm to measure
empirical loss. ,=1
h ( 𝑥 )=0 𝜉 ′
𝑖
𝑥𝑖
𝑥𝑗
𝜉′𝑗
h ( 𝑥 ) −1=0
h ( 𝑥 )+ 1=0
Figure 5.4
A separating hyperplane with point classified incorrectly and point correctly classified, but with margin less than 1.
and are outliers ( > 0 and > 0 ).
T, Morteza Analoui
Loss of Inconsistent case
• , =1
• represents loss for based on
𝐿( 𝑦 𝑖 h ( 𝑥 𝑖 ) )=1− 𝑦 𝑖 h ( 𝑥 𝑖 ) ¿ 𝜉 𝑖
h ( 𝑥 )=0 𝜉 ′ 3
𝑖
𝑚
𝑥𝑖
2 ∑ 𝜉𝑖
Total empirical loss=
𝑖 =1
𝑥𝑗 1
𝜉′𝑗 𝜉𝑖
h ( 𝑥 ) −1=0 -2 -1 +1 𝑦 𝑖 h ( 𝑥𝑖 )=1 − 𝜉 𝑖
0
h ( 𝑥 )+ 1=0 1−𝜉 𝑗 1-
=0

T, Morteza Analoui
Error of outliers:
0 <𝜉 <1: on correct side of separating hyperplane
1< 𝜉: on incorrect side of separating hyperplane

T, Morteza Analoui
Relaxed constrains
• A relaxed version of these constrains can indeed hold, that is, for
each , there exist such that
Subject to: relaxed to Subject to:
• And therefore the loss function becomes:

𝐿 𝑖 ( 𝑦 𝑖 h ( 𝑥𝑖 ) ) =max ¿ ¿
• (slack variable ) measures quantity by which vector violates the

desired inequality,
T, Morteza Analoui
Soft margin – Hard margin
• For ·, vector with can be viewed as an miss classified example
• with is correctly classified by hyperplane but is considered to be an
outlier, that is > 0
• If we omit miss classified examples and outliers, training data is

correctly separated by with a margin that we refer to as soft margin,
as opposed to hard margin in separable case

T, Morteza Analoui
Empirical loss, large-margin, loss function
• One idea consists of selecting hyperplane that minimizes empirical
loss (that is ERM)
• But, that solution will not benefit from large-margin guarantees
• Problem of determining a hyperplane with smallest zero-one loss,

that is smallest number of misclassifications, is NP-hard as a function
of dimension of space. Using hinge or quadratic hinge loss functions
are computationally feasible

T, Morteza Analoui
𝑚
1
Loss functions ℒ (𝑤 ,𝑏 , 𝜶 )= ‖𝒘‖ + ∑ 𝛼 𝑖 ¿ ¿ ¿
2
2
𝑖=1
Slack (error) terms
• There are many possible choices for leading to more or less aggressive
penalizations of slack terms:
• Choices and lead to most straightforward solutions. Loss functions associated
with and are called hinge loss and quadratic hinge loss, respectively.
zero-one loss:
hinge loss:
quadratic hinge:
Figure 5.5
Both hinge loss and quadratic hinge loss provide
convex upper bounds on binary zero-one loss.
𝑦h ( 𝑥)
T, Morteza Analoui
Two conflicting objectives: loss and margin
• On one hand, we wish to limit the total amount of empirical loss
(slack penalty) due to misclassified examples and outliers, which can
be measured by , or, more generally by for some .
• On other hand, we seek a hyperplane with a large margin, though a

larger margin can lead to more misclassified and outliers and thus
larger amounts of loss

T, Morteza Analoui
Primal optimization problem
• This leads to following general optimization problem defining SVMs in non-
separable case where the parameter C determines trade-off between margin-
maximization (or minimization of ) and minimization of slack penalty . Small C
means large empirical loss. Regularization using C
𝑚
1
min , ‖𝒘‖2+ 𝐶 ∑ 𝜉 𝑖
2 𝑝
2 𝑖 =1
(5.24)
Subject to:
relaxed score constrains Non-negativity constrain of slack (error) variable.

Not all examples need to satisfy the score constraint.
• (5.24) is a convex optimization problem

T, Morteza Analoui
A regularization view
• Optimization problem in 5.24 presents a regularization based solution
• Higher C means lower training error and smaller
• Lower C means higher training error and larger
𝑚
1
∑ 𝑝 2
𝑎𝑟𝑔𝑚𝑖𝑛 ( ^ )
𝑅𝑆 ( h ) + 𝜆 ℛ ( h) min 𝜉 𝑖 + 𝜆‖𝒘‖2
h∈H 𝑚, 𝑖=1
𝑚
1
min , ‖𝒘‖2+ 𝐶 ∑ 𝜉 𝐶
2 𝑝
𝑎𝑟𝑔𝑚𝑖𝑛 ( 1/ 𝜌 h +𝐶 ^
𝑅( h) ) 𝑖 1/ 𝜆
h∈H 2 𝑖 =1

T, Morteza Analoui
A regularization view
• Back to results in chapter1 for Regularization-based algorithms
𝑎𝑟𝑔𝑚𝑖𝑛 ( ^ 𝑅 ( h ) + 𝜆 ℛ ( h) )
𝑆
h∈H
𝑚
1
𝑎𝑟𝑔𝑚𝑖𝑛 𝐿 ( 𝑤 )= ∑ [ 𝐿¿ ¿𝑖 ( h ( 𝑥 𝑖 ,𝑊 ) , 𝑦 𝑖 )]+ 𝜆 ℛ (𝑤)¿
h∈H 𝑚 𝑖=1
𝑚
1
min ∑
𝑤 , 𝑏 𝑚 𝑖=1
max ¿ ¿ ¿ ¿
𝐿( 𝑦 𝑖 h ( 𝑥 𝑖 ) ) ¿ 𝜉 𝑖

T, Morteza Analoui
primal
Lagrangian function 1
min ‖𝒘‖ + 𝐶 ∑ 𝜉
2 𝑝
𝑚
2 𝑖
2 ,
𝑖 =1
Subject to:
• Analysis is presented in case of hinge loss which is most widely used loss function
for SVMs.
• We introduce Lagrange variables , associated to constraints and , associated to
non-negativity constraints of slack variables
• We denote by the vector and by vector
• Lagrangian can then be defined for all and , by
dual
(5.25)
Where

T, Morteza Analoui
Derivatives of Lagrangian function
(5.25)
• A vector appears in solution iff . Such vectors are called support

vectors
(5.26)
(5,27)
(5.28)
(5.29)
(5.30)

T, Morteza Analoui
Two types of support vectors
• By the complementarity condition (5.29), if , then
• If = 0, then ·and lies on a marginal hyperplane, as in the separable case and

requires
• Otherwise, 0 and is an outlier. In this case, (5.30) implies = 0 and (5.28)

requires = C
• As in separable case, weight vector solution is unique, support vectors are not.

T, Morteza Analoui
Two types of support vectors
• Support vectors are either outliers, in which case = C, or vectors lying
on marginal hyperplanes, in which
Support vectors:

T, Morteza Analoui
Dual optimization problem (5.24)
• Plug into Lagrangian definition of in terms of dual variables (5.26) and
apply constraint (5.27). This yields
‖ ‖
𝑚 2 𝑚 𝑚 𝑚 𝑚
1
ℒ (𝑤 ,𝑏, 𝛼 )= ℒ 𝑑𝑢𝑎𝑙 (𝜶)=
2
∑ 𝛼 𝑖 𝑦 𝑖 𝑥𝑖 − ∑ ∑ 𝛼 𝑖 𝛼 𝑗 𝑦 𝑖 𝑦 𝑗 ( 𝒙𝑖 ∙ 𝒙 𝑗 ) − ∑ 𝛼𝑖 𝑦 𝑖 𝑏+∑ 𝛼 𝑖 (5.31)
𝑖=1 𝑖=1 𝑗=1 𝑖=1 𝑖=1
𝑚 𝑚
1
− ∑ ∑ 𝛼𝑖 𝛼 𝑗 𝑦 𝑖 𝑦 𝑗 ( 𝒙 𝑖 ∙ 𝒙 𝑗 )
0
2 𝑖 =1 𝑗=1
• Remarkably, we find that objective function is no different than in
separable case:
𝑚 𝑚 𝑚
1
ℒ 𝑑𝑢𝑎𝑙 (𝜶)=− ∑ ∑ 𝛼𝑖 𝛼 𝑗 𝑦 𝑖 𝑦 𝑗 ( 𝒙 𝑖 ∙𝒙 𝑗 ) + ∑ 𝛼 𝑖 (5.32)
2 𝑖=1 𝑗=1 𝑖=1

T, Morteza Analoui
Dual optimization problem: non-separable
• Dual problem only differs from that of separable case (5.14) by
constraints
𝑚 𝑚 𝑚
1
max ℒ 𝑑𝑢𝑎𝑙 (𝜶)=−
𝜶
∑ ∑
2 𝑖=1 𝑗=1
𝛼 𝑖 𝛼 𝑗 𝑦 𝑖 𝑦 𝑗 ( 𝑥 𝑖 ∙ 𝑥 𝑗 ) +∑ 𝛼 𝑖 (5.33)
𝑖=1
subject to:
• Objective function is concave and differentiable and (5.33) is

equivalent to a convex QP. The problem is equivalent to primal
problem (5.24).

T, Morteza Analoui
Hypothesis
• Solution of dual problem (5.33) can be used directly to determine hypothesis
returned by SVMs, using equation (5.26):
(∑ )
𝑚
h ( 𝑥 )=𝑠𝑔𝑛 ( 𝒘 ∙ 𝒙 + 𝑏 )=𝑠𝑔𝑛 𝛼𝑖 𝑦𝑖 ( 𝒙 𝑖 ∙ 𝒙 )+ 𝑏 (5.34)
𝑚 𝑖=1
𝒘 =∑ 𝛼 𝑖 𝑦 𝑖 𝒙 𝑖
𝑖=1
• can be obtained from any support vector lying on a marginal hyperplane,
𝑚
𝑏= 𝑦 𝑗 − ∑ 𝛼 𝑖 𝑦 𝑖 ( 𝒙 𝑖 ∙ 𝒙 𝑗 ) (5.35)
• important property of SVM:𝑖=1
hypothesis solution depends only on inner products
between vectors and not directly on vectors themselves

T, Morteza Analoui
Generalization bounds using margin
theory
• Generalization bounds provide a strong theoretical justification for
the SVM algorithm.
• Confidence margin: Confidence margin of a real-valued function at a
point labeled with is quantity
• when , classifies correctly
• is confidence of prediction made by

T, Morteza Analoui
Margin loss function
• For any parameter , we have a -margin loss function that, penalizes with cost of
1 when it misclassifies point (), and penalizes (linearly) when it correctly
classifies with confidence less than or equal to ρ ().
• The parameter ρ can be interpreted

as the confidence margin
demanded from a hypothesis
𝑦 𝑖 h ( 𝑥𝑖 )=𝜌 𝑖 / 𝜌 h
Figure 5.6
ρ margin loss function: illustrated in red

T, Morteza Analoui
Empirical margin loss
• Definition 5.6 Given a sample and a hypothesis , the empirical margin
loss is defined by
𝑚
^ 1
𝑅 𝑆 , 𝜌 ( h )= ∑ Φ 𝜌 [ 𝑦 𝑖 h(𝑥𝑖 ) ] (5.37)
𝑚 𝑖=1

T, Morteza Analoui
Generalization bound for linear
hypotheses
• Corollary 5.11 Let H and assume that . Fix , then, for any , with probability at least
over the choice of a sample of size , the following hold for any :
^
𝑅 ( h ) ≤ 𝑅 𝑆 , 𝜌 ( h ) +2
𝑟Λ
𝜌 √𝑚
+
𝑙𝑜𝑔1 /𝛿
2𝑚 √
• In the separable case, for a linear with geometric margin and choice of
(5.44)
confidence margin parameter empirical margin loss =0

T, Morteza Analoui
Generalization bound for linear
hypotheses
• (5.44)  small generalization error can be achieved when:
• is small and
• empirical margin loss is relatively small
• For a given problem larger means smaller generalized error upper

bound
• It is a strong justification for margin-maximization algorithms such as
SVMs

T, Morteza Analoui
From Bound to Optimization problem
• An algorithm based on this theoretical guarantee consists of
minimizing right hand side of (5.44), that is, minimizing an objective
function with a term corresponding to sum of slack variables , and
another one minimizing
2
‖𝑤 ‖2
∑ 𝜉𝑖
^
𝑅 ( h ) ≤ 𝑅 𝑆 , 𝜌 ( h ) +2
𝜌
𝑟Λ
√ 𝑚
+
𝑙𝑜𝑔1 /𝛿
2𝑚 √ (5.44)
𝑖 =1
T, Morteza Analoui
Contents
4. Kernel Methods
5. Multiclass SVM

T, Morteza Analoui
4- Kernel Methods
Kernel Methods and non-separable case, H
• When is not linearly separable means target function is nonlinear
• Q: How we can use a linear hypotheses set to learn a non-linear ?
• A: Kernel method:
• use a nonlinear mapping from the input space to a higher-dimensional space H (feature
space),
• H is linear in feature space
• H is linear in feature space
• H is nonlinear in input space
• Now we train a nonlinear hypothesis. In some application, it is possible to find for some . So,
problem becomes consistent in feature space, no training error (consistent case)

T, Morteza Analoui
Example: degree-2 Polynomial Kernel
• Suppose input space is in and
• Nonlinear mapping using
• Then, : nonlinear in space
• And, : linear in space
• is nonlinear in input space and is linear in feature space
• Complexity of linear in feature space is twice of linear in input space

T, Morteza Analoui
Figure 6.1
^
𝑅> 0 ^
𝑅 =0
linear in input space linear in feature space and

nonlinear in input space
Figure 6.1
Non-linearly separable case. The classification task consists of discriminating between blue and red points.
(a) No hyperplane can separate the two populations. (b) A non-linear mapping can be used instead.

T, Morteza Analoui
Polynomial Kernel and complexity of
linear H
• Complexity of linear H in feature space:
• Foe: Input space in , and degree of

• when and
• VC-dimension is huge and can easily overfit training data ()

T, Morteza Analoui
SVMs with kernels
• Replacing each instance of an inner product in 5.33
𝑚 𝑚 𝑚
1
max ℒ 𝑑𝑢𝑎𝑙 (𝜶)=−
𝜶
∑ ∑
2 𝑖=1 𝑗=1
𝛼 𝑖 𝛼 𝑗 𝑦 𝑖 𝑦 𝑗 𝐾 ( 𝑥𝑖 ∙ 𝑥 𝑗 ) + ∑ 𝛼𝑖 (6.13)
𝑖=1
subject to:
• solution can be written as:

(∑ )
𝑚
h ( 𝑥 )=𝑠𝑔𝑛 ( 𝒘 ∙ 𝒙 + 𝑏 )=𝑠𝑔𝑛 𝛼 𝑖 𝑦 𝑖 𝐾 ( 𝒙 𝑖 ∙ 𝒙 ) +𝑏 (6.14)
𝑖=1
for any with

T, Morteza Analoui
Definition 6.1 (Kernels)
• A function is called a kernel over
• Idea is to define a kernel such that for any two examples be equal to an inner
product of vectors
∀ 𝑥 , 𝑥 ∈ 𝑋 , 𝐾 ( 𝑥 , 𝑥 ) =⟨ Φ ( 𝑥 ) , Φ ( 𝑥 ′ ) ⟩
′ ′ (6.1)
inner product of vectors and
• Since an inner product is a measure of similarity of two vectors, is often

interpreted as a similarity measure between elements of input space

T, Morteza Analoui
Polynomial kernels
• For any constant , a polynomial kernel of degree is kernel defined
over by:
𝑑
∀ 𝑥 , 𝑥 ∈ ℝ , 𝐾 ( 𝑥 , 𝑥 )= ( 𝑥 ∙ 𝑥 +𝑐 ) (6.3)
′ 𝑁 ′ ′
• Example: ∀ 𝑥 , 𝑥′ ∈ ℝ 𝑁 , 𝐾 ( 𝑥 , 𝑥 ′ )= ( 𝑥1 𝑥 ′ 1 + 𝑥 2 𝑥 ′ 2 +𝑐 )
2
[ ]
2
𝑥 ′1
2
𝑥 ′2
𝐾 ( 𝑥 , 𝑥 ′ ) =⟨ Φ ( 𝑥 ) , Φ ( 𝑥 ′ ) ⟩ = [ 𝑥21 𝑥22 √ 2 𝑥 1 𝑥 2 √ 2 𝑐 𝑥 1 √ 2 𝑐 𝑥 2 𝑐 ] ∙ √
2 𝑥 ′1 𝑥 ′2 2
= ( 𝑥 1 𝑥 ′ 1+ 𝑥 2 𝑥 ′ 2 + 𝑐 )
√2 𝑐 𝑥 ′1
√2 𝑐 𝑥 ′2
𝑐
T, Morteza Analoui
Example: XOR and 2nd degree polynomial
h ( 𝑥)
√ 2 𝑥1 𝑥 2
(1,1,+ √ 2,− √ 2,− √ 2,1)(1,1,+ √ 2,+ √ 2,+ √ 2,1)
SVM solution:
h (Φ ( 𝑥 )) √ 2 𝑥1
(1,1,− √ 2,− √2,+ √ 2,1) (1,1,− √ 2,+ √ 2,− √ 2,1)
[ 𝑥 21 𝑥 22 √2 𝑥1 𝑥2 √ 2 𝑥 1 √2 𝑥 2 1 ]
Figure 6.3
Illustration of XOR classification problem and use of polynomial kernels. (a) XOR problem linearly
non-separable in input space. (b) Linearly separable using second-degree polynomial kernel.

T, Morteza Analoui
Gaussian kernels or radial basis function
(RBF)
• Gaussian kernels are among most frequently used kernels in applications
( )
2
− ‖𝑥 −𝑥 ′‖2
, 𝐾 ( 𝑥 , 𝑥 )=𝑒
2
′ 𝑁 ′ 2𝜎
∀ 𝑥, 𝑥 ∈ℝ (6.5)
• For , (maximum similarity)

• For , when (maximum dissimilarity)
• What is non linear ?

• What is complexity of linear H in feature space?
T, Morteza Analoui
Sigmoid kernels
• For any real constants a, , a sigmoid kernel is defined over by:
∀ 𝑥 , 𝑥′ ∈ ℝ 𝑁 , 𝐾 ( 𝑥 , 𝑥 ′ )=tanh ( 𝑎 ( 𝑥 ∙ 𝑥 ′ ) + 𝑏 ) (6.6)
• Using sigmoid kernels with SVMs leads to an algorithm that is closely

related to learning algorithms based on simple neural networks,
which have sigmoid activation function.

T, Morteza Analoui
Example-1D
• Suppose we have 5 1D data points as the training set:
• (1=1, 1=+1), (2=2, 2=+1), (3=4, 3=1), (4=5, 4=1), (5=6, 5=+1)
Class label +1 +1 -1 -1 +1
data point 1 2 3 4 5
input space 1 2 4 5 6
[ ]
𝑥 2𝑖
2
Φ ( 𝑥𝑖 ) = √ 2 𝑥 𝑖 𝐾 ( 𝑥 , 𝑥 𝑖 ) =( 𝑥𝑥𝑖 +1)
1

T, Morteza Analoui
[ ]
2
𝑥𝑖
Example-1D, Feature Space Φ ( 𝑥𝑖 ) = √ 2 𝑥 𝑖
1
𝑥 2=2 , 𝜑 ( 𝑥2 ) =[4 ,2 √ 2 , 1] 𝑥5 =6 ,𝜑 ( 𝑥 5 )=[36 , 6 √2 , 1]
𝜑 ( 𝑥 5)
sv
𝜑 ( 𝑥4)
sv
sv 𝜑 ( 𝑥3)
𝜑 ( 𝑥 1) 𝜑 ( 𝑥 2)
Find the widest strip just by looking at the data!

T, Morteza Analoui
Example-1D, Applying SVM algorithm
• Polynomial kernel of degree 2
• is set to 100 (using large C means we like to emphasis on minimizing training

error
• We first find a ) by 2
𝐾 ( 𝑥 , 𝑥 𝑖 ) =( 𝑥 ∙ 𝑥 𝑖 +1)
5 5 5
1
max ℒ 𝑑𝑢𝑎𝑙 (𝜶)=∑ 𝛼𝑖 − ∑ ∑ 𝛼𝑖 𝛼 𝑗 𝑦 𝑖 𝑦 𝑗 ( 𝑥 𝑖 ∙ 𝑥 𝑗 +1 ) 2
𝜶 𝑖=1 2 𝑖=1 𝑗=1

subject to:

T, Morteza Analoui
Example-1D 𝐾 ( 𝑥 , 𝑥 𝑖 ) =( 𝑥 ∙ 𝑥 𝑖 +1)
2
1 2 3 4 5
1 2 4 5 6
= (6x6+1)2
(6x1+1)2
5 5 5
1
ℒ 𝑑𝑢𝑎𝑙 (𝜶)=∑ 𝛼𝑖 − ∑ ∑ 𝛼 𝑖 𝑗 𝑖 𝑗( 𝑖
𝛼 𝑦 𝑦 𝑥 ∙ 𝑥 𝑗 +1 )
2
→
…………………………………………………………………+
𝑖=1 2 𝑖=1 𝑗=1 ………………………………………………………………...+
…………………………………………

T, Morteza Analoui
Support vectors: )
Example-1D
• Finding α to maximize
• Solving the following 5 linear equations: • Or using a QP solver
∑ 𝑦𝑖 𝛼𝑖=0
𝜕ℒ
=1 −0 . 5 ( 2 × 4 𝛼 1+ 2× 9 𝛼 2 −2 ×25 𝛼3 −2 ×36 𝛼4 + 2× 49 𝛼 5 )=0
𝜕 𝛼1

T, Morteza Analoui
Support vectors: )
Example-1D
• We get
• a1=0, a2=2.5, a3=0, a4=7.333, a5=4.833, 2.5 + 4.833 - 7.333 = 0
• Note that a< , so there is no training error
• The support vectors are {2=2, 4=5, 5=6}
• using:
a2=2.5, a4=7.333, a5=4.833
1 2 3 4 5
1 2 6
𝑥
4 5

T, Morteza Analoui
Example-1D,
Support vectors: )
𝑤 =𝛼 2 𝑦 2 𝜑 ( 𝑥 2 ) +𝛼 4 𝑦 4 𝜑 ( 𝑥 4 ) +𝛼 5 𝑦 5 𝜑 ( 𝑥 5 ) =[ +0 .663 ,− 3 .77 , 0] h ( Φ ( 𝑥 ) ) =0 .663 𝑧 1 − 3 .77 𝑧 2 +9

1 1
𝑧1 𝜌h = =0 .261=
‖𝑤‖ √ 𝛼2 +𝛼 4 +𝛼 5
𝜑 ( 𝑥 5)
𝑤
𝜑 ( 𝑥4)
𝜑 ( 𝑥3)
𝜑 ( 𝑥 1) 𝜑 ( 𝑥 2)
𝑧2
T, Morteza Analoui
Example-1D 𝐾 ( 𝑥 , 𝑥 𝑖 ) =( 𝑥 ∙ 𝑥 𝑖 +1)
2
Support vectors: )
• Testing for example : h ( 𝑥 )= ∑ 𝛼𝑖 𝑦𝑖 𝐾 ( 𝑥, 𝑥𝑖 )=𝑏 𝐾 (𝑥 , 𝑥5 )

𝑥𝑖 ∈ 𝑆
2 2 2
h ( 𝑥 )=2 . 5 ( 1 )( 2 𝑥+ 1 ) + 7 . 333 ( −1 ) (5 𝑥 +1 ) + 4 . 833 ( 1 ) ( 6 𝑥+1 ) + 𝑏
2
h ( 𝑥 )=0 . 6667 𝑥 − 5 .333 𝑥 +𝑏
• is also can be recovered by solving or by or by , as x2 and x5 lie on and x4 lies on
• All three give
2
h ( 𝑥 )=0 . 6667 𝑥 − 5 .333 𝑥 +9 h ( Φ ( 𝑥 ) ) =0 .663 𝑧 1 − 3 .77 𝑧 2 +9

T, Morteza Analoui
Example-1D
𝑇 2
h ( 𝑥 )=𝑤 Φ (𝑥 )+𝑏=0 , 0 . 6667 𝑥 − 5 .333 𝑥 +9=0
𝑇 2
h ( 𝑥 )=𝑤 Φ (𝑥 )+𝑏=+1 , 0 . 6667 𝑥 − 5 . 333 𝑥+ 9=+ 1
𝑇 2
h ( 𝑥 )=𝑤 Φ (𝑥 )+𝑏=−1 , 0 . 6667 𝑥 − 5 . 333 𝑥+ 9= − 1
h ( 𝑥 )=−1 h ( 𝑥 )=+1
𝑥
1 2 3 4 5
1 2 3 4 5 6 7
𝐿𝑎𝑏𝑒𝑙 :+1 𝐿𝑎𝑏𝑒𝑙 :+1

𝐿𝑎𝑏𝑒𝑙 :−1
=2.42 =5.58

T, Morteza Analoui
Example-XOR
Input vector y
+1 -1
x1: [-1,-1] -1
x2: [-1,+1] +1
x3: [+1,-1] +1
-1 +1
x4: [+1,+1] -1
[ ]
9 1 1 1
𝜑 ( 𝑧 ) =¿ 1 9 1 1
K= 1 1 9 1
𝑘 ( 𝑥1 , 𝑥 2 ) =¿ 1 1 1 9

T, Morteza Analoui
Example
Note that H is a linear hyperplane set.

Non-linearity is due to kernel function.

T, Morteza Analoui
RBF kernel
C=0.01: Lower C
higher training error and larger
C=100: Higher C
lower training error and smaller

T, Morteza Analoui
RBF kernel
=10≫1: decreasing influence

of support vectors (no amount
of regularization with C will be
able to prevent overfitting)
Too Complex model

T, Morteza Analoui
RBF-Kernel
• ≫1  radius of area of influence of support vectors only includes
support vector itself and no amount of regularization with will be
able to prevent overfitting.
• 1  model is too simple and cannot capture complexity of target
function c. The region of influence of any selected support vector
would include whole training set.
• =intermediate values  good models can be found on a diagonal of
and .

T, Morteza Analoui
RBF kernel
1: Too simple model
increasing margin
𝐶
≫1: Too Complex model
1: region of influence of any support vector

would include the whole training set.
Too simple model ≫1: decreasing influence of support vectors

T, Morteza Analoui
Contents
4. Kernel Methods
5. Multiclass SVM

T, Morteza Analoui
5- Multiclass SVM
Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass kernel-based
vector machines. Journal of Machine Learning Research, 2, 2002.
Multiclass SVM
• Let denote input space and denote output space, and let be an unknown
distribution over according to which input points are drawn. We will distinguish
between two cases:
• mono-label case, where is a finite set of classes that we mark with numbers for convenience,
Learning: Given a dataset
• the multi-label case where
• In mono-label case, each example is labeled with a single class, while in multi-
label case it can be labeled with several. Text documents can be labeled with
several different relevant topics, e.g., sports, business, and society. The positive
components of a vector in indicate classes associated with an example.

T, Morteza Analoui
Multi-class SVM, Mono-label case
• In a risk minimization framework
• Each label has a different weight vector
• Leaning (Training): Maximizing multiclass margin

• Equivalently, Minimize total norm of the weight vectors such that the true
label is scored at least 1 more than the second best one
• Training results in ,
• Testing (Inference): Select the label based on the highest score for all

T, Morteza Analoui
Multiclass Margin loss
• Suppose a 5-class tasks.
• For pattern the scores are:
• The margin loss is given in 3 different possibilities:
𝑇 𝑇 𝑇
𝑠𝑙 =𝑤 𝑙 𝑥 +𝑏 𝑙 𝑠𝑙 =𝑤 𝑙 𝑥 +𝑏 𝑙 𝑠𝑙 =𝑤 𝑙 𝑥 +𝑏 𝑙
3.1
2.8 2.8 2.8
2.2
Labels: Labels: Labels:

1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
Margin loss= = 0 Margin loss= = 0.7 Margin loss= = 1.6

T, Morteza Analoui
Linear Hard SVM
no empirical error
• Recall hard binary linear SVM

1 2 regularizer
min ‖𝒘‖2 (5.7)
𝒘 ,𝑏2
Subject to:
Score constraint
• Single task hard multiclass linear SVM
min
1
𝑘 regularizer
(𝑤 ¿ ¿1 ; 𝑏1), …,(𝑤 𝑘 ; 𝑏𝑘 ) ∑‖ 𝒘‖2 ¿
2
2 𝑙=1
Score constraint: Score for true
𝑠𝑖 − 𝑠 𝑙 ≥1 ≡(1− ( 𝑠𝑖 − 𝑠 𝑙 ) ) ≤ 0 Subject to: label is higher than score for any
other label by 1

T, Morteza Analoui
Linear Soft SVM
𝑚
• Recall soft binary linear SVM 1
‖𝒘‖ + 𝐶 ∑ 𝜉
2 𝑝
min 2 𝑖
2
,
𝑖 =1
Subject to:
relaxed score constrains Non-negativity constrain of slack variable
• Single task soft multiclass linear SVM

min
𝑘 𝑚
1
(𝑤 ¿ ¿1 ; 𝑏1), …,(𝑤 𝑘 ; 𝑏𝑘 ) ∑‖𝑤 𝑙‖ +𝐶 ∑ 𝜉 𝑖 ¿
2 𝑝
2 𝑙=1 𝑖=1
Subject to:

T, Morteza Analoui
Lagrangian of the optimization problem
• To solve the optimization problem we use the Karush-Kuhn-Tucker theorem. We
add a dual set of variables, one for each constraint and get the Lagrangian of the
optimization problem:
• Recall single task soft binary linear SVM
(5.25)
• Single task soft multiclass linear SVM

𝑘 𝑚 𝑚 𝑘 𝑚
1
ℒ ( {𝒘 𝑙 , 𝑏𝑙 }𝑙=1 , 𝝃 , 𝜶 , 𝜷 )= ∑‖𝑤𝑙‖ +𝐶 ∑ 𝜉 𝑝𝑖 − ∑ ∑ 𝛼𝑖, 𝑙 [ 𝑠 𝑖 − 𝑠 𝑙 −1+𝜉 𝑖 ] − ∑ 𝛽 𝑖 𝜉 𝑖
𝑘 2
2 𝑙=1 𝑖=1 𝑖=1 𝑙=1 𝑖=1

Subject to :

T, Morteza Analoui
𝑚 𝑚 𝑚
1
ℒ 𝑑𝑢𝑎𝑙 (𝜶)=∑ 𝛼𝑖 − ∑ ∑ 𝛼 𝑖 𝛼 𝑗 𝑦 𝑖 𝑦 𝑗 ( 𝒙 𝑖 ∙ 𝒙 𝑗 )
Dual Problem 𝑖=1 2 𝑖=1 𝑗=1
(5.32)
subject to:
• We can rewrite the dual program in the following vector form

𝑚 𝑚
𝐶
max ℒ 𝑑𝑢𝑎𝑙 =∑ 𝐴𝑖 ∙1 𝑦 − ∑ ( 𝐴𝑖 ∙ 𝐴 𝑗 ) (𝑥 ¿ ¿𝑖, 𝑥 𝑗 )¿
Α 𝑖=1 2 𝑖, 𝑗=1 𝑖
Subject to: and
• Where and ,
• Let be a vector whose components are all zero except for the component which
is equal to 1,
• Let be the vector whose components are all 1.

T, Morteza Analoui
Dual Problem.
𝑚 𝑚
𝐶
max ℒ 𝑑𝑢𝑎𝑙 =∑ 𝐴𝑖 ∙1 𝑦 − ∑ ( 𝐴𝑖 ∙ 𝐴 𝑗 ) (𝑥 ¿ ¿𝑖, 𝑥 𝑗 )¿
Α 𝑖=1 2 𝑖, 𝑗=1 𝑖
Subject to: and
𝛼 𝑖= { 𝛼𝑖 , 1 𝛼 𝑖 ,2 … 𝛼𝑖 , 𝑘 }
→ 𝐴 =1 −𝛼 𝑖= [ −𝛼 𝑖,1 − 𝛼𝑖 ,2 … 1 −𝛼 𝑖, 𝑦 … −𝛼 𝑖,𝑘 ]
1 𝑦 =[ 0 … 0 1 0 … 0 ] 𝑖 𝑦
𝑖
𝑖 𝑖
𝑚
𝐴 𝑖 ∙1 𝑦 =1 −𝛼𝑖 , 𝑦
𝑖 𝑖
𝐴 𝑖 ∙ 1 =1 − ∑ 𝛼 𝑖= 0
𝑖=1

T, Morteza Analoui
Applying Kernel function ( )
𝑚
h ( 𝑥 )= 𝑠𝑔𝑛 ∑ 𝛼 𝑦 𝐾 ( 𝒙 ∙ 𝒙 ) + 𝑏 𝑖 𝑖 𝑖
𝑖=1
• Replacing the inner-products with a kernel function (·,·) that satisfies

Mercer’s conditions. The general dual program using kernel functions
is therefore,
𝑚 𝑚
𝐶
max ℒ =∑ 𝐴𝑖 ∙1 𝑦 − ∑ ( 𝐴𝑖 ∙ 𝐴 𝑗 ) 𝐾(𝑥 ¿ ¿𝑖∙𝑥 𝑗 )¿
Α 𝑖=1 2 𝑖, 𝑗=1
𝑖
Subject to: and
• Classification function becomes:
{ }
𝑚
h ( 𝑥 )=arg𝑚𝑎𝑥 𝑘𝑙=1 ∑ 𝐴𝑖,𝑙 𝐾 ( 𝑥 , 𝑥 𝑗 ) +𝑏 𝑙
𝑖=1

T, Morteza Analoui
Support Vectors
• The first sum is over all patterns that belong to the class. Hence, an example
labeled is a support pattern only if
• The second sum is over the rest of the patterns whose labels are different from .
In this case, an example is a support pattern only if
[ ]
𝑚 𝑚
𝑤𝑙 = 𝛽 ∑ (1− 𝛼𝑖, 𝑙 )Φ(𝑥 𝑖 )+ ∑ ( − 𝛼𝑖, 𝑙 ¿ Φ (𝑥 𝑖 ))
𝑖=1 𝑖=1
𝑦 𝑖 =𝑙 𝑦 𝑖 ≠𝑙

T, Morteza Analoui
Probabilistic interpretation for vector
• For each pattern (example) the vector satisfies the constraints
𝑘
𝛼 𝑖 , 𝑙 ≥ 0 ∧ ∑ 𝛼𝑖 , 𝑙 =1
𝑙=1
• Each set can be viewed as a probability distribution over the labels
• is a support pattern if and only if its corresponding distribution is not
concentrated on the correct label . That is: for and for
• Therefore, the classifier is constructed using patterns whose labels are uncertain;
the rest of the input patterns are ignored.

T, Morteza Analoui
Example
• Suppose , , and then
• example does not support Example scaled by

• example supports Example scaled by

T, Morteza Analoui
Quadratic Programing
• Both the primal and dual problems are simple QPs generalizing those
of standard SVM algorithm.
• However, size of solution and number of constraints for both
problems is in , which, for a large number of classes , can make it
difficult to solve.
• However, there exist specific optimization solutions designed for this
problem based on a decomposition of the problem into disjoint sets
of constraints.

T, Morteza Analoui
Concluding Remarks
• Generalizes binary SVM algorithm
• If we have only two classes, this reduces to the binary (up to scale)
• Comes with similar generalization guarantees as the binary SVM
• Can be trained using different optimization methods

• Stochastic sub-gradient descent can be generalized

T, Morteza Analoui
Generalization bound
• In multi-class classification, a kernel-based hypothesis is based on a matrix of
maintain prototypes Vector is the row of .
• Each weight vector , defines a scoring function
• A family of kernel-based hypotheses we will consider is
H 𝐾 ={ ( 𝑥 , 𝑦 ) ∈ 𝑋 × { 1 , … , 𝑘 } ⟼ 𝑤 𝑦 ∙ Φ ( 𝑥 ) : ,‖𝑊 ‖ ≤ Λ }
2
2
• In which

T, Morteza Analoui
Generalization bound
• Assume that there exists , such that for all
• For any with probability at list for all
√
𝑚
1 𝑟 Λ 𝑙𝑜𝑔 1/ 𝛿
𝑅 ( h ) ≤ ∑ 𝜉 𝑖 +4 𝑘 + (9.12)
𝑚 𝑖=1 √𝑚 2𝑚
{ }
𝑘
h ∈ H 𝐾 = ( 𝑥 , 𝑦 ) ⟶ 𝑊 𝑦 ∙ Φ ( 𝑥 ) : ∑ ‖𝑤 𝑙‖ ≤ Λ 2
2
2
𝑙 =1
• Where for all

T, Morteza Analoui
Appendix
SVM solvers
SVM solvers, Exact SVM solvers
• LIBSVM
• LIBLINEAR
• liquidSVM
• Pegasos
• LASVM
• SVMLight

T, Morteza Analoui
SVM solvers, Hierarchical solvers
• ThunderSVM
• cuML SVM
• LPSVM

T, Morteza Analoui
SVM solvers, Approximate SVM solvers
• DC-SVM
• EnsembleSVM
• BudgetedSVM

T, Morteza Analoui
SVM solvers run on GPU
• GTSVM
• OHD-SVM

T, Morteza Analoui
SVM solvers, Multiclass
• Crammer-Singer SVM
• MSVMpack
• BSVM
• LaRank
• GaLa

T, Morteza Analoui

2.pattern Recognition (Pattern Classification) - Support Vector Machin

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2.pattern Recognition (Pattern Classification) - Support Vector Machin

Uploaded by

Copyright:

Available Formats

Pattern Recognition

• Linear in feature space:

Λ ≥‖h‖ℍ= ℛ ( 𝑊 ) =‖𝑊 ‖2=√ 𝑤 +𝑤 +𝑤 +… 2

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 3

03/18/2024 Machine Learning Theory for Pattern Recognition - School of C 4

This chapter is mostly based on:

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 5

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 7

• Problem consists of determining a hypothesis H, a binary classer, with small

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 8

• Learning problem is then referred to as a linear classification problem

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 9

• A hypothesis H of form ·thus labels positively all points falling on one

• Definition of SVM solution is based on notion of margin

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 10

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 11

• Margin loss variable:

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 13

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 14

is upper bound of true risk regularization parameter,

• The solution and for the optimization problem min ℒ ( 𝒘 ,𝑏 )

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 15

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 16

• (5.7) and (5.48.1) are convex optimization problem and a specific

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 17

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 18

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 19

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 20

no empirical loss: ∀ 𝑖 , 𝛼𝑖 ¿ (5.11)

is called a support vector (support example) when

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 21

• In dimension , points are sufficient to define a hyperplane When more

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 22

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 23

• Dual objective function is concave and differentiable. Dual optimization problem

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 24

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 25

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 26

average generalization error 𝑆 𝒟

• where denotes distribution according to which points are drawn

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 29

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 31

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 32

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 33

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 34

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 36

• To keep upper bound of true risk as low as possible, we are looking

• What is maximum possible?

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 37

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 38

zero-one loss zero-one loss:

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 42

• The resulting algorithm precisely coincides with (5.48.1)

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 43

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 44

• Using large , generalization ability of the constructed hyperplane is

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 45

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 46

• Constraints imposed in linearly separable case cannot all hold

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 50

1< 𝜉: on incorrect side of separating hyperplane

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 51