Professional Documents
Culture Documents
Support Vector Machine: Debasis Samanta
Support Vector Machine: Debasis Samanta
Debasis Samanta
IIT Kharagpur
dsamanta@iitkgp.ac.in
Autumn 2018
Non-linear SVM
w 0 + w 1 x1 + w 2 x2 = 0 [e.g., ax + by + c = 0]
where w0 , w1 , and w2 are some constants defining the slope and
intercept of the line.
w0 + w1 x1 + w2 x2 > 0 (1)
w0 + w1 x1 + w2 x2 < 0 (2)
An SVM hyperplane is an n-dimensional generalization of a
straight line in 2-D.
w1 x1 + w2 x2 + ........ + wm xm = b (3)
where wi ’s are the real numbers and b is a real constant (called
the intercept, which can be positive or negative).
W .X + b = 0 (4)
where W =[w1 ,w2 .......wm ] and X = [x1 ,x2 .......xm ] and b is a real
constant.
+ if W .X + b > 0
(Y ) =
− if W .X + b < 0
W .x1 + b = 1 (7)
W .x2 + b = −1 (8)
or
W .(x1 − x2 ) = 2 (9)
This represents a dot (.) product of two vectors W and x1 -x2 . Thus
taking magnitude of these vectors, the equation obtained is
2
d= (10)
||W ||
q
where ||W || = w12 + w22 + .......wm
2 in an m-dimension space.
H1 : w1 x1 + w2 x2 − b1 = 0 (11)
H2 : w1 x1 + w2 x2 − b2 = 0 (12)
To draw a perpendicular distance d between H1 and H2 , we draw
a right-angled triangle ABC as shown in Fig.5.
X2
H2
w1
x1
+
w2
x2
–
b2
H1 d
=
0
w1
x1
+
C
w2
x2
–
b1
=
0
A B
(b1 / w1 , 0) (b2 / w1 , 0) X1
w1
Being parallel, the slope of H1 (and H2 ) is tanθ = − .
w2
In triangle ABC, AB is the hypotenuse and AC is the perpendicular
distance between H1 and H2 .
AC
Thus , sin(180 − θ) = or AC = AB · sinθ.
AB
b2 b1 |b2 − b1 | w1
AB = − = , sinθ = q
w1 w1 w1 w12 + w22
w1
(Since, tanθ = − ).
w2
|b2 − b1 |
Hence, AC = q .
w12 + w22
|b2 − b1 | |b2 − b1 |
d=q '
w12 + w22 + · · · + wn2 kW k
q
where, k W k= w12 + w22 + · · · + wn2 .
W .xi + b ≥ 1 if yi = 1 (13)
W .xi + b ≤ −1 if yi = −1 (14)
These conditions impose the requirements that all training tuples
from class Y = + must be located on or above the hyperplane
W .x + b = 1, while those instances from class Y = − must be
located on or below the hyperplane W .x + b = −1 (also see
Fig. 4).
||W ||
µ‘ (W , b) = (16)
2
In nutshell, the learning task in SVM, can be formulated as the
following constrained optimization problem.
minimize µ‘ (W , b)
subject to yi (W .xi + b) ≥ 1, i = 1, 2, 3.....n (17)
δL
= 0, i = 1, 2, ......, d
δxi
δL
= 0, i = 1, 2, ......, p
δλi
3 Solve the (d + p) equations to find the optimal value of
X =[x1 , x2 ......., xd ] and λ0i s.
Debasis Samanta (IIT Kharagpur) Data Analytics Autumn 2018 39 / 131
Lagrange Multiplier Method
δL
= 0, i = 1, 2, ......, d
δxi
λi ≥ 0, i = 1, 2......., p
hi (x) ≤ 0, i = 1, 2......., p
Solving the above equations, we can find the optimal value of f (x).
δL
= 2(x − 1) + λ1 + λ2 = 0
δx
δL
= 2(y − 3) + λ1 − λ2 = 0
δy
λ1 (x + y − 2) = 0
λ2 (x − y ) = 0
λ1 ≥ 0, λ2 ≥ 0
(x + y ) ≤ 2, y ≥ x
2(x − 1) = 0 | 2(y − 3) = 0 ⇒ x = 1, y = 3
Case 2: λ1 = 0, λ2 6= 0 2(x − y) = 0 |
2(x − 1) + λ2 = 0|
2(y − 3) − λ2 = 0
⇒ x = 2, y = 2 and λ2 = −2
since, x + y ≤ 4, it violates λ2 ≥ 0; is not a feasible solution.
Case 3: λ1 6= 0, λ2 = 0 2(x + y) = 2
2(x − 1) + λ1 = 0
2(y − 3) + λ1 = 0
Case 4: λ1 6= 0, λ2 6= 0 2(x + y) = 2
2(x − y ) = 0
2(x − 1) + λ1 + λ2 = 0
2(y − 3) + λ1 − λ2 = 0
⇒ x = 1, y = 1 and λ1 = 2 λ2 = −2
This is not a feasible solution.
n
||W ||2 X
L= − λi (yi (W .xi + b) − 1) (20)
2
i=1
δL Pn
δw =0⇒W = i=1 λi .yi .xi
δL Pn
δb =0⇒ i=1 λi .yi =0
λ ≥ 0, i = 1, 2, .....n
yi (W .xi + b) ≥ 1, i = 1, 2, .....n
We first solve the above set of equations to find all the feasible
solutions.
Note:
1 Lagrangian multiplier λi must be zero unless the training instance xi
satisfies the equation yi (W .xi + b) = 1. Thus, the training tuples
with λi > 0 lie on the hyperplane margins and hence are support
vectors.
2 The training instances that do not lie on the hyperplane margin
have λi = 0.
Now, let us see how this MMH can be used to classify a test tuple
say X . This can be done as follows.
n
X
δ(X ) = W .X + b = λi .yi .xi .X + b (21)
i=1
Note that
n
X
W = λi .yi .xi
i=1
n
X
Thus, δ(X ) = W .X + b = λi .yi .xi .X + b
i=1
1 Once the SVM is trained with training data, the complexity of the
classifier is characterized by the number of support vectors.
2 Dimensionality of data is not an issue in SVM unlike in other
classifier.
A1 A2 y λi
0.38 0.47 + 65.52
0.49 0.61 - 65.52
0.92 0.41 - 0
0.74 0.89 - 0
0.18 0.58 + 0
0.41 0.35 + 0
0.93 0.81 - 0
0.21 0.10 + 0
Thus, the MMH is −6.64x1 − 9.32x2 + 7.93 = 0 (also see Fig. 6).
Suppose, test data is X = (0.5, 0.5). Therefore,
δ(X ) = W .X + b
= −6.64 × 0.5 − 9.32 × 0.5 + 7.93
= −0.05
= −ve
This implies that the test data falls on or below the MMH and SVM
classifies that X belongs to class label -.
Note that the discussed linear SVM can handle any n-dimension,
n ≥ 2. Now, we are to discuss a more generalized linear SVM to
classify n-dimensional data belong to two or more classes.
If the classes are pair wise linearly separable, then we can extend
the principle of linear SVM to each pair. Theare two strategies:
1 One versus one (OVO) strategy
2 One versus all (OVA) strategy
Also, see Fig. 7 for 3 classes (namely +, - and ×). Here, Hyx
denotes MMH between class labels x and y .
With OVO strategy, we test each of the classifier in turn and obtain
δij (X ), that is, the count for MMH between i th and i th classes for
test data X .
If there is a class i, for which δij (X ) for all j( and j 6= i), gives same
sign, then unambiguously, we can say that X is in class i.
OVO strategy is not useful for data with a large number of classes,
as the computational complexity increases exponentially with the
number of classes.
In this approach, we choose any class say Ci and consider that all
tuples of other classes belong to a single class.
Let δj (X ) be the test result with MMHj , which has the maximum
magnitude of test values.
That is
Note:
The linear SVM that is used to classify multi-class data fails, if all
classes are not linearly separable.
All these tuples may be due to noise, errors or data are not
linearly separable.
Figure 10: Problem with linear SVM for linearly not separable data.
||W ||2
minimize (25)
2
subject to yi .(W .xi + b) ≥ 1, i = 1, 2, ..., n
In soft margin SVM, we consider the similar optimization
technique except that a relaxation of inequalities, so that it also
satisfies the case of linearly not separable data.
subject to (W .xi + b) ≥ 1 − ξi , if yi = +1
(W .xi + b) ≤ −1 + ξi , if yi = −1 (26)
where ∀i , ξi ≥ 0.
Thus, in soft margin SVM, we are to calculate W , b and ξi s as a
solution to learn SVM.
Figure 12: MMH with wide margin and large training error.
n n n
||W ||2 X X X
L= + c. ξi − λi (yi (W .xi + b) − 1 + ξi ) − µi .ξi (28)
2
i=1 i=1 i=1
λi {yi (W .xi + b) − 1 + ξi } = 0
µi .ξi = 0
Debasis Samanta (IIT Kharagpur) Data Analytics Autumn 2018 84 / 131
Solving for Soft Margin SVM
δL
= c − λ i − µi = 0
δξi
Note:
1 λi 6= 0 for support vectors or has ξi > 0 and
2 µi = 0 for those training data which are misclassified, that is, ξi > 0.
A hyperplane is expressed as
linear : w1 x1 + w2 x2 + w3 x3 + c = 0 (30)
This task indeed neither hard nor so complex, and fortunately can
be accomplished extending the formulation of linear SVM, we
have already learned.
X (x1 , x2 , x3 ) = w1 x1 + w2 x2 + w3 x3 + w4 x12 + w5 x1 x2 + w6 x1 x3 + c
The 3-D input vector X (x1 , x2 , x3 ) can be mapped into a 6-D space
Z (z1 , z2 , z3 , z4 , z5 , z6 ) using the following mappings:
z1 = φ1 (x) = x1
z2 = φ2 (x) = x2
z3 = φ3 (x) = x3
z4 = φ4 (x) = x12
z5 = φ5 (x) = x1 .x2
z6 = φ6 (x) = x1 .x3
Debasis Samanta (IIT Kharagpur) Data Analytics Autumn 2018 91 / 131
Concept of Non-Linear Mapping
The transformed form of linear data in 6-D space will look like.
Z : w1 z1 + w2 z2 + w3 z3 + w4 z4 + w5 z5 + w6 x1 z6 + c
Thus, if Z space has input data for its attributes x1 , x2 , x3 (and
hence Z ’s values), then we can classify them using linear decision
boundaries.
...
3 Dimensionality problem: It may suffer from the curse of
dimensionality problem often associated with a high dimensional
data.
||W ||2
Minimize
2
Debasis Samanta (IIT Kharagpur) Data Analytics Autumn 2018 100 / 131
Dual Formulation of Optimization Problem
n
||W ||2 X
Lp = − λi (yi (W .xi + b) − 1) (35)
2
i=1
Debasis Samanta (IIT Kharagpur) Data Analytics Autumn 2018 101 / 131
Dual Formulation of Optimization Problem
n
δLp X
=0⇒W = λi .yi .xi (36)
δW
i=1
n
δLp X
=0⇒ λi .yi = 0 (37)
δb
i=1
Debasis Samanta (IIT Kharagpur) Data Analytics Autumn 2018 102 / 131
Dual Formulation of Optimization Problem
n n n
X 1X X X
L= λi + λi .yi .λj .yj .xi .xj − λi .yi (λj .yj .xj )xi
2
i=1 i,j i=1 j=1
n
X 1X
= λi − λi .λj .yi .yj .xi .xj
2
i=1 i,j
Debasis Samanta (IIT Kharagpur) Data Analytics Autumn 2018 103 / 131
Dual Formulation of Optimization Problem
Debasis Samanta (IIT Kharagpur) Data Analytics Autumn 2018 104 / 131
Dual Formulation of Optimization Problem
There are key differences between primal (Lp ) and dual (LD )
forms of Lagrangian optimization problem as follows.
Debasis Samanta (IIT Kharagpur) Data Analytics Autumn 2018 105 / 131
Dual Formulation of Optimization Problem
...
4 The SVM
Pn classifier with primal form is δp (x) = W .x + b with
W = i=1 λi .yi .xi whereas the dual version of the classifier is
m
X
δD (X ) = λi .yi .(xi .x) + b
i=1
Debasis Samanta (IIT Kharagpur) Data Analytics Autumn 2018 106 / 131
Kernel Trick
We have already covered an idea that training data which are not
linearly separable, can be transformed into a higher dimensional
feature space such that in higher dimensional transformed space
a hyperplane can be decided to separate the transformed data
and hence original data(also see Fig.16).
Clearly the data on the left in the figure is not linearly separable.
Debasis Samanta (IIT Kharagpur) Data Analytics Autumn 2018 107 / 131
Kernel Trick
Debasis Samanta (IIT Kharagpur) Data Analytics Autumn 2018 108 / 131
Kernel Trick
R 2 ⇒ X (x1 , x2 )
R 3 ⇒ Z (z1 , z2 , z3 )
Debasis Samanta (IIT Kharagpur) Data Analytics Autumn 2018 109 / 131
Kernel Trick
φ(X ) ⇒ Z
√
z1 = x12 , z2 = 2x1 x2 , z3 = x22
The hyperplane in R 2 is of the form
√
w1 x12 + w2 2x1 x2 + w3 x22 = 0
Debasis Samanta (IIT Kharagpur) Data Analytics Autumn 2018 110 / 131
Kernel Trick
w1 z1 + w2 z2 + w3 z2 = 0
This is clearly a linear form in 3-D space. In other words,
W .x + b = 0 in R 2 has a mapped equivalent W .z + b0 = 0 in R 3
This means that data which are not linearly separable in 2-D are
separable in 3-D, that is, non linear data can be classified by a
linear SVM classifier.
Debasis Samanta (IIT Kharagpur) Data Analytics Autumn 2018 111 / 131
Kernel Trick
The above can be generalized in the following.
Classifier:
n
X
δ(x) = λi yi yi .x + b
i=1
n
X
δ(z) = λi yi φ(xi ).φ(x) + b
i=1
Learning:
n
X 1X
Maximize λi − λi λj .yi .yj .xi .xj
2
i=1 i,j
n
X 1X
Maximize λi − λi λj .yi .yj φ(xi ).φ(xj )
2
i=1 i,j
P
Subject to: λi ≥ 0, i λi .yi = 0
Debasis Samanta (IIT Kharagpur) Data Analytics Autumn 2018 112 / 131
Kernel Trick
Now, question here is how to choose φ, the mapping function
X ⇒ Z , so that linear SVM can be directly applied to.
Debasis Samanta (IIT Kharagpur) Data Analytics Autumn 2018 114 / 131
Kernel Trick
Debasis Samanta (IIT Kharagpur) Data Analytics Autumn 2018 115 / 131
Kernel Trick
Example: Correlation between Xi .Xj and φ(Xi ).φ(Xj )
Now,
x 2
2
√ 2
√ j1
φ(Xi ).φ(Xj ) = [xi1 , 2.xi1 .xi2 , xi2 ] 2xj1 xj2
2
xj2
2 2 2 2
= xi1 .xj1 + 2xi1 xi2 xj1 xj2 + xi2 .xj2
x
= {[xi1 , xi2 ] j1 }2
xj2
= (Xi .Xj )2
Debasis Samanta (IIT Kharagpur) Data Analytics Autumn 2018 116 / 131
Kernel Trick : Correlation between Xi .Xj and
φ(Xi ).φ(Xj )
With reference to the above example, we can conclude that
φ(Xi ).φ(Xj ) are correlated to Xi .Xj .
In fact, the same can be proved in general, for any feature vectors
and their transformed feature vectors.
A formal proof of this is beyond the scope of this course.
More specifically, there is a correlation between dot products of
original data and transformed data.
Based on the above discussion, we can write the following
implications.
Xi .Xj ⇒ φ(Xi ).φ(Xj ) ⇒ K (Xi , Xj ) (41)
Here, K (Xi , Xj denotes a function more popularly called as kernel
function
Debasis Samanta (IIT Kharagpur) Data Analytics Autumn 2018 117 / 131
Kernel Trick : Significance
This kernel function K (Xi , Xj ) physically implies the similarity in
the transformed space (i.e., nonlinear similarity measure) using
the original attribute Xi , Xj . In other words, K , the similarity
function compute a similarity of both whether data in original
attribute space or transformed attribute space.
Implicit transformation
The first and foremost significance is that we do not require any φ
transformation to the original input data at all. This is evident from
the following re-writing of our SVM classification problem.
n
X
Classifier : δ(X ) = λi .Yi .K (Xi , X ) + b (42)
i=1
X 1X
Learning : maximize λi − λi .λj Yi Yj .K (Xi .Xj ) (43)
2
i=1 i,j
n
X
Subject to λi ≥ 0 and λi .Yi = 0 (44)
Debasis Samanta (IIT Kharagpur) Data Analytics i=1 Autumn 2018 118 / 131
Kernel Trick : Significance
Computational efficiency:
Another important significance is easy and efficient computability.
On other hand, using kernel trick, we can do it once and with fewer
dot products.
Debasis Samanta (IIT Kharagpur) Data Analytics Autumn 2018 119 / 131
Kernel Trick : Significance
Debasis Samanta (IIT Kharagpur) Data Analytics Autumn 2018 120 / 131
Kernel Trick : Significance
T
X1 .X1 X1T .X2 . . . X1T .Xn
X T .X1 X T .X2 . . . X T .Xn
2 2 2
Gram matrix : K = . (46)
.. .. ..
.. . . .
XnT .X1 XnT .X2 . . . XnT .Xn n×n
Note that K contains all dot products among all training data and
as XiT .Xj = XjT .Xi . We, in fact, need to compute only half of the
matrix.
More elegantly all dot products are mere in a matrix operation,
and that is too one operation only.
Debasis Samanta (IIT Kharagpur) Data Analytics Autumn 2018 121 / 131
Kernel Trick : Significance
Debasis Samanta (IIT Kharagpur) Data Analytics Autumn 2018 122 / 131
Kernel Functions
Definition 10.1:
A kernel function K (Xi , Xj ) is a real valued function defined on R such
that there is another function φ : X → Z such that
K (Xi , Xj ) = φ(Xi ).φ(Xj ). Symbolically we write
φ : R m → R n : K (Xi , Xj ) = φ(Xi ).φ(Xj ) where n > m.
Debasis Samanta (IIT Kharagpur) Data Analytics Autumn 2018 123 / 131
Kernel Functions : Example
Polynomial kernel of degree p K (X , Y ) = (X T Y + 1)ρ , ρ >0 It produces a large dot products. Power ρ is specified apriori by the user.
c||x−y ||2
Gaussian (RBF) kernel K (X , Y ) = e 2σ 2 It is a nonlinear kernel called Gaussian Radia Bias function kernel
Mahalanobis kernel K (X , Y ) = e −(X −y)T A(x−y) Followed when statistical test data is known
Debasis Samanta (IIT Kharagpur) Data Analytics Autumn 2018 124 / 131
Kernel Functions : Example
Debasis Samanta (IIT Kharagpur) Data Analytics Autumn 2018 125 / 131
Kernel Functions : Example
1 K1 +c
2 aK1
3 aK1 +bK2
4 K1 .K2
are also kernels. Here, a, b and c R + .
Debasis Samanta (IIT Kharagpur) Data Analytics Autumn 2018 127 / 131
Mercers Theorem: Properties of Kernel Functions
Debasis Samanta (IIT Kharagpur) Data Analytics Autumn 2018 128 / 131
Mercers Theorem: Properties of Kernel Functions
The kernels which satisfy the Mercer’s theorem are called Mercer
kernels.
Additional properties of Mercer’s kernels are:
Debasis Samanta (IIT Kharagpur) Data Analytics Autumn 2018 129 / 131
Mercers Theorem: Properties of Kernel Functions
Symmetric: K (X , Y ) = K (Y , X )
It can be prove that all the kernels listed in Table 2 are satisfying
the (i) kernel properties, (ii) Mercer’s theorem and hence (iii)
Mercer kernels.
Debasis Samanta (IIT Kharagpur) Data Analytics Autumn 2018 130 / 131
Characteristics of SVM
Debasis Samanta (IIT Kharagpur) Data Analytics Autumn 2018 131 / 131