Professional Documents
Culture Documents
November 2018
Boucetta Imad
1 Introduction 2
2 Mathematical background 2
2.1 An optimization problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.2 Misclassification tolerance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Dual problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.4 Kernel trick . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4 Results 9
4.0.1 Data using hard margin separator . . . . . . . . . . . . . . . . . . . . 9
4.0.2 Data using soft margin separator . . . . . . . . . . . . . . . . . . . . 10
4.0.3 Data using polynomial kernel . . . . . . . . . . . . . . . . . . . . . . 13
4.0.4 Data using Gaussian kernel . . . . . . . . . . . . . . . . . . . . . . . 14
5 Summary 18
1
1 Introduction
SVMs are a family of algorithms among the best supervised machine learning algorithms
that can be used as classifiers.
We will discuss some of its different types, and apply it to 4 different types of data randomly
generated, and work on finding the best results (i.e. most suitable parameters suitable to
our data set).
The data generated will be randomly generated, it will be composed of 50 point (2 features)
to train our, and 300 point (2 features) to test our boundary decision.
The whole work will be executed under Matlab.
2 Mathematical background
To separate the data we look for an hyperplane (i.e a decision boundary, e.g. a line for data
with 2 features or a plane for data with 3 features etc.) described by its equation (1) where
2
w is the hyperplane’s normal and b its bias.
As there are an infinite of hyperplanes that can separate the data (perfectly or not), we tend
to look for a separator that not only separate data but also maximize the margins (i.e. an
hyperplane where its distance to all the points is maximized). As the data is separated, each
Where yi is the point i label (i.e. belongs to class 1 (yi = +1) or 2 (yi = −1) ), equations (2)
and (3) can be combined into one equation (4).
The points that sits on the margin (i.e < w, xi > + b = 1) are called support vectors, because
they are the ones that actually orientate the hyperplane as they are the closet to it.
1
As consequence their distance to the hyperplane is the same and it is equal to kwk (the
2
distance can be easily proved), thus the distance of the margin is kwk .
To maximize the margin we can reformulate the problem and minimize the value of 12 kwk,
which is the same as minimizing 21 kwk2 (quadratic optimizations turned out to be easier)
3
Thus the optimization problem is :
1
min kwk2
w 2 (5)
subject to yi (< w, xi > + b) ≥ 1, i = 1, . . . , n.
4
2.3 Dual problem
The constrained optimization can be solved using Lagrangian multipliers :
n
1 X
max min kwk2 − αi [yi (< w, xi > +b) − 1)]
αi w,b 2 i=1 (7)
subject to αi ≥ 0, i = 1, . . . , n.
n
X
αi yi = 0. (9)
i=1
5
Figure 4: Data not linearly separable
The importance of the kernel trick is that we tend to calculate the kernel without calculating
(or even knowing) the transformation function.
6
Finally :
n
X n
X
max αi − yi yj αi αj k(xi , xj )
αi
i,j=1 i,j=1 (13)
subject to αi ≥ 0, i = 1, . . . , n.
H, A, and Aeq are matrices, and f , b, beq, lb, ub, and x are vectors.
To solve our problem we look for the solver’s parameters to adapt it with our problem.
Where x is the data vector, y data labels and n is the number of the training data. Finally
we call the solver quadprog and calculate w and b.
[ a l p ha s , f v a l , e f , op , l d ]= quadprog (H, f , A, b , Aeq , beq , lb , ub ) ;
w=x . ’ ∗ ( a l p h a s . ∗ y ) ;
[ maxAlphas , imaxAlphas ]=max( a l p h a s ) ;
b=y ( imaxAlphas)−w. ’ ∗ x ( imaxAlphas , : ) . ’ ;
7
3.2 Non-linearly separable data
For non-linearly separable data we have to use other ways to separate data.
For misclassification tolerated problems, we can use equation (11), therefor the solver’s pa-
rameters are :
C=T r a d e o f f v a l u e ;
H=(x∗x . ’ ) . ∗ ( y∗y . ’ ) ;
f=−ones ( n , 1 ) ;
A= [ ] ;
b=[];
l b=z e r o s ( n , 1 ) ;
ub=C∗ ones ( n , 1 ) ;
Aeq=y . ’ ;
beq =0;
To apply the kernel trick we should create firstly the kernel functions, in our case we created
the polynomial kernel:
f u n c t i o n [ Z ] = k e r n e l P o l (X1 , X2 , q )
[ xxx1 , yyy1 ]= meshgrid (X1 ( : , 1 ) , X2 ( : , 1 ) ) ;
[ xxx2 , yyy2 ]= meshgrid (X1 ( : , 2 ) , X2 ( : , 2 ) ) ;
Z=(1+xxx1 . ∗ yyy1+xxx2 . ∗ yyy2 ) . ˆ q ;
end
8
And the radial basis function kernel:
1 2
k(x1 , x2 ) = e− 2σ2 kx1 −x2 k
4 Results
We dispose of 4 types of data distributions.
9
Figure 7: Training data
As expected, there is no misclassified data, and therefor the linear decision boundary is
suitable to our data.
In this section we will use the code mentioned in section 3.2.1, and act on the variable C to
get the smallest misclassification percentage possible.
10
Figure 9: Training and test data separated
11
Figure 11: Training and test data separated
* The trade-off parameter C represent how much we want to stick to our data. Small C
means we want to enlarge our margins by sacrificing training data separation, in the other
hand large C means that we want to stick to our training data.
* Even though the data is not linearly separable, by introducing to the optimizer a trade-off
variable we got reasonable results with separators featured by low misclassification percent-
age.
12
4.0.3 Data using polynomial kernel
In this section we will include the polynomial kernel using the code mentioned in the section
3.2.2 Ẇe will acts on the polynomial’s degree to find the most suitable separator for our
data.
13
Figure 15: Test data separated by our decision boundary
In this section we will include the radial basis kernel using the code mentioned in the section
3.2.3 Ẇe will acts on the RBF dispersion σ and trade-off value C to find the most suitable
separator for our data.
14
Figure 17: Training data
We can use a loop to find the separator for different values of σ and C.
15
Figure 19: Training and test data separated
16
Figure 21: Training and test data separated
* We can extract results of the misclassification percentage by changing C and σ and observ-
ing a search grid (i.e. set C and vary σ) to find the most suitable parameters to our data
distribution.
* The SVM optimization is affected by the dot product < xi , xj > of the points, nevertheless
in the RBF kernel this dot product can represented as the distance between points kxi − xj k,
thus σ works as an amplifier of this distance, if σ is large only very close points define the
boundary decision, otherwise the boundary decision, otherwise take on take into considera-
tion relation between spaced points.
The RBF kernel is more useful to separate data where a class encircle another one.
17
Figure 23: Test data separated by our decision boundary
5 Summary
* Support vector machines are known by their performances.
* Trade-off parameter C represent how much we want to sacrifice separation for bigger mar-
gin.
* Kernels are useful to generate a boundary decision for non linearly separable data.
* The σ in the RBF kernel determine the what aspect we want to define our boundary
decision, do we want to very close points, or do we want to create larger decision bound-
aries.
References
[1] Efavdb. http://efavdb.com/svm-classification/.
[2] Medium. https://medium.com/deep-math-machine-learning-ai/chapter-3-support-
vector-machine-with-math-47d6193c82be.
[3] Thisdata. https://thisdata.com/blog/unsupervised-machine-learning-with-one-class-
support-vector-machines/.
18
A Hard margin matlab code
na =25;
nt =150;
[ ax1 , ay1 , tx1 , ty1 ]= Gen data ( 1 , na , nt ) ;
figure
h=p l o t d a t a ( ax1 , ay1 ) ;
legend (h , ’ c l a s s 1 ’ , ’ c l a s s 2 ’)
t y f=tx1 ∗w+b ;
Err=1−sum ( s i g n ( t y f )==s i g n ( ty1 ) ) / l e n g t h ( ty1 )
h o l d on
f p l o t (@( x ) −(w( 1 ) ∗ x+b ) /w( 2 ) , ’ k ’ ) ;
f p l o t (@( x ) −(w( 1 ) ∗ x+b−1)/w( 2 ) , ’ k − − ’);
f p l o t (@( x ) −(w( 1 ) ∗ x+b+1)/w( 2 ) , ’ k − − ’);
legend (h , ’ c l a s s 1 ’ , ’ c l a s s 2 ’)
figure
h=p l o t d a t a ( tx1 , ty1 ) ;
19
h o l d on
f p l o t (@( x ) −(w( 1 ) ∗ x+b ) /w( 2 ) , ’ k ’ ) ;
f p l o t (@( x ) −(w( 1 ) ∗ x+b−1)/w( 2 ) , ’ k − − ’);
f p l o t (@( x ) −(w( 1 ) ∗ x+b+1)/w( 2 ) , ’ k − − ’);
legend (h , ’ c l a s s 1 ’ , ’ c l a s s 2 ’)
a n n o t a t i o n ( ’ textbox ’ , . . .
[0.15 0.85 0.29 0 . 0 5 ] , . . .
’ S t r i n g ’ , { [ ’ M i s c l a s s i f i e d %= ’ num2str ( round ( Err ∗ 1 0 0 , 2 ) ) ’% ’ ] } , . . .
’ BackgroundColor ’ , [ 1 1 1 ] ) ;
t i t l e ( [ ’ Linear seperator ’ ] )
hold o f f
na =25;
nt =150;
[ ax1 , ay1 , tx1 , ty1 ]= Gen data ( 2 , na , nt ) ;
figure
h=p l o t d a t a ( ax1 , ay1 ) ;
legend (h , ’ c l a s s 1 ’ , ’ c l a s s 2 ’)
C=40;
f o r C= [ 0 . 0 0 0 0 1 0 . 0 0 0 1 0 . 0 0 1 0 . 0 1 0 . 1 1 10 100 1000 10000 100000 1 0 0 0 0 0 0 ]
try
20
[ a l p ha s , f v a l 2 , e x i t f l a g 2 , output2 , lambda2 ] =quadprog (H, f , A, b , Aeq , beq , lb , ub ) ;
w=ax1 . ’ ∗ ( a l p h a s . ∗ ay1 ) ;
t y f=tx1 ∗w+b ;
a y f=ax1 ∗w+b ;
Err=1−sum ( s i g n ( t y f )==s i g n ( ty1 ) ) / l e n g t h ( ty1 ) ;
Errx=1−sum ( s i g n ( a y f)==s i g n ( ay1 ) ) / l e n g t h ( ay1 ) ;
figure
subplot (1 ,2 ,1)
h=p l o t d a t a ( ax1 , ay1 ) ;
h o l d on
f p l o t (@( x ) −(w( 1 ) ∗ x+b ) /w( 2 ) , ’ k ’ ) ;
f p l o t (@( x ) −(w( 1 ) ∗ x+b−1)/w( 2 ) , ’ k − − ’);
f p l o t (@( x ) −(w( 1 ) ∗ x+b+1)/w( 2 ) , ’ k − − ’);
legend (h , ’ c l a s s 1 ’ , ’ c l a s s 2 ’)
a n n o t a t i o n ( ’ textbox ’ , . . .
[0.15 0.85 0.12 0 . 0 5 ] , . . .
’ S t r i n g ’ , { [ ’ M i s c l a s s i f i e d %= ’ num2str ( round ( Errx ∗ 1 0 0 , 2 ) ) ’% ’ ] } , . . .
’ BackgroundColor ’ , [ 1 1 1 ] ) ;
t i t l e ( [ ’ T r a i n i n g data s e p e r a t i o n f o r C = ’ num2str (C) ] )
hold o f f
subplot (1 ,2 ,2)
h=p l o t d a t a ( tx1 , ty1 ) ;
h o l d on
f p l o t (@( x ) −(w( 1 ) ∗ x+b ) /w( 2 ) , ’ k ’ ) ;
f p l o t (@( x ) −(w( 1 ) ∗ x+b−1)/w( 2 ) , ’ k − − ’);
f p l o t (@( x ) −(w( 1 ) ∗ x+b+1)/w( 2 ) , ’ k − − ’);
legend (h , ’ c l a s s 1 ’ , ’ c l a s s 2 ’)
a n n o t a t i o n ( ’ textbox ’ , . . .
[0.59 0.85 0.12 0 . 0 5 ] , . . .
21
’ S t r i n g ’ , { [ ’ M i s c l a s s i f i e d %= ’ num2str ( round ( Err ∗ 1 0 0 , 2 ) ) ’% ’ ] } , . . .
’ BackgroundColor ’ , [ 1 1 1 ] ) ;
t i t l e ( [ ’ Test data s e p e r a t i o n f o r C = ’ num2str (C) ] )
s e t ( g c f , ’ u n i t s ’ , ’ p o i n t s ’ , ’ p o s i t i o n ’ , [ 0 , 0 , width , h e i g h t ] )
hold o f f
catch
end
end
na =25;
nt =150;
[ ax1 , ay1 , tx1 , ty1 ]= Gen data ( 4 , na , nt ) ;
figure
h=p l o t d a t a ( ax1 , ay1 ) ;
legend (h , ’ c l a s s 1 ’ , ’ c l a s s 2 ’)
C=10;
sigma =10;
f o r C= [ 0 . 0 0 0 0 1 0 . 0 0 0 1 0 . 0 0 1 0 . 0 1 0 . 1 1 10 100 1000 10000 100000 1 0 0 0 0 0 0 ]
f o r sigma = [ 0 . 0 0 0 0 1 0 . 0 0 0 1 0 . 0 0 1 0 . 0 1 0 . 1 1 10 100 1000 10000 100000 1 0 0 0 0 0 0
22
[ a l p ha s , f v a l , e x i t f l a g , output , lambda2 ] =quadprog (H, f , A, b2 , Aeq , beq , lb , ub ) ;
ep s=1e −6;
s v s=f i n d ( alphas >eps & alphas <C−eps ) ;
b=ay1 ( s v s (1)) − kernel RBF ( ax1 , ax1 ( s v s ( 1 ) , : ) , sigma ) ∗ ( ay1 . ∗ a l p h a s ) ;
figure
subplot (1 ,2 ,1)
h=p l o t d a t a ( ax1 , ay1 ) ;
h o l d on
mn=min ( ax1 ) ;
mx=max( ax1 ) ;
Xc=l i n s p a c e ( min ( ax1 ( : , 1 ) ) , max( ax1 ( : , 1 ) ) , 1 0 0 ) ;
Yc=l i n s p a c e ( min ( ax1 ( : , 2 ) ) , max( ax1 ( : , 2 ) ) , 1 0 0 ) ;
[ Xc , Yc]= meshgrid ( Xc , Yc ) ;
Xc=Xc ( : ) ;
Yc=Yc ( : ) ;
Zc=kernel RBF ( ax1 , [ Xc , Yc ] , sigma ) ∗ ( ay1 . ∗ a l p h a s )+b ;
n = c e i l ( s q r t ( numel ( Zc ) ) ) ;
Zcf = zeros (n ) ;
Xcf = z e r o s ( n ) ;
Ycf = z e r o s ( n ) ;
Z c f ( 1 : numel ( Z c f ) ) = Zc ( : ) ;
Ycf ( 1 : numel ( Z c f ) ) = Yc ( : ) ;
Xcf ( 1 : numel ( Z c f ) ) = Xc ( : ) ;
23
c o n t o u r ( Xcf , Ycf , Zcf , [ 0 0 ] , ’ k ’ )
legend (h , ’ c l a s s 1 ’ , ’ c l a s s 2 ’)
a n n o t a t i o n ( ’ textbox ’ , . . .
[0.15 0.85 0.12 0 . 0 5 ] , . . .
’ S t r i n g ’ , { [ ’ M i s c l a s s i f i e d %= ’ num2str ( round ( Errx ∗ 1 0 0 , 2 ) ) ’% ’ ] } , . . .
’ BackgroundColor ’ , [ 1 1 1 ] ) ;
t i t l e ( [ ’ T r a i n i n g data s e p e r a t i o n by RBF k e r n e l f o r \ sigma = ’ num2str ( sigm
’ , C = ’ num2str (C) ] )
hold o f f
subplot (1 ,2 ,2)
h=p l o t d a t a ( tx1 , ty1 ) ;
h o l d on
mn=min ( ax1 ) ;
mx=max( ax1 ) ;
Xc=l i n s p a c e ( min ( tx1 ( : , 1 ) ) , max( tx1 ( : , 1 ) ) , 1 0 0 ) ;
Yc=l i n s p a c e ( min ( tx1 ( : , 2 ) ) , max( tx1 ( : , 2 ) ) , 1 0 0 ) ;
[ Xc , Yc]= meshgrid ( Xc , Yc ) ;
Xc=Xc ( : ) ;
Yc=Yc ( : ) ;
Zc=kernel RBF ( ax1 , [ Xc , Yc ] , sigma ) ∗ ( ay1 . ∗ a l p h a s )+b ;
n = c e i l ( s q r t ( numel ( Zc ) ) ) ;
Zcf = zeros (n ) ;
Xcf = z e r o s ( n ) ;
Ycf = z e r o s ( n ) ;
Z c f ( 1 : numel ( Z c f ) ) = Zc ( : ) ;
Ycf ( 1 : numel ( Z c f ) ) = Yc ( : ) ;
Xcf ( 1 : numel ( Z c f ) ) = Xc ( : ) ;
24
c o n t o u r ( Xcf , Ycf , Zcf , [ 1 −1] , ’ k−−’)
h o l d on
c o n t o u r ( Xcf , Ycf , Zcf , [ 0 0 ] , ’ k ’ )
legend (h , ’ c l a s s 1 ’ , ’ c l a s s 2 ’)
a n n o t a t i o n ( ’ textbox ’ , . . .
[0.59 0.85 0.12 0 . 0 5 ] , . . .
’ S t r i n g ’ , { [ ’ M i s c l a s s i f i e d %= ’ num2str ( round ( Err ∗ 1 0 0 , 2 ) ) ’% ’ ] } , . . .
’ BackgroundColor ’ , [ 1 1 1 ] ) ;
t i t l e ( [ ’ Test data s e p e r a t i o n by RBF k e r n e l f o r \ sigma = ’ num2str ( sigma )
’ , C = ’ num2str (C) ] )
hold o f f
catch
end
end
end
25