You are on page 1of 10

SUPPORT VECTOR MACHINES

Thanks for reading this article, in this article we will go through a very powerful
and popular algorithm of machine learning algorithm Support Vector Machine.
Here we will try to understand the underlying concept on which SVM is based
through very simple example from our real-world scenarios to understand it
better. We will also try to understand few basic concepts of linear algebra,
because it will help in understanding the mathematics behind SVM, which we
will see in next article.
So, by reading this article you will get a proper understanding of following
points-
 What is Support Vector Machine and how it works.

 Then before going into graphical explanation of SVM and its underlying
concept we will see few important concepts of Linear Algebra which are-
o How we plot data in a space.
o How number of features of data (attributes or feature columns)
decides the dimension of space.
o Concept of plane or hyper-plane.
 Then we will go back again to SVM and we will see objective of the SVM
and how it is important.
 We will also explain Margin, Maximum Margin Hyper Plane and Why it is
necessary to select Maximum Margin Hyperplane.
 So, after reading this article you will be able to understand the concept of
SVM, objective of SVM and importance of the objective.

Introduction to Support Vector Machine-


A Support Vector Machine is a supervised machine learning algorithm
developed by Vladimir N Vapnik, it can be used for both classification and
regression but it’s mostly used for classification problems. Today it’s one of the
most used algorithms in Machine Learning. As we know the most trusted and
popular algorithm in machine learning is neural networks, but there is a dent in
its popularity and this dent is due to SVM, as by using much lesser computational
power than neural networks it gives very trusted results with both linear and
nonlinear data.
The main idea or bottom line on which SVM works is that it tries to find the
classifier or decision boundary such that the distance between decision
boundary to the nearest data points of each classes is maximum. That’s why its
also called maximal margin classifier.
People who are familiar with linear classifiers like logistic regression, neural
networks it’s easy for them to visualize the concept of decision boundary with
maximum distance from each class, but let’s discuss the entire concept from
scratch, to classify any data we first need to plot it in some space and it’s a
general idea and we can see many example in our day to day life, suppose you
were given a task to separate two kinds fruits say orange and mango kept in a
bag in that case you will take these and keep on a table int two groups with some
appropriate distance between each groups , so when I say we plot data in a
space then think the table as a space and fruits are the data points.
In same way in Support Vector Machine each data points are plotted in a N
Dimensional space where N is nothing but number of features. So why we take
dimension of space same as number of features, the reason of it is in linear
algebra, there if we wanted to plot a point where X1 = 2 and X2 = 3 then we use
2 draw a graph as below and to put point such that it has 2 distance from x1 and
3 from X2
X2 (2,3)

X1
So ideally to find location to plot a point we want distance from each line and
each line is a dimension in linear algebra and entire graph we can think as space
so we need space with dimension equal to number of coordinates and
coordinates are nothing but values of features. So, we need dimension of space
equal to number of features.
Now take same example of separating fruits to understand it properly
As you have a task to separate two types of fruits which are kept together in a
bag so, to differentiate between two types you will usually look for colour,
shape, size etc. so the shape, size and colour is nothing but the features on the
basis of which you can say which fruit is mango or orange. Let’s put these
features in a table as below assuming that
colour code for orange=1 and for mango=2
Colour Shape Size Type of
Fruits

1 3 6 Orange

So here the number features= 3


So, all the above points will be plotted in 3 Dimension space.
The point drawn will have coordinates values which is nothing but the values of
features. So, for row above coordinate will be (1,3,6). Now, consider that each
such coordinates represent a fruit as the values of coordinates are of that fruit
only and then we plot each point in N (3 for this example) dimensional space,
and then we find the hyperplane which can separate these points in two classes
orange or mango. So that we can say data in one side of the plane belongs to
orange and data on other side are of mango. So, this gave us an overview what
our objectives are and how we can represent a day to day problems in
mathematical or more specifically in linear algebra way.
Now, let’s take another example to understand our objective of finding best
possible hyperplane to separate data points in terms of linear algebra here we
will take only 2 features and 2 classes as it’s easy to visualise -
Suppose we have two features X1 and X2 and we have two classes A and B. And
based on number of training examples (suppose we have n number of training
examples), we will have our points to be drawn and the values of X1 and X2 will
be the coordinates of these points -

Ex- (X11,X21), (X12,X22), (X13,X23) ……..(X1n,X2n)


X1

(X11,X21) (X13,X23)

(X12,X22) (X1n,X2n)
Class Name-A Class Name-B

X2

Now suppose after placing the above points in 2-Dimensional space we have got
graph like above. Since we have two feature columns (X1,X2) and as we know
dimension of space depends on number of features, so we have chosen a 2D
space.
Now we need a decision boundary in case of above example the decision
boundary is the line to separate the classes A and B. But a line can be drawn in
any direction and at any place, and the orientation and place of line is decided
on several basis but the main and top most criteria in which we are interested is
that the line should be drawn in such a way that it can divided entire data set in
two parts (as we have two classes here and for multiple class the number of
parts will be the count of classes) and division should be in a such a way that
points belonging to one class should be on one side and points belonging to
other should be on other side. So, in diagram above we can say that the line
drawn is effective and can be our decision boundary.
And seeing the above graph it’s quite obvious that we have got our perfect
boundary and as per training dataset it is even true, but if we see carefully the
points (X11,X21) and (X13,X23), these points seems to be very close to the decision
boundary and it’s even accepted for training dataset but suppose in test dataset
we have point (X1T,X2T) that has values near to (X13,X23) but has some slight
changes in any feature which is not even changing its class, but due to that there
is a high chance that point might go to other side of the decision boundary see
the figure below-
Misclassified Data

X1
(X1T,X2T)
(X11,X21) (X13,X23)
(X12,X22) (X1n,X2n)

Class Name-A Class Name-B

X2
So from above example we can say that although the point seems to more
closer to class B but due to decision boundary now it will considered as Class A
as it falls slightly on left hand side of the line, so we can say that although the
line was best fitted for training data but it failed in case of test data and in
machine learning terminology we call it as over fitting.
To avoid such cases, we have many options like apply neural network and
through gradient decent we can draw an arbitrary shape to achieve best
classification.
See figure below-

Misclassified Data

X1
(X1T,X2T)
(X11,X21) (X13,X23)

(X12,X22) (X1n,X2n)

Class Name-A Class Name-B

X2
But this will require huge computational resources, so the question here is that
do we have any other effective mechanism to solve this problem without using
that much resources?
The answer is Yes, we have Support Vector Machines which can help us in this.
But before we start with details of SVM lets first get an idea of few concepts
which we are going to use in explanation.
Hyperplane- As we have seen above that we draw a hyperplane to separate
the points but what’s exactly is hyperplane. A hyperplane is a geometric entity
which has dimension one less than the dimension surrounding it. As definition
says that, dimension of hyperplane is one less than dimension of space, so if
space has the dimension N, then
Dimension of Hyperplane= N-1
So, in 3 Dimensional space the hyperplane will have 2 Dimensions and as we
know that a 2 Dimensional entity is called plane so the hyperplane in 3D space
is a plane. Similarly, a hyperplane in 2D space will have one dimension and we
know a 1-dimensional entity is nothing but a line.
For example, above we had 2D dimension so dimension of our decision
boundary must be 1D, that is why we have drawn a line to separate our dataset.
So, a plane is nothing but a projection of a line in 3 Dimension space.
Before we go in mathematical equations of hyperplane, we should know the
concept of hyperplane in terms of Machine Learning. In machine learning a
hyperplane divides the dataset in their respective classes so if we have two
classes then the hyperplane should divide that in 2 parts.
The equation for a hyperplane is-

XTn+b=0
If we expand the equation we get-

X1n1+X2n2+X3n3+…….+Xnnn+b=0
Generic equation of plane

Now as we have seen the concept of plane so let’s see the Support Vector
Machines in detail-
As we know the idea behind support vector machine is to find a plane or a
decision boundary such that distance from nearest points of each classes to
the decision boundary is maximum. Here we can have 2 questions, one is that
why we need maximum distance and second is that why need maximum
distance from each class. To understand it let’s take same example of
separating fruits in 2 parts or classes i.e. orange or mango. And this time we
will not draw a graph and rather take a very simple approach, see the figure
below-

Size
Orange Mango

Consider only features size and colour for now and we can say that if size is less
and colour is Orange the class is orange and if size is more and colour is yellow
it belongs to mango. Now our goal is to find a decision boundary so that we can
say that data on left hand side of the decision boundary belong to orange class
and those on right hand side of decision boundary belong to mango class and by
seeing figure above we can say that we can place the decision boundary any
where so, let’s take 3 cases one closer to orange class denoted in figure below
as D1, one closer to mango class denoted by D2 in the figure below and for
decision boundary with maximum distance from each class lets calculate
distance between nearest points of each class i.e. Po of orange class and Pm of
mango class and place our decision boundary exactly at the middle of that
distance it is denoted by Dm in the figure below.

Size
Orange Po Pm Mango
D1 Dm D2
Let’s take the case of decision boundary D1 it seems okay as distance between
it and nearest point of at least one class i.e. mango is maximum, but what will
happen when we get a data of an orange (Po) whose size is little more than
other like in figure below-

Size
Orange Po Mango
D1
So as per our decision boundary it belongs to mango class as it is on right hand
side of decision boundary, but it seems very close to class orange so here we can
say it misclassification and our decision boundary is not capable of handling a
scenario where the size was little higher and from our personal experience, we
can say that the size of orange can be little larger. And in terminology of machine
learning our decision boundary is not generalised to handle such variations.
Similarly, for case of decision boundary D2 having maximum distance from class
orange it can be the an appropriate decision boundary, but case where size of
mango is less than usual shown as Pm below so here our decision boundary will
put it in class orange but the data is very close to class mango. So, this decision
boundary is also now best as it also fails to handle slight variations in data.

Size
Orange Pm Mango
D2
Now let’s take our decision boundary Dm having equal distance from nearest
point of each class.

Size
Orange Po Pm Mango
Dm
Here we can say that it’s the best decision boundary we can have as it has
successfully handled the variations present in points Po and Pm which were
getting misclassified by other 2 boundaries.
So, from above example we got the answers of our 2 questions-
One why we need maximum distance because if we take maximum distance
then we will able to avoid misclassification that can occur due to some variations
in data.
Second why we need maximum distance from each class, because we in this
case we will have freedom for each class to adjust variations in data properly.
So, when we talk about distance from nearest points, we actually have a
terminology in linear algebra for it which is Margin. We can define margin as
below-
Margin- A margin can be defined as the distance of the closest points to the
decision surface. We can also say that the margin is the distance between the
decision boundary and each of the classes. So in figure below we can see that
points (X11,X21) and (X12,X22) are closest to the decision surface hence the
distance between these point and decision surface is the Margin.

Let’s plot above points in the graph below to visualize the concept in more
details-

X1

(X12,X22)
(X11,X21) Margin
Class -A Class -B

And for the example of fruits classification the points Po and Pm are nearest
points so distance between those and the decision surface Dm is margin.
Margin Margin
Size
Orange Po Pm Mango
Dm

So, in this article tried to understand below concepts-


 First, we understood what is support vector machine.
 Next, we went through concepts of space through an example.
 Then we saw how dimension of space is related to number of features.
 Then through a graph and an example we got the idea of how data points
are plotted in a space and how we draw a plane to divide it.
 After that idea of hyperplane was explained with equations.
 Then we went through SVM in detail and with few examples we got the
idea of margin and why we need to have maximum margin from each
class.
Now in next article we will see the mathematical concepts behind SVM
and we will try to get the intuition behind it using an example of another
classifier algorithm Logistic Regression

You might also like