You are on page 1of 4

A mathematical essay on support vector machine

Awik Dhar
Indian Institute of Technology, Madras
Ahmedabad, India
ch18b041@smail.iitm.ac.in

Abstract—This article aims to give and overview of Support


Vector Machine algorithm and describes how it has been used
to predict with high performance whether a star with observed
characteristics is a pulsar star or not.

I. I NTRODUCTION
Classification is a common Machine Learning problem, and
has a variety of available techniques for different sorts of
features and data.
Support Vector Machine is a classification algorithm which
predicts using a decision boundary to separate classes of data. Fig. 1. Optimal Hyperplane
In the SVM algorithm, we plot each data item as a point in n-
dimensional space (where n is a number of features you have)
with the value of each feature being the value of a particular B. Non linear decision boundary
coordinate. Then, we perform classification by finding the
Often the decision boundary for real world data is non-
hyper-plane that differentiates the two classes very well.
linear. The linear SVM fails in that scenario.
Pulsars are a rare type of Neutron star that produces radio
emissions detectable here on Earth.They are of considerable
scientific interest as probes of space-time, the inter-stellar
medium, and states of matter. Machine learning tools are now
being used to automatically label pulsar candidates to facilitate
rapid analysis. The key task is to Predict if a star is a pulsar
start or not. Each candidate is described by 8 continuous
variables and a single class variable. The first four are simple
statistics obtained from the integrated pulse profile (folded
profile). This is an array of continuous variables that describe a
longitude-resolved version of the signal that has been averaged
in both time and frequency. The remaining four variables are
similarly obtained from the DM-SNR curve. Fig. 2. Linear vs non linear decision boundary

We inspect the data and variable associations and fit a


Support Vector classifier to make predictions. The algorithm is We can manually change features and map the data from
suitable for multi dimensional separable data with appropriate the original space into a higher dimensional feature space. The
use of kernels. goal is that after the transformation to the higher dimensional
space, the classes are now linearly separable in this higher
II. SUPPORT VECTOR CLASSIFIER dimensional feature space. But this is manual and tedious.

A. Linear Support Vector Classifier


The objective of the support vector machine algorithm is
to find a hyperplane in an N-dimensional space(N — the
number of features) that distinctly classifies the data points.
To separate the two classes of data points, there are many
possible hyperplanes that could be chosen. Our objective is to
find a plane that has the maximum margin, i.e the maximum
distance between data points of both classes. Maximizing the
margin distance provides some reinforcement so that future Fig. 3. Linearly separable after squaring the features
data points can be classified with more confidence.
C. The Kernel trick
The kernel trick provides a solution to this problem. The
“trick” is that kernel methods represent the data only through
a set of pairwise similarity comparisons between the original
data observations x (with the original coordinates in the lower
dimensional space), instead of explicitly applying the trans-
formations (x) and representing the data by these transformed
coordinates in the higher dimensional feature space.
Our kernel function accepts inputs in the original lower
dimensional space and returns the dot product of the trans-
formed vectors in the higher dimensional space. There are
also theorems which guarantee the existence of such kernel
functions under certain conditions.

Fig. 6. Heat map of features, intensity of colours shows strong correlations

Fig. 4. Linearly separable after kernel transformation B. Approach


20% of the data was held out as test data. A Support Vector
III. DATA AND T HE P ROBLEM Classifier(SCV) was fit with regularization parameter C set to
1.3, smaller C results in more regularization. The penalty is
The key task is to Predict if a star is a pulsar start or not.
l2 and kernel is set to rbf. The observed train accuracy was
Each candidate is described by 8 continuous variables and a
97.25% and test accuracy was 97.2% .
single class variable.

A. Overview
We are given a dataset of 12k records with features such
as integrated profile mean, standard deviation, excess kurtosis,
and so forth.
The dataset is imbalanced with 91% records falling in
’Not pulsar’ category and 9% falling in ’pulsar’.
Since the variance in features with missing values were high,

Fig. 7. Train data learning curve

But since the data was imbalanced, a better classification


performance metric would be the f1 score for different
classes, where f1 score for any given class is defined as

2 ∗ precision ∗ recall
Fig. 5. Imbalance of data f1 =
precision + recall
substitution with mean or mode was skipped and records The model has performed well when the precision and recall
with missing values were dropped. are simultaneously high for a minor class, or when the f1
score is high. The f1 scores for the smaller class ’pulsar’
Correlation heatmaps were obtained and an estimate of is 0.83, which is high given this accounts for 9% of dataset.
important features can be made already.
Fig. 8. Confusion Matrix

Fig. 11. Distribution of Standard deviation of the DM-SNR curve

The data was highly separable, which is favourable for an


SVC. Taking few pairs of features we can see that pulsar and
non-pulsar points are well separated. Skewness of DM-SNR
vs skewness profile, and, skewness vs kurtosis plots illustrate
this.

Fig. 9. f1 score, Classification Report

C. Observations
Most of the features follow gaussian like distributions. For
example, majority of excess kurtosis (at least 75%) is less than
mean. Hence a large head portion in this distribution. Hence
the distribution of the left of mean is more tightly spread than
the right. This means the integrated profile’s tails are generally
the same size as normal distributions. Standard deviation of
DM-SNR curve was very skewed towards the higher side. It
is a very highly spread DM-SNR curve.
Fig. 12. Skewness of DM-SNR vs skewness profile

Fig. 10. Distribution of Excess kurtosis of integrated profile Fig. 13. skewness vs kurtosis
IV. C ONCLUSION
Support Vector Classifier is a simple machine learning algo-
rithm that works really well with a clear margin of separation.
It is effective in high dimensional spaces. It does not perform
well when we have large data set because the required training
time is higher but our training dataset size is small. It also
does not perform very well, when the data set has more noise
i.e. target classes are overlapping, which is why interpolation
or mean values were avoided as substitutes for the missing
data, to prevent more points from leaking into the decision
boundary. Support Vector Machines are slow to train and this
can be an area of research.
R EFERENCES
[1] James, Gareth, D. Witten, T. Hastie, and R. Tibshirani, ”An introduction
to statistical learning: with applications in R”, 2017.
[2] Müller, Andreas and S. Guido, “Introduction to machine learning with
Python: a guide for data scientists”, 2016.
[3] A. Christmann, I. Steinwart, ”Support Vector Machines”.
[4] B. Schölkopf, ”Learning with kernels”.
[5] Y. Zhang, ”Support Vector Machine Classification Algorithm and Its
Application”.
[6] Y. Yand, J. Li, ”The research of the fast SVM classifier method”.

You might also like