Professional Documents
Culture Documents
Awik Dhar
Indian Institute of Technology, Madras
Ahmedabad, India
ch18b041@smail.iitm.ac.in
I. I NTRODUCTION
Classification is a common Machine Learning problem, and
has a variety of available techniques for different sorts of
features and data.
Support Vector Machine is a classification algorithm which
predicts using a decision boundary to separate classes of data. Fig. 1. Optimal Hyperplane
In the SVM algorithm, we plot each data item as a point in n-
dimensional space (where n is a number of features you have)
with the value of each feature being the value of a particular B. Non linear decision boundary
coordinate. Then, we perform classification by finding the
Often the decision boundary for real world data is non-
hyper-plane that differentiates the two classes very well.
linear. The linear SVM fails in that scenario.
Pulsars are a rare type of Neutron star that produces radio
emissions detectable here on Earth.They are of considerable
scientific interest as probes of space-time, the inter-stellar
medium, and states of matter. Machine learning tools are now
being used to automatically label pulsar candidates to facilitate
rapid analysis. The key task is to Predict if a star is a pulsar
start or not. Each candidate is described by 8 continuous
variables and a single class variable. The first four are simple
statistics obtained from the integrated pulse profile (folded
profile). This is an array of continuous variables that describe a
longitude-resolved version of the signal that has been averaged
in both time and frequency. The remaining four variables are
similarly obtained from the DM-SNR curve. Fig. 2. Linear vs non linear decision boundary
A. Overview
We are given a dataset of 12k records with features such
as integrated profile mean, standard deviation, excess kurtosis,
and so forth.
The dataset is imbalanced with 91% records falling in
’Not pulsar’ category and 9% falling in ’pulsar’.
Since the variance in features with missing values were high,
2 ∗ precision ∗ recall
Fig. 5. Imbalance of data f1 =
precision + recall
substitution with mean or mode was skipped and records The model has performed well when the precision and recall
with missing values were dropped. are simultaneously high for a minor class, or when the f1
score is high. The f1 scores for the smaller class ’pulsar’
Correlation heatmaps were obtained and an estimate of is 0.83, which is high given this accounts for 9% of dataset.
important features can be made already.
Fig. 8. Confusion Matrix
C. Observations
Most of the features follow gaussian like distributions. For
example, majority of excess kurtosis (at least 75%) is less than
mean. Hence a large head portion in this distribution. Hence
the distribution of the left of mean is more tightly spread than
the right. This means the integrated profile’s tails are generally
the same size as normal distributions. Standard deviation of
DM-SNR curve was very skewed towards the higher side. It
is a very highly spread DM-SNR curve.
Fig. 12. Skewness of DM-SNR vs skewness profile
Fig. 10. Distribution of Excess kurtosis of integrated profile Fig. 13. skewness vs kurtosis
IV. C ONCLUSION
Support Vector Classifier is a simple machine learning algo-
rithm that works really well with a clear margin of separation.
It is effective in high dimensional spaces. It does not perform
well when we have large data set because the required training
time is higher but our training dataset size is small. It also
does not perform very well, when the data set has more noise
i.e. target classes are overlapping, which is why interpolation
or mean values were avoided as substitutes for the missing
data, to prevent more points from leaking into the decision
boundary. Support Vector Machines are slow to train and this
can be an area of research.
R EFERENCES
[1] James, Gareth, D. Witten, T. Hastie, and R. Tibshirani, ”An introduction
to statistical learning: with applications in R”, 2017.
[2] Müller, Andreas and S. Guido, “Introduction to machine learning with
Python: a guide for data scientists”, 2016.
[3] A. Christmann, I. Steinwart, ”Support Vector Machines”.
[4] B. Schölkopf, ”Learning with kernels”.
[5] Y. Zhang, ”Support Vector Machine Classification Algorithm and Its
Application”.
[6] Y. Yand, J. Li, ”The research of the fast SVM classifier method”.