You are on page 1of 19

SVM

Introduction

 Support Vector Machine (SVM) is the most popular margin based


supervised classifier.
 It is a discriminative classifier formally defined by a separating
hyperplane. In other words, given labeled training data (supervised
learning), the algorithm outputs an optimal hyperplane which
categorizes new examples.
 In two dimensional space this hyperplane is a line dividing a plane in
two parts where in each class lay in either side.
 Generally it is used to separate the dataset into two classes, but an
extension is performed later for multi-class classification.
 The goal of designing an SVM is to find the optimal dichotomic
hyperplane which can maximize the margin (the largest separation) of
two classes. The data points on the margin of the hyperplane are called
support vectors.
To decide separation line of classes
If data are overlapped then there is a
chance of misclassification
Solution 1: tolerates some outlier
points
Solution 2: trying to achieve 0
tolerance with perfect partition
Definitions
Define the hyperplane H such that:
xi•w+b  +1 when yi =+1 H1

xi•w+b  -1 when yi =-1


H2
H1 and H2 are the planes: d+
H1: xi•w+b = +1
d-
H2: xi•w+b = -1
H
The points on the planes
H1 and H2 are the
Support Vectors

d+ = the shortest distance to the closest positive point

d- = the shortest distance to the closest negative point


The margin of a separating hyperplane is d+ + d-.
Maximizing the margin
We want a classifier with as big margin as possible.
H1
H
H2
Recall the distance from a point(x0,y0) to a line: d+
Ax+By+c = 0 is|A x0 +B y0 +c|/sqrt(A2+B2) d-
The distance between H and H1 is:
|w•x+b|/||w||=1/||w||

The distance between H1 and H2 is: 2/||w||

In order to maximize the margin, we need to minimize ||w||. With the


condition that there are no datapoints between H1 and H2:
xi•w+b  +1 when yi =+1
xi•w+b  -1 when yi =-1 Can be combined into yi(xi•w)  1
Tuning parameters
 Regularization : The Regularization parameter tells the SVM optimization how much you want to avoid
misclassifying each training example.
 Small value = classify correctly
 Large value = increase chances of misclassification
 Gamma: The gamma parameter defines how far the influence of a single training example reaches, with
low values meaning ‘far’ and high values meaning ‘close’.
 High values = nearby points to be considered
 Low values = far points to be considered
 Kernel: The learning of the hyperplane in linear SVM is done by transforming the problem using some
linear algebra. This is where the kernel plays role.
 Linear
 Polynomial calculates separation line in higher dimensional space. This is called
kernel tricks.
 Exponential
 Margin: A margin is a separation of line to the closest class points. A good margin is one where this
separation is larger for both the classes.
Margin

 The hard margin is a one which clearly separate positive and negative points.
 Soft margin is also called as noisy linear SVM which includes some miss-
classified points.
 Solution to the soft margin is approximation of points which are miss-
classified in linear decision boundary.
Loss function
Conti..

 Support Vector Machines (SVM) were originally designed for binary


classification.
 How to effectively extend it for multi-class classification is still an on-going
research issue.
 Currently there are two types of approaches for multi-class SVM.
 SVM has two methods based on binary classification:
 “one-against-all,”
 “one-against-one,”
Overtraining/overfitting
A well known problem with machine learning methods is overtraining.
This means that we have learned the training data very well, but
we can not classify unseen examples correctly.
An example: A botanist really knowing trees.Everytime he sees a new tree,
he claims it is not a tree.

=-1
=+1
Overtraining/overfitting 2

A measure of the risk of overtraining with SVM (there are also other
measures).
It can be shown that: The portion, n, of unseen data that will be
missclassified is bounded by:
n  Number of support vectors / number of training examples

Ockham´s razor principle: Simpler system are better than more complex ones.
In SVM case: fewer support vectors mean a simpler representation of the
hyperplane.

Example: Understanding a certain cancer if it can be described by one gene


is easier than if we have to describe it with 5000.
Limitation

 The biggest limitation of SVM lies in the choice of the kernel (the best choice
of kernel for a given problem is still a research problem).
 A second limitation is speed and size (mostly in training - for large training
sets, it typically selects a small number of support vectors, thereby
minimizing the computational requirements during testing).
 The optimal design for multiclass SVM classifiers is also a research area.

You might also like