Support Vector Machine

Support Vector Machine
1
Machine Learning
Machine Learning is a subfield of AI that is concerned with the

design and development of algorithm and techniques that allow
computers to learn.
2
Types of Machine Learning
Supervised learning (Target Values , Teacher is Present)
Unsupervised learning (No Teacher, No Target Values) i.e.,

Clustering
Semi-Supervised learning (Both labeled and Unlabelled Data)
Reinforcement learning (It allows machines and software agents

to automatically determine the ideal behaviour within a specific
context, in order to maximize its performance)
3
Classification
Classification is the task of learning a target function f that
maps each attribute set x to one of the predefined class labels
y
Given training data in different classes (labels known)
Predict test data (labels unknown)
Input Output
Attribute Set Classification Model Class Label
(x) (y)
4
Classification contd
Examples
Handwritten digits recognition
Spam filtering
Text classification
Medical diagnosis
Methods:
Nearest Neighbor
Neural Networks
Decision Tree
Rule Based
Support vector machines: a new method
etc.
5
Introduction to SVM
6
Introduction to SVM (contd)
Roots in statistical learning theory

Works well for high dimensional data
The SVM can be trained to classify both linearly separable and
non-linearly separable data
Represents decision boundary using subset of training examples
(from both classes), known as support vectors
7
Introduction to SVM (contd)
The hyper plane is determined

by a subset of the datapoints
Datapoints in this subset are
called support vectors
Good if less fraction of the
datapoints are support vectors
The support vectors

are indicated by the
circles around them
8
Basic Concepts
Let the set of training examples D be
{(x1, y1), (x2, y2), , (xr, yr)},
where,
xi = (x1, x2, , xn) is an input vector in a real valued space
X Rn and
yi is its class label (output value), yi {1, -1}.
1: positive class and -1: negative class.
SVM finds a linear function of the form (w: weight vector)
f(x) = w x + b
1 if w xi b 0
yi
1 if w xi b 0
9
Hyper Plane
The hyperplane that separates positive and
negative training data is
w x + b = 0
It is also called the decision boundary (surface).
So many possible hyperplanes, which one to
choose?
1 if w xi b 0
yi
1 if w xi b 0
10
a
Linear Classifiers
x f yest
f(x,w,b) = sign(w x + b)
denotes +1 w x + b>0
denotes -1
How would you

classify this data?
w x + b<0
a
Linear Classifiers
x f yest
denotes +1
denotes -1
How would you

classify this data?
a
Linear Classifiers
x f yest
denotes +1
denotes -1
How would you

classify this data?
a
Linear Classifiers
x f yest
denotes +1
denotes -1
Any of these
would be fine..
..but which is
best?
a
Linear Classifiers
x f yest
denotes +1
denotes -1
How would you

classify this data?
Misclassified
to +1 class
a
Classifier Margin
x f yest
denotes +1
denotes -1 Define the margin
of a linear
classifier as the
width that the
boundary could be
increased by
before hitting a
datapoint.
a
Maximum Margin
x f yest
1. Maximizing the margin is good
accordingf(x,w,b)
to intuition.
= sign(w x + b)
denotes +1 2. Implies that only support vectors are
denotes -1 important; other The
training examples
maximum
are ignorable.
margin linear
3. Empirically it works very very
classifier iswell.
the
linear classifier
Support Vectors with the,
are those
datapoints that maximum margin.
the margin This is the
pushes up
against simplest kind of
SVM (Called an
LSVM)
Linear SVM
Large Margin Linear Classifier
denotes +1
denotes -1
Given a set of data points: x2
{(xi , yi )}, i 1, 2, , n , where
For yi 1, wT xi b 0
For yi 1, wT xi b 0
With a scale transformation on

both w and b, the above is
equivalent to
For yi 1, wT xi b 1
x1
For yi 1, wT xi b 1
Linear SVM Mathematically
x+ M=Margin Width
X-
What we know:
w . M = 2
w . x+ + b = +1
w . x- + b = -1
w . (x+-x-) = 2
SVM imposes an additional requirement
Margin of its decision boundary must be maximum
Maximizing the margin is equivalent to
minimizing the following objective function:
The learning task in SVM can be formalized as the

following constrained optimization problem:
1 2
minimize w
2
subject to yi (w . xi + b) 1, i = 1, 2, 3,..., N
Learning a linear SVM modeldenotes +1
denotes -1
Formulation: x2
Margin
1 2 x+
minimize w
2
such that x+
For yi 1, wT xi b 1 n
x-
For yi 1, wT xi b 1
yi (wT xi b) 1 x1
Linear SVM: Separable Case
1 2
Quadratic minimize w
programming 2
with linear
constraints s.t. yi (wT xi b) 1
Lagrangian
Function
minimize Lp (w, b, a i ) w a i yi (wT xi b) 1

n
1 2
2 i 1
s.t. ai 0
First term is same
Second term captures the inequality constraints
1 2
minimize w
2
It is easy to show that function is minimized when

w = 0, but it violates the constraints
Solving the Optimization Problem
minimize L (w, b, a ) w a y (w x b) 1
n
1 2 T
p i i i i
2 i 1
s.t. ai 0
L p
n
0 w a i yi xi
w i 1
n
L p
0 a y i i 0
b i 1
From KKT condition, we know:
a i yi (wT xi b) 1 0
x2
x+
ai 0
x+
Thus, only support vectors have

x-
The solution has the form: Support Vectors

x1
n
w a i yi xi a y x i i i
i 1 iSV
get b from yi (wT xi b) 1 0,

where xi is support vector
minimize L (w, b, a ) w a y (w x b) 1
n
1 2 T
p i i i i
2 i 1
s.t. ai 0
Lagrangian Dual
Problem
n
1 n n
maximize ai aia j yi y j xTi x j
i 1 2 i 1 j 1
n
s.t. a i 0 , and a y
i 1
i i 0
Example:
Consider the two-dimensional data set which contains eight training instances.
Using quadratic programming we can solve the optimization problem to obtain
the Lagrange multiplier for each training instance. The Lagrange multipliers
are depicted in the last column of the table. Notice that only the first two
instances have non-zero Lagrange multipliers. These instances are correspond to
the support vector for this data set.
Solution: Let w=(w1,w2) and b denotes the parameter of decision boundary. We

can solve for w1 and w2 in the following way:
w1= iyixi1= 65.5621*1*0.3858 + 65.5621*-1*0.4871= -6.64
w2= iyixi2= 65.5621*1*0.4687 + 65.5621*-1*0.611= -9.32
The bias term b can be computed for each support vector
b1= 1-wx1 = 1- (-6.64)(0.3858) - (-9.32)(0.4687) = 7.9300
b1= -1-wx2 = -1- (-6.64)(0.4871) - (-9.32)(0.611) = 7.9289
Averaging these values, we obtain b=7.93. The decision boundary corresponding
to these parameter
-6.64 x1 -9.32 x2 +7.93 = 0
x1 x2 y Lagrange Multiplie
0.3858 0.4687 1 65.5261
0.4871 0.611 -1 65.5261
0.9218 0.4103 -1 0
0.7328 0.8936 -1 0
0.1763 0.0579 1 0
0.4057 0.3529 1 0
0.9355 0.8132 -1 0
0.2146 0.0099 1 0
Once, the parameter of decision boundary are found, a test
instance z is classified as follows
F(x)= sign (w*x + b) = sign( iyixi*z + b)
If f(z)=1, then the test instance is classified as a positive class

otherwise it is classified as negative class
Linear SVM: Nonseparable Case
denotes +1
denotes -1
What if data is not linear x2
separable? (noisy data, outliers,
etc.)
Slack variables i can be 2

added to allow mis- 1
classification of difficult or
noisy data points
x1
Formulation:
n
1
w C i
2
minimize
2 i 1
such that
yi (wT xi b) 1 i
i 0
Parameter C can be viewed as a way to control over-fitting.

Formulation: (Lagrangian Dual Problem)
n
1 n n
maximize ai aia j yi y j xTi x j
i 1 2 i 1 j 1
such that
0 ai C
n
a y
i 1
i i 0
Non-linear SVMs
Datasets that are linearly separable with some noise
work out great:
0 x
But what are we going to do if the dataset is just too hard?

0 x
How about mapping data to a higher-dimensional
space:
x2
0 x
Non-linear SVMs: Feature spaces
General idea: the original input space can always be
mapped to some higher-dimensional feature space
where the training set is separable:
: x (x)
Not linearly separable data. Linearly separable data.
polar
coordinates
0 5
Distance from center (radius)
Need to transform the coordinates: polar coordinates, kernel transformation into

higher dimensional space (support vector machines).
35
Figure : Feature Space Representation
The Kernel Trick
The linear classifier relies on dot product between vectors
K(xi,xj)=xiTxj
If every data point is mapped into high-dimensional space via
some transformation : x (x), the dot product becomes:
K(xi,xj)= (xi) T(xj)
A kernel function is some function that corresponds to an inner
product in some expanded feature space.
Example: 2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1+ xiTxj)2,
Need to show that K(xi,xj)= (xi) T(xj):
K(xi,xj)=(1 + xiTxj)2,
= 1+ xi12xj12 + 2 xi1xj1 xi2xj2+ xi22xj22 + 2xi1xj1 + 2xi2xj2
= [1 xi12 2 xi1xi2 xi22 2xi1 2xi2]T [1 xj12 2 xj1xj2 xj22 2xj1
2xj2]
= (xi) T(xj), where (x) = [1 x12 2 x1x2 x22 2x1 2x2]
Examples of Kernel Functions
Linear: K(xi,xj)= xi Txj
Polynomial of power p: K(xi,xj)= (1+ xi Txj)p
Gaussian (radial-basis function network):

2
xi x j
K (xi , x j ) exp( )
2 2
Sigmoid: K(xi,xj)= tanh(0xi Txj + 1)

Performance
Support Vector Machines work very well in practice.
The user must choose the kernel function and its parameters,
but the rest is automatic.
The test performance is very good.
They can be expensive in time and space for big datasets
The computation of the maximum-margin hyper-plane
depends on the square of the number of training cases.
We need to store all the support vectors.
SVMs are very good if you have no idea about what structure to
impose on the task.
Characteristics of SVM
The SVM learning problem can be formulated as a convex
optimization problem, in which efficient algorithm are
available to find global minimum of objective function.
SVM performs capacity control by maximizing the margin of
decision boundary.
SVM can be applied to categorical data by introducing dummy
variable for each categorical attribute value present in the data.
Properties of SVM
Flexibility in choosing a similarity function
Sparseness of solution when dealing with large data sets(only
support vector are used to specify separating the hyper plane)
Ability to handle large feature space.
Nice math property: a simple convex optimization problem
which is guaranteed to converge to a single global solution.
SVM Application
SVM has been used successfully in many real-world
problems
Text categorization
Image classification
Bioinformatics
Hand written character recognition
Support vector machines (SVMs) at work
(a) Two-dimensional expression profiles of lymphoblastic
leukemia. The SVMs task is to assign a label to the gene
expression profile labeled Unknown.
(b) A separating hyperplane.
(c) A hyperplane in one dimension. The hyperplane is shown as a
single black point.
(d) A hyperplane in three dimensions.
(e) Many possible separating hyperplanes.
44
f) The maximum-margin hyperplane. The three support vectors are
circled.
g) A data set containing one error, indicated by arrow.
h) A separating hyperplane with a soft margin. Error is indicated
by arrow.
i) A nonseparable one-dimensional data set.
j) Separating previously nonseparable data.
k) A linearly nonseparable two-dimensional data set, which is
linearly separable in four dimensions.
l) An SVM that has overfit a twodimensional data set.
45

Support Vector Machine

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Support Vector Machine

Uploaded by

Copyright:

Available Formats

Support Vector Machine

Machine Learning is a subfield of AI that is concerned with the

Supervised learning (Target Values , Teacher is Present)

Unsupervised learning (No Teacher, No Target Values) i.e.,

Semi-Supervised learning (Both labeled and Unlabelled Data)

Reinforcement learning (It allows machines and software agents

Roots in statistical learning theory

The hyper plane is determined

The support vectors

How would you

How would you

How would you

How would you

With a scale transformation on

The learning task in SVM can be formalized as the

minimize Lp (w, b, a i ) w a i yi (wT xi b) 1

It is easy to show that function is minimized when

Thus, only support vectors have

The solution has the form: Support Vectors

get b from yi (wT xi b) 1 0,

Solution: Let w=(w1,w2) and b denotes the parameter of decision boundary. We

If f(z)=1, then the test instance is classified as a positive class

Slack variables i can be 2

Parameter C can be viewed as a way to control over-fitting.

But what are we going to do if the dataset is just too hard?

Need to transform the coordinates: polar coordinates, kernel transformation into

Polynomial of power p: K(xi,xj)= (1+ xi Txj)p

Gaussian (radial-basis function network):

Sigmoid: K(xi,xj)= tanh(0xi Txj + 1)

You might also like