You are on page 1of 45

Support Vector Machine

1
Machine Learning

Machine Learning is a subfield of AI that is concerned with the


design and development of algorithm and techniques that allow
computers to learn.

2
Types of Machine Learning

Supervised learning (Target Values , Teacher is Present)

Unsupervised learning (No Teacher, No Target Values) i.e.,


Clustering

Semi-Supervised learning (Both labeled and Unlabelled Data)

Reinforcement learning (It allows machines and software agents


to automatically determine the ideal behaviour within a specific
context, in order to maximize its performance)

3
Classification
Classification is the task of learning a target function f that
maps each attribute set x to one of the predefined class labels
y
Given training data in different classes (labels known)
Predict test data (labels unknown)

Input Output
Attribute Set Classification Model Class Label
(x) (y)

4
Classification contd

Examples
Handwritten digits recognition
Spam filtering
Text classification
Medical diagnosis
Methods:
Nearest Neighbor
Neural Networks
Decision Tree
Rule Based
Support vector machines: a new method
etc.
5
Introduction to SVM

6
Introduction to SVM (contd)

Roots in statistical learning theory


Works well for high dimensional data
The SVM can be trained to classify both linearly separable and
non-linearly separable data
Represents decision boundary using subset of training examples
(from both classes), known as support vectors

7
Introduction to SVM (contd)

The hyper plane is determined


by a subset of the datapoints
Datapoints in this subset are
called support vectors
Good if less fraction of the
datapoints are support vectors

The support vectors


are indicated by the
circles around them
8
Basic Concepts
Let the set of training examples D be
{(x1, y1), (x2, y2), , (xr, yr)},
where,
xi = (x1, x2, , xn) is an input vector in a real valued space
X Rn and
yi is its class label (output value), yi {1, -1}.
1: positive class and -1: negative class.
SVM finds a linear function of the form (w: weight vector)
f(x) = w x + b

1 if w xi b 0
yi
1 if w xi b 0
9
Hyper Plane
The hyperplane that separates positive and
negative training data is
w x + b = 0
It is also called the decision boundary (surface).
So many possible hyperplanes, which one to
choose?

1 if w xi b 0
yi
1 if w xi b 0
10
a
Linear Classifiers
x f yest
f(x,w,b) = sign(w x + b)
denotes +1 w x + b>0
denotes -1

How would you


classify this data?

w x + b<0
a
Linear Classifiers
x f yest
f(x,w,b) = sign(w x + b)
denotes +1
denotes -1

How would you


classify this data?
a
Linear Classifiers
x f yest
f(x,w,b) = sign(w x + b)
denotes +1
denotes -1

How would you


classify this data?
a
Linear Classifiers
x f yest
f(x,w,b) = sign(w x + b)
denotes +1
denotes -1

Any of these
would be fine..

..but which is
best?
a
Linear Classifiers
x f yest
f(x,w,b) = sign(w x + b)
denotes +1
denotes -1

How would you


classify this data?

Misclassified
to +1 class
a
Classifier Margin
x f yest
f(x,w,b) = sign(w x + b)
denotes +1
denotes -1 Define the margin
of a linear
classifier as the
width that the
boundary could be
increased by
before hitting a
datapoint.
a
Maximum Margin
x f yest
1. Maximizing the margin is good
accordingf(x,w,b)
to intuition.
= sign(w x + b)
denotes +1 2. Implies that only support vectors are
denotes -1 important; other The
training examples
maximum
are ignorable.
margin linear
3. Empirically it works very very
classifier iswell.
the
linear classifier
Support Vectors with the,
are those
datapoints that maximum margin.
the margin This is the
pushes up
against simplest kind of
SVM (Called an
LSVM)
Linear SVM
Large Margin Linear Classifier
denotes +1
denotes -1
Given a set of data points: x2
{(xi , yi )}, i 1, 2, , n , where

For yi 1, wT xi b 0
For yi 1, wT xi b 0

With a scale transformation on


both w and b, the above is
equivalent to

For yi 1, wT xi b 1
x1
For yi 1, wT xi b 1
Linear SVM Mathematically
x+ M=Margin Width

X-

What we know:
w . M = 2
w . x+ + b = +1
w . x- + b = -1
w . (x+-x-) = 2
SVM imposes an additional requirement
Margin of its decision boundary must be maximum
Maximizing the margin is equivalent to
minimizing the following objective function:

The learning task in SVM can be formalized as the


following constrained optimization problem:
1 2
minimize w
2
subject to yi (w . xi + b) 1, i = 1, 2, 3,..., N
Learning a linear SVM modeldenotes +1
denotes -1
Formulation: x2
Margin

1 2 x+
minimize w
2

such that x+

For yi 1, wT xi b 1 n
x-
For yi 1, wT xi b 1

yi (wT xi b) 1 x1
Linear SVM: Separable Case
1 2
Quadratic minimize w
programming 2
with linear
constraints s.t. yi (wT xi b) 1

Lagrangian
Function

minimize Lp (w, b, a i ) w a i yi (wT xi b) 1


n
1 2

2 i 1

s.t. ai 0
First term is same
Second term captures the inequality constraints
1 2
minimize w
2

It is easy to show that function is minimized when


w = 0, but it violates the constraints
Solving the Optimization Problem
minimize L (w, b, a ) w a y (w x b) 1
n
1 2 T
p i i i i
2 i 1

s.t. ai 0

L p
n

0 w a i yi xi
w i 1
n
L p
0 a y i i 0
b i 1
Solving the Optimization Problem
From KKT condition, we know:

a i yi (wT xi b) 1 0
x2

x+

ai 0
x+

Thus, only support vectors have


x-

The solution has the form: Support Vectors


x1
n
w a i yi xi a y x i i i
i 1 iSV

get b from yi (wT xi b) 1 0,


where xi is support vector
Solving the Optimization Problem
minimize L (w, b, a ) w a y (w x b) 1
n
1 2 T
p i i i i
2 i 1

s.t. ai 0

Lagrangian Dual
Problem
n
1 n n
maximize ai aia j yi y j xTi x j
i 1 2 i 1 j 1
n
s.t. a i 0 , and a y
i 1
i i 0
Example:
Consider the two-dimensional data set which contains eight training instances.
Using quadratic programming we can solve the optimization problem to obtain
the Lagrange multiplier for each training instance. The Lagrange multipliers
are depicted in the last column of the table. Notice that only the first two
instances have non-zero Lagrange multipliers. These instances are correspond to
the support vector for this data set.

Solution: Let w=(w1,w2) and b denotes the parameter of decision boundary. We


can solve for w1 and w2 in the following way:
w1= iyixi1= 65.5621*1*0.3858 + 65.5621*-1*0.4871= -6.64
w2= iyixi2= 65.5621*1*0.4687 + 65.5621*-1*0.611= -9.32
The bias term b can be computed for each support vector
b1= 1-wx1 = 1- (-6.64)(0.3858) - (-9.32)(0.4687) = 7.9300
b1= -1-wx2 = -1- (-6.64)(0.4871) - (-9.32)(0.611) = 7.9289
Averaging these values, we obtain b=7.93. The decision boundary corresponding
to these parameter
-6.64 x1 -9.32 x2 +7.93 = 0
x1 x2 y Lagrange Multiplie
0.3858 0.4687 1 65.5261
0.4871 0.611 -1 65.5261
0.9218 0.4103 -1 0
0.7328 0.8936 -1 0
0.1763 0.0579 1 0
0.4057 0.3529 1 0
0.9355 0.8132 -1 0
0.2146 0.0099 1 0
Once, the parameter of decision boundary are found, a test
instance z is classified as follows
F(x)= sign (w*x + b) = sign( iyixi*z + b)

If f(z)=1, then the test instance is classified as a positive class


otherwise it is classified as negative class
Linear SVM: Nonseparable Case
denotes +1
denotes -1
What if data is not linear x2
separable? (noisy data, outliers,
etc.)

Slack variables i can be 2


added to allow mis- 1
classification of difficult or
noisy data points

x1
Linear SVM: Nonseparable Case
Formulation:
n
1
w C i
2
minimize
2 i 1

such that
yi (wT xi b) 1 i

i 0

Parameter C can be viewed as a way to control over-fitting.


Linear SVM: Nonseparable Case
Formulation: (Lagrangian Dual Problem)

n
1 n n
maximize ai aia j yi y j xTi x j
i 1 2 i 1 j 1

such that
0 ai C
n

a y
i 1
i i 0
Non-linear SVMs
Datasets that are linearly separable with some noise
work out great:
0 x

But what are we going to do if the dataset is just too hard?


0 x
How about mapping data to a higher-dimensional
space:
x2

0 x
Non-linear SVMs: Feature spaces
General idea: the original input space can always be
mapped to some higher-dimensional feature space
where the training set is separable:

: x (x)
Not linearly separable data. Linearly separable data.

polar
coordinates

0 5
Distance from center (radius)

Need to transform the coordinates: polar coordinates, kernel transformation into


higher dimensional space (support vector machines).

35
Figure : Feature Space Representation
The Kernel Trick
The linear classifier relies on dot product between vectors
K(xi,xj)=xiTxj
If every data point is mapped into high-dimensional space via
some transformation : x (x), the dot product becomes:
K(xi,xj)= (xi) T(xj)
A kernel function is some function that corresponds to an inner
product in some expanded feature space.
Example: 2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1+ xiTxj)2,
Need to show that K(xi,xj)= (xi) T(xj):
K(xi,xj)=(1 + xiTxj)2,
= 1+ xi12xj12 + 2 xi1xj1 xi2xj2+ xi22xj22 + 2xi1xj1 + 2xi2xj2
= [1 xi12 2 xi1xi2 xi22 2xi1 2xi2]T [1 xj12 2 xj1xj2 xj22 2xj1
2xj2]
= (xi) T(xj), where (x) = [1 x12 2 x1x2 x22 2x1 2x2]
Examples of Kernel Functions
Linear: K(xi,xj)= xi Txj

Polynomial of power p: K(xi,xj)= (1+ xi Txj)p

Gaussian (radial-basis function network):


2
xi x j
K (xi , x j ) exp( )
2 2

Sigmoid: K(xi,xj)= tanh(0xi Txj + 1)


Performance
Support Vector Machines work very well in practice.
The user must choose the kernel function and its parameters,
but the rest is automatic.
The test performance is very good.
They can be expensive in time and space for big datasets
The computation of the maximum-margin hyper-plane
depends on the square of the number of training cases.
We need to store all the support vectors.
SVMs are very good if you have no idea about what structure to
impose on the task.
Characteristics of SVM
The SVM learning problem can be formulated as a convex
optimization problem, in which efficient algorithm are
available to find global minimum of objective function.
SVM performs capacity control by maximizing the margin of
decision boundary.
SVM can be applied to categorical data by introducing dummy
variable for each categorical attribute value present in the data.
Properties of SVM
Flexibility in choosing a similarity function
Sparseness of solution when dealing with large data sets(only
support vector are used to specify separating the hyper plane)
Ability to handle large feature space.
Nice math property: a simple convex optimization problem
which is guaranteed to converge to a single global solution.
SVM Application
SVM has been used successfully in many real-world
problems
Text categorization
Image classification
Bioinformatics
Hand written character recognition
Support vector machines (SVMs) at work
(a) Two-dimensional expression profiles of lymphoblastic
leukemia. The SVMs task is to assign a label to the gene
expression profile labeled Unknown.
(b) A separating hyperplane.
(c) A hyperplane in one dimension. The hyperplane is shown as a
single black point.
(d) A hyperplane in three dimensions.
(e) Many possible separating hyperplanes.

44
f) The maximum-margin hyperplane. The three support vectors are
circled.
g) A data set containing one error, indicated by arrow.
h) A separating hyperplane with a soft margin. Error is indicated
by arrow.
i) A nonseparable one-dimensional data set.
j) Separating previously nonseparable data.
k) A linearly nonseparable two-dimensional data set, which is
linearly separable in four dimensions.
l) An SVM that has overfit a twodimensional data set.

45

You might also like