10 Support Vector Machines

Support Vector Machines
CSE 6363 – Machine Learning

Vassilis Athitsos
Computer Science and Engineering Department
University of Texas at Arlington
1
A Linearly Separable Problem
• Consider the binary classification problem on the figure.
– The blue points belong to one class, with label +1.
– The orange points belong to the other class, with label -1.
• These two classes are linearly separable.
– Infinitely many lines separate them.
– Are any of those infinitely many lines preferable?
2
A Linearly Separable Problem
• Do we prefer the blue line or the red line, as decision
boundary?
• What criterion can we use?
• Both decision boundaries classify the training data with
100% accuracy.
3
Margin of a Decision Boundary
• The margin of a decision boundary is defined as the
smallest distance between the boundary and any of the
samples.
ma
rgi
n
4
Margin of a Decision Boundary
• One way to visualize the margin is this:
– For each class, draw a line that:
• is parallel to the decision boundary.
• touches the class point that is the closest to the decision boundary.
– The margin is the smallest distance between the decision
boundary and one of those two parallel lines.
• In this example, the decision boundary is equally far from both lines.
margin
margin
5
• One way to visualize the margin is this:
– For each class, draw a line that:
• is parallel to the decision boundary.
• touches the class point that is the closest to the decision boundary.
– The margin is the smallest distance between the decision
boundary and one of those two parallel lines.
margin
6
• Support Vector Machines (SVMs) are a classification
method, whose goal is to find the decision boundary
with the maximum margin.
– The idea is that, even if multiple decision boundaries give
100% accuracy on the training data, larger margins lead to
less overfitting.
– Larger margins can tolerate more perturbations of the data.
margin
margin
7
• Note: so far, we are only discussing cases where the training data
is linearly separable.
• First, we will see how to maximize the margin for such data.
• Second, we will deal with data that are not linearly separable.
– We will define SVMs that classify such training data imperfectly.
• Third, we will see how to define nonlinear SVMs, which can
define non-linear decision boundaries.
margin
margin
8
• Note: so far, we are only discussing cases where the training data
is linearly separable.
• First, we will see how to maximize the margin for such data.
• Second, we will deal with data that are not linearly separable.
– We will define SVMs that classify such training data imperfectly.
• Third, we will see how to define nonlinear SVMs, which can
define non-linear decision boundaries.
An example of a nonlinear
decision boundary produced
by a nonlinear SVM.
9
Support Vectors
• In the figure, the red line is the maximum margin decision
boundary.
• One of the parallel lines touches a single orange point.
– If that orange point moves closer to or farther from the red line, the
optimal boundary changes.
– If other orange points move, the optimal boundary does not change,
unless those points move to the right of the blue line.
margin
margin
10
Support Vectors
• In the figure, the red line is the maximum margin decision
boundary.
• One of the parallel lines touches two blue points.
– If either of those points moves closer to or farther from the red line, the
optimal boundary changes.
– If other blue points move, the optimal boundary does not change, unless
those points move to the left of the blue line.
margin
margin
11
Support Vectors
• In summary, in this example, the maximum margin is
defined by only three points:
– One orange point.
– Two blue points.
• These points are called support vectors.
– They are indicated by a black circle around them.
margin
margin
12
Distances to the Boundary
• The decision boundary consists of all points that are
solutions to equation: .
– is a column vector of parameters (weights).
– is an input vector.
– is a scalar value (a real number).
• If is a training point, its distance to the boundary is
computed using this equation:
13
• If is a training point, its distance to the boundary is computed
using this equation:
• Since the training data are linearly separable, the data from
each class should fall on opposite sides of the boundary.
• Suppose that for points of one class, and for points of the
other class.
• Then, we can rewrite the distance as:
14
• So, given a decision boundary defined and , and given a
training input , the distance of to the boundary is:
• If , then:
• If , then:
• So, in all cases, is positive.

15
Optimization Criterion
• If is a training point, its distance to the boundary is
computed using this equation:
• Therefore, the optimal boundary is defined as:
– In words: find the and that maximize the minimum

distance of any training input from the boundary.
16
• The optimal boundary is defined as:
• Suppose that, for some values and , the decision boundary

defined by misclassifies some objects.
• Can those values of and be selected as ?
17
• Suppose that, for some values and , the decision boundary

defined by misclassifies some objects.
• Can those values of and be selected as ?
• No.
– If some objects get misclassified, then, for some it holds that .
– Thus, for such and , the expression in red will be negative.
– Since the data is linearly separable, we can find better values for and , for
which the expression in red will be greater than 0.
18
Scale of
• Suppose that is a real number, and .

• If and define an optimal boundary, then and also
define an optimal boundary.
• We constrain the scale of to a single value, by requiring
that:
19
• We introduced the requirement that:
• Therefore, for any , it holds that:
• The original optimization criterion becomes:
These are equivalent formulations.

The textbook uses the last one because
it simplifies subsequent calculations.
20
Constrained Optimization
• Summarizing the previous slides, we want to find:
subject to the following constraints:
• This is a different optimization problem than what we have

seen before.
• We need to minimize a quantity while satisfying a set of
inequalities.
• This type of problem is a constrained optimization problem.
21
Quadratic Programming
• Our constrained optimization problem can be solved
using a method called quadratic programming.
• Describing quadratic programming in depth is
outside the scope of this course.
• Our goal is simply to understand how to use
quadratic programming as a black box, to solve our
optimization problem.
– This way, you can use any quadratic programming toolkit
(Matlab includes one).
22
Quadratic Programming
• The quadratic programming problem is defined as follows:
• Inputs:
– : an -dimensional column vector.
– : an -dimensional symmetric matrix.
– : an -dimensional symmetric matrix.
– : an -dimensional column vector.
• Output:
– : an -dimensional column vector, such that:
subject to constraint:
23
Quadratic Programming for SVMs
• Quadratic Programming: subject to constraint:
• SVM goal:
subject to constraints:
• We need to define appropriate values of , , , , so that
quadratic programming computes and .
24
• Quadratic Programming constraint:
• SVM constraints:
25
• SVM constraints:
• Define: , ,
– Matrix is , vector has rows, vector has rows.
• The -th row of is , which should be .

• SVM constraint  quadratic programming constraint .
26
• Quadratic programming:
• SVM:
:-
dimensional
• Define: , , already defined in
column vector
the previous slides
– is like the identity matrix, except that of zeros.
• Then, .
27
• SVM:
• Alternative definitions that would NOT work:
• Define: , ,
– is the identity matrix, , is the -dimensional zero vector.
• It still holds that .
• Why would these definitions not work?
28
• SVM:
• Alternative definitions that would NOT work:
• Define: , ,
– is the identity matrix, , is the -dimensional zero vector.
• It still holds that .
• Why would these definitions not work?
• Vector must also make match the SVM constraints.
– With this definition of , no appropriate and can be found.
29
• Quadratic programming: subject to constraint:
• SVM goal:
• Task: define , , , , so that quadratic programming computes and .
,,,,
Like rows rows, rows rows

identity matrix, except columns
that 30
• SVM goal:
,,,,
• Quadratic programming takes as inputs , , , , and outputs , from
which we get the , values for our SVM.
31
• SVM goal:
,,,,
• is the vector of values at dimensions , …, of .
• is the value at dimension of .
32
Solving the Same Problem, Again
• So far, we have solved the problem of defining an SVM (i.e.,
defining and ), so as to maximize the margin between linearly
separable data.
• If this were all that SVMs can do, SVMs would not be that
important.
– Linearly separable data are a rare case, and a very easy case to deal with.
margin
margin
33
Solving the Same Problem, Again
• So far, we have solved the problem of defining an SVM (i.e.,
defining and ), so as to maximize the margin between linearly
separable data.
• We will see two extensions that
make SVMs much more powerful.
– The extensions will allow SVMs to
define highly non-linear decision
boundaries, as in this figure.
• However, first we need to solve
the same problem again.
– Maximize the margin between
linearly separable data.
• We will get a more complicated solution, but that solution will be
easier to improve upon.
34
Lagrange Multipliers
• Our new solutions are derived using Lagrange
multipliers.
• Here is a quick review from multivariate calculus.
• Let be a -dimensional vector.
• Let and be functions from to .
– Functions and map -dimensional vectors to real numbers.
• Suppose that we want to minimize , subject to the
constraint that .
• Then, we can solve this problem using a Lagrange
multiplier to define a Lagrangian function.
35
Lagrange Multipliers
• To minimize , subject to the constraint: :
• We define the Lagrangian function:
– is called a Lagrange multiplier, .
• We find , and a corresponding value for , subject to the following
constraints:
• If , the third constraint implies that .

• Then, the constraint is called inactive.
• If , then .
• Then, constraint is called active. 36
Multiple Constraints
• Suppose that we have constraints:
• We want to minimize , subject to those constraints.

• Define vector .
• Define the Lagrangian function as:
• We find , and a value for , subject to:
37
Lagrange Dual Problems
• We have constraints:
• We want to minimize , subject to those constraints.
• Under some conditions (which are satisfied in our SVM problem),
we can solve an alternative dual problem:
• Define the Lagrangian function as before:
• We find , and the best value for , denoted as , by solving:
38
Lagrange Dual Problems
• Lagrangian dual problem: Solve:
• This dual problem formulation will be used in training
SVMs.
• The key thing to remember is:
– We minimize the Lagrangian with respect to .
– We maximize with respect to the Lagrange multipliers .
39
Lagrange Multipliers and SVMs
• SVM goal: subject to constraints:
• To make the constraints more amenable to Lagrange multipliers,

we rewrite them as:
• Define to be a vector of Lagrange multipliers.

• Define the Lagrangian function:
• Remember from the previous slides, we minimize with respect to

, and maximize with respect to .
40
• The and that minimize must satisfy:
.
41
• The and that minimize must satisfy:
.
42
• Our Lagrangian function is:
• We showed that . Using that, we get:
43
• We showed that:
• Define an matrix such that .

• Remember that we have defined .
• Then, it follows that:
44
• We showed that . Using that, we get:
We have shown before

that the red part equals 0.
45
• Function now does not depend on .

• We simplify more, using again the fact that :
46
• Function now does not depend on anymore.

• We can rewrite as a function whose only input is :
• Remember, we want to maximize with respect to .
47
• By combining the results from the last few slides, our
optimization problem becomes:
Maximize
subject to these constraints:
48
• We want to maximize subject to some constraints.

• Therefore, we want to find an such that:
subject to those constraints.
49
SVM Optimization Problem
• Our SVM optimization problem now is to find an such that:
subject to these constraints:
This problem can be

solved again using
quadratic programming.
50
Using Quadratic Programming
• SVM problem: find
• Again, we must find values for that convert the SVM problem
into a quadratic programming problem.
• Note that we already have a matrix in the Lagrangian. It is an
matrix such that .
• We will use the same for quadratic programming.
51
• Define: , and define N-dimensional vector .
• Then,
• We have mapped the SVM minimization goal of finding to the
quadratic programming minimization goal.
52
• Quadratic programming constraint:
• SVM problem constraints:
• Define: ,
:-
• The first rows of are the negation of the identity matrix.
dimensional
column vector
• Row of is the transpose of vector of target outputs.
of zeros.
• Row of is the negation of the previous row.
53
• Define: ,
:-
• Since we defined : dimensional
– For , the -th row of equals . column vector
– For , the -th row of and specifies that . of zeros.
– The -th row of and specifies that .
– The -th row of and specifies that .
54
• Define: ,
:-
• Since we defined : dimensional
– For , the -th row of and specifies that . column vector
– The last two rows of and specify that . of zeros.
• We have mapped the SVM constraints to the quadratic
programming constraint .
55
Interpretation of the Solution
• Quadratic programming, given the inputs defined in the
previous slides, outputs the optimal value for vector .
• We used this Lagrangian function:
• Remember, when we find to minimize any Lagrangian
one of the constraints we enforce is that .

• In the SVM case, this means:
56
• In the SVM case, it holds that:
• What does this mean?

• Mathematically,
• Either ,
• Or .
• If , what does that imply for ?
57
• In the SVM case, it holds that:
• What does this mean?

• Mathematically,
• Either ,
• Or .
• If , what does that imply for ?
• Remember, many slides back, we imposed the constraint that .
• Equality holds only for the support vectors.
• Therefore, only for the support vectors.
58
Support Vectors
• This is an example we have seen before.
• The three support vectors have a black circle around them.
• When we find the optimal values for this problem, only three of
those values will be non-zero.
– If is not a support vector, then .
margin
margin
59
• We showed before that:
• This means that is a linear combination of the
training data.
• However, since only for the support vectors,
obviously only the support vector influence .
margin
margin
60
Computing
• Define set is a support vector.

• Since only for the support vectors, we get:
• If is a support vector, then .

• Substituting , if is a support vector:
61
Computing
• Remember that can only take values and .

• Therefore,
• Multiplying both sides of the equation with we get:
62
Computing
• Thus, if is a support vector, we can compute with formula:
• To avoid numerical problems, instead of using a single support

vector to calculate , we can use all support vectors (and take the
average of the computed values for ).
• If is the number of support vectors, then:
63
Classification Using
• To classify a test object , we can use the original
formula .
• However, since , we can substitute that formula for ,
and classify using:
• This formula will be our only choice when we use

SVMs that produce nonlinear boundaries.
– Details on that will be coming later in this presentation.
64
Recap of Lagrangian-Based Solution
• We defined the Lagrangian function, which became (after
simplifications):
• Using quadratic programming, we find the value of that

maximizes , subject to constraints: ,
• Then:
𝒘= ∑ ( 𝑎𝑛 𝑡𝑛 𝒙 𝑛 ) 1
∑ (
𝑡 𝑛 − ∑ ( 𝑎𝑚 𝑡 𝑚 ( 𝒙𝑚 ) 𝒙 𝑛 )
)
𝑇
𝑏=
𝑛∈𝑆
𝑁 𝑆 𝑛∈𝑆 𝑚∈ 𝑆
65
Why Did We Do All This?
• The Lagrangian-based solution solves a problem that
we had already solved, in a more complicated way.
• Typically, we prefer simpler solutions.
• However, this more complicated solution can be
tweaked relatively easily, to produce more powerful
SVMs that:
– Can be trained even if the training data is not linearly
separable.
– Can produce nonlinear decision boundaries.
66
The Not Linearly Separable Case
• Our previous formulation required that the training examples
are linearly separable.
• This requirement was encoded in the constraint:
• According to that constraint, every has to be on the correct

side of the boundary.
• To handle data that is not linearly separable, we introduce
variables , to define a modified constraint:
• These variables are called slack variables.

• The values for will be computed during optimization. 67
The Meaning of Slack Variables
• To handle data that is not linearly separable, we introduce slack
variables , to define a modified constraint:
• If , then is where it
should be:
– either on the blue line for
objects of its class, or on the
correct side of that blue line.
• If , then is too
close to the decision boundary.
– Between the red line and the
blue line for objects of its class.
• If , is on the red line.
68
• If , is misclassified (on the wrong side of the red line).
The Meaning of
Slack Variables
• If training data is not linearly
separable, we introduce
slack variables , to
define a modified constraint:
• If , then is where it should be:

– either on the blue line for objects of its class, or on the correct side of
that blue line.
• If , then is too close to the decision boundary.
– Between the red line and the blue line for objects of its class.
• If , is on the decision boundary (the red line).
• If , is misclassified (on the wrong side of the red line). 69
• Before, we minimized subject to constraints:
• Now, we want to minimize error function:
• is a parameter that we pick manually, .

• controls the trade-off between maximizing the margin, and
penalizing training examples that violate the margin.
– The higher is, the farther away is from where it should be.
– If is on the correct side of the decision boundary, and the distance of to
the boundary is greater than or equal to the margin, then , and does not
contribute to the error function.
70
Lagrangian Function
• To do our constrained optimization, we define the Lagrangian:
• The Lagrange multipliers are now and .

• The term in the Lagrangian corresponds to constraint .
• The term in the Lagrangian corresponds to constraint .
71
Lagrangian Function
• Lagrangian :
• Constraints:
72
Lagrangian Function
• Lagrangian :
• As before, we can simplify the Lagrangian significantly:
𝑁
𝜕𝐿
=0 ⇒ ∑ { 𝑎𝑛 𝑡 𝑛 } =0
𝜕𝑏 𝑛=1
73
Lagrangian Function
• Lagrangian :
• In computing , contributes the following value:
• As we saw in the previous slide, .

• Therefore, we can eliminate all those occurrences of .
74
Lagrangian Function
• Lagrangian :
• In computing , contributes the following value:
• As we saw in the previous slide, .

• Therefore, we can eliminate all those occurrences of .
• By eliminating we also eliminate all appearances of and .
75
Lagrangian Function
• By eliminating and , we get:
• This is exactly the Lagrangian we had for the linearly separable

case.
• Following the same steps as in the linearly separable case, we can
simplify even more to:
76
Lagrangian Function
• So, we have ended up with the same Lagrangian as in the linearly
separable case:
• There is a small difference, however, in the constraints.
Linearly separable case: Linearly inseparable case:
77
Lagrangian Function
• So, we have ended up with the same Lagrangian as in the linearly
separable case:
• Where does constraint come from?
• because is a Lagrange multiplier.
• comes from the fact that we showed earlier, that .

– Since is also a Lagrange multiplier, and thus .
78
• Again, we must find values for that convert the SVM problem
into a quadratic programming problem.
• Values for and are the same as in the linearly separable case,
since in both cases we minimize .
79
• is an matrix such that .

• , and .
80
Define:
rows rows to ,
to set to .
rows rows )
) to , set to .
to
rows ) rows ,
and ( , set to .
• The top rows of are the negation of the identity matrix.
• Rows to of are the identity matrix
• Row of is the transpose of vector of target outputs.
• Row of is the negation of the previous row.
81
Define:
rows rows to ,
to set to .
rows rows )
) to , set to .
to
rows ) rows ,
and ( , set to .
• If , the -th row of is , and the -th row of is .
• Thus, the -th rows of and capture constraint .
82
Define:
rows rows to ,
to set to .
rows rows )
) to , set to .
to
rows ) rows ,
and ( , set to .
83
Define:
rows rows to ,
to set to .
rows rows )
) to , set to .
to
rows ) rows ,
and ( , set to .
84
Define:
rows rows to ,
to set to .
rows rows )
) to , set to .
to
rows ) rows ,
and ( , set to .
85
Define:
rows rows to ,
to set to .
rows rows )
) to , set to .
to
rows ) rows ,
and ( , set to .
• Thus, the last two rows of and in combination capture constraint
86
Using the Solution
• Quadratic programming, given the inputs defined in the
previous slides, outputs the optimal value for vector .
• The value of is computed as in the linearly separable case:
• Classification of input x is done as in the linearly separable

case:
87
The Disappearing
• We have used vector , to define the decision boundary .
• However, does not need to be either computed, or used.
• Quadratic programming computes values .
• Using those values , we compute :
• Using those values for and , we can classify test objects :
• If we know the values for and , we do not need .
88
The Role of Training Inputs
• Where, during training, do we use input vectors?
• Where, during classification, do we use input vectors?
• Overall, input vectors are only used in two formulas:
1. During training, we use vectors to define matrix , where:
2. During classification, we use training vectors and test input

in this formula:
• In both formulas, input vectors are only used by taking their

dot products.
89
The Role of Training Inputs
• To make this more clear, define a kernel function as:
• During training, we use vectors to define matrix , where:
• During classification, we use training vectors and test input in

this formula:
• In both formulas, input vectors are used only through

function .
90
The Kernel Trick
• We have defined kernel function .
• During training, we use vectors to define matrix , where:
• During classification, we use this formula:
• What if we defined differently?

• The SVM formulation (both for training and for classification)
would remain exactly the same.
• In the SVM formulation, the kernel is a black box.
– You can define any way you like.
91
A Different Kernel
• Let and be 2-dimensional vectors.
• Consider this alternative definition for the kernel:
• Then:
• Suppose we define a basis function as:
• It is easy to verify that

92
Kernels and Basis Functions
• In general, kernels make it easy to incorporate basis
functions into SVMs:
– Define any way you like.
– Define .
• The kernel function represents a dot product, but in

a (typically) higher-dimensional feature space
compared to the original space of and .
93
Polynomial Kernels
• Let and be -dimensional vectors.
• A polynomial kernel of degree is defined as:
• The kernel that we saw a couple of slides back was a

quadratic kernel.
• Parameter controls the trade-off between influence
higher-order and lower-order terms.
– Increasing values of give increasing influence to lower-
order terms.
94
Polynomial Kernels – An Easy Case
Decision boundary with polynomial kernel of degree 1.
This is identical to
the result using
the standard dot
product as kernel.
95
The decision
boundary is
not linear
anymore.
96
97
Polynomial Kernels – A Harder Case
98
99
100
101
102
103
104
105
106
107
108
109
RBF/Gaussian Kernels
• The Radial Basis Function (RBF) kernel, also known
as Gaussian kernel, is defined as:
• Given , the value of only depends on the distance

between and .
– decreases exponentially to the distance between and .
• Parameter is chosen manually.
– Parameter specifies how fast decreases as moves away
from .
110
RBF Output Vs. Distance
• X axis: distance between and .
• Y axis: , with .
111
112
• Y axis: , with 1.
113
114
RBF Kernels – An Easier Dataset
Decision boundary
with a linear kernel.
115
Decision boundary
with an RBF kernel.
For this dataset, this

is a relatively large
value for , and it
produces a boundary
that is almost linear.
116
Decision boundary
with an RBF kernel.
Decreasing the value

of leads to less
linear boundaries.
117
Decision boundary
with an RBF kernel.

of leads to less
linear boundaries.
118
Decision boundary
with an RBF kernel.

of leads to less
linear boundaries.
119
Decision boundary
with an RBF kernel.

of leads to less
linear boundaries.
120
Decision boundary
with an RBF kernel.

of leads to less
linear boundaries.
121
Decision boundary
with an RBF kernel.
Note that smaller

values of increase
dangers of
overfitting.
122
Decision boundary
with an RBF kernel.
Note that smaller

values of increase
dangers of
overfitting.
123
RBF Kernels – A Harder Dataset
Decision boundary
with a linear kernel.
124
Decision boundary
with an RBF kernel.
The boundary is
almost linear.
125
Decision boundary
with an RBF kernel.
The boundary now

is clearly nonlinear.
126
Decision boundary
with an RBF kernel.
127
Decision boundary
with an RBF kernel.
128
Decision boundary
with an RBF kernel.
Again, smaller values

of increase dangers
of overfitting.
129
Decision boundary
with an RBF kernel.
Again, smaller values

of increase dangers
of overfitting.
130
RBF Kernels and Basis Functions
• The RBF kernel is defined as:
• Is the RBF kernel equivalent to taking the dot product

in some feature space?
• In other words, is there any basis function such that
• The answer is "yes":
– There exists such a function , but its output is infinite-
dimensional.
– We will not get into more details here.
131
Kernels for Non-Vector Data
• So far, all our methods have been taking real-valued vectors as
input.
– The inputs have been elements of , for some .
• However, there are many interesting problems where the
input is not such real-valued vectors.
• Examples???
132
Kernels for Non-Vector Data
• So far, all our methods have been taking real-valued vectors as
input.
– The inputs have been elements of , for some .
• However, there are many interesting problems where the
input is not such real-valued vectors.
• The inputs can be strings, like "cat", "elephant", "dog".
• The inputs can be sets, such as {1, 5, 3}, {5, 1, 2, 7}.
– Sets are NOT vectors. They have very different properties.
– As sets, .
– As vectors,
• There are many other types of non-vector data…
• SVMs can be applied to such data, as long as we define an
appropriate kernel function. 133
Training Time Complexity
• Solving the quadratic programming problem takes time,
where is the the dimensionality of vector .
– In other words, is the number of values we want to estimate.
• If we d and , then we estimate values.
– The time complexity is , where is the dimensionality of input vectors
(or the dimensionality of , if we use a basis function).
• If we use quadratic programming to compute vector , then we
estimate values.
– The time complexity is , where is the number of training inputs.
• Which one is faster?
134
Training Time Complexity
• If we use quadratic programming to compute directly and ,
then the time complexity is .
– If we use no basis function, is the dimensionality of input vectors
– If we use a basis function , is the dimensionality of
• If we use quadratic programming to compute vector , the time
complexity is .
• For linear SVMs (i.e., SVMs with linear kernels, that use the
regular dot product), usually is much smaller than .
• If we use RBF kernels, is infinite-dimensional.
– Computing but computing vector still takes time.
• If you want to use the kernel trick, then there is no choice, it
takes time to do training.
135
SVMs for Multiclass Problems
• As usual, you can always train one-vs.-all SVMs if there are
more than two classes.
• Other, more complicated methods are also available.
• You can also train what is called "all-pairs" classifiers:
– Each SVM is trained to discriminate between two classes.
– The number of SVMs is quadratic to the number of classes.
• All-pairs classifiers can be used in different ways to classify an
input object.
– Each pairwise classifier votes for one of the two classes it was trained
on. Classification time is quadratic to the number of classes.
– There is an alternative method, called DAGSVM, where the all-pairs
classifiers are organized in a directed acyclic graph, and classification
time is linear to the number of classes.
136
SVMs: Recap
• Advantages:
– Training finds globally best solution.
• No need to worry about local optima, or iterations.
– SVMs can define complex decision boundaries.
• Disadvantages:
– Training time is cubic to the number of training data. This
makes it hard to apply SVMs to large datasets.
– High-dimensional kernels increase the risk of overfitting.
• Usually larger training sets help reduce overfitting, but SVMs cannot
be applied to large training sets due to cubic time complexity.
– Some choices must still be made manually.
• Choice of kernel function.
• Choice of parameter in formula .
137

10 Support Vector Machines

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

10 Support Vector Machines

Uploaded by

Copyright:

Available Formats

Support Vector Machines

CSE 6363 – Machine Learning

• So, in all cases, is positive.

• Therefore, the optimal boundary is defined as:

– In words: find the and that maximize the minimum

• Suppose that, for some values and , the decision boundary

• Suppose that, for some values and , the decision boundary

• Suppose that is a real number, and .

These are equivalent formulations.

subject to the following constraints:

• This is a different optimization problem than what we have

• The -th row of is , which should be .

Like rows rows, rows rows

• If , the third constraint implies that .

• We want to minimize , subject to those constraints.

• We find , and a value for , subject to:

• We find , and the best value for , denoted as , by solving:

• To make the constraints more amenable to Lagrange multipliers,

• Define to be a vector of Lagrange multipliers.

• Remember from the previous slides, we minimize with respect to

• We showed that . Using that, we get:

• Define an matrix such that .

We have shown before

• Function now does not depend on .

• Function now does not depend on anymore.

• Remember, we want to maximize with respect to .

subject to these constraints:

• We want to maximize subject to some constraints.

subject to those constraints.

subject to these constraints:

This problem can be

• Remember, when we find to minimize any Lagrangian

one of the constraints we enforce is that .

• What does this mean?

• What does this mean?

• Define set is a support vector.

• If is a support vector, then .

• Remember that can only take values and .

• To avoid numerical problems, instead of using a single support

• This formula will be our only choice when we use

• Using quadratic programming, we find the value of that

• According to that constraint, every has to be on the correct

• These variables are called slack variables.

• If , then is where it should be:

• is a parameter that we pick manually, .

• The Lagrange multipliers are now and .

• As before, we can simplify the Lagrangian significantly:

• In computing , contributes the following value:

• As we saw in the previous slide, .

• In computing , contributes the following value:

• As we saw in the previous slide, .

• This is exactly the Lagrangian we had for the linearly separable

• There is a small difference, however, in the constraints.

Linearly separable case: Linearly inseparable case:

• Where does constraint come from?

• because is a Lagrange multiplier.

• comes from the fact that we showed earlier, that .

• is an matrix such that .

• Classification of input x is done as in the linearly separable

• Using those values for and , we can classify test objects :

• If we know the values for and , we do not need .

2. During classification, we use training vectors and test input

• In both formulas, input vectors are only used by taking their