You are on page 1of 8

CIS 520: Machine Learning Spring 2021: Lecture 6

Support Vector Machines for Classification and Regression

Lecturer: Shivani Agarwal

Disclaimer: These notes are designed to be a supplement to the lecture. They


may or may not cover all the material discussed in the lecture (and vice versa).

Outline
• Linearly separable data: Hard margin SVMs
• Non-linearly separable data: Soft margin SVMs
• Loss minimization view
• Support vector regression (SVR)

1 Linearly Separable Data: Hard Margin SVMs

In this lecture we consider linear support vector machines (SVMs); we will consider nonlinear extensions
in the next lecture. Let X = Rd , and consider a binary classification task with Y = Yb = {±1}. A
training sample S = ((x1 , y1 ), . . . , (xm , ym )) ∈ (Rd × {±1})m is said to be linearly separable if there
exists a linear classifier hw,b (x) = sign(w> x + b) which classifies all examples in S correctly, i.e. for which
yi (w> xi + b) > 0 ∀i ∈ [m]. For example, Figure 1 (left) shows a training sample in R2 that is linearly
separable, together with two possible linear classifiers that separate the data correctly (note that the decision
surface of a linear classifier in 2 dimensions is a line, and more generally in d > 2 dimensions is a hyperplane).
Which of the two classifiers is likely to give better generalization performance?

Figure 1: Left: A linearly separable data set, with two possible linear classifiers that separate the data. Blue
circles represent class label 1 and red crosses −1; the arrow represents the direction of positive classification.
Right: The same data set and classifiers, with margin of separation shown.

Although both classifiers separate the data, the distance or margin with which the separation is achieved
is different; this is shown in Figure 1 (right). For the rest of this section, assume that the training sample
S = ((x1 , y1 ), . . . , (xm , ym )) is linearly separable; in this setting, the SVM algorithm selects the maximum

1
2 Support Vector Machines for Classification and Regression

margin linear classifier, i.e. the linear classifier that separates the training data with the largest margin.
More precisely, define the (geometric) margin of a linear classifier hw,b (x) = sign(w> x + b) on an
example (xi , yi ) ∈ Rd × {±1} as
yi (w> xi + b)
γi = . (1)
kwk2
>
Note that the distance of xi from the hyperplane w> x+b = 0 is given by |wkwk xi +b|
2
; therefore the above margin
on (xi , yi ) is simply a signed version of this distance, with a positive sign if the example is classified correctly
and negative otherwise. The (geometric) margin of hw,b on the sample S = ((x1 , y1 ), . . . , (xm , ym )) is
then defined as the minimal margin on examples in S:
γ = min γi . (2)
i∈[m]

Given a linearly separable training sample S = ((x1 , y1 ), . . . , (xm , ym )) ∈ (Rd × {±1})m , the hard margin
SVM algorithm finds a linear classifier that maximizes the above margin on S. In particular, any linear
classifier that separates S correctly will have margin γ > 0; without loss of generality, we can represent any
such classifier by some (w, b) such that
min yi (w> xi + b) = 1 . (3)
i∈[m]

The margin of such a classifier on S then becomes simply


yi (w> xi + b) 1
γ = min = . (4)
i∈[m] kwk2 kwk2
Thus, maximizing the margin becomes equivalent to minimizing the norm kwk2 subject to the constraints
in Eq. (3), which can be written as the following optimization problem:
1
min kwk22 (5)
w,b 2
subject to
yi (w> xi + b) ≥ 1 , i = 1, . . . , m. (6)
This is a convex quadratic program (QP) and can in principle be solved directly. However it is useful to
consider the dual of the above problem, which sheds light on the structure of the solution and also facilitates
the extension to nonlinear classifiers which we will see in the next lecture. Note that by our assumption
that the data is linearly separable, the above problem satisfies Slater’s condition, and so strong duality
holds. Therefore solving the dual problem is equivalent to solving the above primal problem. Introducing
dual variables (or Lagrange multipliers) αi ≥ 0 (i = 1, . . . , m) for the inequality constraints above gives the
Lagrangian function
m
1 X
L(w, b, α) = kwk22 + αi (1 − yi (w> xi + b)) . (7)
2 i=1
The(Lagrange) dual function is then given by
φ(α) = inf L(w, b, α) .
w∈Rd ,b∈R

To compute the dual function, we set the derivatives of L(w, b, α) w.r.t. w and b to zero; this gives the
following:
m
X
w = αi yi xi (8)
i=1
m
X
αi yi = 0. (9)
i=1
Support Vector Machines for Classification and Regression 3

Substituting these back into L(w, b, α), we have the following dual function:
m m m
1 XX X
φ(α) = − αi αj yi yj (x>
i x j ) + αi ;
2 i=1 j=1 i=1
 Pm
this dual function is defined over the domain α ∈ Rm : i=1 αi yi = 0 . This leads to the following dual
problem:
m m m
1 XX >
X
max − αi αj yi yj (xi xj ) + αi (10)
α 2 i=1 j=1 i=1
subject to
Xm
αi yi = 0 (11)
i=1
αi ≥ 0, i = 1, . . . , m. (12)

This is again a convex QP (in the m variables αi ) and can be solved efficiently using numerical optimization
methods. On obtaining the solution αb to the above dual problem, the weight vector w b corresponding to the
maximal margin classifier can be obtained via Eq. (8):
m
X
w
b = α
bi yi xi .
i=1

Now, by the complementary slackness condition in the KKT conditions, we have for each i ∈ [m],

b > xi + bb) = 0 .

bi 1 − yi (w
α

This gives
b > xi + bb) = 0 .
bi > 0 =⇒ 1 − yi (w
α
In other words, α bi is positive only for training points xi that lie on the margin, i.e. that are closest to the
separating hyperplane; these points are called the support vectors. For all other training points xi , we
have αbi = 0. Thus the solution for w b can be written as a linear combination of just the support vectors;
specifically, if we define 
SV = i ∈ [m] : αbi > 0 ,
then we have X
w
b = α
b i yi x i .
i∈SV

Moreover, for all i ∈ SV, we have

b > xi + bb) = 0
1 − yi (w b > xi + bb) = 0 .
or yi − (w

This allows us to obtain bb from any of the support vectors; in practice, for numerical stability, one generally
averages over all the support vectors, giving
1 X
bb = b > xi ) .
(yi − w
|SV|
i∈SV

In order to classify a new point x ∈ Rd using the learned classifier, one then computes
X 
> >
hw,b (x) = sign(w x + b) = sign
b b
b b αbi yi (xi x) + b .
b (13)
i∈SV
4 Support Vector Machines for Classification and Regression

2 Non-Linearly Separable Data: Soft Margin SVMs

The above derivation assumed the existence of a linear classifier that can correctly classify all examples in a
given training sample S = ((x1 , y1 ), . . . , (xm , ym )). But what if the sample is not linearly separable?
In this case, one needs to allow for the possibility of errors in classification. This is usually done by relaxing
the constraints in Eq. (6) through the introduction of slack variables ξi ≥ 0 (i = 1, . . . , m), and requiring
only that
yi (w> xi + b) ≥ 1 − ξi , i = 1, . . . , m. (14)
An extra cost for errors can be assigned as follows:
m
1 X
min kwk22 + C ξi (15)
w,b,ξ 2 i=1
subject to
yi (w> xi + b) ≥ 1 − ξi , i = 1, . . . , m (16)
ξi ≥ 0, i = 1, . . . , m. (17)

Thus, whenever yi (w> xi + b) < 1, we pay an associated cost of Cξi = C(1 − yi (w> xi + b)) in the objective
function; a classification error occurs when yi (w> xi + b) ≤ 0, or equivalently when ξi ≥ 1. The parameter
C > 0 controls the tradeoff between increasing the margin (minimizing kwk2 ) and reducing the errors
(minimizing ξi ): a large value of C keeps the errors small at the cost of a reduced margin; a small value
of C allows for more errors while increasing the margin on the remaining examples. Forming the dual of
the above problem as before leads to the same convex QP as in the linearly separable case, except that the
constraints in Eq. (12) are replaced by1

0 ≤ αi ≤ C i = 1, . . . , m. (18)

The solution for w


b is obtained similarly to the linearly separable case:
m
X
w
b = α
bi yi xi .
i=1

In this case, the complementary slackness conditions yield for each i ∈ [m]:2

b > xi + bb) = 0

bi 1 − ξbi − yi (w
α
(C − αbi )ξbi = 0 .

This gives

α
bi > 0 =⇒ b > xi + bb) = 0
1 − ξbi − yi (w
α
bi < C =⇒ ξbi = 0 .

In particular, this gives

0<α
bi < C =⇒ b > xi + bb) = 0 ;
1 − yi (w

these are the points on the margin. Thus here we have three types of support vectors with α
bi > 0 (see Figure
2):
1 To see this, note that in this case there are 2m dual variables, say {α } for the first set of inequality constraints and {β }
i i
for the second set of inequality constraints ξi ≥ 0. When setting the derivative of the Lagrangian L(w, b, ξ, α, β) w.r.t. ξi to
zero, one gets αi + βi = C, allowing one to replace βi with C − αi throughout; the constraint βi ≥ 0 then becomes αi ≤ C.
2 Again, the second set of complementary slackness conditions here are obtained by replacing the dual variables β (for the
i
inequality constraints ξi ≥ 0) with C − αi throughout; see also Footnote 1.
Support Vector Machines for Classification and Regression 5


SV1 = i ∈ [m] : 0 < α
bi < C

SV2 = i ∈ [m] : α
bi = C, ξbi < 1

SV3 = i ∈ [m] : α
bi = C, ξbi ≥ 1 .

SV1 contains margin support vectors (ξbi = 0; these lie on the margin
and are correctly classified); SV2 contains non-margin support vectors with
0 < ξbi < 1 (these are correctly classified, but lie within the margin); SV3 Figure 2: Three types
contains non-margin support vectors with ξbi ≥ 1 (these correspond to clas- of support vectors in the
sification errors). non-separable case.
Let
SV = SV1 ∪ SV2 ∪ SV3 .

Then we have
X
w
b = α
bi yi xi .
i∈SV

Moreover, we can use the margin support vectors in SV1 to compute bb:

1 X
bb = b > xi ) .
(yi − w
|SV1 |
i∈SV1

The above formulation of the SVM algorithm for the general (nonseparable) case is often called the soft
margin SVM.

3 Loss Minimization View

An alternative motivation for the (soft margin) SVM algorithm is in terms of minimizing the hinge loss on
the training sample S = ((x1 , y1 ), . . . , (xm , ym )). Specifically, define `hinge : {±1} × R→R+ as

`hinge (y, f ) = (1 − yf )+ , (19)

where z+ = max(0, z). This loss is convex in f and upper bounds the 0-1 loss much as the logistic loss does
(see Figure 3).

Figure 3: Hinge and 0-1 losses, as a function of the margin yf .


6 Support Vector Machines for Classification and Regression

Now consider learning a linear classifier that minimizes the empirical hinge loss, plus an L2 regularization
term:
m
1 X
1 − yi (w> xi + b) + + λkwk22 .

min (20)
w,b m i=1

Introducing slack variables ξi (i = 1, . . . , m), we can re-write this as

m
1 X
min ξi + λkwk22 (21)
w,b,ξ m i=1
subject to
ξi ≥ 1 − yi (w> xi + b) , i = 1, . . . , m (22)
ξi ≥ 0, i = 1, . . . , m. (23)

1
This is equivalent to the soft margin SVM (with C = 2λm ); in other words, the soft margin SVM algorithm
1
derived earlier effectively performs L2 -regularized empirical hinge loss minimization (with λ = 2Cm )!

4 Support Vector Regression (SVR)

Consider now a regression problem with X = Rd and Y = Yb = R. Given a training sample S =


((x1 , y1 ), . . . , (xm , ym )) ∈ (Rd × R)m , the support vector regression (SVR) algorithm minimizes an
L2 -regularized form of the -insensitive loss ` : R × R→R+ , defined as

` (y, yb) = |b
y − y| −  + (24)
(
0 if |b
y − y| ≤ 
= (25)
|b
y − y| −  otherwise.

This yields
m
1 X >
(w xi + b) − yi −  + + λkwk22 .

min (26)
w,b m i=1

Introducing slack variables ξi , ξi∗ (i = 1, . . . , m) and writing λ = 1


2Cm for appropriate C > 0, we can re-write
this as
m
1 X
min ∗ kwk22 + C (ξi + ξi∗ ) (27)
w,b,ξ,ξ 2 i=1
subject to
ξi ≥ yi − (w> xi + b) −  , i = 1, . . . , m (28)
ξi∗ ≥ (w> xi + b) − yi −  , i = 1, . . . , m (29)
ξi , ξi∗ ≥ 0, i = 1, . . . , m. (30)

This is again a convex QP that can in principle be solved directly; again, it useful to consider the dual,
which helps to understand the structure of the solution and facilitates the extension to nonlinear SVR. We
Support Vector Machines for Classification and Regression 7

leave the details as an exercise; the resulting dual problem has the following form:
m m m m
1 XX X X
max − (αi − αi∗ )(αj − αj∗ )(x>
i x j ) + y i (αi − αi

) −  (αi + αi∗ ) (31)
α 2 i=1 j=1 i=1 i=1
subject to
Xm
(αi − αi∗ ) = 0 (32)
i=1
0 ≤ αi ≤ C , i = 1, . . . , m. (33)
0≤ αi∗ ≤C, i = 1, . . . , m. (34)

This is again a convex QP (in the 2m variables αi , αi∗ ); the solution α,


b αb ∗ can be used to find the solution
w
b to the primal problem as follows:
Xm
w
b = αi − α
(b bi∗ )xi .
i=1

In this case, the complementary slackness conditions yield for each i ∈ [m]:

b > xi + bb) + 

αbi ξbi − yi + (w = 0
b∗ ξb∗ + yi − (wb > xi + bb) + 

α i i = 0
(C − α
bi )ξbi = 0
b∗ )ξb∗
(C − α i i = 0.

Analysis of these conditions shows that for each i, either αbi or αbi∗ (or both) must be zero. For points inside
the -tube around the learned linear function, i.e. for which |(wb > xi + bb) − yi | < , we have both α bi∗ = 0.
bi = α
The remaining points constitute two types of support vectors:

bi∗ < C

SV1 = i ∈ [m] : 0 < αbi < C or 0 < α
bi∗ = C .

SV2 = i ∈ [m] : α
bi = C or α

SV1 contains support vectors on the tube boundary (with ξbi = ξbi∗ = 0); SV2 contains support vectors outside
the tube (with ξbi > 0 or ξbi∗ > 0). Taking

SV = SV1 ∪ SV2 ,

we then have X
w
b = (b bi∗ )xi .
αi − α
i∈SV

As before, the boundary support vectors in SV1 can be used to compute bb, which gives
 X 
bb = 1
X
b > xi − ) +
(yi − w b > xi − yi − ) .
(w
|SV1 | ∗
i:0<b
αi <C i:0<b
αi <C

The prediction for a new point x ∈ Rd is then made via


X
fw,
b b
b>
b (x) = w x + b =
b (b bi∗ )(x>
αi − α i x) + b .
b
i∈SV

In practice, the parameter C in SVM and the parameters C and  in SVR are generally selected by cross-
validation on the training sample (or using a separate validation set). An alternative parametrization of the
8 Support Vector Machines for Classification and Regression

SVM and SVR optimization problems, termed ν-SVM and ν-SVR, makes use of a different parameter ν that
directly bounds the fraction of training examples that end up as support vectors.
Exercise. Derive the dual of the SVR optimization problem above.
Exercise. Derive an alternative formulation of the SVR optimization problem that makes use of a single
slack variable ξi for each data point rather than two slack variables ξi , ξi∗ . Show that this leads to the same
solution as above.
Exercise. Derive a regression algorithm that given a training sample S, minimizes on S the L2 -regularized
absolute loss `abs : R × R→R+ , given by `abs (y, yb) = |b
y − y|, over all linear functions.

Acknowledgments. Thanks to Saurav Bose for help in preparing the plot in Figure 3.

You might also like