You are on page 1of 4

Statistics & Discrete Methods of Data Sciences CS395T(51800), CSE392(63625), M393C(54377)

MW 11:00a.m.-12:30p.m. GDC 5.304


Lecture Notes: Geometry of Support Vector Machines and Kernel Trick, bajaj@cs.utexas.edu

1 Support Vector Machines(SVM)


We consider a machine learning approach to 2-class hyperplane separation.
Given a training set of instance pairs {(xi , yi ) | xi ∈ Rn , i = 1, . . . , m, class labels yi = ±1}, we wish to find
a hyperplane direction w ∈ Rn and an offset scalar b such that
(
w · xi − b ≥ 0 for yi = +1
w · xi − b < 0 for yi = −1
or altogether

yi (w · xi − b) > 0
If such a hyperplane exists, then it is not unique. In real world classification problems it is quite likely that
one would require non-linear separators having a reasonable complexity vs accuracy tradeoff.
Since the training data are merely samples of the instance space, and not necessarily adequate "represen-
tative" samples, doing well on the training data (samples) does not necessarily guarantee (or even imply)
that one will do well on the entire instance space. A related issue is that the training data distribution is
unknown, so contrary to statistical inference we do not estimate the unknown distribution. Nevertheless
optimal learning algorithms can be developed without the need for first estimating the distribution.

1.1 Idea of SVM


Consider subset Cr of all hyperplanes which have a fixed margin r where the margin is
yi (wT xi − b)
r = min{ }
i kwk2
representing distance of the closest training point to the hyperplane.
The Support Vector Machine (SVM) method first introduced via Vapnik et al. 1992. SVM seeks a
hyperplane that simultaneously minimizes empirical error and maximizes the margin.
Remark: Distance between point x and hyperplane is |w·x−b| kwk2
Note since
w · x − b > 0 ⇐⇒ (λw) · x − (λb) > 0 ∀λ > 0
The distance |w·x−b| 1
kwk can be normalized to kwk2 =
√1
w·w
by setting the smallest margin (closest to hyperplane)
2 2
equals to 1,i.e. let |w · x − b| = 1. Therefore, the overall margin (from both sides) is √w·w = kwk 2

Note
2 2 2 1
argmax √ = argmax = argmax 2 = argmin 2 (w · w)
w w·w w kwk 2 w kwk2 w

Then we could write down the following optimization problem that SVM seeks to solve:

1
min w·w
w,b 2 (1)
s.t. yi (w · xi − b) − 1 ≥ 0 i = 1, . . . , m

here the constraints are the result of normalization of |w·x−b|


kwk : All points labeled ±1 are farther away from
the "narrow-band boundary" margin, i.e.
(
w·x−b≥1 when x is + 1
w · x − b ≤ −1 when x is − 1
combining with the label, one has:

yi (w · xi − b) − 1 ≥ 0 i = 1, . . . , m

1
Statistics & Discrete Methods of Data Sciences CS395T(51800), CSE392(63625), M393C(54377)
MW 11:00a.m.-12:30p.m. GDC 5.304
Lecture Notes: Geometry of Support Vector Machines and Kernel Trick, bajaj@cs.utexas.edu

Noisy Data Case One relaxes the SVM problem to having a "soft" margin. Separability holds with some
error:

m
1 X
min w·w+ν i
w,b,i 2 i=1
(2)
s.t. yi (w · xi − b) ≥ 1 − i i = 1, . . . , m
i ≥ 0i = 1, . . . , m
Pm
With i > 0, point can lie inside the margin. Note that i=1 i is ki k1 ,i.e. the L1 norm, this implies sparsity
and thus sparse errors.
Best is to penalize based on number of errors, i.e.: L0 norm: ki k0 = {i : i > 0}
which minimizes the number of errors. However L0 norms are non-convex. L1 norm is the convex relaxation
of this L0 norm.

1.2 Primal-Dual formulation of SVM


The primal form of SVM with maximization of the soft margin is:

m
1 X
min w·w+ν i
w,b,i 2 i=1
(3)
s.t. yi (w · xi − b) ≥ 1 − i i = 1, . . . , m
i ≥ 0 i = 1, . . . , m

The Lagrangian formulation using Lagrange multipliers yields


m m m
1 X X X
sup min L(w, b, i , µi , δi ) = w · w + ν i − µi [yi (w · xi − b) − 1 + i ] − δ i i
µi ,δi w,b,i 2 i=1 i=1 i=1

using additional first, we check the first order optimality condition:



∂L
∂w = 0



∂L
 ∂b = 0



∂L
min L(w, b, i , µi , δi ) =⇒ ∂i = 0
w,b,i 
µi [yi (w · xi − b) − 1 + i ] = 0





δ  = 0
i i

m m
∂L X X
=w− µi yi xi = 0 =⇒ w = µi yi xi
∂w i=1 i=1
m
∂L X
= µi yi = 0
∂b i=1

∂L
= ν − µi − δi =⇒ 0 ≤ µi ≤ ν, i = 1, 2, . . . , m
∂i
When µi = 0, i.e. yi (w · xi − b) > 1 − i , instance xi is classified and is not a boundary point.
When µi > 0, i.e. yi (w · xi − b) = 1 − i , then xi is a boundary point with margin error i > 0 as small as
possible. These boundary points are support vectors and w is determined by them.
∂L ∂L
∂w = 0 and using ∂i implies the following classification

• 0 < µi < ν: point xi is on margin and no margin error


• µi = ν: points xi is a margin error point (since δi = 0)

2
Statistics & Discrete Methods of Data Sciences CS395T(51800), CSE392(63625), M393C(54377)
MW 11:00a.m.-12:30p.m. GDC 5.304
Lecture Notes: Geometry of Support Vector Machines and Kernel Trick, bajaj@cs.utexas.edu

• µi = 0: point xi is not in the margins


Pm
by replacing w = i=1 µi yi xi and δi = ν − µi , one obtains the Dual Problem for SVM with soft margin:

m
X 1X
max µi − (yi yj xi · xj )µi µj
µi
i=1
2 i,j
s.t. 0 ≤ µi ≤ ν, i = 1, . . . , m (4)
Xm
yi µi = 0
i=1

If we denote 1 as the vector with all elements 1, then the maximization problem can be written as:
1
max µT 1 − µT M µ
µi 2
with Gram Matrix Mij = yi yj xi · xj , which is a Positive Semi-Definite (PSD) matrix.
The reason to introduce the dual problem is the following: dual form of SVM is simpler than the primal
SVM; the key feature is that the optimization objective function is now described by inner products of data
instance pairs hxi , xj i.

2 Kernel Trick in Support Vector Machine


SVM is applicable for the "kernel trick", meaning that inner products which determines Gram Matrix can
be replaced by non-linear functional inner product (which keeps the Gram matrix PSD).

2.1 Kernel mapping


A Kernel K is obtained from a mapping of original measurement vectors xi to a higher dimensional feature
vector space. Given a map φ(x)

φ(x) : Rn −→ Rp , p > n
the functional space formulation of a Kernel space is a Hilbert space H = {K : Rn ×Rn −→ R defines an inner product}.
Here K is given by

K(xi , xj ) = φ(xi )T φ(xj )


The Kernel trick is to use and evaluate the kernel without ever explicitly evaluating the φ(·).
Some common choice of Kernels used in conjunction with non-linear (Kernel) SVM are:

• d-th order polynomial K(xi , xj ) = (xTi xj + θ)d


• Gaussian Radial Basis Function(RBF): K(xi , xj ) = exp(− 2σ1 2 kxi − xj k2 )

Algorithms on input vectors expressed as only computing inner products between vectors are amenable to
the kernel trick where xi · xj can be replaced by φT (xi )φ(xj )

We know the Gram Matrix M in dual formulation is Positive Semi-Definite (PSD) since M = QT Q, where
 
Q = y1 x1 , y2 x2 , · · · , ym xm

with all xi ∈ Rn as column vectors.


Theorem. [Mercer] Let K(x, y) be symmetric and continuous. Then the following conditions are equivalent:
P∞
(I) K(x, y) = i=1 αi φi (x)φi (y) = φT (x)φ(y) for any uniformly converging series with αi > 0

3
Statistics & Discrete Methods of Data Sciences CS395T(51800), CSE392(63625), M393C(54377)
MW 11:00a.m.-12:30p.m. GDC 5.304
Lecture Notes: Geometry of Support Vector Machines and Kernel Trick, bajaj@cs.utexas.edu

ψ 2 (x) dx < ∞, we have


R
(II) ∀ψ(·) satisfy x
Z Z
K(x, y)ψ(x)ψ(y)dxdy ≥ 0
x y

(III) ∀ {xi }qi=1 and all q, matrix K with Kij = K(xi , xj ) is PSD
To better understand the relationship of feature maps with Kernels, consider:
Ps
• Homogeneous Polynomial Kernel: x, y ∈ Rs =⇒ k(x, y) = (xT y)d = ( i=1 xi yi )d , d > 0
The feature map can be defined as
  s  ! s
s+d d n1 ns
X
φ(x) ≡ ( dimensional vector space) = x1 ...xs , ni = d, ni ≥ 0
d n1 ...ns i=1

One constructs the Gram matrix of Kernels, as the following: Mij = K(xi , xj ) = (xTi xj )d

When s = d = 2, (xT y)2 = x21 y12 + 2x1 x2 y1 y2 + x22 y22 , we pick φ(x) = (x21 , x22 , 2x1 x2 ) and thereby
create Gram matrix with Mij = (xTi xj )2 .
• Non-Homogeneous Polynomial Kernel: All monomials with degree ≤ d
√ √ d
K(x, y) = (xT y + α)d = (x1 y1 + x2 y2 + · · · + xk yk + α α)

Again, consider the case where s = d = 2, and now K(x, y) = (x1 y1 + x2 y2 + α)d : φ : R2 −→ R6 maps
a conic curve in the measurement plane to a hyper-plane in six dimensional feature space.

2.2 Applying the Kernel Trick to SVM


Following notation in section 1.2, by adopting kernel K(·, ·), one can solve dual SVM for φ(w) and b, which
yields K :
K(x, y) = φ(x)T φ(y)
Xm
φ(w) = µi yi φ(xi ).
i=1
Rather than explicitly represent φ(w) or evaluate φ(xi ), we store the support vectors xi for which µi > 0
and use the classifier f for all test data:
m
X
T
f (x) = sgn(φ(w) φ(x) − b) = sgn( µi yi K(xi , x) − b)
i=1
Kernel Trick allows non-linear separator (classifier) by changing the Gram matrix of kernels. However one
should store and use all support vectors at time of classification (rather than w, b), which increase the cost
for time and storage complexity.
remark. Constraint b can be recovered from any of the support vectors, say x+ is a support vector with label
+1, (but not a margin error, i.e. µi < ν), then
φ(w)T φ(x+ ) − b = 1 =⇒ b = φ(w)T φ(x+ ) − 1
remark. Support vectors are typically 10% of training examples so computational load plus memory utility
is relatively high. Approximations use reduced number of support vectors so as to maintain the cost for
computation of kernels.

References
[BHK] Avrim Blum, John Hopcroft and Ravindran Kannan. Foundations of Data Science, Chap 5
[SVM] Various SVM notes,

You might also like