Lecture 14: Kernels - Applied ML
Lecture 14: Kernels - Applied ML
Contents
Lecture 14: Kernels
14.1. The Kernel Trick in SVMs
14.2. Kernelized Ridge Regression
14.3. More on Kernels
In this lecture, we will introduce a new and important concept in machine learning:
the kernel. So far, the majority of the machine learning models we have seen have
been linear. Kernels are a general way to make many of these models non-linear.
In the previous two lectures, we introduced linear SVMs. Kernels will be a way to
make the SVM algorithm suitable for dealing with non-linear data.
Linear models for this binary classification can take the form
f θ (x) = θ ⊤ ϕ(x) + θ 0 ,
where x is the input and y ∈ {−1, 1} is the target. Support vector machines are a
machine learning algorithm that fits a linear model by finding the maximum margin
separating hyperplane between the two classes.
Recall that the the max-margin hyperplane can be formulated as the solution to the
following primal optimization problem.
n
1
min ||θ|| 2 + C ∑ ξ i
θ,θ 0 ,ξ 2 i=1
subject to y (i) ((x (i) ) ⊤ θ + θ 0 ) ≥ 1 − ξ i for all i
ξi ≥ 0
The solution to this problem also happens to be given by the following dual
problem:
[Link] Page 1 of 14
Lecture 14: Kernels — Applied ML 20/10/24, 6:22 PM
n n n
1
max ∑ λ i − ∑ ∑ λ i λ k y (i) y (k) (x (i) ) ⊤ x (k)
λ
i=1 2 i=1 k=1
n
subject to ∑ λ i y (i) = 0
i=1
C ≥ λ i ≥ 0 for all i
We can obtain a primal solution from the dual via the following equation: $
θ ∗ = ∑ ni=1 λ ∗i y (i) x (i) .$
Ignoring the θ 0 term for now, the score at a new point x ′ will equal $
(θ ∗ ) ⊤ x ′ = ∑ ni=1 λ ∗i y (i) (x (i) ) ⊤ x ′ .$
a p x p + a p−1 x p−1 +. . . +a 1 x + a 0 .
⎡1⎤
x
ϕ(x) = x 2 .
⋮
⎣x p ⎦
p
f θ (x) := ∑ θ p x p = θ ⊤ ϕ(x)
j=0
Crucially, observe that f θ is a linear model with input features ϕ(x) and
parameters θ. The parameters θ are the coefficients of the polynomial. Thus, we
can use our algorithms for linear regression to learn θ. This yields a polynomial
model for y given x that is non-linear in x (because ϕ(x) is non-linear in x).
[Link] Page 2 of 14
Lecture 14: Kernels — Applied ML
J(λ) = ∑ λ i −
i=1
n
1
n
⎢⎥
The disadvantage of the above approach is that it requires more features and more
computation. When x is a scalar, we need O(p) features. When applying the
normal equations to compute θ, we need O(p 3 ) time. More generally, when x is a
d-dimensional vector, we will need O(d p ) features to represent a full polynomial,
and time complexity of applying the normal equations will be even greater.
n n
∑ ∑ λ i λ k y (i) y (k) ϕ(x (i) ) ⊤ ϕ(x (k) )
2 i=1 k=1
Notice that in both equations, the features ϕ(x) are never used directly. Only their
dot product is used. If we can compute the dot product efficiently, we can
potentially use very complex features.
Can we compute the dot product ϕ(x) ⊤ ϕ(x ′ ) of polynomial features ϕ(x) more
efficiently than using the standard definition of a dot product? It turns out that we
can. Let’s look at an example.
ϕ(x) =
⎣x x ⎦
x3 x2
3 3
d d
.
ϕ(x) ⊤ ϕ(z) = ∑ ∑ x i x j z i z j
i=1 j=1
2
These features consist of all the pairwise products among all the entries of x. For
d = 3 this looks like
Normally, computing this dot product involves a sum over d 2 terms and takes
O(d 2 ) time.
[Link]
20/10/24, 6:22 PM
Page 3 of 14
Lecture 14: Kernels — Applied ML
follows:
d
i=1
d
d
i=1 j=1
= ϕ(x) ⊤ ϕ(z)
⎢⎥
An alternative way of computing the dot product ϕ(x) ⊤ ϕ(z) is to instead compute
(x ⊤ z) 2 . One can check that this has the same result:
(x ⊤ z) 2 = (∑ x i z i ) 2
i=1
d
= ∑ ∑ xi zi xj zj
j=1
d
= (∑ x i z i ) ⋅ (∑ x j z j )
But computing (x ⊤ z) 2 can be done in only O(d) time: we simply compute the dot
product x ⊤ z in O(d) time, and then square the resulting scalar. This is much
faster than the naive O(d 2 ) procedure.
However, using a version of the above argument, we can compute the dot product
ϕ p (x) ⊤ ϕ p (z) in this feature space in only O(d) time for any p as follows:
We can compute the dot product between O(d p ) features in only O(d) time.
We can use high-dimensional features within ML algorithms that only rely on
dot products without incurring extra costs.
function is
K(x, z) = ϕ(x) ⊤ ϕ(z).
We will call K the kernel function. Recall that an example of a useful kernel
K(x, z) = (x ⋅ z) p
[Link]
20/10/24, 6:22 PM
Page 4 of 14
Lecture 14: Kernels — Applied ML 20/10/24, 6:22 PM
n n n
1
max ∑ λ i − ∑ ∑ λ i λ k y (i) y (k) K(x (i) , x (k) )
λ
i=1 2 i=1 k=1
n
subject to ∑ λ i y (i) = 0
i=1
C ≥ λ i ≥ 0 for all i
n
∑ λ ∗i y (i) K(x (i) , x ′ ).
i=1
We can efficiently use any features ϕ(x) (e.g., polynomial features of any degree p
) as long as the kernel functions computes the dot products of the ϕ(x) efficiently.
We will see several examples of kernel functions below.
The Kernel Trick means that we can use complex non-linear features within these
algorithms with little additional computational cost.
Support vector machines are far from being the only algorithm that benefits from
kernels. Another algorithm that supports kernels is Ridge regression.
f θ (x) = θ ⊤ ϕ(x).
n d
1
∑(y (i) − θ ⊤ ϕ(x (i) )) 2 + ∑ θ 2j
λ
J(θ) =
2n i=1 2 j=1
[Link] Page 5 of 14
Φ=
⎢⎥
Lecture 14: Kernels — Applied ML
⎡ ϕ(x ) 1
(1)
ϕ(x (2) ) 1
⋮
⎣ϕ(x (n) ) 1
ϕ(x (1) ) 2
ϕ(x (2) ) 2
ϕ(x (n) ) 2
…
…
…
ϕ(x (1) ) p ⎤
ϕ(x (2) ) p
ϕ(x (n) )
Regression
identity:
p > n.
θ = Φ ⊤ (ΦΦ ⊤ + λI ) −1 y = Φ T α
θ = (X ⊤ X + λI ) −1 X ⊤ y.
θ = (Φ ⊤ Φ + λI ) −1 Φ ⊤ y.
(λI + UV ) −1 U = U(λI + V U ) −1
θ = (Φ ⊤ Φ + λI ) −1 Φ ⊤ y
θ = Φ ⊤ (ΦΦ ⊤ + λI ) −1 y.
p
⎦
=
When the vectors of attributes x (i) are featurized, we can write this as
The proof sketch is: Start with U(λI + V U) = (λI + UV )U and multiply both
⎡−
−
sides by (λI + V U ) −1 on the right and (λI + UV ) −1 on the left. If you are
interested, you can try to derive in detail by yourself.
The first approach takes O(p 3 ) time; the second is O(n 3 ) and is faster when
[Link]
ϕ(x (1) ) ⊤
ϕ(x (2) ) ⊤
⋮
⎣− ϕ(x (n) ) ⊤
−⎤
−
−⎦
.
20/10/24, 6:22 PM
Page 6 of 14
Lecture 14: Kernels — Applied ML 20/10/24, 6:22 PM
n
θ = ∑ α i ϕ(x (i) ).
i=1
n
α i = ∑ L ij y j
j=1
where L = (ΦΦ ⊤ + λI ) −1 .
n
ϕ(x ′ ) ⊤ θ = ∑ α i ϕ(x ′ ) ⊤ ϕ(x (i) ).
i=1
The crucial observation is that the features ϕ(x) are never used directly in this
equation. Only their dot product is used!
We also don’t need features ϕ for learning θ, just their dot product! First, recall that
each row i of Φ is the i-th featurized input ϕ(x (i) ) ⊤ . Thus K = ΦΦ ⊤ is a matrix
of all dot products between all the ϕ(x (i) )
n
ϕ(x ′ ) ⊤ θ = ∑ α i ϕ(x ′ ) ⊤ ϕ(x (i) ).
i=1
We have seen two examples of kernelized algorithms: SVM and ridge regression.
Let’s now look at some additional examples of kernels.
We will look at a few examples of kernels using the following dataset. The following
code visualizes the dataset.
[Link] Page 7 of 14
Lecture 14: Kernels — Applied ML 20/10/24, 6:22 PM
# [Link]
[Link]/stable/auto_examples/svm/plot_svm_kernels.html
import numpy as np
import [Link] as plt
from sklearn import svm
(-3.0, 3.0)
The above figure is a visualization of the 2D dataset we have created. We will then
use it to illustrate the difference between various kernels.
K(x, z) = x ⊤ z
Below is an example of how we can use the SVM implementation in sklearn with a
linear kernel. Internally, this solves the dual SVM optimization problem.
[Link] Page 8 of 14
Lecture 14: Kernels — Applied ML 20/10/24, 6:22 PM
# [Link]
[Link]/stable/auto_examples/svm/plot_svm_kernels.html
clf = [Link](kernel='linear' , gamma=2)
[Link](X, Y)
# plot the line, the points, and the nearest vectors to the plane
[Link](clf.support_vectors_[:, 0], clf.support_vectors_[:, 1],
s=80, facecolors='none', zorder=10, edgecolors='k')
[Link](X[:, 0], X[:, 1], c=Y, zorder=10, cmap=[Link],
edgecolors='k')
XX, YY = [Link][x_min:x_max:200j, y_min:y_max:200j]
Z = clf.decision_function(np.c_[[Link](), [Link]()])
(-3.0, 3.0)
The above figure shows that a linear kernel provides linear decision boundaries to
separate the data.
K(x, z) = (x ⊤ z + c) p .
This corresponds to a mapping to a feature space of dimension (d+p) that has all
p
monomials x i1 x i2 ⋯ x ip of degree at most p.
[Link] Page 9 of 14
Lecture 14: Kernels — Applied ML
sklearn.
# [Link]
ϕ(x) =
⎢⎥
⎡ x1 x1 ⎤
x1 x2
x1 x3
x2 x1
x2 x1
x2 x2
x3 x3
x3 x1
x3 x2
x3 x3
√2cx 1
√2cx 2
√2cx 3
⎣ c ⎦
.
[Link]/stable/auto_examples/svm/plot_svm_kernels.html
clf = [Link](kernel='poly', degree=3, gamma=2)
[Link](X, Y)
# plot the line, the points, and the nearest vectors to the plane
[Link](clf.support_vectors_[:, 0], clf.support_vectors_[:, 1],
s=80, facecolors='none', zorder=10, edgecolors='k')
[Link](X[:, 0], X[:, 1], c=Y, zorder=10, cmap=[Link],
edgecolors='k')
XX, YY = [Link][x_min:x_max:200j, y_min:y_max:200j]
Z = clf.decision_function(np.c_[[Link](), [Link]()])
(-3.0, 3.0)
[Link]
20/10/24, 6:22 PM
Page 10 of 14
Lecture 14: Kernels — Applied ML 20/10/24, 6:22 PM
The above figure is a visualization of the polynomial kernel with degree of 3 and
gamma of 2. The decision boundary is non-linear, in contrast to that of linear
kernels.
||x − z|| 2
K(x, z) = exp (− ),
2σ 2
To see why that’s intuitively the case, consider the Taylor expansion
Each term on the right-hand side can be expanded into a polynomial. Thus, the
above infinite series contains an infinite number of polynomial terms, and
consequently an infinite number of polynomial features.
[Link] Page 11 of 14
Lecture 14: Kernels — Applied ML 20/10/24, 6:22 PM
# [Link]
[Link]/stable/auto_examples/svm/plot_svm_kernels.html
clf = [Link](kernel='rbf', gamma=.5)
[Link](X, Y)
# plot the line, the points, and the nearest vectors to the plane
[Link](clf.support_vectors_[:, 0], clf.support_vectors_[:, 1],
s=80, facecolors='none', zorder=10, edgecolors='k')
[Link](X[:, 0], X[:, 1], c=Y, zorder=10, cmap=[Link],
edgecolors='k')
XX, YY = [Link][x_min:x_max:200j, y_min:y_max:200j]
Z = clf.decision_function(np.c_[[Link](), [Link]()])
(-3.0, 3.0)
When using an RBF kernel, we may see multiple decision boundaries around
different sets of points (as well as larger number of support vectors). These
unusual shapes can be viewed as the projection of a linear max-margin separating
hyperplane from a higher-dimensional (or infinite-dimensional) feature space into
the original 2d space where x lives.
There are even more kernel options in sklearn and we encourage you to try more
of them using the same data.
[Link] Page 12 of 14
Lecture 14: Kernels — Applied ML 20/10/24, 6:22 PM
Consider the matrix L ∈ R n×n defined as L ij = K(x (i) , x (j) ) = ϕ(x (i) ) ⊤ ϕ(x (j) )
. We claim that L must be symmetric and positive semidefinite. Indeed, L is
symmetric because the dot product ϕ(x (i) ) ⊤ ϕ(x (j) ) is symmetric. Moreover, for
any z,
n n n n
z ⊤ Lz = ∑ ∑ z i L ij z j = ∑ ∑ z i ϕ(x (i) ) ⊤ ϕ(x (j) )z j
i=1 j=1 i=1 j=1
n n n
= ∑ ∑ z i (∑ ϕ(x (i) ) k ϕ(x (j) ) k )z j
i=1 j=1 k=1
n n n
= ∑ ∑ ∑ z i ϕ(x (i) ) k ϕ(x (j) ) k z j
k=1 i=1 j=1
n n 2
= ∑ ∑ (z i ϕ(x (i) ) k ) ≥ 0
k=1 i=1
[Link] Page 13 of 14
Lecture 14: Kernels — Applied ML 20/10/24, 6:22 PM
Overall, kernels are very powerful because they can be used throughout machine
learning.
By Cornell University
© Copyright 2023.
[Link] Page 14 of 14