0% found this document useful (0 votes)

40 views14 pages

Lecture 14: Kernels - Applied ML

Lecture 14 introduces the concept of kernels in machine learning, which allows linear models to handle non-linear data, particularly in support vector machines (SVMs) and ridge regression. The kernel trick enables efficient computation of dot products in high-dimensional feature spaces without explicitly forming the features, thus reducing computational costs. Various algorithms, including supervised and unsupervised learning methods, can benefit from this approach, enhancing their performance with complex non-linear features.

Uploaded by

thepalaceartisanfoundation

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views14 pages

Lecture 14: Kernels - Applied ML

Uploaded by

thepalaceartisanfoundation

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Lecture 14: Kernels — Applied ML 20/10/24, 6:22 PM

Lecture 14: Kernels

Contents
Lecture 14: Kernels
14.1. The Kernel Trick in SVMs
14.2. Kernelized Ridge Regression
14.3. More on Kernels

In this lecture, we will introduce a new and important concept in machine learning:
the kernel. So far, the majority of the machine learning models we have seen have
been linear. Kernels are a general way to make many of these models non-linear.

In the previous two lectures, we introduced linear SVMs. Kernels will be a way to
make the SVM algorithm suitable for dealing with non-linear data.

14.1.1. Review: Support Vector Machines.

We are given a training dataset D = {(x (1) , y (1) ), (x (2) , y (2) ), … , (x (n) , y (n) )}.
We are interested in binary classification, in which the target variable y is discrete
and takes on one of K = 2 possible values. In this lecture, we assume that
Y = {−1, +1}.

Linear models for this binary classification can take the form

f θ (x) = θ ⊤ ϕ(x) + θ 0 ,

where x is the input and y ∈ {−1, 1} is the target. Support vector machines are a
machine learning algorithm that fits a linear model by finding the maximum margin
separating hyperplane between the two classes.

Recall that the the max-margin hyperplane can be formulated as the solution to the
following primal optimization problem.

n
1
min ||θ|| 2 + C ∑ ξ i
θ,θ 0 ,ξ 2 i=1
subject to y (i) ((x (i) ) ⊤ θ + θ 0 ) ≥ 1 − ξ i for all i
ξi ≥ 0

The solution to this problem also happens to be given by the following dual
problem:

[Link] Page 1 of 14
Lecture 14: Kernels — Applied ML 20/10/24, 6:22 PM

n n n
1
max ∑ λ i − ∑ ∑ λ i λ k y (i) y (k) (x (i) ) ⊤ x (k)
λ
i=1 2 i=1 k=1
n
subject to ∑ λ i y (i) = 0
i=1
C ≥ λ i ≥ 0 for all i

We can obtain a primal solution from the dual via the following equation: $
θ ∗ = ∑ ni=1 λ ∗i y (i) x (i) .$

Ignoring the θ 0 term for now, the score at a new point x ′ will equal $
(θ ∗ ) ⊤ x ′ = ∑ ni=1 λ ∗i y (i) (x (i) ) ⊤ x ′ .$

14.1.2. The Kernel Trick in SVMs

The kernel trick is a way of extending the SVM algorithm to non-linear models. We
are going to introduce this trick via a concrete example.

[Link]. Review: Polynomial Regression

In an earlier lecture, we have seen one important example of a non-linear
algorithm: polynomial regression. Let’s start with a quick recap of this algorithm.
Recall that a p-th degree polynomial is a function of the form

a p x p + a p−1 x p−1 +. . . +a 1 x + a 0 .

A polynomial is a non-linear function in x. Nonetheless, we can use techniques we

developed earlier for linear regression to fit polynomial models to data.

Specifically, given a one-dimensional continuous variable x, we can define a

feature function ϕ : R → R p as

⎡1⎤
x
ϕ(x) = x 2 .
⋮
⎣x p ⎦

Then the class of models of the form

p
f θ (x) := ∑ θ p x p = θ ⊤ ϕ(x)
j=0

with parameters θ encompasses the set of p-degree polynomials.

Crucially, observe that f θ is a linear model with input features ϕ(x) and
parameters θ. The parameters θ are the coefficients of the polynomial. Thus, we
can use our algorithms for linear regression to learn θ. This yields a polynomial
model for y given x that is non-linear in x (because ϕ(x) is non-linear in x).

[Link] Page 2 of 14
Lecture 14: Kernels — Applied ML

J(λ) = ∑ λ i −
i=1
n
1
n
⎢⎥
The disadvantage of the above approach is that it requires more features and more
computation. When x is a scalar, we need O(p) features. When applying the
normal equations to compute θ, we need O(p 3 ) time. More generally, when x is a
d-dimensional vector, we will need O(d p ) features to represent a full polynomial,
and time complexity of applying the normal equations will be even greater.

[Link]. The Kernel Trick: A First Example

Our approach for making SVMs non-linear will be analogous to the idea we used in
polynomial regression. We will apply the SVM algorithm not over x, but over non-
linear (e.g., polynomial) features x.

When x is replaced by features ϕ(x), the SVM algorithm is defined as follows:

n n
∑ ∑ λ i λ k y (i) y (k) ϕ(x (i) ) ⊤ ϕ(x (k) )
2 i=1 k=1

(θ ∗ ) ⊤ ϕ(x ′ ) = ∑ λ ∗i y (i) ϕ(x (i) ) ⊤ ϕ(x ′ ).

i=1

Notice that in both equations, the features ϕ(x) are never used directly. Only their
dot product is used. If we can compute the dot product efficiently, we can
potentially use very complex features.

Can we compute the dot product ϕ(x) ⊤ ϕ(x ′ ) of polynomial features ϕ(x) more
efficiently than using the standard definition of a dot product? It turns out that we
can. Let’s look at an example.

To start, consider pairwise polynomial features ϕ : R d → R d of the form

ϕ(x) ij = x i x j for i, j ∈ {1, 2, … , d}.

ϕ(x) =

The product of x and z in feature space equals:

⎡x 1 x 1 ⎤
x1 x2
x1 x3
x2 x1
x2 x1
x2 x2
x3 x3
x3 x1

⎣x x ⎦
x3 x2
3 3

d d
.

ϕ(x) ⊤ ϕ(z) = ∑ ∑ x i x j z i z j
i=1 j=1
2

These features consist of all the pairwise products among all the entries of x. For
d = 3 this looks like

Normally, computing this dot product involves a sum over d 2 terms and takes
O(d 2 ) time.

[Link]
20/10/24, 6:22 PM

Page 3 of 14
Lecture 14: Kernels — Applied ML

follows:
d

i=1
d
d

i=1 j=1
= ϕ(x) ⊤ ϕ(z)
⎢⎥
An alternative way of computing the dot product ϕ(x) ⊤ ϕ(z) is to instead compute
(x ⊤ z) 2 . One can check that this has the same result:

(x ⊤ z) 2 = (∑ x i z i ) 2
i=1

d
= ∑ ∑ xi zi xj zj
j=1
d
= (∑ x i z i ) ⋅ (∑ x j z j )

But computing (x ⊤ z) 2 can be done in only O(d) time: we simply compute the dot
product x ⊤ z in O(d) time, and then square the resulting scalar. This is much
faster than the naive O(d 2 ) procedure.

More generally, polynomial features ϕ p of degree p when x ∈ R d are defined as

ϕ p (x) i1 ,i2 ,…,ip = x i1 x i2 ⋯ x ip for i 1 , i 2 , … , i p ∈ {1, 2, … , d}

The number of these features scales as O(d p ). The straightforward way of

computing their dot product also takes O(d p ) time.

However, using a version of the above argument, we can compute the dot product
ϕ p (x) ⊤ ϕ p (z) in this feature space in only O(d) time for any p as follows:

This is a very powerful idea:

ϕ p (x) ⊤ ϕ p (z) = (x ⊤ z) p .

We can compute the dot product between O(d p ) features in only O(d) time.
We can use high-dimensional features within ML algorithms that only rely on
dot products without incurring extra costs.

[Link]. The General Kernel Trick in SVMs

More generally, given features ϕ(x), suppose that we have a function
K : X × X → [0, ∞] that outputs dot products between vectors in X

function is
K(x, z) = ϕ(x) ⊤ ϕ(z).

We will call K the kernel function. Recall that an example of a useful kernel

K(x, z) = (x ⋅ z) p

because it computes the dot product of polynomial features of degree p.

Notice that we can rewrite the dual of the SVM as

[Link]
20/10/24, 6:22 PM

Page 4 of 14
Lecture 14: Kernels — Applied ML 20/10/24, 6:22 PM

n n n
1
max ∑ λ i − ∑ ∑ λ i λ k y (i) y (k) K(x (i) , x (k) )
λ
i=1 2 i=1 k=1
n
subject to ∑ λ i y (i) = 0
i=1
C ≥ λ i ≥ 0 for all i

Also, the predictions at a new point x ′ are given by

n
∑ λ ∗i y (i) K(x (i) , x ′ ).
i=1

We can efficiently use any features ϕ(x) (e.g., polynomial features of any degree p
) as long as the kernel functions computes the dot products of the ϕ(x) efficiently.
We will see several examples of kernel functions below.

14.1.3. The Kernel Trick: General Idea

Many types of features ϕ(x) have the property that their dot product ϕ(x) ⊤ ϕ(z)
can be computed more efficiently than if we had to form these features explicitly.
Also, we will see that many algorithms in machine learning can be written down as
optimization problems in which the features ϕ(x) only appear as dot products
ϕ(x) ⊤ ϕ(z).

The Kernel Trick means that we can use complex non-linear features within these
algorithms with little additional computational cost.

Examples of algorithms in which we can use the Kernel trick:

Supervised learning algorithms: linear regression, logistic regression, support

vector machines, etc.
Unsupervised learning algorithms: PCA, density estimation.

We will look at more examples shortly.

Support vector machines are far from being the only algorithm that benefits from
kernels. Another algorithm that supports kernels is Ridge regression.

14.2.1. Review: Ridge Regression

Recall that a linear model has the form

f θ (x) = θ ⊤ ϕ(x).

where ϕ(x) is a vector of features. We pick θ to minimize the (L2-regularized)

mean squared error (MSE):

n d
1
∑(y (i) − θ ⊤ ϕ(x (i) )) 2 + ∑ θ 2j
λ
J(θ) =
2n i=1 2 j=1

[Link] Page 5 of 14

   
Φ=

⎢⎥
Lecture 14: Kernels — Applied ML

It is useful to represent the featurized dataset as a matrix Φ ∈ R n×p :

⎡ ϕ(x ) 1
(1)

ϕ(x (2) ) 1

⋮
⎣ϕ(x (n) ) 1
ϕ(x (1) ) 2
ϕ(x (2) ) 2

ϕ(x (n) ) 2
…
…

…
ϕ(x (1) ) p ⎤
ϕ(x (2) ) p

ϕ(x (n) )

The normal equations provide a closed-form solution for θ:

Regression

identity:

to obtain the dual form:

p > n.

θ = Φ ⊤ (ΦΦ ⊤ + λI ) −1 y = Φ T α
θ = (X ⊤ X + λI ) −1 X ⊤ y.

θ = (Φ ⊤ Φ + λI ) −1 Φ ⊤ y.

14.2.2. A Dual Formulation for Ridge

(λI + UV ) −1 U = U(λI + V U ) −1

where U ∈ R n×m and V ∈ R m×n and λ ≠ 0

θ = (Φ ⊤ Φ + λI ) −1 Φ ⊤ y

θ = Φ ⊤ (ΦΦ ⊤ + λI ) −1 y.
p
⎦
=

When the vectors of attributes x (i) are featurized, we can write this as

We can modify this expression by using a version of the push-through matrix

The proof sketch is: Start with U(λI + V U) = (λI + UV )U and multiply both
⎡−
−

sides by (λI + V U ) −1 on the right and (λI + UV ) −1 on the left. If you are
interested, you can try to derive in detail by yourself.

We can apply the identity (λI + UV ) −1 U = U(λI + V U ) −1 to the normal

equations with U = Φ ⊤ and V = Φ.

The first approach takes O(p 3 ) time; the second is O(n 3 ) and is faster when

14.2.3. Kernelized Ridge Regression

An interesting corollary of the dual form

call this vector α

is that the optimal θ is a linear combination of the n training set features:

[Link]
ϕ(x (1) ) ⊤
ϕ(x (2) ) ⊤
⋮
⎣− ϕ(x (n) ) ⊤
−⎤
−

−⎦
.
20/10/24, 6:22 PM

Page 6 of 14
Lecture 14: Kernels — Applied ML 20/10/24, 6:22 PM

n
θ = ∑ α i ϕ(x (i) ).
i=1

Here, the weights α i are derived from (ΦΦ ⊤ + λI ) −1 y and equal

n
α i = ∑ L ij y j
j=1

where L = (ΦΦ ⊤ + λI ) −1 .

Consider now a prediction ϕ(x ′ ) ⊤ θ at a new input x ′ :

n
ϕ(x ′ ) ⊤ θ = ∑ α i ϕ(x ′ ) ⊤ ϕ(x (i) ).
i=1

The crucial observation is that the features ϕ(x) are never used directly in this
equation. Only their dot product is used!

We also don’t need features ϕ for learning θ, just their dot product! First, recall that
each row i of Φ is the i-th featurized input ϕ(x (i) ) ⊤ . Thus K = ΦΦ ⊤ is a matrix
of all dot products between all the ϕ(x (i) )

K ij = ϕ(x (i) ) ⊤ ϕ(x (j) ).

We can compute α = (K + λI ) −1 y and use it for predictions

n
ϕ(x ′ ) ⊤ θ = ∑ α i ϕ(x ′ ) ⊤ ϕ(x (i) ).
i=1

and all this only requires dot products, not features ϕ!

We have seen two examples of kernelized algorithms: SVM and ridge regression.
Let’s now look at some additional examples of kernels.

14.3.1. Definition: Kernels

The kernel corresponding to features ϕ(x) is a function K : X × X → [0, ∞]
that outputs dot products between vectors in X

K(x, z) = ϕ(x) ⊤ ϕ(z).

We will also consider general functions K : X × X → [0, ∞] and call these

kernel functions. Kernels have multiple integrations:

The dot product or geometrical angle between x and z

A notion of similarity between x and z

We will look at a few examples of kernels using the following dataset. The following
code visualizes the dataset.

[Link] Page 7 of 14
Lecture 14: Kernels — Applied ML 20/10/24, 6:22 PM

# [Link]
[Link]/stable/auto_examples/svm/plot_svm_kernels.html
import numpy as np
import [Link] as plt
from sklearn import svm

# Our dataset and targets

X = np.c_[(.4, -.7), (-1.5, -1), (-1.4, -.9), (-1.3, -1.2), (-1.1,
-.2), (-1.2, -.4), (-.5, 1.2), (-1.5, 2.1), (1, 1),
(1.3, .8), (1.2, .5), (.2, -2), (.5, -2.4), (.2, -2.3),
(0, -2.7), (1.3, 2.1)].T
Y = [0] * 8 + [1] * 8

x_min, x_max = -3, 3

y_min, y_max = -3, 3
[Link](X[:, 0], X[:, 1], c=Y, zorder=10, cmap=[Link],
edgecolors='k', s=80)
[Link](x_min, x_max)
[Link](y_min, y_max)

(-3.0, 3.0)

The above figure is a visualization of the 2D dataset we have created. We will then
use it to illustrate the difference between various kernels.

14.3.2. Examples of Kernels

[Link]. Linear Kernel
The simplest kind of kernel that exists is called the linear kernel. This simply
corresponds to dot product multiplication of the features:

K(x, z) = x ⊤ z

Applied to an SVM, this corresponds to a linear decision boundary.

Below is an example of how we can use the SVM implementation in sklearn with a
linear kernel. Internally, this solves the dual SVM optimization problem.

[Link] Page 8 of 14
Lecture 14: Kernels — Applied ML 20/10/24, 6:22 PM

# [Link]
[Link]/stable/auto_examples/svm/plot_svm_kernels.html
clf = [Link](kernel='linear' , gamma=2)
[Link](X, Y)

# plot the line, the points, and the nearest vectors to the plane
[Link](clf.support_vectors_[:, 0], clf.support_vectors_[:, 1],
s=80, facecolors='none', zorder=10, edgecolors='k')
[Link](X[:, 0], X[:, 1], c=Y, zorder=10, cmap=[Link],
edgecolors='k')
XX, YY = [Link][x_min:x_max:200j, y_min:y_max:200j]
Z = clf.decision_function(np.c_[[Link](), [Link]()])

# Put the result into a color plot

Z = [Link]([Link])
[Link](XX, YY, Z > 0, cmap=[Link])
[Link](XX, YY, Z, colors=['k', 'k', 'k'], linestyles=['--', '-
', '--'],levels=[-.5, 0, .5])
[Link](x_min, x_max)
[Link](y_min, y_max)

(-3.0, 3.0)

The above figure shows that a linear kernel provides linear decision boundaries to
separate the data.

[Link]. Polynomial Kernel

A more interesting example is the polynomial kernel of degree p, of which we have
already seen a simple example:

K(x, z) = (x ⊤ z + c) p .

This corresponds to a mapping to a feature space of dimension (d+p) that has all
p
monomials x i1 x i2 ⋯ x ip of degree at most p.

For d = 3 this feature map looks like

[Link] Page 9 of 14
Lecture 14: Kernels — Applied ML

sklearn.

# [Link]
ϕ(x) =

⎢⎥
⎡ x1 x1 ⎤
x1 x2
x1 x3
x2 x1
x2 x1
x2 x2
x3 x3
x3 x1
x3 x2
x3 x3
√2cx 1
√2cx 2
√2cx 3
⎣ c ⎦
.

The polynomial kernel allows us to compute dot products in a O(d p )-dimensional

space in time O(d). The following code shows how it would be implemented in

[Link]/stable/auto_examples/svm/plot_svm_kernels.html
clf = [Link](kernel='poly', degree=3, gamma=2)
[Link](X, Y)

# Put the result into a color plot

Z = [Link]([Link])
[Link](XX, YY, Z > 0, cmap=[Link])
[Link](XX, YY, Z, colors=['k', 'k', 'k'], linestyles=['--', '-
', '--'],levels=[-.5, 0, .5])
[Link](x_min, x_max)
[Link](y_min, y_max)

(-3.0, 3.0)

[Link]
20/10/24, 6:22 PM

Page 10 of 14
Lecture 14: Kernels — Applied ML 20/10/24, 6:22 PM

The above figure is a visualization of the polynomial kernel with degree of 3 and
gamma of 2. The decision boundary is non-linear, in contrast to that of linear
kernels.

[Link]. Radial Basis Function Kernel

Another example is the Radial Basis Function (RBF; sometimes called Gaussian)
kernel

||x − z|| 2
K(x, z) = exp (− ),
2σ 2

where σ is a hyper-parameter. It’s easiest to understand this kernel by viewing it as

a similarity measure.

We can show that this kernel corresponds to an infinite-dimensional feature map

and the limit of the polynomial kernel as p → ∞.

To see why that’s intuitively the case, consider the Taylor expansion

||x − z|| 2 ||x − z|| 2 ||x − z|| 4 ||x − z|| 6

exp (− )≈1− + − +…
2σ 2 2σ 2 2! ⋅ 4σ 4 3! ⋅ 8σ 6

Each term on the right-hand side can be expanded into a polynomial. Thus, the
above infinite series contains an infinite number of polynomial terms, and
consequently an infinite number of polynomial features.

We can look at the sklearn implementation again.

[Link] Page 11 of 14
Lecture 14: Kernels — Applied ML 20/10/24, 6:22 PM

# [Link]
[Link]/stable/auto_examples/svm/plot_svm_kernels.html
clf = [Link](kernel='rbf', gamma=.5)
[Link](X, Y)

# Put the result into a color plot

Z = [Link]([Link])
[Link](XX, YY, Z > 0, cmap=[Link])
[Link](XX, YY, Z, colors=['k', 'k', 'k'], linestyles=['--', '-
', '--'],levels=[-.5, 0, .5])
[Link](x_min, x_max)
[Link](y_min, y_max)

(-3.0, 3.0)

When using an RBF kernel, we may see multiple decision boundaries around
different sets of points (as well as larger number of support vectors). These
unusual shapes can be viewed as the projection of a linear max-margin separating
hyperplane from a higher-dimensional (or infinite-dimensional) feature space into
the original 2d space where x lives.

There are even more kernel options in sklearn and we encourage you to try more
of them using the same data.

14.3.3. When is K A Kernel?

We’ve seen that for many features ϕ we can define a kernel function
K : X × X → [0, ∞] that efficiently computes ϕ(x) ⊤ ϕ(x). Suppose now that
we use some kernel function K : X × X → [0, ∞] in an ML algorithm. Is there an
implicit feature mapping ϕ that corresponds to using K?

[Link] Page 12 of 14
Lecture 14: Kernels — Applied ML 20/10/24, 6:22 PM

Let’s start by defining a necessary condition for K : X × X → [0, ∞] to be

associated with a feature map. Suppose that K is a kernel for some feature map ϕ,
and consider an arbitrary set of n points {x (1) , x (2) , … , x (n) }.

Consider the matrix L ∈ R n×n defined as L ij = K(x (i) , x (j) ) = ϕ(x (i) ) ⊤ ϕ(x (j) )
. We claim that L must be symmetric and positive semidefinite. Indeed, L is
symmetric because the dot product ϕ(x (i) ) ⊤ ϕ(x (j) ) is symmetric. Moreover, for
any z,

n n n n
z ⊤ Lz = ∑ ∑ z i L ij z j = ∑ ∑ z i ϕ(x (i) ) ⊤ ϕ(x (j) )z j
i=1 j=1 i=1 j=1
n n n
= ∑ ∑ z i (∑ ϕ(x (i) ) k ϕ(x (j) ) k )z j
i=1 j=1 k=1
n n n
= ∑ ∑ ∑ z i ϕ(x (i) ) k ϕ(x (j) ) k z j
k=1 i=1 j=1
n n 2
= ∑ ∑ (z i ϕ(x (i) ) k ) ≥ 0
k=1 i=1

Thus if K is a kernel, L must be positive semidefinite for any n points x (i) .

[Link]. Mercer’s Theorem

The general idea of the theorem is that: if K is a kernel, L must be positive
semidefinite for any set of n points x (i) . It turns out that it is is also a sufficient
condition.

Theorem. (Mercer) Let K : X × X → [0, ∞] be a kernel function. There exists a

mapping ϕ associated with K if for any n and any dataset {x (1) , x (2) , … , x (n) }
of size n ≥ 1, if and only if the matrix L defined as L ij = K(x (i) , x (j) ) is
symmetric and positive semidefinite.

This characterizes precisely which kernel functions correspond to some ϕ.

14.3.4. Pros and Cons of Kernels

We have introduce many good properties of kernels. However, are kernels a free
lunch? Not quite.

Kernels allow us to use features ϕ of very large dimension d. However computation

is at least O(n 2 ), where n is the dataset size. We need to compute distances
K(x (i) , x (j) ), for all i, j.

Approximate solutions can be found more quickly, but in practice kernel

methods are not used with today’s massive datasets.
However, on small and medium-sized data, kernel methods will be at least as
good as neural nets and probably much easier to train.

Examples of other algorithms in which we can use kernels include:

[Link] Page 13 of 14
Lecture 14: Kernels — Applied ML 20/10/24, 6:22 PM

Supervised learning algorithms: linear regression, logistic regression, support

vector machines, etc.
Unsupervised learning algorithms: PCA, density estimation.

Overall, kernels are very powerful because they can be used throughout machine
learning.

By Cornell University
© Copyright 2023.

[Link] Page 14 of 14

03 - Kernelization
No ratings yet
03 - Kernelization
32 pages
Machine Learning: Kernel Methods
No ratings yet
Machine Learning: Kernel Methods
6 pages
SVM Kernel Functions
No ratings yet
SVM Kernel Functions
12 pages
cs229 Notes3
No ratings yet
cs229 Notes3
30 pages
Math Behind SVM (Kernel Trick) - This Is PART III of SVM Series - by MLMath - Io - Medium
No ratings yet
Math Behind SVM (Kernel Trick) - This Is PART III of SVM Series - by MLMath - Io - Medium
6 pages
Kernel Models for Data Scientists
No ratings yet
Kernel Models for Data Scientists
56 pages
Lecture-Notes Kernal Methods
No ratings yet
Lecture-Notes Kernal Methods
12 pages
SVM 4
No ratings yet
SVM 4
8 pages
SVM
No ratings yet
SVM
9 pages
SVM Extra Kernels
No ratings yet
SVM Extra Kernels
29 pages
תרגול - SVM 1
No ratings yet
תרגול - SVM 1
32 pages
Lecture 05
No ratings yet
Lecture 05
49 pages
SVM Kernal
No ratings yet
SVM Kernal
5 pages
Some Methods of Constructing Kernel
No ratings yet
Some Methods of Constructing Kernel
23 pages
Machine Learning 3
No ratings yet
Machine Learning 3
35 pages
SVM Geometry and Kernel Trick Notes
No ratings yet
SVM Geometry and Kernel Trick Notes
4 pages
SVM Geometry and Kernel Trick Notes
No ratings yet
SVM Geometry and Kernel Trick Notes
4 pages
4c Kernels
No ratings yet
4c Kernels
31 pages
Handout 03 Classic Classifiers
No ratings yet
Handout 03 Classic Classifiers
39 pages
Math Behind SVM (Kernel Trick) - This Is PART III of SVM Series - by MLMath - Io - Medium
No ratings yet
Math Behind SVM (Kernel Trick) - This Is PART III of SVM Series - by MLMath - Io - Medium
6 pages
Kernel Trick
No ratings yet
Kernel Trick
40 pages
Vahid
No ratings yet
Vahid
18 pages
SVM and Kernels
No ratings yet
SVM and Kernels
13 pages
Support Vector Machine Explained
No ratings yet
Support Vector Machine Explained
10 pages
Kernal and Multiclass
No ratings yet
Kernal and Multiclass
51 pages
SVM Class 2
No ratings yet
SVM Class 2
87 pages
Kernel Methods in Machine Learning
No ratings yet
Kernel Methods in Machine Learning
53 pages
Kernel Methods for Statisticians
No ratings yet
Kernel Methods for Statisticians
53 pages
Lecture 4
No ratings yet
Lecture 4
49 pages
Support Vector Machine With Kernel Trick
No ratings yet
Support Vector Machine With Kernel Trick
33 pages
SVM and Kernel Trick Explained
No ratings yet
SVM and Kernel Trick Explained
33 pages
SVM
No ratings yet
SVM
8 pages
05 Kernel
No ratings yet
05 Kernel
24 pages
Ds 11
No ratings yet
Ds 11
21 pages
5th Unit ML
No ratings yet
5th Unit ML
40 pages
2014 02 26 Kernels
No ratings yet
2014 02 26 Kernels
140 pages
Understanding Kernel Methods in ML
No ratings yet
Understanding Kernel Methods in ML
5 pages
Support Vector Machines Explained
No ratings yet
Support Vector Machines Explained
36 pages
Machine Learning Course - Kernel Regression
No ratings yet
Machine Learning Course - Kernel Regression
9 pages
SVM Kernels: Non-Linear Learning
No ratings yet
SVM Kernels: Non-Linear Learning
15 pages
Lecture 13 - Kernels
No ratings yet
Lecture 13 - Kernels
5 pages
Lect 11-SVM
No ratings yet
Lect 11-SVM
14 pages
This Is
No ratings yet
This Is
7 pages
Kernelized Perceptron Techniques
No ratings yet
Kernelized Perceptron Techniques
13 pages
Support Vector Machines & Kernels: David Sontag New York University
No ratings yet
Support Vector Machines & Kernels: David Sontag New York University
19 pages
Kernels in Support Vector Machine Part B
No ratings yet
Kernels in Support Vector Machine Part B
5 pages
Kernel Methods for Nonlinear Regression
No ratings yet
Kernel Methods for Nonlinear Regression
23 pages
Slides Chap5 KernelMethods
No ratings yet
Slides Chap5 KernelMethods
24 pages
ML SVM Lect10 11
No ratings yet
ML SVM Lect10 11
27 pages
Support Vector Machines Overview
No ratings yet
Support Vector Machines Overview
40 pages
Lecture Slides-Week12
100% (1)
Lecture Slides-Week12
41 pages
Advanced SVM and Kernel Theory
No ratings yet
Advanced SVM and Kernel Theory
9 pages
KernelTrick PDF
No ratings yet
KernelTrick PDF
4 pages
Machine Learning: Kernel Methods Explained
No ratings yet
Machine Learning: Kernel Methods Explained
19 pages
Kernel Functions: Tejumade Afonja Jan 2, 2017 6 Min Read
No ratings yet
Kernel Functions: Tejumade Afonja Jan 2, 2017 6 Min Read
6 pages
Lect 3
No ratings yet
Lect 3
14 pages
Support Vector and Kernel Methods - Detailed Notes
No ratings yet
Support Vector and Kernel Methods - Detailed Notes
10 pages
SVM Lecture Notes by Linda Sellie
No ratings yet
SVM Lecture Notes by Linda Sellie
58 pages
Lecture 5
No ratings yet
Lecture 5
19 pages
Boiler & Pressure Vessel Guide
No ratings yet
Boiler & Pressure Vessel Guide
10 pages
Overview of Fashion & Apparel Industry
No ratings yet
Overview of Fashion & Apparel Industry
17 pages
Bahay Aruga Docu
No ratings yet
Bahay Aruga Docu
16 pages
Ponlop Rinpoche Et Al. - What Is Enlightenment
No ratings yet
Ponlop Rinpoche Et Al. - What Is Enlightenment
16 pages
Grade 1 Environmental Activities Notes
No ratings yet
Grade 1 Environmental Activities Notes
7 pages
Gas BQ
No ratings yet
Gas BQ
6 pages
Hero Maestro Edge-VX
No ratings yet
Hero Maestro Edge-VX
6 pages
Core Stability Training
100% (1)
Core Stability Training
15 pages
INFO Penguins Linda Henry
No ratings yet
INFO Penguins Linda Henry
5 pages
Dolphin Foods R 11042017
No ratings yet
Dolphin Foods R 11042017
6 pages
Construction Management 2 Full Notes
No ratings yet
Construction Management 2 Full Notes
108 pages
Charlotte's Web Vocabulary List
100% (1)
Charlotte's Web Vocabulary List
3 pages
Reinjection and Tracer Tests in The Laugaland Geothermal Field, N-Iceland
No ratings yet
Reinjection and Tracer Tests in The Laugaland Geothermal Field, N-Iceland
24 pages
Graceful Label
No ratings yet
Graceful Label
31 pages
TP Rate Fy 2022-23-1
No ratings yet
TP Rate Fy 2022-23-1
1 page
LESSON 2 Part 1
No ratings yet
LESSON 2 Part 1
7 pages
GR 5 Xhosa Term 1 2015
No ratings yet
GR 5 Xhosa Term 1 2015
2 pages
FLUKE Automotive Diagnosis
100% (4)
FLUKE Automotive Diagnosis
14 pages
Lexmark Supplies Guide 2003
No ratings yet
Lexmark Supplies Guide 2003
22 pages
Global Cities Ranking Report Featuring Sean Gregory’s Leadership at PwC Australia
No ratings yet
Global Cities Ranking Report Featuring Sean Gregory’s Leadership at PwC Australia
2 pages
Solution Manual For Calculus Single and Multivariable 6Th Edition by Hughes Hallett Gleason and Mccallum Isbn 047088861X 9780470888612 Download
No ratings yet
Solution Manual For Calculus Single and Multivariable 6Th Edition by Hughes Hallett Gleason and Mccallum Isbn 047088861X 9780470888612 Download
167 pages
Recruitment and Selection
67% (3)
Recruitment and Selection
140 pages
Advantages of Old Model and New Model Television
No ratings yet
Advantages of Old Model and New Model Television
35 pages
Textile Materials
No ratings yet
Textile Materials
6 pages
MITSUBISHI GENSET MGS2000B-50Hz-380V
No ratings yet
MITSUBISHI GENSET MGS2000B-50Hz-380V
4 pages
Challenger Disaster Risk Lessons
No ratings yet
Challenger Disaster Risk Lessons
3 pages
Year Round Fire Safety Awerness Program
100% (4)
Year Round Fire Safety Awerness Program
2 pages
Monthly Commodity Price Indices
No ratings yet
Monthly Commodity Price Indices
112 pages
Multigrade Lesson Plan
No ratings yet
Multigrade Lesson Plan
31 pages
Iaru Region 1 HF Band Plan: As Revised at The General Conference Sun City 2011
No ratings yet
Iaru Region 1 HF Band Plan: As Revised at The General Conference Sun City 2011
4 pages

Lecture 14: Kernels - Applied ML

Uploaded by

Lecture 14: Kernels - Applied ML

Uploaded by

Lecture 14: Kernels — Applied ML 20/10/24, 6:22 PM

Lecture 14: Kernels

14.1.1. Review: Support Vector Machines.

14.1.2. The Kernel Trick in SVMs

[Link]. Review: Polynomial Regression

A polynomial is a non-linear function in x. Nonetheless, we can use techniques we

Specifically, given a one-dimensional continuous variable x, we can define a

Then the class of models of the form

with parameters θ encompasses the set of p-degree polynomials.

[Link]. The Kernel Trick: A First Example

When x is replaced by features ϕ(x), the SVM algorithm is defined as follows:

(θ ∗ ) ⊤ ϕ(x ′ ) = ∑ λ ∗i y (i) ϕ(x (i) ) ⊤ ϕ(x ′ ).

To start, consider pairwise polynomial features ϕ : R d → R d of the form

ϕ(x) ij = x i x j for i, j ∈ {1, 2, … , d}.

The product of x and z in feature space equals:

More generally, polynomial features ϕ p of degree p when x ∈ R d are defined as

ϕ p (x) i1 ,i2 ,…,ip = x i1 x i2 ⋯ x ip for i 1 , i 2 , … , i p ∈ {1, 2, … , d}

The number of these features scales as O(d p ). The straightforward way of

This is a very powerful idea:

[Link]. The General Kernel Trick in SVMs

because it computes the dot product of polynomial features of degree p.

Notice that we can rewrite the dual of the SVM as

Also, the predictions at a new point x ′ are given by

14.1.3. The Kernel Trick: General Idea

Examples of algorithms in which we can use the Kernel trick:

Supervised learning algorithms: linear regression, logistic regression, support

We will look at more examples shortly.

14.2.1. Review: Ridge Regression

where ϕ(x) is a vector of features. We pick θ to minimize the (L2-regularized)

It is useful to represent the featurized dataset as a matrix Φ ∈ R n×p :

The normal equations provide a closed-form solution for θ:

to obtain the dual form:

14.2.2. A Dual Formulation for Ridge

where U ∈ R n×m and V ∈ R m×n and λ ≠ 0

We can modify this expression by using a version of the push-through matrix

We can apply the identity (λI + UV ) −1 U = U(λI + V U ) −1 to the normal

14.2.3. Kernelized Ridge Regression

call this vector α

is that the optimal θ is a linear combination of the n training set features:

Here, the weights α i are derived from (ΦΦ ⊤ + λI ) −1 y and equal

Consider now a prediction ϕ(x ′ ) ⊤ θ at a new input x ′ :

K ij = ϕ(x (i) ) ⊤ ϕ(x (j) ).

We can compute α = (K + λI ) −1 y and use it for predictions

and all this only requires dot products, not features ϕ!

14.3.1. Definition: Kernels

K(x, z) = ϕ(x) ⊤ ϕ(z).

We will also consider general functions K : X × X → [0, ∞] and call these

The dot product or geometrical angle between x and z

# Our dataset and targets

x_min, x_max = -3, 3

14.3.2. Examples of Kernels

Applied to an SVM, this corresponds to a linear decision boundary.

# Put the result into a color plot

[Link]. Polynomial Kernel

For d = 3 this feature map looks like

The polynomial kernel allows us to compute dot products in a O(d p )-dimensional

# Put the result into a color plot

[Link]. Radial Basis Function Kernel

where σ is a hyper-parameter. It’s easiest to understand this kernel by viewing it as

We can show that this kernel corresponds to an infinite-dimensional feature map

||x − z|| 2 ||x − z|| 2 ||x − z|| 4 ||x − z|| 6

We can look at the sklearn implementation again.

# Put the result into a color plot

14.3.3. When is K A Kernel?

Let’s start by defining a necessary condition for K : X × X → [0, ∞] to be

Thus if K is a kernel, L must be positive semidefinite for any n points x (i) .

[Link]. Mercer’s Theorem

Theorem. (Mercer) Let K : X × X → [0, ∞] be a kernel function. There exists a

This characterizes precisely which kernel functions correspond to some ϕ.

14.3.4. Pros and Cons of Kernels

Kernels allow us to use features ϕ of very large dimension d. However computation

Approximate solutions can be found more quickly, but in practice kernel

Examples of other algorithms in which we can use kernels include:

Supervised learning algorithms: linear regression, logistic regression, support

You might also like