Machine Learning Course - Kernel Regression

Machine Learning Course - CS-433
Kernel Ridge Regression and

the Kernel Trick
Oct 31, 2019
changes by Martin Jaggi 2019, changes by Rüdiger Urbanke 2018, changes by Martin Jaggi
2016, 2017 Mohammad
c Emtiyaz Khan 2015
Last updated on: October 31, 2019
Motivation
The ridge solution w? ∈ RD has a
counterpart α? ∈ RN . Using dual-
ity, we will establish a relationship
between w? and α? which leads the
way to kernels.
Ridge regression
Recall the ridge regression problem
min 1
2 ky − Xwk2 + λ2 kwk2
w
For its solution, we have that
w? = (X>X + λID )−1X>y

= X>(XX> + λIN )−1y =: X>α?,
where α? := (XX> + λIN )−1y.
This can be proved using the follow-

ing identity: let P be an N × D ma-
trix while Q be D × N , and let both
PQ + I and QP + I be invertible.
(PQ + IN )−1P = P(QP + ID )−1
What are the computational com-

plexities for the above two ways of
computing w??
With this, we know that w? =
X>α? lies in the column space
of X>,
 
x11 x12 . . . x1D
 x21 x22 . . . x2D 
where X =  .. .. . . . ..  
xN 1 xD2 . . . xN D
The representer theorem

The representer theorem generalizes
this result: for a w? minimizing the
following function for any Ln,
N
X
min Ln(x> λ
n w, yn ) + 2 kwk
2
w
n=1
there exists α? such that

w ? = X> α? .
Such a general statement was orig-

inally proved by Schölkopf, Her-
brich and Smola (2001).
Kernelized ridge regression
The representer theorem allows us
to write an equivalent optimization
problem in terms of α. For exam-
ple, for ridge regression, the follow-
ing two problems are equivalent:
w? = arg min 1
2 ky − Xwk2 + λ2 kwk2
w
1 >
α = arg max − α (XX> + λIN )α + α>y
?
α 2
i.e. they both have the same
optimal value. Also, we can always
have the correspondence mapping
w = X>α.
Most importantly, the second

problem is expressed in terms of
the matrix XX>. This is our first
example of a kernel matrix.
Note: To see the equivalence, you can show

that we obtain equal optimal values for the
two problems. Take the gradient of the sec-
ond expression, to get (XX> + λIN )α − y.
Setting this to 0 and solving for α results
in α? = (XX> + λIN )−1y.
If we combine this with the representer the-

orem w? = X>α? we find back the dual
solution.
Advantages of kernelized ridge regression
First, it might be computationally
efficient in some cases when solving
the system of equations.
Second, by defining K = XX>, we

can work directly with K and never
have to worry about X. This is the
kernel trick.
Third, working with α is some-

times advantageous, and provides
additional information for each dat-
apoint (e.g. as in SVMs).
Kernel functions
The linear kernel is defined below:
 >
x1 x1 x> >

1 x2 . . . x1 xN
>
 x> > >
2 x1 x2 x2 . . . x2 xN

K = XX =  .  .. ... .. .
. 
x> > >
N x1 xN x2 . . . xN xN
Kernel with basis functions φ(x)

with K := ΦΦ> is shown below:
> > >
 
φ(x1) φ(x1) φ(x1) φ(x2) . . . φ(x1) φ(xN )
 φ(x2)>φ(x1) φ(x2)>φ(x2) . . . φ(x2)>φ(xN ) 
. . . .
. ..

 . . . 
φ(xN )>φ(x1) φ(xN )>φ(x2) . . . φ(xN )>φ(xN )
The kernel trick
A big advantage of using kernels
is that we do not need to specify
φ(x) explicitly, since we can work
directly with K.
We will use a kernel function

κ(x, x0) and compute the (i, j)-th
entry of K as Kij = κ(xi, xj ). A
kernel function κ is usually associ-
ated with a feature map φ, such that
κ(x, x0) := φ(x)>φ(x0) .
For example, for the trivial or ‘lin-

ear kernel’ κ(x, x0) := x>x0, the fea-
ture map is just the original features,
φ(x0) = x0.
Another example: For data x ∈

R3, the kernel κ(x, x0) := (x>x0)2 =
(x1x01 +x2x02 +x3x03)2 corresponds to
>
2 2 2 √ √ √
φ(x) = x1, x2, x3, 2x1x2, 2x1x3, 2x2x3
The good news is that the evalua-

tion of a kernel is often faster when
using κ instead of φ.
Visualization
Why would we want such general
feature maps?
See video explaining linear sepa-

ration in the kernel space (where
φ(x) maps to) corresponding to
non-linear separation in the original
x-space: https://www.youtube.
com/watch?v=3liCbRZPrZA
Examples of kernels
The above kernel is an example of
the polynomial kernel. Another ex-
ample is the Radial Basis Function
(RBF) kernel.

1
κ(x, x ) = exp − (x − x0)>(x − x0)
0
2
See more examples in Section 14.2
of Murphy’s book.
A natural question is the following:

how can we ensure that there exists
a φ corresponding to a given ker-
nel K? The answer is: as long as the
kernel satisfies certain properties.
Properties of a kernel
A kernel function must be an inner-
product in some feature space. Here
are a few properties that ensure it is
the case.
1. K should be symmetric, i.e.
κ(x, x0) = κ(x0, x).
2. For any arbitrary input set
{xn} and all N , K should be
positive semi-definite.
An important subclass is the
positive-definite kernel functions,
giving rise to infinite-dimensional
feature spaces.
Exercises
1. Understand the relationship w? = X>α?. Understand the state-
ment of the representer theorem.
2. Show that the primal and dual formulations of ridge regression
are equivalent. Hint: show that the optimization problems corre-
sponding to w and α have the same optimal value.
3. Get familiar with various examples of kernels. See Section 6.2 of
Bishop on examples of kernel construction. Read Section 14.2 of
Murphy’s book for examples of kernels.
4. Revise and understand the difference between positive-definite
and positive-semi-definite matrices.
5. If you are interested in more details about kernels, read about Mer-
cer and Matern kernels from Kevin Murphy’s Section 14.2. There
is also a small note by Matthias Seeger on the git repository under
lectures/07, in particular for the case of infinite dimensional φ.

Machine Learning Course - Kernel Regression

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning Course - Kernel Regression

Uploaded by

Copyright:

Available Formats

Machine Learning Course - CS-433

Kernel Ridge Regression and

Oct 31, 2019

For its solution, we have that

w? = (X>X + λID )−1X>y

where α? := (XX> + λIN )−1y.

This can be proved using the follow-

(PQ + IN )−1P = P(QP + ID )−1

What are the computational com-

The representer theorem

there exists α? such that

Such a general statement was orig-

Most importantly, the second

Note: To see the equivalence, you can show

If we combine this with the representer the-

Second, by defining K = XX>, we

Third, working with α is some-

Kernel with basis functions φ(x)

We will use a kernel function

κ(x, x0) := φ(x)>φ(x0) .

For example, for the trivial or ‘lin-

Another example: For data x ∈

The good news is that the evalua-

See video explaining linear sepa-

A natural question is the following:

You might also like