Professional Documents
Culture Documents
in Machine Learning
Arthur Gretton
1/74
Overview
The course has two parts:
Kernel methods (Arthur Gretton)
Convex optimization
2/74
Course times, locations
Lecture times:
Optimization lectures are Wednesday, 14:00 -15:30
Kernel lectures will be on Friday 15:00 -16:30
There will be lectures during reading week, due to clash with NeurIPS
conference.
The tutor for the kernels part is Dimitri Meunier.
Lecture notes are online:
http://www.gatsby.ucl.ac.uk/gretton/rkhscourse.html
3/74
A motivation: comparing two samples
Given: Samples from unknown distributions P and Q.
Goal: do P and Q differ?
4/74
A real-life example: two-sample tests
Goal: do P and Q differ?
7/74
Testing goodness of fit
Given: a model P and samples Q.
Goal: is P a good fit for Q?
7/74
Testing independence
Given: Samples from a distribution PX Y
Goal: Are X and Y independent?
X Y
A large animal who slings slobber,
exudes a distinctive houndy odor,
and wants nothing more than to
follow his nose.
A responsive, interactive
pet, one that will blow in
your ear and follow you
everywhere.
Text from dogtime.com and petfinder.com
8/74
Course overview (kernels part)
1 Construction of RKHS,
2 Simple linear algorithms in RKHS (e.g. PCA, ridge regression)
3 Kernel methods for hypothesis testing (two-sample, independence,
goodness-of-fit)
4 Further applications of kenels (feature selection, clustering,...)
5 Support vector machines for classification, regression
6 Cutting-edge kernel algorithms (to be a surprise)
9/74
Reproducing Kernel Hilbert Spaces
10/74
Kernels and feature space (1): XOR example
5
1
x2
−1
−2
−3
−4
−5
−5 −4 −3 −2 −1 0 1 2 3 4 5
x
1
11/74
Kernels and feature space (2): document classification
12/74
Kernels and feature space (3): smoothing
0 0 0
−1 −1 −1
−0.5 0 0.5 1 1.5 −0.5 0 0.5 1 1.5 −0.5 0 0.5 1 1.5
13/74
Outline: reproducing kernel Hilbert space
14/74
Hilbert space
15/74
Hilbert space
15/74
Hilbert space
15/74
Kernel
Definition
Let X be a non-empty set. A function k : X X ! R is a kernel if
there exists a Hilbert space H and a feature map : X ! H such
that 8x ; x 0 2 X ,
k (x ; x 0 ) := (x ); (x 0 ) H :
16/74
New kernels from old: sums, transformations
Example: k (x ; x 0 ) = x 2 (x 0 )2 :
17/74
New kernels from old: sums, transformations
Example: k (x ; x 0 ) = x 2 (x 0 )2 :
17/74
New kernels from old: products
Theorem (Products of kernels are kernels)
Given k1 on X1 and k2 on X2 , then k1 k2 is a kernel on X1 X2.
If X1 = X2 = X , then k := k1 k2 is a kernel on X .
18/74
New kernels from old: products
I I4 I
= 2 (x )>1 (x )
(x ) = = I I4
I I4 I
19/74
New kernels from old: products
I I4 I
= 2 (x )>1 (x )
(x ) = = I I4
I I4 I
Kernel is:
k (x ; x 0 ) = ij (x )ij (x 0 )
X X
i 2f;g j 2f;4g
19/74
New kernels from old: products
I I4 I
= 2 (x )>1 (x )
(x ) = = I I4
I I4 I
Kernel is:
19/74
New kernels from old: products
I I4 I
= 2 (x )>1 (x )
(x ) = = I I4
I I4 I
Kernel is:
2f;g j 2f;4g k2 (x ;x 0 )
| {z }
i
19/74
New kernels from old: products
I I4 I
= 2 (x )>1 (x )
(x ) = = I I4
I I4 I
Kernel is:
2f;g j 2f;4g k2 (x ;x 0 )
| {z }
i
= tr > 0 0
1 (x )1 (x ) k2 (x ; x )
k1 (x ;x 0 )
| {z }
19/74
New kernels from old: products
I I4 I
= 2 (x )>1 (x )
(x ) = = I I4
I I4 I
Kernel is:
2f;g j 2f;4g k2 (x ;x 0 )
| {z }
i
= tr > 0 0 0 0
1 (x )1 (x ) k2 (x ; x ) = k1 (x ; x )k2 (x ; x )
k1 (x ;x 0 )
| {z }
19/74
Sums and products =) polynomials
k (x ; x 0 ) := x;x0 +c
m
is a valid kernel.
20/74
Infinite sequences
The kernels we’ve seen so far are dot products between finitely many
features. E.g.
21/74
Taylor series kernels
Definition (Taylor series kernel)
For r 2 (0; 1], with an 0 for all n 0
1
f (z ) = jz j < r ; z 2 R;
X
an z n
n =0
Define X to be the
pr -ball in Rd , so kx k < pr ,
1
k (x ; x 0 ) = f x;x 0 = an x ; x 0 :
X n
n =0
Exponential kernel:
k (x ; x 0 ) := exp x;x0 :
22/74
Taylor series kernel (proof)
1
k (x ; x 0 ) = x;x0
X n
an
n =0
By Cauchy-Schwarz,
23/74
Exponentiated quadratic kernel
k (x ; x 0 ) := exp x x0 :
2 2
24/74
Infinite sequences
Definition
The space `2 (square summable sequences) comprises all sequences
a := (ai )i 1 for which
1
ka k` = < 1:
2
X
2
a`2
=
` 1
Definition
Given sequence of functions (` (x ))`1 in `2 where ` : X ! R is the
i th coordinate of (x ). Then
1
0
k (x ; x ) :=
X
` (x )` (x 0 ) (1)
=
` 1
25/74
Infinite sequences
Definition
The space `2 (square summable sequences) comprises all sequences
a := (ai )i 1 for which
1
ka k` = < 1:
2
X
2
a`2
=
` 1
Definition
Given sequence of functions (` (x ))`1 in `2 where ` : X ! R is the
i th coordinate of (x ). Then
1
0
k (x ; x ) :=
X
` (x )` (x 0 ) (1)
=
` 1
25/74
Infinite sequences (proof)
26/74
Positive definite functions
27/74
Positive definite functions
i =1 j =1
28/74
Kernels are positive definite
Theorem
Let H be a Hilbert space, X a non-empty set and : X ! H.
Then h(x ); (y )iH =: k (x ; y ) is positive definite.
Proof.
n X
n n X
n
ai aj k (xi ; xj ) = hai (xi ); aj (xj )iH
X X
i =1 j =1 i =1 j =1
n 2
= ai (xi ) 0:
X
i =1 H
29/74
Sum of kernels is a kernel
n X
n
[k1 (xi ; xj ) + k2 (xi ; xj )]
X
ai aj
i =1 j =1
n X n n X
n
= ai aj k1 (xi ; xj ) + ai aj k2 (xi ; xj )
X X
i =1 j =1 i =1 j =1
0
30/74
The reproducing kernel Hilbert space
First example: finite space, polynomial features
1
x2
−1
−2
−3
−4
−5
−5 −4 −3 −2 −1 0 1 2 3 4 5
x
1
32/74
Example: finite space, polynomial features
: R
2
! R
3
x1
7!
x = x1
x2
(x ) = x2 ;
x1 x2
with kernel
>
x1 y1
k (x ; y ) = x2 y2
x1 x2 y1 y2
(the standard inner product in R3 between features). Denote this
feature space by H.
33/74
Example: finite space, polynomial features
Define a linear function of the inputs x1 ; x2 ; and their product x1 x2 ,
f (x ) = f1 x1 + f2 x2 + f3 (x1 x2 ):
f (x ) = f1 x1 + f2 x2 + f3 (x1 x2 ):
1
k (x ; y ) = ` (x )` (x 0 )
X
=
` 1
1 1
f (x ) = f` ` (x ) < 1:
X X
f`2
=
` 1 =
` 1
35/74
Expressing the functions with kernels
Function with exponentiated quadratic kernel:
1
f (x ) = f` ` (x )
X
=
` 1
1 X
0.8
m
!
= (xi ) ` (x )
X 0.6
i `
= | i =1 {z
f(x)
` 1 0.4
}
f` 0.2
*m +
= (xi ); (x )
X 0
-6 -4 -2 0 2 4 6 8
i x
|i =1 {z } H
f
m
= ( xi ; x )
X
ik
i =1
36/74
Expressing the functions with kernels
Function with exponentiated quadratic kernel:
1
f (x ) = f` ` (x )
X
=
0.8
` 1
1 X
m
! 0.6
= i ` (xi ) ` (x )
X
f(x)
0.4
= | i =1 {z
` 1
} 0.2
f`
*m + 0
-6 -4 -2 0 2 4 6 8
= (xi ); (x )
X x
i
:= (xi )
Pm
H f` i =1
|i =1 {z
i `
}
f
m
= ( xi ; x )
X
ik
i =1
36/74
Expressing the functions with kernels
Function with exponentiated quadratic kernel:
1
f (x ) = f` ` (x )
X
=
0.8
` 1
1 X
m
! 0.6
= i ` (xi ) ` (x )
X
f(x)
0.4
= | i =1 {z
` 1
} 0.2
f`
*m + 0
-6 -4 -2 0 2 4 6 8
= (xi ); (x )
X x
i
:= (xi )
Pm
H f i =1
|i =1 {z
i
}
f
m
= ( xi ; x )
X
ik
i =1
36/74
Expressing the functions with kernels
Function with exponentiated quadratic kernel:
1
f (x ) = f` ` (x )
X
=
0.8
` 1
1 X
m
! 0.6
= (xi ) ` (x )
X 0.4
i `
f(x)
0.2
= | i =1 {z
` 1
} 0
-0.2
f`
*m + -0.4
-6 -4 -2 0 2 4 6 8
= (xi ); (x )
X x
i
:= (xi )
Pm
i =1 H f i =1 i
| {z }
f
m
= ( xi ; x )
X
ik
i =1
Function of infinitely many features expressed using f( i ; xi )gmi=1.
36/74
The feature map is also a function
On previous page,
m m
f (x ) := i k (xi ; x ) = hf (); (x )iH = (xi ):
X X
where f` i `
i =1 i =1
What if m = 1 and 1 = 1?
Then
* +
f (x ) = k (x1 ; x ) = k (x1 ; ); (x )
f( )
| {z }
H
37/74
The feature map is also a function
On previous page,
m m
f (x ) := i k (xi ; x ) = hf (); (x )iH = (xi ):
X X
where f` i `
i =1 i =1
What if m = 1 and 1 = 1?
Then
* +
f (x ) = k (x1 ; x ) = k (x1 ; ); (x )
f( )
| {z }
H
37/74
The feature map is also a function
On previous page,
m m
f (x ) := i k (xi ; x ) = hf (); (x )iH = (xi ):
X X
where f` i `
i =1 i =1
What if m = 1 and 1 = 1?
Then
* +
f (x ) = k (x1 ; x ) = k (x1 ; ); (x )
f( )
| {z }
H
= hk (x ; ); (x1 )iH
....so the feature map is a (very simple) function!
We can write without ambiguity
k (x ; y ) = hk (; x ) ; k (; y )iH :
38/74
The feature map is also a function
On previous page,
m m
f (x ) := i k (xi ; x ) = hf (); (x )iH = (xi ):
X X
where f` i `
i =1 i =1
What if m = 1 and 1 = 1?
Then
* +
f (x ) = k (x1 ; x ) = k (x1 ; ); (x )
f( )
| {z }
H
= hk (x ; ); (x1 )iH
....so the feature map is a (very simple) function!
We can write without ambiguity
k (x ; y ) = hk (; x ) ; k (; y )iH :
38/74
Features vs functions
A subtle point: H can be larger than all (x ).
40/74
Understanding smoothness in the RKHS
Infinite feature space via fourier series
Function on the interval [ ; ] with periodic boundary.
Fourier series:
1 1
f (x ) = f^` exp({`x ) = f^` (cos(`x ) + { sin(`x )) :
X X
= 1
` l= 1
using the orthonormal basis on [ ; ],
` = m;
Z (
1
1
exp({`x )exp({mx )dx =
2 0 ` 6= m :
Example: “top hat” function,
1 jx j < T ;
(
f (x ) =
0 T jx j < :
1
f^` := sin(``T ) f (x ) = 2f^` cos(`x ):
X
=
` 0 42/74
Infinite feature space via fourier series
Function on the interval [ ; ] with periodic boundary.
Fourier series:
1 1
f (x ) = f^` exp({`x ) = f^` (cos(`x ) + { sin(`x )) :
X X
= 1
` l= 1
using the orthonormal basis on [ ; ],
` = m;
Z (
1
1
exp({`x )exp({mx )dx =
2 0 ` 6= m :
Example: “top hat” function,
1 jx j < T ;
(
f (x ) =
0 T jx j < :
1
f^` := sin(``T ) f (x ) = 2f^` cos(`x ):
X
=
` 0 42/74
Infinite feature space via fourier series
Function on the interval [ ; ] with periodic boundary.
Fourier series:
1 1
f (x ) = f^` exp({`x ) = f^` (cos(`x ) + { sin(`x )) :
X X
= 1
` l= 1
using the orthonormal basis on [ ; ],
` = m;
Z (
1
1
exp({`x )exp({mx )dx =
2 0 ` 6= m :
Example: “top hat” function,
1 jx j < T ;
(
f (x ) =
0 T jx j < :
1
f^` := sin(``T ) f (x ) = 2f^` cos(`x ):
X
=
` 0 42/74
Fourier series for top hat function
cos(ℓ × x)
1.2
0
1
−0.5
0.8
−1
−4 −2 0 2 4
f (x)
0.6 t
Fourier series coefficients
0.4 0.5
0.4
0.2
0.3
fˆℓ
0 0.2
0.1
−0.2
0
−4 −2 0 2 4 −10 −5 0 5 10
x ℓ
43/74
Fourier series for top hat function
cos(ℓ × x)
1.2
0
1
−0.5
0.8
−1
−4 −2 0 2 4
f (x)
0.6 t
Fourier series coefficients
0.4 0.5
0.4
0.2
0.3
fˆℓ
0 0.2
0.1
−0.2
0
−4 −2 0 2 4 −10 −5 0 5 10
x ℓ
44/74
Fourier series for top hat function
cos(ℓ × x)
1.2
0
1
−0.5
0.8
−1
−4 −2 0 2 4
f (x)
0.6 t
Fourier series coefficients
0.4 0.6
0.4
0.2
fˆℓ
0.2
0
0
−0.2
−0.2
−4 −2 0 2 4 −10 −5 0 5 10
x ℓ
45/74
Fourier series for top hat function
cos(ℓ × x)
1.2
0
1
−0.5
0.8
−1
−4 −2 0 2 4
f (x)
0.6 t
Fourier series coefficients
0.4 0.6
0.4
0.2
fˆℓ
0.2
0
0
−0.2
−0.2
−4 −2 0 2 4 −10 −5 0 5 10
x ℓ
46/74
Fourier series for top hat function
cos(ℓ × x)
1.2
0
1
−0.5
0.8
−1
−4 −2 0 2 4
f (x)
0.6 t
Fourier series coefficients
0.4 0.6
0.4
0.2
fˆℓ
0.2
0
0
−0.2
−0.2
−4 −2 0 2 4 −10 −5 0 5 10
x ℓ
47/74
Fourier series for top hat function
cos(ℓ × x)
1.2
0
1
−0.5
0.8
−1
−4 −2 0 2 4
f (x)
0.6 t
Fourier series coefficients
0.4 0.6
0.4
0.2
fˆℓ
0.2
0
0
−0.2
−0.2
−4 −2 0 2 4 −10 −5 0 5 10
x ℓ
48/74
Fourier series for top hat function
cos(ℓ × x)
1.2
0
1
−0.5
0.8
−1
−4 −2 0 2 4
f (x)
0.6 t
Fourier series coefficients
0.4 0.6
0.4
0.2
fˆℓ
0.2
0
0
−0.2
−0.2
−4 −2 0 2 4 −10 −5 0 5 10
x ℓ
49/74
Fourier series for kernel function
Assume kernel translation invariant,
k (x ; y ) = k (x y );
Fourier series representation of k
1
k (x y) = k^` exp ({`(x y ))
X
`= 1
1 q
^ k^` exp (
q
= k` exp ({`(x ) {`y ) :
X
`= 1
( )
| {z } | {z }
` x ` y( )
50/74
Fourier series for kernel function
Assume kernel translation invariant,
k (x ; y ) = k (x y );
Fourier series representation of k
1
k (x y) = k^` exp ({`(x y ))
X
`= 1
1 q
^ k^` exp (
q
= k` exp ({`(x ) {`y ) :
X
`= 1
( )
| {z } | {z }
` x ` y( )
50/74
Fourier series for Gaussian-spectrum kernel
0.5
cos(ℓ × x)
0.6
0
0.5
−0.5
0.4
−1
−4 −2 0 2 4
k (x)
0.3 t
Fourier series coefficients
0.2
0.2
0.15
0.1
fˆℓ
0.1
0
0.05
−0.1 0
−4 −2 0 2 4 −10 −5 0 5 10
x ℓ
51/74
Fourier series for Gaussian-spectrum kernel
0.5
cos(ℓ × x)
0.6
0
0.5
−0.5
0.4
−1
−4 −2 0 2 4
k (x)
0.3 t
Fourier series coefficients
0.2
0.2
0.15
0.1
fˆℓ
0.1
0
0.05
−0.1 0
−4 −2 0 2 4 −10 −5 0 5 10
x ℓ
52/74
Fourier series for Gaussian-spectrum kernel
0.5
cos(ℓ × x)
0.6
0
0.5
−0.5
0.4
−1
−4 −2 0 2 4
k (x)
0.3 t
Fourier series coefficients
0.2
0.2
0.15
0.1
fˆℓ
0.1
0
0.05
−0.1 0
−4 −2 0 2 4 −10 −5 0 5 10
x ℓ
53/74
Fourier series for Gaussian-spectrum kernel
0.5
cos(ℓ × x)
0.6
0
0.5
−0.5
0.4
−1
−4 −2 0 2 4
k (x)
0.3 t
Fourier series coefficients
0.2
0.2
0.15
0.1
fˆℓ
0.1
0
0.05
−0.1 0
−4 −2 0 2 4 −10 −5 0 5 10
x ℓ
54/74
RKHS via fourier series
Recall standard dot product in L2 :
Z
hf ; g iL2 = 2 f (x )g (x )dx
1
Z " 1 #" 1 #
= 2 f^` exp({`x ) g^m exp({mx ) dx
1 X X
`= 1 m= 1
1 X 1 Z
= f^` g^ m exp({`x ) exp( {mx )
X 1
` m= 1
= 1 2
1
= f^` g^ ` :
X
` = 1
55/74
RKHS via fourier series
Recall standard dot product in L2 :
Z
hf ; g iL2 = 2 f (x )g (x )dx
1
Z " 1 #" 1 #
= 2 f^` exp({`x ) g^m exp({mx ) dx
1 X X
`= 1 m= 1
1 X 1 Z
= f^` g^ m exp({`x ) exp( {mx )
X 1
` m= 1
= 1 2
1
= f^` g^ ` :
X
` = 1
55/74
RKHS via fourier series
Recall standard dot product in L2 :
Z
hf ; g iL2 = 2 f (x )g (x )dx
1
Z " 1 #" 1 #
= 2 f^` exp({`x ) g^m exp({mx ) dx
1 X X
`= 1 m= 1
1 X 1 Z
= f^` g^ m exp({`x ) exp( {mx )
X 1
` m= 1
= 1 2
1
= f^` g^ ` :
X
` = 1
55/74
RKHS via fourier series
Recall standard dot product in L2 :
Z
hf ; g iL2 = 2 f (x )g (x )dx
1
Z " 1 #" 1 #
= 2 f^` exp({`x ) g^m exp({mx ) dx
1 X X
`= 1 m= 1
1 X 1 Z
= f^` g^ m exp({`x ) exp( {mx )
X 1
` m= 1
= 1 2
1
= f^` g^ ` :
X
` = 1
55/74
RKHS via fourier series
Recall standard dot product in L2 :
Z
hf ; g iL2 = 2 f (x )g (x )dx
1
Z " 1 #" 1 #
= 2 f^` exp({`x ) g^m exp({mx ) dx
1 X X
`= 1 m= 1
1 X 1 Z
= f^` g^ m exp({`x ) exp( {mx )
X 1
` m= 1
= 1 2
1
= f^` g^ ` :
X
` = 1
Define the dot product in H to have a roughness penalty,
1 f^ g^
hf ; g iH = ` `
:
X
`= 1
k^`
55/74
Roughness penalty explained
l= 1 `
k^ l=
^
1 k`
If k^` decays fast, then so must f^` if we want kf k2H < 1.
Recall f (x ) = 1 ^
`= 1 f` (cos(`x ) + { sin(`x )) :
P
56/74
Roughness penalty explained
l= 1 `
k^ l=
^
1 k`
If k^` decays fast, then so must f^` if we want kf k2H < 1.
Recall f (x ) = 1 ^
`= 1 f` (cos(`x ) + { sin(`x )) :
P
56/74
Roughness penalty explained
l= 1 `
k^ l=
^
1 k`
If k^` decays fast, then so must f^` if we want kf k2H < 1.
Recall f (x ) = 1 ^
`= 1 f` (cos(`x ) + { sin(`x )) :
P
56/74
Feature map and reproducing property
Reproducing property: define a function
1
g (x ) := k (x z) = exp ({`x ) k|^` exp{z( {`z })
X
= 1
`
g^`
` = 1 k^`
1
f^` exp({`z ) = f (z ):
X
` = 1
57/74
Feature map and reproducing property
Reproducing property: define a function
1
g (x ) := k (x z) = exp ({`x ) k|^` exp{z( {`z })
X
= 1
`
g^`
` = 1 k^`
1
f^` exp({`z ) = f (z ):
X
` = 1
57/74
Feature map and reproducing property
Reproducing property: define a function
1
g (x ) := k (x z) = exp ({`x ) k|^` exp{z( {`z })
X
= 1
`
g^`
` = 1 k^`
1
f^` exp({`z ) = f (z ):
X
` = 1
57/74
Feature map and reproducing property
Reproducing property for the kernel:
Recall kernel definition:
1 1
k (x y ) = k^` exp ({`(x y )) = k^` exp ({`x ) exp ( {`y )
X X
`= 1 `= 1
Define two functions
1
f (x ) := k (x y) = k^` exp ({`(x y ))
X
` = 1
1
= exp ({`x ) k|^` exp{z( {`y})
X
` = 1 f^`
1
g (x ) := k (x z) = exp ({`x ) k|^` exp{z( {`z })
X
` = 1 g^`
58/74
Feature map and reproducing property
Reproducing property for the kernel:
Recall kernel definition:
1 1
k (x y ) = k^` exp ({`(x y )) = k^` exp ({`x ) exp ( {`y )
X X
`= 1 `= 1
Define two functions
1
f (x ) := k (x y) = k^` exp ({`(x y ))
X
` = 1
1
= exp ({`x ) k|^` exp{z( {`y})
X
` = 1 f^`
1
g (x ) := k (x z) = exp ({`x ) k|^` exp{z( {`z })
X
` = 1 g^`
58/74
Feature map and reproducing property
`= 1
k^`
1 k^` exp( {`y ) k^` exp( {`z )
=
X
= 1
`
k^`
1
= k^` exp({`(z y )) = k (z y ):
X
= 1
`
59/74
Feature map and reproducing property
`= 1
k^`
1 k^` exp( {`y ) k^` exp( {`z )
=
X
` = 1 k^`
1
= k^` exp({`(z y )) = k (z y ):
X
` = 1
59/74
Feature map and reproducing property
`= 1
k^`
1 k^` exp( {`y ) k^` exp( {`z )
=
X
` = 1 k^`
1
= k^` exp({`(z y )) = k (z y ):
X
` = 1
59/74
Link back to original RKHS function definition
1
f (z ) = f` ` (z ) = hf (); (z )iH :
X
= 1
`
hf ; g iH = ` `
hf (); k (; z )iH =
X X
l=
^
1 k` ` = 1 k^`
60/74
Link back to original RKHS function definition
1
f (z ) = f` ` (z ) = hf (); (z )iH :
X
= 1
`
hf ; g iH = ` `
hf (); k (; z )iH =
X X
k^` k^`
p 2
l= 1 `= 1
60/74
Link back to original RKHS function definition
Original form of a function in the RKHS was
(detail: sum now from 1 to 1, complex conjugate)
1
f (z ) = f` ` (z ) = hf (); (z )iH :
X
` = 1
We’ve defined the RKHS dot product as
1 f^ g^ 1 f^` k^` exp( {`z )
hf ; g iH = ` `
hf (); k (; z )iH =
X X
k^` k^`
p 2
l= 1 `= 1
By inspection
60/74
Second infinite feature space (on R)
Define a probability measure on X := R. We’ll use the Gaussian
density,
p (x ) = p1 exp x2
2
We can write
1
0
k (x ; x ) =
X
` e` (x )e` (x 0 );
=
` 1
which converges in L2 p . ( )
Warning: again, need stronger conditions on kernel than L2 convergence.
61/74
Second infinite feature space (on R)
Define a probability measure on X := R. We’ll use the Gaussian
density,
p (x ) = p1 exp x2
2
We can write
1
0
k (x ; x ) =
X
` e` (x )e` (x 0 );
=
` 1
which converges in L2 p . ( )
Warning: again, need stronger conditions on kernel than L2 convergence.
61/74
Second infinite feature space (on R)
Define a probability measure on X := R. We’ll use the Gaussian
density,
p (x ) = p1 exp x2
2
We can write
1
0
k (x ; x ) =
X
` e` (x )e` (x 0 );
=
` 1
which converges in L2 p . ( )
Warning: again, need stronger conditions on kernel than L2 convergence.
61/74
Second infinite feature space (on R)
Exponentiated quadratic kernel,
k x x 0 k2
1 p
0 0)
X
k (x ; x ) = exp = ( ) (
p
e x e x
2 2
` ` ` `
`=1 |
` (x ) ` (x 0 )
{z }| {z }
` e` (x ) = k (x ; x 0 )e` (x 0 )p (x 0 )dx 0 ;
Z
p (x ) = N (0; 2 ):
62/74
Second infinite feature space (on R)
Exponentiated quadratic kernel,
k x x 0 k2
1 p
0 0)
X
k (x ; x ) = exp = ( ) (
p
e x e x
2 2
` ` ` `
`=1 |
` (x ) ` (x 0 )
{z }| {z }
` e` (x ) = k (x ; x 0 )e` (x 0 )p (x 0 )dx 0 ;
Z
p (x ) = N (0; 2 ):
e1(x)
` / b ` b<1
p
e` (x ) / exp( (c a )x 2 )H` (x 2c );
e2(x)
a ; b ; c are functions of ,
and H` is `th order Her-
e3(x)
mite polynomial.
62/74
Second infinite feature space (on R)
Reminder: for two functions f ; g in L2 (p ),
1 1
f (x ) = f^` e` (x ) g (x ) = g^m em (x );
X X
` 1= m =1
dot product is
1
hf ; g iL (p) =
Z
f (x )g (x )p (x )dx
Z 11
2
1 ! 1 !
= f^` e` (x ) g^m em (x ) p (x )dx
X X
1 ` 1 = m =1
1
= f^` g^`
X
=
` 1
`=1
` `=1
` 63/74
Second infinite feature space (on R)
Reminder: for two functions f ; g in L2 (p ),
1 1
f (x ) = f^` e` (x ) g (x ) = g^m em (x );
X X
` 1= m =1
dot product is
1
hf ; g iL (p) =
Z
f (x )g (x )p (x )dx
Z 11
2
1 ! 1 !
= f^` e` (x ) g^m em (x ) p (x )dx
X X
1 ` 1 = m =1
1
= f^` g^`
X
=
` 1
`=1
` `=1
` 63/74
Second infinite feature space (on R)
Reminder: for two functions f ; g in L2 (p ),
1 1
f (x ) = f^` e` (x ) g (x ) = g^m em (x );
X X
` 1= m =1
dot product is
1
hf ; g iL (p) =
Z
f (x )g (x )p (x )dx
Z 11
2
1 ! 1 !
= f^` e` (x ) g^m em (x ) p (x )dx
X X
1 ` 1 = m =1
1
= f^` g^`
X
=
` 1
`=1
` `=1
` 63/74
Second infinite feature space (on R)
Reminder: for two functions f ; g in L2 (p ),
1 1
f (x ) = f^` e` (x ) g (x ) = g^m em (x );
X X
` 1= m =1
dot product is
1
hf ; g iL (p) =
Z
f (x )g (x )p (x )dx
Z 11
2
1 ! 1 !
= f^` e` (x ) g^m em (x ) p (x )dx
X X
1 ` 1 = m =1
1
= f^` g^`
X
=
` 1
`=1
` `=1
` 63/74
Does the reproducing property hold?
Check the reproducing property:
1 ^
hf ; g iH = f`g^`
X
l =1
`
64/74
Does the reproducing property hold?
Check the reproducing property:
1 ^ 1
hf ; g iH = f`g^` g () = k (; z ) = | ` e{z` (z })e` ()
X X
l =1 =
`
g^`
` 1
64/74
Does the reproducing property hold?
l =1
` = g^`
` 1
Then:
g^
`
1
X f^` (` e` (z ))
z }| {
hf (); k (; z )iH = `
=
` 1
64/74
Does the reproducing property hold?
l =1
` = g^`
` 1
Then:
1 f^ e (z )
hf (); k (; z )iH = ` ` `
X
=
` 1
`
64/74
Does the reproducing property hold?
l =1
` = g^`
` 1
Then:
1 f^ e (z )
hf (); k (; z )iH = ` ` `
X
=
` 1
`
1
= f^` e` (z ) = f (z )
X
=
` 1
64/74
Link back to the original RKHS definition
Original form of a function in the RKHS was
1
f (z ) = f` ` (z ) = hf (); (z )iH
X
=
` 1
=
` 1 =
` 1
65/74
Link back to the original RKHS definition
Original form of a function in the RKHS was
1
f (z ) = f` ` (z ) = hf (); (z )iH
X
=
` 1
=
` 1 =
` 1
l =1
` = g^`
` 1
65/74
Link back to the original RKHS definition
Original form of a function in the RKHS was
1
f (z ) = f` ` (z ) = hf (); (z )iH
X
=
` 1
=
` 1 =
` 1
Same expression with “roughness penalised” dot product:
1 f^ g^ 1
hf ; g iH = ` `
g () = k (; z ) = | ` e{z` (z })e` ()
X X
l =1 =
`
g^`
` 1
^g
}|` {
f^` (` e` (z ))
z
Thus: hf (); k (; z )iH = P1`=1 `
65/74
Link back to the original RKHS definition
Original form of a function in the RKHS was
1
f (z ) = f` ` (z ) = hf (); (z )iH
X
=
` 1
=
` 1 =
` 1
Same expression with “roughness penalised” dot product:
1 f^ g^ 1
hf ; g iH = ` `
g () = k (; z ) = | ` e{z` (z })e` ()
X X
l =1 =
`
g^`
` 1
P1 f^` (` e` (z ))
Thus: hf (); k (; z )iH = = 1
`
(p ) `
2
65/74
Link back to the original RKHS definition
Original form of a function in the RKHS was
1
f (z ) = f` ` (z ) = hf (); (z )iH
X
=
` 1
=
` 1 =
` 1
Same expression with “roughness penalised” dot product:
1 f^ g^ 1
hf ; g iH = ` `
g () = k (; z ) = | ` e{z` (z })e` ()
X X
l =1 =
`
g^`
` 1
P1 f^` (` e` (z ))
Thus: hf (); k (; z )iH = = 1
`
(p ) `
2
By inspection: f`
p
= f^` = `
p
` (z ) = ` e` (z ):
65/74
RKHS function, exponentiated quadratic kernel:
1 1 hp
m m
f (x ) = ( xi ; x ) = j ej (xi )ej (x ) = f` ` e` (x )
X X X X i
ik i
i =1 i =1 j =1 `=1 |
( )
{z }
` x
=
Pm p e (x ):
where f` i =1 i ` ` i
0.8
0.6
NOTE that this
0.4
enforces smoothing:
` decay as e` become
f(x)
0.2
0 rougher,
f` decay since
kf k2H = P` f`2 < 1.
-0.2
-0.4
-6 -4 -2 0 2 4 6 8
x
66/74
Main message
1
0.5 0.5
0.5
0 0
0
−0.5 −0.5
−0.5
−1 −1 −1
−0.5 0 0.5 1 1.5 −0.5 0 0.5 1 1.5 −0.5 0 0.5 1 1.5
67/74
Some reproducing kernel Hilbert space
theory
Reproducing kernel Hilbert space (1)
Definition
H a Hilbert space of R-valued functions on non-empty set X . A
function k : X X ! R is a reproducing kernel of H, and H is a
reproducing kernel Hilbert space, if
8x 2 X ; k (; x ) 2 H,
8x 2 X ; 8f 2 H; hf (); k (; x )iH = f (x ) (the reproducing property).
In particular, for any x ; y 2 X ,
69/74
Reproducing kernel Hilbert space (2)
Another RKHS definition:
Define x to be the operator of evaluation at x , i.e.
x f = f (x ) 8f 2 H; x 2 X :
jf (x )j = jx f j x kf kH
72/74
Moore-Aronszajn Theorem
Theorem (Moore-Aronszajn)
Let k: X X ! R be positive definite. There is a unique RKHS
H RX with reproducing kernel k .
Recall feature map is not unique (as we saw earlier):
only kernel is unique.
73/74
Main message
74/74