Lecture 10

CS446 Introduction to Machine Learning (Fall 2013)
University of Illinois at Urbana-Champaign

http://courses.engr.illinois.edu/cs446
LECTURE 10:
DUAL AND KERNELS II
Prof. Julia Hockenmaier

juliahmr@illinois.edu
Primal and dual representation
Linear classifier (primal representation):
w defines weights of features of x
f(x) = w·x
Linear classifier (dual representation):
Rewrite w as a (weighted) sum
of training items:
w = ∑n αn yn xn
f(x) = w·x = ∑n αn yn xn ·x
CS446 Machine Learning 2
The kernel trick
– Define a feature function φ(x) which maps
items x into a higher-dimensional space.
– The kernel function K(xi, xj) computes the
inner product between the φ(xi) and φ(xj)
K(xi, xj) = φ(xi)φ(xj)
– Dual representation: We don’t need to
learn w in this higher-dimensional space.
It is sufficient to evaluate K(xi, xj)

The kernel matrix
The kernel matrix of a data set D = {x1, …, xn}
defined by a kernel function k(x, z) = φ(x)φ(z)
is the n×n matrix K with Kij = k(xi, xj)
You’ll also find the term ‘Gram matrix’ used:

– The Gram matrix of a set of n vectors S = {x1…xn}
is the n×n matrix G with Gij = xixj
– The kernel matrix is the Gram matrix of
{φ(x1), …,φ(xn)}

Properties of the kernel matrix K
K is symmetric:
Kij = k(xi, xj) = φ(xi)φ(xj) = k(xj, xi) = Kji
K is positive semi-definite (∀ vectors v: vTKv ≥0):

D D D D
Proof: vT Kv = ∑∑ v v K = ∑∑ v v φ (x ), φ (x )
i j ij i j i j
i=1 j=1 i=1 j=1
D D N N D D
= ∑∑ vi v j ∑φk (x i ) ⋅ φk (x j ) = ∑∑∑ viφk (x i ) ⋅ v jφk (x j )
i=1 j=1 k=1 k=1 i=1 j=1
N #D &2
= ∑% ∑ viφk (x i )( ≥ 0
k=1 $ i=1 '
Quadratic kernel (1)
K(x, z) = (xz)2
This corresponds to a feature space
which contains only terms of degree 2
(products of two features)
(for x = (x1, x2) in R2, these are x1x1, x1x2, x2x2)
For x = (x1, x2), z = (z1, z2):
K(x, z) = (xz)2
= x12z12 + 2x1z1x2z2 + x22z22
= φ(x)·φ(z)
Hence, φ(x) = (x12 , √2·x1x2, x22)

Quadratic kernel (2)
K(x, z) = (xz + c)2
This corresponds to a feature space
which contains constants, linear terms
(original features), as well as terms of degree
2 (products of two features)
(for x = (x1, x2) in R2: x1, x2, x1x1, x1x2, x2x2)

Polynomial kernels
– Linear kernel: k(x, z) = xz
– Polynomial kernel of degree d:

(only dth-order interactions):
k(x, z) = (xz)d
– Polynomial kernel up to degree d:

(all interactions of order d or lower:
k(x, z) = (xz + c)d with c > 0

Constructing new kernels from
one existing kernel k(x, x’)
You can construct new kernels k’(x, x’) from k(x, x’) by:
– Multiplying k(x, x’) by a constant c:
k’(x, x’) = ck(x, x’)
– Multiplying k(x, x’) by a function f applied to x and x’:
k’(x, x’) = f(x)k(x, x’)f(x’)
– Applying a polynomial (with non-negative coefficients) to k(x, x’):
k’(x, x’) = P( k(x, x’) ) with P(z) = ∑i aizi and ai≥0
– Exponentiating k(x, x’):
k’(x, x’) = exp(k(x, x’))

Constructing new kernels by combining
two kernels k1(x, x’), k2(x, x’)
You can construct k’(x, x’) from
k1(x, x’), k2(x, x’) by:
– Adding k1(x, x’) and k2(x, x’):
k’(x, x’) = k1(x, x’) + k2(x, x’)
– Multiplying k1(x, x’) and k2(x, x’):

k’(x, x’) = k1(x, x’)k2(x, x’)

Constructing new kernels
– If φ(x) ∈ Rm and km(z, z’) a valid kernel in Rm,
k(x, x’) = km(φ(x), φ(x’)) is also a valid kernel
– If A is a symmetric positive semi-definite matrix,

k(x, x’) = xAx’ is also a valid kernel

Normalizing a kernel
Recall: you can normalize any vector x
(transform it into a unit vector that has the
same direction as x) by
x x
x̂ = =
x x12 +... + x 2N
k(x, z)
k '(x, z) =
k(x, x)k(z, z)
φ (x)φ (z)
=
φ (x)φ (x)φ (z)φ (z)
φ (x)φ (z)
=
φ (x) φ (z)
φ (x)
= ψ (x)ψ (z) with ψ (x) =
φ (x)

Gaussian kernel
(aka radial basis function kernel)
k(x, z) = exp( −‖x − z‖2/c）
‖x − z‖2: squared Euclidean distance between x and z
c (often called σ2): a free parameter
very small c: K ≈ identity matrix (every item is different)
very large c: K ≈ unit matrix (all items are the same)
– k(x, z) ≈ 1 when x, z close
– k(x, z) ≈ 0 when x, z dissimilar

Gaussian kernel
(aka radial basis function kernel)
k(x, z) = exp( −‖x − z‖2/c）
This is a valid kernel because:
k(x, z) = exp( −‖x − z‖2/2σ2）
= exp( −(xx + zz − 2xz)/2σ2）
= exp(−xx/2σ2）exp(xz/σ2) exp(−zz/2σ2）
= f(x) exp(xz/σ2) f(z)
exp(xz/σ2) is a valid kernel:
– xz is the linear kernel;
– we can multiply kernels by constants (1/σ2)
– we can exponentiate kernels

Kernels over (finite) sets
X, Z: subsets of a finite set D with |D| elements
k(X, Z) = |X∩Z| (the number of elements in X and Z)

is a valid kernel:
k(X, Z) = φ(X)φ(Z) where φ(X) maps X to a bit vector of length |D|
(ith bit: does X contains the i-th element of D?).
k(X, Z) = 2|X∩Z| (the number of subsets shared by X and Z)

is a valid kernel:
φ(X) maps X to a bit vector of length 2|D|
(i-th bit: does X contains the i-th subset of D?)

Lecture 10

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 10

Uploaded by

Copyright:

Available Formats

CS446 Introduction to Machine Learning (Fall 2013)

University of Illinois at Urbana-Champaign

Prof. Julia Hockenmaier

CS446 Machine Learning 3

You’ll also find the term ‘Gram matrix’ used:

CS446 Machine Learning 4

K is positive semi-definite (∀ vectors v: vTKv ≥0):

CS446 Machine Learning 6

(for x = (x1, x2) in R2: x1, x2, x1x1, x1x2, x2x2)

CS446 Machine Learning 7

– Polynomial kernel of degree d:

– Polynomial kernel up to degree d:

CS446 Machine Learning 8

CS446 Machine Learning 9

– Multiplying k1(x, x’) and k2(x, x’):

CS446 Machine Learning 10

– If A is a symmetric positive semi-definite matrix,

CS446 Machine Learning 11

CS446 Machine Learning 12

CS446 Machine Learning 13

CS446 Machine Learning 14

k(X, Z) = |X∩Z| (the number of elements in X and Z)

k(X, Z) = 2|X∩Z| (the number of subsets shared by X and Z)

CS446 Machine Learning 15

You might also like