You are on page 1of 15

CS446 Introduction to Machine Learning (Fall 2013)

University of Illinois at Urbana-Champaign


http://courses.engr.illinois.edu/cs446

LECTURE 10:
DUAL AND KERNELS II

Prof. Julia Hockenmaier


juliahmr@illinois.edu
Primal and dual representation
Linear classifier (primal representation):
w defines weights of features of x
f(x) = w·x
Linear classifier (dual representation):
Rewrite w as a (weighted) sum
of training items:
w = ∑n αn yn xn
f(x) = w·x = ∑n αn yn xn ·x
CS446 Machine Learning 2
The kernel trick
– Define a feature function φ(x) which maps
items x into a higher-dimensional space.
– The kernel function K(xi, xj) computes the
inner product between the φ(xi) and φ(xj)
K(xi, xj) = φ(xi)φ(xj)
– Dual representation: We don’t need to
learn w in this higher-dimensional space.
It is sufficient to evaluate K(xi, xj)

CS446 Machine Learning 3


The kernel matrix
The kernel matrix of a data set D = {x1, …, xn}
defined by a kernel function k(x, z) = φ(x)φ(z)
is the n×n matrix K with Kij = k(xi, xj)

You’ll also find the term ‘Gram matrix’ used:


– The Gram matrix of a set of n vectors S = {x1…xn}
is the n×n matrix G with Gij = xixj
– The kernel matrix is the Gram matrix of
{φ(x1), …,φ(xn)}

CS446 Machine Learning 4


Properties of the kernel matrix K
K is symmetric:
Kij = k(xi, xj) = φ(xi)φ(xj) = k(xj, xi) = Kji

K is positive semi-definite (∀ vectors v: vTKv ≥0):


D D D D
Proof: vT Kv = ∑∑ v v K = ∑∑ v v φ (x ), φ (x )
i j ij i j i j
i=1 j=1 i=1 j=1
D D N N D D
= ∑∑ vi v j ∑φk (x i ) ⋅ φk (x j ) = ∑∑∑ viφk (x i ) ⋅ v jφk (x j )
i=1 j=1 k=1 k=1 i=1 j=1
N #D &2
= ∑% ∑ viφk (x i )( ≥ 0
k=1 $ i=1 '
CS446 Machine Learning 5
Quadratic kernel (1)
K(x, z) = (xz)2
This corresponds to a feature space
which contains only terms of degree 2
(products of two features)
(for x = (x1, x2) in R2, these are x1x1, x1x2, x2x2)
For x = (x1, x2), z = (z1, z2):
K(x, z) = (xz)2
= x12z12 + 2x1z1x2z2 + x22z22
= φ(x)·φ(z)
Hence, φ(x) = (x12 , √2·x1x2, x22)

CS446 Machine Learning 6


Quadratic kernel (2)
K(x, z) = (xz + c)2
This corresponds to a feature space
which contains constants, linear terms
(original features), as well as terms of degree
2 (products of two features)

(for x = (x1, x2) in R2: x1, x2, x1x1, x1x2, x2x2)

CS446 Machine Learning 7


Polynomial kernels
– Linear kernel: k(x, z) = xz

– Polynomial kernel of degree d:


(only dth-order interactions):
k(x, z) = (xz)d

– Polynomial kernel up to degree d:


(all interactions of order d or lower:
k(x, z) = (xz + c)d with c > 0

CS446 Machine Learning 8


Constructing new kernels from
one existing kernel k(x, x’)
You can construct new kernels k’(x, x’) from k(x, x’) by:
– Multiplying k(x, x’) by a constant c:
k’(x, x’) = ck(x, x’)
– Multiplying k(x, x’) by a function f applied to x and x’:
k’(x, x’) = f(x)k(x, x’)f(x’)
– Applying a polynomial (with non-negative coefficients) to k(x, x’):
k’(x, x’) = P( k(x, x’) ) with P(z) = ∑i aizi and ai≥0
– Exponentiating k(x, x’):
k’(x, x’) = exp(k(x, x’))

CS446 Machine Learning 9


Constructing new kernels by combining
two kernels k1(x, x’), k2(x, x’)
You can construct k’(x, x’) from
k1(x, x’), k2(x, x’) by:
– Adding k1(x, x’) and k2(x, x’):
k’(x, x’) = k1(x, x’) + k2(x, x’)

– Multiplying k1(x, x’) and k2(x, x’):


k’(x, x’) = k1(x, x’)k2(x, x’)

CS446 Machine Learning 10


Constructing new kernels
– If φ(x) ∈ Rm and km(z, z’) a valid kernel in Rm,
k(x, x’) = km(φ(x), φ(x’)) is also a valid kernel

– If A is a symmetric positive semi-definite matrix,


k(x, x’) = xAx’ is also a valid kernel

CS446 Machine Learning 11


Normalizing a kernel
Recall: you can normalize any vector x
(transform it into a unit vector that has the
same direction as x) by
x x
x̂ = =
x x12 +... + x 2N

k(x, z)
k '(x, z) =
k(x, x)k(z, z)
φ (x)φ (z)
=
φ (x)φ (x)φ (z)φ (z)
φ (x)φ (z)
=
φ (x) φ (z)
φ (x)
= ψ (x)ψ (z) with ψ (x) =
φ (x)

CS446 Machine Learning 12


Gaussian kernel
(aka radial basis function kernel)
k(x, z) = exp( −‖x − z‖2/c)
‖x − z‖2: squared Euclidean distance between x and z
c (often called σ2): a free parameter
very small c: K ≈ identity matrix (every item is different)
very large c: K ≈ unit matrix (all items are the same)
– k(x, z) ≈ 1 when x, z close
– k(x, z) ≈ 0 when x, z dissimilar

CS446 Machine Learning 13


Gaussian kernel
(aka radial basis function kernel)
k(x, z) = exp( −‖x − z‖2/c)
This is a valid kernel because:
k(x, z) = exp( −‖x − z‖2/2σ2)
= exp( −(xx + zz − 2xz)/2σ2)
= exp(−xx/2σ2)exp(xz/σ2) exp(−zz/2σ2)
= f(x) exp(xz/σ2) f(z)
exp(xz/σ2) is a valid kernel:
– xz is the linear kernel;
– we can multiply kernels by constants (1/σ2)
– we can exponentiate kernels

CS446 Machine Learning 14


Kernels over (finite) sets
X, Z: subsets of a finite set D with |D| elements

k(X, Z) = |X∩Z| (the number of elements in X and Z)


is a valid kernel:
k(X, Z) = φ(X)φ(Z) where φ(X) maps X to a bit vector of length |D|
(ith bit: does X contains the i-th element of D?).

k(X, Z) = 2|X∩Z| (the number of subsets shared by X and Z)


is a valid kernel:
φ(X) maps X to a bit vector of length 2|D|
(i-th bit: does X contains the i-th subset of D?)

CS446 Machine Learning 15

You might also like