Cookbook-En On Stat and Probablity

Kernel Machines
Kernel trick
• Feature mapping Φ(·) can be very high dimensional (e.g. think of polynomial mapping)
• It can be highly expensive to explicitly compute it
• Feature mappings appear only in dot products in dual formulations
• The kernel trick consists in replacing these dot products with an equivalent kernel function:
k(x, x0 ) = Φ(x)T Φ(x0 )
• The kernel function uses examples in input (not feature) space
Kernel trick
Support vector classification
• Dual optimization problem
m m
X 1 X
max αi − αi αj yi yj Φ(xi )T Φ(xj )
α∈IRm 2 i,j=1 | {z }
i=1
k(xi ,xj )
subject to 0 ≤ αi ≤ C i = 1, . . . , m
Xm
αi yi = 0
i=1
• Dual decision function

m
X
f (x) = αi Φ(xi )T Φ(x)
| {z }
i=1
k(xi ,x)
Kernel trick
Polynomial kernel
• Homogeneous:
k(x, x0 ) = (xT x0 )d
• E.g. (d = 2)
0
x1 x1
k( , ) = (x1 x01 + x2 x02 )2
x2 x02
= (x1 x01 )2 + (x2 x02 )2 + 2x1 x01 x2 x02
x02
 
√ T √ 1
= x21 2x1 x2 x22  2x01 x02 
| {z } x022
Φ(x)T | {z }
Φ(x0 )
Kernel trick
Polynomial kernel
• Inhomogeneous: k(x, x0 ) = (1 + xT x0 )d
• E.g. (d = 2)
0
x1 x1
k( , ) = (1 + x1 x01 + x2 x02 )2
x2 x02
= 1 + (x1 x01 )2 + (x2 x02 )2 + 2x1 x01 + 2x2 x02 + 2x1 x01 x2 x02
 
√1 0
√2x10 
 
√ √ √ T 
2x2 
= 1 2x1 2x2 x21 2x1 x2 x22 


02
}  √ x10 0 
 
| {z
Φ(x)T
 2x1 x2 
x02
2
| {z }
Φ(x0 )
Valid Kernels
Dot product in feature space
• A valid kernel is a (similarity) function defined in cartesian product of input space:
k : X × X → IR
• corresponding to a dot product in a (certain) feature space:
k(x, x0 ) = Φ(x)T Φ(x0 )
Note
• The kernel generalizes the notion of dot product to arbitrary input space (e.g. protein sequences)
• It can be seen as a measure of similarity between objects
Valid Kernels
Gram matrix
• Given examples {x1 , . . . , xm } and kernel function k

• The Gram matrix K is the (symmetric) matrix of pairwise kernels between examples:
Kij = k(xi , xj ) ∀i, j
2
Valid Kernels
Positive definite matrix
• A symmetric m × m matrix K is positive definite (p.d.) if
m
X
ci cj Kij ≥ 0, ∀c ∈ IRm
i,j=1
If equality only holds for c = 0, the matrix is strictly positive definite (s.p.d)
Alternative conditions
• All eigenvalues are non-negative (positive for s.p.d.)
• There exists a matrix B such that
K = BT B
Valid Kernels
Positive definite kernels
• A positive definite kernel is a function k : X × X → IR giving rise to a p.d. Gram matrix for any m and
{x1 , . . . , xm }
• Positive definiteness is necessary and sufficient condition for a kernel to correspond to a dot product of some
feature map Φ
How to verify kernel validity

• Prove its positive definiteness (difficult)
• Find out a corresponding feature map (see polynomial example)
• Use kernel combination properties (we’ll see)
Kernel machines
Support vector regression
• Dual problem:
m
1 X ∗
max − (α − αi )(αj∗ − αj ) Φ(xi )T Φ(xj )
α∈IRm 2 i,j=1 i | {z }
k(xi ,xj )
m
X m
X
− (αi∗ + αi ) + yi (αi∗ − αi )
i=1 i=1
m
X
subject to (αi − αi∗ ) = 0 αi , αi∗ ∈ [0, C] ∀i ∈ [1, m]
i=1
• Regression function: m
X
f (x) = wT Φ(x) + w0 = (αi − αi∗ ) Φ(xi )T Φ(x) +w0
| {z }
i=1
k(xi ,x)
3
Kernel machines
Smallest Enclosing Hypersphere
• Dual formulation
m
X m
X
max αi Φ(xi )T Φ(xi ) − αi αj Φ(xi )T Φ(xj )
α∈IRm | {z } | {z }
i=1 i,j=1
k(xi ,xi ) k(xi ,xj )
m
X
subject to αi = 1, 0 ≤ αi ≤ C, i = 1, . . . , m.
i=1
• Distance function
m
X m
X
R2 (x) = Φ(x)T Φ(x) −2 αi Φ(xi )T Φ(x) + αi αj Φ(xi )T Φ(xj )
| {z } | {z } | {z }
i=1 i,j=1
k(x,x) k(xi ,x) k(xi ,xj )
Kernel machines
(Stochastic) Perceptron: f (x) = wT x
1. Initialize w = 0
2. Iterate until all examples correctly classified:
(a) For each incorrectly classified training example (xi , yi ):
w ← w + ηyi xi
Pm
Kernel Perceptron: f (x) = i=1 αi k(xi , x)
1. Initialize αi = 0 ∀i
2. Iterate until all examples correctly classified:
(a) For each incorrectly classified training example (xi , yi ):
αi ← αi + ηyi
Kernels
Basic kernels
• linear kernel:
k(x, x0 ) = xT x0
• polynomial kernel:
kd,c (x, x0 ) = (xT x + c)d
4
Kernels
Gaussian kernel
||x − x0 ||2 xT x − 2xT x0 + x0T x0

kσ (x, x0 ) = exp − = exp −
2σ 2 2σ 2
• Depends on a width parameter σ
• The smaller the width, the more prediction on a point only depends on its nearest neighbours
• Example of Universal kernel: they can uniformly approximate any arbitrary continuous target function (pb of
number of training examples and choice of σ)
Kernels
Kernels on structured data
• Kernels are generalization of dot products to arbitrary domains
• It is possible to design kernels over structured objects like sequences, trees or graphs
• The idea is designing a pairwise function measuring the similarity of two objects
• This measure has to sastisfy the p.d. conditions to be a valid kernel
Match (or delta) kernel
if x = x0

1
kδ (x, x0 ) = δ(x, x0 ) =
0 otherwise.
• Simplest kernel on structures
• x does not need to be a vector! (no boldface to stress it)
E.g. string kernel: 3-gram spectrum kernel
5
Kernels
Kernel combination
• Simpler kernels can combined using certain operators (e.g. sum, product)
• Kernel combination allows to design complex kernels on structures from simpler ones
• Correctly using combination operators guarantees that complex kernels are p.d.
Note
• Simplest constructive approach to build valid kernels
Kernel combination
Kernel Sum
• The sum of two kernels corresponds to the concatenation of their respective feature spaces:
(k1 + k2 )(x, x0 ) = k1 (x, x0 ) + k2 (x, x0 )

= Φ1 (x)T Φ1 (x0 ) + Φ2 (x)T Φ2 (x0 )
Φ1 (x0 )

= (Φ1 (x) Φ2 (x))
Φ2 (x0 )
• The two kernels can be defined on different spaces (direct sum, e.g. string spectrum kernel plus string length)
Kernel combination
Kernel Product
• The product of two kernels corresponds to the Cartesian products of their features:
(k1 × k2 )(x, x0 ) = k1 (x, x0 )k2 (x, x0 )
Xn m
X
= Φ1i (x)Φ1i (x0 ) Φ2j (x)Φ2j (x0 )
i=1 j=1
Xn Xm
= (Φ1i (x)Φ2j (x))(Φ1i (x0 )Φ2j (x0 ))
i=1 j=1
nm
X
= Φ12k (x)Φ12k (x0 ) = Φ12 (x)T Φ12 (x0 )
k=1
• where Φ12 (x) = Φ1 (x) × Φ2 (x) is the Cartesian product

• the product can be between kernels in different spaces (tensor product)
6
Kernel combination
Linear combination
• A kernel can be rescaled by an arbitrary positive constant: kβ (x, x0 ) = βk(x, x0 )

• We can e.g. define linear combinations of kernels (each rescaled by the desired weight):
K
X
ksum (x, x0 ) = βi ki (x, x0 )
k=1
Note
• The weights of the linear combination can be learned simultaneously to the predictor weights (the alphas)
• This amounts at performing kernel learning
Kernel combination
Decomposition kernels
• Use the combination operators (sum and product) to define kernels on structures.
• Rely on a decomposition relationship R(x) = (x1 , . . . , xD ) breaking a structure into its parts
E.g. for strings
• R(x) = (x1 , . . . , xD ) could be break string x into substrings such that x1 ◦ . . . xD = x (where ◦ is string
concatenation)
• E.g. (D = 3, empty string not allowed):
Kernel combination
Convolution kernels
• decomposition kernels defining a kernel as the convolution of its parts:
X X D
Y
(k1 ? · · · ? kD )(x, x0 ) = kd (xd , x0d )
(x1 ,...,xD )∈R(x) (x01 ,...,x0D )∈R(x0 ) d=1
• where the sums run over all possible decompositions of x and x0 .
7
Convolution kernels
Set kernel
• Let R(x) be the set membership relationship (written as ∈)

• Let kmember (ξ, ξ 0 ) be a kernel defined over set elements
• The set kernel is defined as:
X X
kset (X, X 0 ) = kmember (ξ, ξ 0 )
ξ∈X ξ 0 ∈X 0
Set intersection kernel
• For delta membership kernel we obtain:
k∩ (X, X 0 ) = |X ∩ X 0 |
Kernel combination
Kernel normalization
• Kernel values can often be influenced by the dimension of objects

• E.g. a longer string has more substrings → higher kernel value
• This effect can be reduced normalizing the kernel
Cosine normalization
• Cosine normalization computes the cosine of the dot product in feature space:
k(x, x0 )
k̂(x, x0 ) = p
k(x, x)k(x0 , x0 )
Kernel combination
Kernel composition
• Given a kernel over structured data k(x, x0 )

• it is always possible to use a basic kernel on top of it, e.g.:
(kc,d ◦ k))(x, x0 ) = (k(x, x0 ) + c)d

k(x, x) − 2k(x, x0 ) + k(x0 , x0 )

(kσ ◦ k)(x, x0 ) = exp −
2σ 2
• it corresponds to the composition of the mappings associated with the two kernels
• E.g. all possible conjunctions of up to d k-grams for string kernels
8
References
kernel trick C. Burges, A tutorial on support vector machines for pattern recognition, Data Mining and Knowledge
Discovery, 2(2), 121-167, 1998.
kernel properties J.Shawe-Taylor and N. Cristianini, Kernel Methods for Pattern Analysis, Cambridge University
Press, 2004 (Section 3)
kernels J.Shawe-Taylor and N. Cristianini, Kernel Methods for Pattern Analysis, Cambridge University Press, 2004
(Section 9)

Cookbook-En On Stat and Probablity

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cookbook-En On Stat and Probablity

Uploaded by

Copyright:

Available Formats

Kernel Machines

k(x, x0 ) = Φ(x)T Φ(x0 )

• The kernel function uses examples in input (not feature) space

• Dual decision function

• A valid kernel is a (similarity) function defined in cartesian product of input space:

• corresponding to a dot product in a (certain) feature space:

k(x, x0 ) = Φ(x)T Φ(x0 )

• Given examples {x1 , . . . , xm } and kernel function k

Kij = k(xi , xj ) ∀i, j

How to verify kernel validity

(a) For each incorrectly classified training example (xi , yi ):

2. Iterate until all examples correctly classified:

(a) For each incorrectly classified training example (xi , yi ):

||x − x0 ||2 xT x − 2xT x0 + x0T x0

Match (or delta) kernel

E.g. string kernel: 3-gram spectrum kernel

• Simplest constructive approach to build valid kernels

(k1 + k2 )(x, x0 ) = k1 (x, x0 ) + k2 (x, x0 )

• where Φ12 (x) = Φ1 (x) × Φ2 (x) is the Cartesian product

• A kernel can be rescaled by an arbitrary positive constant: kβ (x, x0 ) = βk(x, x0 )

• This amounts at performing kernel learning

E.g. for strings

• decomposition kernels defining a kernel as the convolution of its parts:

• where the sums run over all possible decompositions of x and x0 .

• Let R(x) be the set membership relationship (written as ∈)

Set intersection kernel

• For delta membership kernel we obtain:

• Kernel values can often be influenced by the dimension of objects

• Given a kernel over structured data k(x, x0 )

(kc,d ◦ k))(x, x0 ) = (k(x, x0 ) + c)d

You might also like