Professional Documents
Culture Documents
Cookbook-En On Stat and Probablity
Cookbook-En On Stat and Probablity
Kernel trick
• Feature mapping Φ(·) can be very high dimensional (e.g. think of polynomial mapping)
• It can be highly expensive to explicitly compute it
• Feature mappings appear only in dot products in dual formulations
• The kernel trick consists in replacing these dot products with an equivalent kernel function:
Kernel trick
Support vector classification
• Dual optimization problem
m m
X 1 X
max αi − αi αj yi yj Φ(xi )T Φ(xj )
α∈IRm 2 i,j=1 | {z }
i=1
k(xi ,xj )
subject to 0 ≤ αi ≤ C i = 1, . . . , m
Xm
αi yi = 0
i=1
Kernel trick
Polynomial kernel
• Homogeneous:
k(x, x0 ) = (xT x0 )d
• E.g. (d = 2)
0
x1 x1
k( , ) = (x1 x01 + x2 x02 )2
x2 x02
= (x1 x01 )2 + (x2 x02 )2 + 2x1 x01 x2 x02
x02
√ T √ 1
= x21 2x1 x2 x22 2x01 x02
| {z } x022
Φ(x)T | {z }
Φ(x0 )
Kernel trick
Polynomial kernel
• Inhomogeneous: k(x, x0 ) = (1 + xT x0 )d
• E.g. (d = 2)
0
x1 x1
k( , ) = (1 + x1 x01 + x2 x02 )2
x2 x02
= 1 + (x1 x01 )2 + (x2 x02 )2 + 2x1 x01 + 2x2 x02 + 2x1 x01 x2 x02
√1 0
√2x10
√ √ √ T
2x2
= 1 2x1 2x2 x21 2x1 x2 x22
02
} √ x10 0
| {z
Φ(x)T
2x1 x2
x02
2
| {z }
Φ(x0 )
Valid Kernels
Dot product in feature space
k : X × X → IR
Note
• The kernel generalizes the notion of dot product to arbitrary input space (e.g. protein sequences)
• It can be seen as a measure of similarity between objects
Valid Kernels
Gram matrix
2
Valid Kernels
Positive definite matrix
• A symmetric m × m matrix K is positive definite (p.d.) if
m
X
ci cj Kij ≥ 0, ∀c ∈ IRm
i,j=1
If equality only holds for c = 0, the matrix is strictly positive definite (s.p.d)
Alternative conditions
• All eigenvalues are non-negative (positive for s.p.d.)
• There exists a matrix B such that
K = BT B
Valid Kernels
Positive definite kernels
• A positive definite kernel is a function k : X × X → IR giving rise to a p.d. Gram matrix for any m and
{x1 , . . . , xm }
• Positive definiteness is necessary and sufficient condition for a kernel to correspond to a dot product of some
feature map Φ
Kernel machines
Support vector regression
• Dual problem:
m
1 X ∗
max − (α − αi )(αj∗ − αj ) Φ(xi )T Φ(xj )
α∈IRm 2 i,j=1 i | {z }
k(xi ,xj )
m
X m
X
− (αi∗ + αi ) + yi (αi∗ − αi )
i=1 i=1
m
X
subject to (αi − αi∗ ) = 0 αi , αi∗ ∈ [0, C] ∀i ∈ [1, m]
i=1
• Regression function: m
X
f (x) = wT Φ(x) + w0 = (αi − αi∗ ) Φ(xi )T Φ(x) +w0
| {z }
i=1
k(xi ,x)
3
Kernel machines
Smallest Enclosing Hypersphere
• Dual formulation
m
X m
X
max αi Φ(xi )T Φ(xi ) − αi αj Φ(xi )T Φ(xj )
α∈IRm | {z } | {z }
i=1 i,j=1
k(xi ,xi ) k(xi ,xj )
m
X
subject to αi = 1, 0 ≤ αi ≤ C, i = 1, . . . , m.
i=1
• Distance function
m
X m
X
R2 (x) = Φ(x)T Φ(x) −2 αi Φ(xi )T Φ(x) + αi αj Φ(xi )T Φ(xj )
| {z } | {z } | {z }
i=1 i,j=1
k(x,x) k(xi ,x) k(xi ,xj )
Kernel machines
(Stochastic) Perceptron: f (x) = wT x
1. Initialize w = 0
2. Iterate until all examples correctly classified:
w ← w + ηyi xi
Pm
Kernel Perceptron: f (x) = i=1 αi k(xi , x)
1. Initialize αi = 0 ∀i
αi ← αi + ηyi
Kernels
Basic kernels
• linear kernel:
k(x, x0 ) = xT x0
• polynomial kernel:
kd,c (x, x0 ) = (xT x + c)d
4
Kernels
Gaussian kernel
Kernels
Kernels on structured data
• Kernels are generalization of dot products to arbitrary domains
• It is possible to design kernels over structured objects like sequences, trees or graphs
• The idea is designing a pairwise function measuring the similarity of two objects
• This measure has to sastisfy the p.d. conditions to be a valid kernel
if x = x0
1
kδ (x, x0 ) = δ(x, x0 ) =
0 otherwise.
• Simplest kernel on structures
• x does not need to be a vector! (no boldface to stress it)
5
Kernels
Kernel combination
• Simpler kernels can combined using certain operators (e.g. sum, product)
• Kernel combination allows to design complex kernels on structures from simpler ones
• Correctly using combination operators guarantees that complex kernels are p.d.
Note
Kernel combination
Kernel Sum
• The sum of two kernels corresponds to the concatenation of their respective feature spaces:
• The two kernels can be defined on different spaces (direct sum, e.g. string spectrum kernel plus string length)
Kernel combination
Kernel Product
• The product of two kernels corresponds to the Cartesian products of their features:
(k1 × k2 )(x, x0 ) = k1 (x, x0 )k2 (x, x0 )
Xn m
X
= Φ1i (x)Φ1i (x0 ) Φ2j (x)Φ2j (x0 )
i=1 j=1
Xn Xm
= (Φ1i (x)Φ2j (x))(Φ1i (x0 )Φ2j (x0 ))
i=1 j=1
nm
X
= Φ12k (x)Φ12k (x0 ) = Φ12 (x)T Φ12 (x0 )
k=1
6
Kernel combination
Linear combination
K
X
ksum (x, x0 ) = βi ki (x, x0 )
k=1
Note
• The weights of the linear combination can be learned simultaneously to the predictor weights (the alphas)
Kernel combination
Decomposition kernels
• Use the combination operators (sum and product) to define kernels on structures.
• Rely on a decomposition relationship R(x) = (x1 , . . . , xD ) breaking a structure into its parts
• R(x) = (x1 , . . . , xD ) could be break string x into substrings such that x1 ◦ . . . xD = x (where ◦ is string
concatenation)
• E.g. (D = 3, empty string not allowed):
Kernel combination
Convolution kernels
X X D
Y
(k1 ? · · · ? kD )(x, x0 ) = kd (xd , x0d )
(x1 ,...,xD )∈R(x) (x01 ,...,x0D )∈R(x0 ) d=1
7
Convolution kernels
Set kernel
k∩ (X, X 0 ) = |X ∩ X 0 |
Kernel combination
Kernel normalization
Cosine normalization
• Cosine normalization computes the cosine of the dot product in feature space:
k(x, x0 )
k̂(x, x0 ) = p
k(x, x)k(x0 , x0 )
Kernel combination
Kernel composition
• it corresponds to the composition of the mappings associated with the two kernels
• E.g. all possible conjunctions of up to d k-grams for string kernels
8
References
kernel trick C. Burges, A tutorial on support vector machines for pattern recognition, Data Mining and Knowledge
Discovery, 2(2), 121-167, 1998.
kernel properties J.Shawe-Taylor and N. Cristianini, Kernel Methods for Pattern Analysis, Cambridge University
Press, 2004 (Section 3)
kernels J.Shawe-Taylor and N. Cristianini, Kernel Methods for Pattern Analysis, Cambridge University Press, 2004
(Section 9)