You are on page 1of 25

Kernel Methods

Feature mapping at no cost


Kernel methods
• Many machine learning algorithms depend on the data
x1; : : : ; xn only through the pair-wise inner products
<xi . xj>
• Inner products capture geometry of the data set.
• geometrically inspired algorithms (e.g., SVM) depend
only on inner products.
• when predictions are based on inner products of data
points, replace with kernel function
• keep advantages of linear models, but make them
capture non-linear patterns in data
• K(x,x’) basically measures similarity between x and x’
Similarity measure is important
Gaussian/RBF Kernel
Distances and inner products

algorithms depending on distances can


be rewritten in terms of inner products
The terminology “kernel trick” simply refers to
taking any algorithm that depends on the data
only through inner products, and replacing each
inner product <xi,xj> by a kernel value k(xi,xj)
Feature Space is a dot product space

 : x  ( x), R  F d

non-linear mapping to F 
1. high-D space
2. infinite-D countable space :
3. function space (Hilbert space)

example: ( x, y)  ( x , y , 2 xy)
2 2

7
How Kernel solves XOR
Find the weight vector to solve the XOR under the following
Look into earlier notes: we had discussed this as an exercise problem
QUADRATURE FEATURE MAPPING
Cubic Kernel
Complexity cost
Polynomial Kernel
Computational cost saving
Properties of Kernels
The Kernel Matrix of dot products
(contain all information of algorithm
Formal Definitions
PSD property (Not needed for this
course)
Mercer’s condition ( not needed for
this course)
CAN ANY FUNCTION BE USED AS A KERNEL FUNCTION?

Mercer’s condition tells us which kernel function can


be expressed as dot product of two vectors
Pros and Cons

Good
•Kernel algorithms are typically constrained convex optimization
problems  solved with either spectral methods or convex optimization tools.
• Efficient algorithms do exist in most cases.
• The similarity to linear methods facilitates analysis. There are strong
generalization bounds on test error.

Bad
• You need to choose the appropriate kernel
• Kernel learning is prone to over-fitting
• All information must go through the kernel-bottleneck. (The Gram-Matrix)

22
Modularity
Kernel methods consist of two modules:

1) The choice of kernel ( HOW TO DO THIS?)


2) The algorithm which takes kernels as input

Modularity: Any kernel can be used with any kernel-algorithm.

some kernels:

( || x  y||2 / c )
some kernel algorithms:
k ( x, y )  e - support vector machine
k ( x, y )  ( x, y   ) d - Fisher discriminant analysis
- kernel regression
k ( x, y )  tanh(  x, y   )
- kernel PCA
1 - kernel CCA
k ( x, y ) 
|| x  y ||2 c 2
25

You might also like