You are on page 1of 28

KERNEL METHODS

Prof. Navneet Goyal,


CS Dept.
BITS-Pilani, Pilani Campus
INDIA

Figure source:
http://wwwold.ini.ruhr-uni-
bochum.de/thbio/group/neuralnet
/index_p.html
Kernel Methods

§ In computer science, kernel methods (KMs)


are a class of algorithms for pattern analysis,
whose best known element is the support
vector machine (SVM) (Wikipedia)
§ Transformations
§ Feature Spaces
§ Kernel Functions
§ Kernel Tricks
§ Inner Products
Kernel Methods

Algorithms capable of operating with kernels include:


§ Support vector machine (SVM)
§ Gaussian processes
§ Fisher's linear discriminant analysis (LDA)
§ Principal components analysis (PCA) (Kernel PCA)
§ Canonical correlation analysis
§ Ridge regression
§ Spectral clustering
§ Linear adaptive filters
§ …
Kernel Methods

§ Kernels are non-linear generalizations of


inner products
Kernel Methods

§ Any kernel-based method comprises of two


modules:
ú Mapping into embedding or feature space
ú Learning algorithm designed to discover linear
patterns in that space
Kernel Methods
§ Why this approach works?
ú Detecting linear patterns has been the focus of
much research in statistics and machine learning
ú Resulting algorithms are well understood and
efficient
ú Computational shortcut: makes it possible to
represent linear patterns efficiently in high
dimensional space to ensure adequate
representational power
ú The shortcut is nothing but the KERNEL FUNCTION
Kernel Methods
§ Kernel methods allow us to extend algorithms such as
SVMs to define non-linear decision boundaries
§ Other algorithms that only depend on inner products
between data points can be extended similarly
§ Kernel functions which are symmetric and positive
definite allows us to implicitly define inner products in
high dimensional
§ Replacing inner products in input space with positive
definite kernels immediately extends algorithms like
SVM to
ú Linear separation in high dimensional space
ú Or equivalently to a non-linear separation in input space
Types of Kernels
§ Positive definite symmetric kernels (PDS)
§ Negative definite symmetric kernels (NDS)
§ Role of NDS in construction on PDS!
Kernel Methods
§ Input space, χ
§ High dimensional space, ℍ
§ ℍ can be really large!!
§ Document classification
ú Trigrams
ú Vocabulary of 100000 words
ú Dimension of feature space reaches 1015
ú Generalization ability of large-margin classifiers like
SVM does not depend on dimensions of the feature
space, but on the margin and no. of training examples
Kernel Functions
§ A function K: 𝟀 x 𝟀 → ℝ is called a kernel over 𝟀
§ For any two points x, x’ ∈ 𝟀,
K(x,x’) = 〈 ϕ (x), ϕ (x’)〉
For some mapping ϕ : 𝟀 → ℍ to a Hilbert space ℍ
called a feature space
§ K is efficient!
ú K(x,x’) is O(N)
ú 〈 ϕ (x), ϕ (x’)〉 is O(dim ℍ) with dim ℍ ≫ N
§ K is flexible!
ú No need to explicitly define or compute ϕ
ú Kernel K can be arbitrarily chosen so long as the existensce
of ϕ is guaranteed, i.e. K satisfies Mercer’s condition
Kernel Functions
§ Mercer’s Condition
§ A kernel function K can be expressed as
K(x,x’) = 〈 ϕ (x), ϕ (x’)〉
iff, for any function g(x) such that ∫g(x)2dx is
finite, then
∫K(x,x’)g(x)g(x’) dxdx’ ≥ 0.

Kernels satisfying Mercer’s condition are called


Positive Definite Kernel Functions!
Transformed space of SVM kernels is called a
Reproducing Kernel Hilbert Space (RKHS)
Kernel Functions
§ Examples
( -|| x - y||2 / c )
k ( x, y ) = e
k ( x, y ) = (< x, y > +q ) d
k ( x, y ) = tanh(a < x, y > +q )
1
k ( x, y ) =
|| x - y ||2 +c 2
§ Show that the polynomial Kernel fn. satisfies the Mercer’s
condition
Feature Spaces

FF: :xx®
®FF( x( ),
x), RR ®
dd
®FF

F
L2

example: ( x, y ) ® ( x , y , 2 xy )
2 2

13
Modularity

Kernel methods consist of two modules:

1) The choice of kernel (this is non-trivial)


2) The algorithm which takes kernels as input

Modularity: Any kernel can be used with any kernel-algorithm.


some kernels: some kernel algorithms:
( -|| x - y||2 / c ) - support vector machine
k ( x, y ) = e
k ( x, y ) = (< x, y > +q ) d - Fisher discriminant analysis
- kernel regression
k ( x, y ) = tanh(a < x, y > +q )
- kernel PCA
1
k ( x, y ) =
|| x - y ||2 +c 2 14
Kernel Trick
Kernel Trick: Examples

Consider the mapping:

The points in feature space corresponding to x1 and x2:

Find:
Kernel Trick: Examples
Kernel Functions: Example
§ Given x = (x1, x2, x3); y = (y1, y2, y3). Find the
kernel for the function:
f(x)=(x1x1, x1x2, x1x3, x2x1, x2x2, x2x3, x3x1, x3x2, x3x3)
Also find f(x).f(y) without finding f(x) & f(y).
Verify your result using x = (1, 2, 3) & y = (4, 5, 6).
Kernel Functions: Example
§ Consider a two-dimensional input space 𝑋 ⊆ ℝE
together with the feature map:
∅: 𝒙 = 𝑥F , 𝑥E ⟼ ∅ 𝒙 = 𝑥FE , 𝑥EE , 2 𝑥F 𝑥E ∈ 𝐹
= ℝG
§ Find the Kernel corresponding to ∅ and 𝐹
§ Also find the Kernel function corresponding to:
∅: 𝒙 = 𝑥F , 𝑥E ⟼ ∅ 𝒙 = 𝑥FE , 𝑥EE , 𝑥F 𝑥E , 𝑥E 𝑥F ∈ 𝐹
= ℝH
§ What conclusions you can draw?
Kernel Functions: Example
§ Find a mapping f from ℝ𝟐 → ℝ𝟔 corresponding to
the kernel 𝐾 𝑿, 𝒀 = [2 𝑿. 𝒀 + 3]E
§ For a given kernel, is the mapping unique?
Kernel Functions
§ You must be wondering about:
ú How to choose the mapping ϕ ?
ú How to avoid CoD?
ú Which Kernel to choose?

Let us try to find answers in context of SVM!!


Kernel Functions
§ How can we extend the linear SVM approach to
obtain a non-linear SVM approach?
§ 2 steps:
ú Transform the original input data to a higher
dimensional space using a non-linear mapping ϕ
ú Search for a linear separating hyperplane in the
new space
Kernel Functions
§ How to choose the mapping ϕ ?
ú Several common non-linear mappings can be used
ú In fact we do not even need to know what the mapping
is J
§ How to avoid CoD?
ú Dot product in feature space can be represented as a
kernel function of input points:
K(x,x’) = ϕ (x).ϕ (x’)
ú SVMs do not suffer from CoD
  Complexity of SVMs depend on the number of Support
Vectors (SVs)
Kernel Functions
§ Which Kernel to choose?
ú Properties of the kinds of kernels that can be used
to replace the dot product have been studiedJ
ú Admissible Kernels functions are:
  Polynomial Kernel
  Gaussian Radial Basis Function Kernel
  Sigmoid Kernel
ú Each of these Kernels result in a different nonlinear
classifier in the input space
ú Connection with ANN?
  Later – after we have discussed ANNs
Kernel Functions
§ Which Kernel to choose?
ú No golden rules for determining which admissible
kernel will result in the most accurate SVM L
ú Kernel chosen does not make a large difference in
resulting accuracy
ú SVM training always finds a global solution
§ Challenges in SVM
ú Scalability
  Need to improve the speed of training/testing so that SVM
scales to very large data sets (millions of SVs)
ú Determining best Kernel for a given data set
ú More efficient implementations of SVM for multiclass
problems
Kernel Functions

Non-linear SVM Input Space Feature Space


(Low dimensional) (High dimensional

Optimization Convex Convex


Problem Optimization Optimization
(concept of Duals (concept of Duals
and Lagrange and Lagrange
Multipliers) Multipliers)
Main operation in Xi. Xj Φ(Xi ). (Xj)
Training (= K(Xi. Xj))

Classifier Non-linear Maximal margin


separating hyperplane
hypersurface
Kernel Functions: Examples
§ Decision boundary a circle cantered at
(0.5,0.5) with r=0.2
Goodies and Baddies
Goodies:
•Kernel algorithms are typically constrained convex optimization
problems à solved with either spectral methods or convex
optimization tools.
• Efficient algorithms do exist in most cases.
• The similarity to linear methods facilitates analysis. There are strong
generalization bounds on test error.

Baddies:
•You need to choose the appropriate kernel
• Kernel learning is prone to over-fitting
• All information must go through the kernel-bottleneck.

28

You might also like