You are on page 1of 95

Kernel Methods and Nonlinear Classification

Piyush Rai
CS5350/6350: Machine Learning
September 15, 2011

(CS5350/6350)

Kernel Methods

September 15, 2011

1 / 16

Kernel Methods: Motivation
Often we want to capture nonlinear patterns in the data
Nonlinear Regression: Input-output relationship may not be linear
Nonlinear Classification: Classes may not be separable by a linear boundary

(CS5350/6350)

Kernel Methods

September 15, 2011

2 / 16

Kernel Methods: Motivation
Often we want to capture nonlinear patterns in the data
Nonlinear Regression: Input-output relationship may not be linear
Nonlinear Classification: Classes may not be separable by a linear boundary

Linear models (e.g., linear regression, linear SVM) are not just rich enough

(CS5350/6350)

Kernel Methods

September 15, 2011

2 / 16

Kernel Methods: Motivation Often we want to capture nonlinear patterns in the data Nonlinear Regression: Input-output relationship may not be linear Nonlinear Classification: Classes may not be separable by a linear boundary Linear models (e. linear SVM) are not just rich enough Kernels: Make linear models work in nonlinear settings (CS5350/6350) Kernel Methods September 15.g. linear regression. 2011 2 / 16 ..

Kernel Methods: Motivation Often we want to capture nonlinear patterns in the data Nonlinear Regression: Input-output relationship may not be linear Nonlinear Classification: Classes may not be separable by a linear boundary Linear models (e. 2011 2 / 16 . linear regression.g.. linear SVM) are not just rich enough Kernels: Make linear models work in nonlinear settings By mapping data to higher dimensions where it exhibits linear patterns (CS5350/6350) Kernel Methods September 15.

g.Kernel Methods: Motivation Often we want to capture nonlinear patterns in the data Nonlinear Regression: Input-output relationship may not be linear Nonlinear Classification: Classes may not be separable by a linear boundary Linear models (e.. 2011 2 / 16 . linear SVM) are not just rich enough Kernels: Make linear models work in nonlinear settings By mapping data to higher dimensions where it exhibits linear patterns Apply the linear model in the new input space (CS5350/6350) Kernel Methods September 15. linear regression.

g. linear regression. 2011 2 / 16 .Kernel Methods: Motivation Often we want to capture nonlinear patterns in the data Nonlinear Regression: Input-output relationship may not be linear Nonlinear Classification: Classes may not be separable by a linear boundary Linear models (e. linear SVM) are not just rich enough Kernels: Make linear models work in nonlinear settings By mapping data to higher dimensions where it exhibits linear patterns Apply the linear model in the new input space Mapping ≡ changing the feature representation (CS5350/6350) Kernel Methods September 15..

linear regression.g..Kernel Methods: Motivation Often we want to capture nonlinear patterns in the data Nonlinear Regression: Input-output relationship may not be linear Nonlinear Classification: Classes may not be separable by a linear boundary Linear models (e. linear SVM) are not just rich enough Kernels: Make linear models work in nonlinear settings By mapping data to higher dimensions where it exhibits linear patterns Apply the linear model in the new input space Mapping ≡ changing the feature representation Note: Such mappings can be expensive to compute in general (CS5350/6350) Kernel Methods September 15. 2011 2 / 16 .

g.. the mappings need not be even computed . using the Kernel Trick! (CS5350/6350) Kernel Methods September 15. linear SVM) are not just rich enough Kernels: Make linear models work in nonlinear settings By mapping data to higher dimensions where it exhibits linear patterns Apply the linear model in the new input space Mapping ≡ changing the feature representation Note: Such mappings can be expensive to compute in general Kernels give such mappings for (almost) free In most cases.Kernel Methods: Motivation Often we want to capture nonlinear patterns in the data Nonlinear Regression: Input-output relationship may not be linear Nonlinear Classification: Classes may not be separable by a linear boundary Linear models (e. 2011 2 / 16 .. linear regression.

2011 3 / 16 .Classifying non-linearly separable data Consider this binary classification problem Each example represented by a single feature x No linear separator exists for this data (CS5350/6350) Kernel Methods September 15.

Classifying non-linearly separable data Consider this binary classification problem Each example represented by a single feature x No linear separator exists for this data Now map each example as x → {x. 2011 3 / 16 . x 2 } Each example now has two features (“derived” from the old representation) (CS5350/6350) Kernel Methods September 15.

Classifying non-linearly separable data Consider this binary classification problem Each example represented by a single feature x No linear separator exists for this data Now map each example as x → {x. 2011 3 / 16 . x 2 } Each example now has two features (“derived” from the old representation) Data now becomes linearly separable in the new representation (CS5350/6350) Kernel Methods September 15.

x 2 } Each example now has two features (“derived” from the old representation) Data now becomes linearly separable in the new representation Linear in the new representation ≡ nonlinear in the old representation (CS5350/6350) Kernel Methods September 15.Classifying non-linearly separable data Consider this binary classification problem Each example represented by a single feature x No linear separator exists for this data Now map each example as x → {x. 2011 3 / 16 .

2011 4 / 16 .Classifying non-linearly separable data Let’s look at another example: Each example defined by a two features x = {x1 . x2 } No linear separator exists for this data (CS5350/6350) Kernel Methods September 15.

2x1 x2 . x2 } → z = {x12 . 2011 4 / 16 . x2 } No linear separator exists for this data √ Now map each example as x = {x1 .Classifying non-linearly separable data Let’s look at another example: Each example defined by a two features x = {x1 . x22 } Each example now has three features (“derived” from the old representation) (CS5350/6350) Kernel Methods September 15.

Classifying non-linearly separable data Let’s look at another example: Each example defined by a two features x = {x1 . 2011 4 / 16 . x2 } No linear separator exists for this data √ Now map each example as x = {x1 . x22 } Each example now has three features (“derived” from the old representation) Data now becomes linearly separable in the new representation (CS5350/6350) Kernel Methods September 15. 2x1 x2 . x2 } → z = {x12 .

. xD2 . . . 2011 5 / 16 .Feature Mapping Consider the following mapping φ for an example x = {x1 . . . . . . . x22 . . x1 x2 . . xD−1 xD } (CS5350/6350) Kernel Methods September 15. . xD } φ : x → {x12 . . . . x1 x2 . . x1 xD . . . . .

. . . 2011 5 / 16 . . . . xD2 . . . .Feature Mapping Consider the following mapping φ for an example x = {x1 . . x1 xD . xD−1 xD } It’s an example of a quadratic mapping Each new feature uses a pair of the original features (CS5350/6350) Kernel Methods September 15. x1 x2 . . . . . . . . . . x1 x2 . . xD } φ : x → {x12 . x22 .

xD } φ : x → {x12 . . x22 . . . . . . xD2 . xD−1 xD } It’s an example of a quadratic mapping Each new feature uses a pair of the original features Problem: Mapping usually leads to the number of features blow up! (CS5350/6350) Kernel Methods September 15. . x1 xD . . . . 2011 5 / 16 . . . x1 x2 . . . x1 x2 .Feature Mapping Consider the following mapping φ for an example x = {x1 . . . . . . .

xD } φ : x → {x12 . . . . . . . . . x1 xD . . . . . . . x1 x2 . . xD2 . xD−1 xD } It’s an example of a quadratic mapping Each new feature uses a pair of the original features Problem: Mapping usually leads to the number of features blow up! Computing the mapping itself can be inefficient in such cases (CS5350/6350) Kernel Methods September 15. x1 x2 . . . x22 . . .Feature Mapping Consider the following mapping φ for an example x = {x1 . 2011 5 / 16 . .

xD2 . .g. . using the mapped representation could be inefficient too e. x1 x2 . . . . 2011 5 / 16 . x22 . imagine computing the similarity between two examples: φ(x)⊤ φ(z) (CS5350/6350) Kernel Methods September 15. . . . . . .Feature Mapping Consider the following mapping φ for an example x = {x1 . . . . . .. x1 x2 . . . . xD−1 xD } It’s an example of a quadratic mapping Each new feature uses a pair of the original features Problem: Mapping usually leads to the number of features blow up! Computing the mapping itself can be inefficient in such cases Moreover. x1 xD . xD } φ : x → {x12 . .

.g. . . . . using the mapped representation could be inefficient too e. . . xD2 . . . . . . . xD } φ : x → {x12 . x1 x2 . xD−1 xD } It’s an example of a quadratic mapping Each new feature uses a pair of the original features Problem: Mapping usually leads to the number of features blow up! Computing the mapping itself can be inefficient in such cases Moreover. . 2011 5 / 16 .Feature Mapping Consider the following mapping φ for an example x = {x1 . . . imagine computing the similarity between two examples: φ(x)⊤ φ(z) Thankfully. x1 x2 . x1 xD . Kernels help us avoid both these issues! (CS5350/6350) Kernel Methods September 15. x22 . .. . . .

. . xD−1 xD } It’s an example of a quadratic mapping Each new feature uses a pair of the original features Problem: Mapping usually leads to the number of features blow up! Computing the mapping itself can be inefficient in such cases Moreover. . . . . . . imagine computing the similarity between two examples: φ(x)⊤ φ(z) Thankfully. x22 . . x1 x2 . . xD2 . . . . . . Kernels help us avoid both these issues! The mapping doesn’t have to be explicitly computed (CS5350/6350) Kernel Methods September 15. x1 x2 . . . 2011 5 / 16 .Feature Mapping Consider the following mapping φ for an example x = {x1 . .. . xD } φ : x → {x12 . .g. using the mapped representation could be inefficient too e. x1 xD .

x22 . using the mapped representation could be inefficient too e. . x1 xD . 2011 5 / 16 . . . . . .. x1 x2 . . . xD } φ : x → {x12 . xD2 . . . . . . . . xD−1 xD } It’s an example of a quadratic mapping Each new feature uses a pair of the original features Problem: Mapping usually leads to the number of features blow up! Computing the mapping itself can be inefficient in such cases Moreover. . imagine computing the similarity between two examples: φ(x)⊤ φ(z) Thankfully. . . .g. . Kernels help us avoid both these issues! The mapping doesn’t have to be explicitly computed Computations with the mapped features remain efficient (CS5350/6350) Kernel Methods September 15.Feature Mapping Consider the following mapping φ for an example x = {x1 . x1 x2 .

z2 } (CS5350/6350) Kernel Methods September 15. 2011 6 / 16 .Kernels as High Dimensional Feature Mapping Consider two examples x = {x1 . x2 } and z = {z1 .

Kernels as High Dimensional Feature Mapping
Consider two examples x = {x1 , x2 } and z = {z1 , z2 }
Let’s assume we are given a function k (kernel) that takes as inputs x and z
k(x, z)

(CS5350/6350)

=

2

(x z)

Kernel Methods

September 15, 2011

6 / 16

Kernels as High Dimensional Feature Mapping
Consider two examples x = {x1 , x2 } and z = {z1 , z2 }
Let’s assume we are given a function k (kernel) that takes as inputs x and z
k(x, z)

(CS5350/6350)

2

=

(x z)

=

(x1 z1 + x2 z2 )

2

Kernel Methods

September 15, 2011

6 / 16

Kernels as High Dimensional Feature Mapping
Consider two examples x = {x1 , x2 } and z = {z1 , z2 }
Let’s assume we are given a function k (kernel) that takes as inputs x and z
k(x, z)

(CS5350/6350)

2

=

(x z)

=

(x1 z1 + x2 z2 )

=

2 2
x1 z 1

2

+

2 2
x2 z 2

Kernel Methods

+ 2x1 x2 z1 z2

September 15, 2011

6 / 16

x2 } and z = {z1 . 2011 6 / 16 . 2z1 z2 . z2 ) = (CS5350/6350) ⊤ = 2 Kernel Methods September 15. x2 ) (z1 . 2x1 x2 . z) 2 (x z) = (x1 z1 + x2 z2 ) = 2 2 2 2 x1 z1 + x2 z2 + 2x1 x2 z1 z2 √ √ 2 2 ⊤ 2 2 (x1 . z2 } Let’s assume we are given a function k (kernel) that takes as inputs x and z k(x.Kernels as High Dimensional Feature Mapping Consider two examples x = {x1 .

2011 6 / 16 . x2 ) (z1 . x2 } and z = {z1 . 2x1 x2 . z2 ) ⊤ = = (CS5350/6350) ⊤ = 2 φ(x) φ(z) Kernel Methods September 15.Kernels as High Dimensional Feature Mapping Consider two examples x = {x1 . z) 2 (x z) = (x1 z1 + x2 z2 ) = 2 2 2 2 x1 z1 + x2 z2 + 2x1 x2 z1 z2 √ √ 2 2 ⊤ 2 2 (x1 . z2 } Let’s assume we are given a function k (kernel) that takes as inputs x and z k(x. 2z1 z2 .

z) ⊤ 2 = (x z) = (x1 z1 + x2 z2 ) = 2 2 2 2 x1 z1 + x2 z2 + 2x1 x2 z1 z2 √ √ 2 2 ⊤ 2 2 (x1 . x2 } and z = {z1 . z2 } Let’s assume we are given a function k (kernel) that takes as inputs x and z k(x. 2011 6 / 16 . 2z1 z2 .Kernels as High Dimensional Feature Mapping Consider two examples x = {x1 . z2 ) ⊤ = = 2 φ(x) φ(z) The above k implicitly defines a mapping φ to a higher dimensional space √ φ(x) = {x12 . x2 ) (z1 . 2x1 x2 . 2x1 x2 . x22 } (CS5350/6350) Kernel Methods September 15.

2z1 z2 . z) ⊤ 2 = (x z) = (x1 z1 + x2 z2 ) = 2 2 2 2 x1 z1 + x2 z2 + 2x1 x2 z1 z2 √ √ 2 2 ⊤ 2 2 (x1 . 2x1 x2 . x2 } and z = {z1 . z2 ) ⊤ = = 2 φ(x) φ(z) The above k implicitly defines a mapping φ to a higher dimensional space √ φ(x) = {x12 . z2 } Let’s assume we are given a function k (kernel) that takes as inputs x and z k(x.Kernels as High Dimensional Feature Mapping Consider two examples x = {x1 . x2 ) (z1 . 2x1 x2 . x22 } Note that we didn’t have to define/compute this mapping (CS5350/6350) Kernel Methods September 15. 2011 6 / 16 .

z2 ) ⊤ = = 2 φ(x) φ(z) The above k implicitly defines a mapping φ to a higher dimensional space √ φ(x) = {x12 . 2z1 z2 . x2 } and z = {z1 . x2 ) (z1 . mapping φ (CS5350/6350) Kernel Methods September 15. x22 } Note that we didn’t have to define/compute this mapping Simply defining the kernel a certain way gives a higher dim. 2011 6 / 16 . z) ⊤ 2 = (x z) = (x1 z1 + x2 z2 ) = 2 2 2 2 x1 z1 + x2 z2 + 2x1 x2 z1 z2 √ √ 2 2 ⊤ 2 2 (x1 . 2x1 x2 .Kernels as High Dimensional Feature Mapping Consider two examples x = {x1 . 2x1 x2 . z2 } Let’s assume we are given a function k (kernel) that takes as inputs x and z k(x.

x2 } and z = {z1 .Kernels as High Dimensional Feature Mapping Consider two examples x = {x1 . z) ⊤ 2 = (x z) = (x1 z1 + x2 z2 ) = 2 2 2 2 x1 z1 + x2 z2 + 2x1 x2 z1 z2 √ √ 2 2 ⊤ 2 2 (x1 . 2x1 x2 . 2x1 x2 . z) also computes the dot product φ(x)⊤ φ(z) φ(x)⊤ φ(z) would otherwise be much more expensive to compute explicitly (CS5350/6350) Kernel Methods September 15. 2z1 z2 . 2011 6 / 16 . z2 ) ⊤ = = 2 φ(x) φ(z) The above k implicitly defines a mapping φ to a higher dimensional space √ φ(x) = {x12 . mapping φ Moreover the kernel k(x. x2 ) (z1 . x22 } Note that we didn’t have to define/compute this mapping Simply defining the kernel a certain way gives a higher dim. z2 } Let’s assume we are given a function k (kernel) that takes as inputs x and z k(x.

2011 6 / 16 . x2 } and z = {z1 . mapping φ Moreover the kernel k(x. 2x1 x2 . z2 ) ⊤ = = 2 φ(x) φ(z) The above k implicitly defines a mapping φ to a higher dimensional space √ φ(x) = {x12 . x22 } Note that we didn’t have to define/compute this mapping Simply defining the kernel a certain way gives a higher dim. 2x1 x2 . x2 ) (z1 .Kernels as High Dimensional Feature Mapping Consider two examples x = {x1 . 2z1 z2 . z) ⊤ 2 = (x z) = (x1 z1 + x2 z2 ) = 2 2 2 2 x1 z1 + x2 z2 + 2x1 x2 z1 z2 √ √ 2 2 ⊤ 2 2 (x1 . z2 } Let’s assume we are given a function k (kernel) that takes as inputs x and z k(x. z) also computes the dot product φ(x)⊤ φ(z) φ(x)⊤ φ(z) would otherwise be much more expensive to compute explicitly All kernel functions have these properties (CS5350/6350) Kernel Methods September 15.

2011 7 / 16 .Kernels: Formally Defined Recall: Each kernel k has an associated feature mapping φ (CS5350/6350) Kernel Methods September 15.

Kernels: Formally Defined Recall: Each kernel k has an associated feature mapping φ φ takes input x ∈ X (input space) and maps it to F (“feature space”) (CS5350/6350) Kernel Methods September 15. 2011 7 / 16 .

z) takes two inputs and gives their similarity in F space φ : k (CS5350/6350) : X →F X × X → R. k(x. 2011 7 / 16 . z) = φ(x)⊤ φ(z) Kernel Methods September 15.Kernels: Formally Defined Recall: Each kernel k has an associated feature mapping φ φ takes input x ∈ X (input space) and maps it to F (“feature space”) Kernel k(x.

k(x. 2011 7 / 16 . z) takes two inputs and gives their similarity in F space φ : k : X →F X × X → R.Kernels: Formally Defined Recall: Each kernel k has an associated feature mapping φ φ takes input x ∈ X (input space) and maps it to F (“feature space”) Kernel k(x. z) = φ(x)⊤ φ(z) F needs to be a vector space with a dot product defined on it Also called a Hilbert Space (CS5350/6350) Kernel Methods September 15.

k(x. 2011 7 / 16 . z) = φ(x)⊤ φ(z) F needs to be a vector space with a dot product defined on it Also called a Hilbert Space Can just any function be used as a kernel function? (CS5350/6350) Kernel Methods September 15.Kernels: Formally Defined Recall: Each kernel k has an associated feature mapping φ φ takes input x ∈ X (input space) and maps it to F (“feature space”) Kernel k(x. z) takes two inputs and gives their similarity in F space φ : k : X →F X × X → R.

z) = φ(x)⊤ φ(z) F needs to be a vector space with a dot product defined on it Also called a Hilbert Space Can just any function be used as a kernel function? No. It must satisfy Mercer’s Condition (CS5350/6350) Kernel Methods September 15.Kernels: Formally Defined Recall: Each kernel k has an associated feature mapping φ φ takes input x ∈ X (input space) and maps it to F (“feature space”) Kernel k(x. z) takes two inputs and gives their similarity in F space φ : k : X →F X × X → R. 2011 7 / 16 . k(x.

2011 8 / 16 .Mercer’s Condition For k to be a kernel function (CS5350/6350) Kernel Methods September 15.

2011 8 / 16 .Mercer’s Condition For k to be a kernel function There must exist a Hilbert Space F for which k defines a dot product (CS5350/6350) Kernel Methods September 15.

Mercer’s Condition For k to be a kernel function There must exist a Hilbert Space F for which k defines a dot product The above is true if K is a positive definite function Z Z dx dzf (x)k(x. z)f (z) > 0 (∀f ∈ L2 ) (CS5350/6350) Kernel Methods September 15. 2011 8 / 16 .

Mercer’s Condition
For k to be a kernel function
There must exist a Hilbert Space F for which k defines a dot product
The above is true if K is a positive definite function
Z
Z
dx dzf (x)k(x, z)f (z) > 0 (∀f ∈ L2 )
This is Mercer’s Condition

(CS5350/6350)

Kernel Methods

September 15, 2011

8 / 16

Mercer’s Condition
For k to be a kernel function
There must exist a Hilbert Space F for which k defines a dot product
The above is true if K is a positive definite function
Z
Z
dx dzf (x)k(x, z)f (z) > 0 (∀f ∈ L2 )
This is Mercer’s Condition

Let k1 , k2 be two kernel functions then the following are as well:

(CS5350/6350)

Kernel Methods

September 15, 2011

8 / 16

Mercer’s Condition
For k to be a kernel function
There must exist a Hilbert Space F for which k defines a dot product
The above is true if K is a positive definite function
Z
Z
dx dzf (x)k(x, z)f (z) > 0 (∀f ∈ L2 )
This is Mercer’s Condition

Let k1 , k2 be two kernel functions then the following are as well:
k(x, z) = k1 (x, z) + k2 (x, z): direct sum

(CS5350/6350)

Kernel Methods

September 15, 2011

8 / 16

Mercer’s Condition For k to be a kernel function There must exist a Hilbert Space F for which k defines a dot product The above is true if K is a positive definite function Z Z dx dzf (x)k(x. z) = αk1 (x. z)f (z) > 0 (∀f ∈ L2 ) This is Mercer’s Condition Let k1 . z) + k2 (x. z): scalar product (CS5350/6350) Kernel Methods September 15. k2 be two kernel functions then the following are as well: k(x. 2011 8 / 16 . z) = k1 (x. z): direct sum k(x.

z): direct product (CS5350/6350) Kernel Methods September 15. z): scalar product k(x. z) + k2 (x. z) = αk1 (x. z) = k1 (x. z): direct sum k(x. 2011 8 / 16 . z)f (z) > 0 (∀f ∈ L2 ) This is Mercer’s Condition Let k1 .Mercer’s Condition For k to be a kernel function There must exist a Hilbert Space F for which k defines a dot product The above is true if K is a positive definite function Z Z dx dzf (x)k(x. z) = k1 (x. z)k2 (x. k2 be two kernel functions then the following are as well: k(x.

2011 8 / 16 . z) + k2 (x. z): direct sum k(x. z)k2 (x. z) = k1 (x. z): direct product Kernels can also be constructed by composing these rules (CS5350/6350) Kernel Methods September 15. z): scalar product k(x. z) = k1 (x. z) = αk1 (x. k2 be two kernel functions then the following are as well: k(x. z)f (z) > 0 (∀f ∈ L2 ) This is Mercer’s Condition Let k1 .Mercer’s Condition For k to be a kernel function There must exist a Hilbert Space F for which k defines a dot product The above is true if K is a positive definite function Z Z dx dzf (x)k(x.

2011 9 / 16 .The Kernel Matrix The kernel function k also defines the Kernel Matrix K over the data (CS5350/6350) Kernel Methods September 15.

j)-th entry of K is defined as: Kij = k(xi . .The Kernel Matrix The kernel function k also defines the Kernel Matrix K over the data Given N examples {x1 . 2011 9 / 16 . the (i. . . xj ) = φ(xi )⊤ φ(xj ) (CS5350/6350) Kernel Methods September 15. . xN }.

the (i. xN }. . j)-th entry of K is defined as: Kij = k(xi . 2011 9 / 16 . . xj ) = φ(xi )⊤ φ(xj ) Kij : Similarity between the i-th and j-th example in the feature space F (CS5350/6350) Kernel Methods September 15. . .The Kernel Matrix The kernel function k also defines the Kernel Matrix K over the data Given N examples {x1 .

xN }.The Kernel Matrix The kernel function k also defines the Kernel Matrix K over the data Given N examples {x1 . the (i. . . j)-th entry of K is defined as: Kij = k(xi . . xj ) = φ(xi )⊤ φ(xj ) Kij : Similarity between the i-th and j-th example in the feature space F K: N × N matrix of pairwise similarities between examples in F space (CS5350/6350) Kernel Methods September 15. 2011 9 / 16 . .

the (i. . xN }. 2011 9 / 16 . j)-th entry of K is defined as: Kij = k(xi . . xj ) = φ(xi )⊤ φ(xj ) Kij : Similarity between the i-th and j-th example in the feature space F K: N × N matrix of pairwise similarities between examples in F space K is a symmetric matrix (CS5350/6350) Kernel Methods September 15. . .The Kernel Matrix The kernel function k also defines the Kernel Matrix K over the data Given N examples {x1 .

. xj ) = φ(xi )⊤ φ(xj ) Kij : Similarity between the i-th and j-th example in the feature space F K: N × N matrix of pairwise similarities between examples in F space K is a symmetric matrix K is a positive definite matrix (except for a few exceptions) (CS5350/6350) Kernel Methods September 15. .The Kernel Matrix The kernel function k also defines the Kernel Matrix K over the data Given N examples {x1 . 2011 9 / 16 . . xN }. . the (i. j)-th entry of K is defined as: Kij = k(xi .

j)-th entry of K is defined as: Kij = k(xi . the (i. . xj ) = φ(xi )⊤ φ(xj ) Kij : Similarity between the i-th and j-th example in the feature space F K: N × N matrix of pairwise similarities between examples in F space K is a symmetric matrix K is a positive definite matrix (except for a few exceptions) For a P. (CS5350/6350) ∀z ∈ RN (also. . matrix: z⊤ Kz > 0.D.The Kernel Matrix The kernel function k also defines the Kernel Matrix K over the data Given N examples {x1 . all eigenvalues positive) Kernel Methods September 15. . . xN }. 2011 9 / 16 .

matrix: z⊤ Kz > 0. ∀z ∈ RN (also. . 2011 9 / 16 . .D. all eigenvalues positive) The Kernel Matrix K is also known as the Gram Matrix (CS5350/6350) Kernel Methods September 15. . xN }. j)-th entry of K is defined as: Kij = k(xi . .The Kernel Matrix The kernel function k also defines the Kernel Matrix K over the data Given N examples {x1 . the (i. xj ) = φ(xi )⊤ φ(xj ) Kij : Similarity between the i-th and j-th example in the feature space F K: N × N matrix of pairwise similarities between examples in F space K is a symmetric matrix K is a positive definite matrix (except for a few exceptions) For a P.

Some Examples of Kernels The following are the most popular kernels for real-valued vector inputs Linear (trivial) Kernel: k(x. 2011 10 / 16 .no mapping) (CS5350/6350) Kernel Methods September 15. z) = x⊤ z (mapping function φ is identity .

2011 10 / 16 .no mapping) Quadratic Kernel: k(x. z) = (x⊤ z)2 (CS5350/6350) Kernel Methods or (1 + x⊤ z)2 September 15. z) = x⊤ z (mapping function φ is identity .Some Examples of Kernels The following are the most popular kernels for real-valued vector inputs Linear (trivial) Kernel: k(x.

Some Examples of Kernels The following are the most popular kernels for real-valued vector inputs Linear (trivial) Kernel: k(x. z) = (x⊤ z)2 or (1 + x⊤ z)2 or (1 + x⊤ z)d Polynomial Kernel (of degree d): k(x.no mapping) Quadratic Kernel: k(x. z) = (x⊤ z)d (CS5350/6350) Kernel Methods September 15. z) = x⊤ z (mapping function φ is identity . 2011 10 / 16 .

z) = x⊤ z (mapping function φ is identity .Some Examples of Kernels The following are the most popular kernels for real-valued vector inputs Linear (trivial) Kernel: k(x.no mapping) Quadratic Kernel: k(x. 2011 10 / 16 . z) = exp[−γ||x − z||2 ] (CS5350/6350) Kernel Methods September 15. z) = (x⊤ z)2 or (1 + x⊤ z)2 or (1 + x⊤ z)d Polynomial Kernel (of degree d): k(x. z) = (x⊤ z)d Radial Basis Function (RBF) Kernel: k(x.

z) = exp[−γ||x − z||2 ] γ is a hyperparameter (also called the kernel bandwidth) (CS5350/6350) Kernel Methods September 15.no mapping) Quadratic Kernel: k(x. 2011 10 / 16 . z) = (x⊤ z)2 or (1 + x⊤ z)2 or (1 + x⊤ z)d Polynomial Kernel (of degree d): k(x. z) = x⊤ z (mapping function φ is identity .Some Examples of Kernels The following are the most popular kernels for real-valued vector inputs Linear (trivial) Kernel: k(x. z) = (x⊤ z)d Radial Basis Function (RBF) Kernel: k(x.

z) = exp[−γ||x − z||2 ] γ is a hyperparameter (also called the kernel bandwidth) The RBF kernel corresponds to an infinite dimensional feature space F (i.no mapping) Quadratic Kernel: k(x.. z) = (x⊤ z)2 or (1 + x⊤ z)2 or (1 + x⊤ z)d Polynomial Kernel (of degree d): k(x. you can’t actually write down the vector φ(x)) (CS5350/6350) Kernel Methods September 15.Some Examples of Kernels The following are the most popular kernels for real-valued vector inputs Linear (trivial) Kernel: k(x.e. z) = x⊤ z (mapping function φ is identity . z) = (x⊤ z)d Radial Basis Function (RBF) Kernel: k(x. 2011 10 / 16 .

g. d. you can’t actually write down the vector φ(x)) Note: Kernel hyperparameters (e.e. 2011 10 / 16 . z) = (x⊤ z)d Radial Basis Function (RBF) Kernel: k(x..Some Examples of Kernels The following are the most popular kernels for real-valued vector inputs Linear (trivial) Kernel: k(x. z) = (x⊤ z)2 or (1 + x⊤ z)2 or (1 + x⊤ z)d Polynomial Kernel (of degree d): k(x.. z) = exp[−γ||x − z||2 ] γ is a hyperparameter (also called the kernel bandwidth) The RBF kernel corresponds to an infinite dimensional feature space F (i. z) = x⊤ z (mapping function φ is identity . γ) chosen via cross-validation (CS5350/6350) Kernel Methods September 15.no mapping) Quadratic Kernel: k(x.

Using Kernels Kernels can turn a linear model into a nonlinear one (CS5350/6350) Kernel Methods September 15. 2011 11 / 16 .

2011 11 / 16 . z) represents a dot product in some high dimensional feature space F (CS5350/6350) Kernel Methods September 15.Using Kernels Kernels can turn a linear model into a nonlinear one Recall: Kernel k(x.

Using Kernels Kernels can turn a linear model into a nonlinear one Recall: Kernel k(x. 2011 11 / 16 .e. non-linearlized) (CS5350/6350) Kernel Methods September 15. z) represents a dot product in some high dimensional feature space F Any learning algorithm in which examples only appear as dot products (x⊤ i xj ) can be kernelized (i..

e.. xj ) (CS5350/6350) Kernel Methods September 15. z) represents a dot product in some high dimensional feature space F Any learning algorithm in which examples only appear as dot products (x⊤ i xj ) can be kernelized (i.Using Kernels Kernels can turn a linear model into a nonlinear one Recall: Kernel k(x. non-linearlized) ⊤ . 2011 11 / 16 .. by replacing the x⊤ i xj terms by φ(xi ) φ(xj ) = k(xi .

z) represents a dot product in some high dimensional feature space F Any learning algorithm in which examples only appear as dot products (x⊤ i xj ) can be kernelized (i. by replacing the x⊤ i xj terms by φ(xi ) φ(xj ) = k(xi .Using Kernels Kernels can turn a linear model into a nonlinear one Recall: Kernel k(x.. xj ) Most learning algorithms are like that (CS5350/6350) Kernel Methods September 15. non-linearlized) ⊤ ..e. 2011 11 / 16 .

. by replacing the x⊤ i xj terms by φ(xi ) φ(xj ) = k(xi . z) represents a dot product in some high dimensional feature space F Any learning algorithm in which examples only appear as dot products (x⊤ i xj ) can be kernelized (i. SVM. (CS5350/6350) Kernel Methods September 15. non-linearlized) ⊤ . linear regression. 2011 11 / 16 .. xj ) Most learning algorithms are like that Perceptron.e.Using Kernels Kernels can turn a linear model into a nonlinear one Recall: Kernel k(x. etc.

SVM. linear regression. Many of the unsupervised learning algorithms too can be kernelized (e.) (CS5350/6350) Kernel Methods September 15. 2011 11 / 16 . xj ) Most learning algorithms are like that Perceptron.. z) represents a dot product in some high dimensional feature space F Any learning algorithm in which examples only appear as dot products (x⊤ i xj ) can be kernelized (i. etc.. Principal Component Analysis. etc.g.. K -means clustering. non-linearlized) ⊤ .Using Kernels Kernels can turn a linear model into a nonlinear one Recall: Kernel k(x.e. by replacing the x⊤ i xj terms by φ(xi ) φ(xj ) = k(xi .

. αn − N 1 X T αm αn ym yn (xm xn ) 2 m. N September 15. α. ξ. . 2011 12 / 16 . .Kernelized SVM Training Recall the SVM dual Lagrangian: Maximize LD (w. .n=1 0 ≤ αn ≤ C . Kernel Methods n = 1. b. β) = N X n=1 subject to N X n=1 (CS5350/6350) αn yn = 0.

. b. . 2011 12 / 16 ..n=1 0 ≤ αn ≤ C . . N ⊤ Replacing xT m xn by φ(xm ) φ(xn ) = k(xm .Kernelized SVM Training Recall the SVM dual Lagrangian: Maximize LD (w. n = 1.) is some suitable kernel function (CS5350/6350) Kernel Methods September 15. α. β) = N X n=1 subject to N X n=1 αn yn = 0. where k(. . αn − N 1 X T αm αn ym yn (xm xn ) 2 m. . xn ) = Kmn . ξ.

Kernelized SVM Training Recall the SVM dual Lagrangian: Maximize LD (w. α. b.n=1 0 ≤ αn ≤ C . n = 1. n=1 αn − N 1 X T αm αn ym yn (xm xn ) 2 m. β) = N X n=1 subject to N X αn yn = 0. . . . b. .) is some suitable kernel function Maximize LD (w. 2011 12 / 16 . ξ. αn − N 1 X αm αn ym yn Kmn 2 m. . Kernel Methods n = 1. β) = N X n=1 subject to N X n=1 (CS5350/6350) αn yn = 0. ξ. . . N ⊤ Replacing xT m xn by φ(xm ) φ(xn ) = k(xm .n=1 0 ≤ αn ≤ C . α.. . xn ) = Kmn . . where k(. N September 15.

ξ. . where k(. . N SVM now learns a linear separator in the kernel defined feature space F (CS5350/6350) Kernel Methods September 15. β) = N X n=1 subject to N X αn yn = 0. .Kernelized SVM Training Recall the SVM dual Lagrangian: Maximize LD (w. . n = 1. . n = 1. α.n=1 0 ≤ αn ≤ C .n=1 0 ≤ αn ≤ C . β) = N X n=1 subject to N X n=1 αn yn = 0. α. . αn − N 1 X αm αn ym yn Kmn 2 m. ξ. 2011 12 / 16 . . b. b.) is some suitable kernel function Maximize LD (w. N ⊤ Replacing xT m xn by φ(xm ) φ(xn ) = k(xm .. . xn ) = Kmn . . n=1 αn − N 1 X T αm αn ym yn (xm xn ) 2 m.

. 2011 12 / 16 .Kernelized SVM Training Recall the SVM dual Lagrangian: Maximize LD (w. n = 1. n=1 αn − N 1 X T αm αn ym yn (xm xn ) 2 m. N SVM now learns a linear separator in the kernel defined feature space F This corresponds to a non-linear separator in the original space X (CS5350/6350) Kernel Methods September 15. n = 1. . ξ. .n=1 0 ≤ αn ≤ C . . α. ξ. b. . . αn − N 1 X αm αn ym yn Kmn 2 m. β) = N X n=1 subject to N X αn yn = 0. . α. N ⊤ Replacing xT m xn by φ(xm ) φ(xn ) = k(xm . .) is some suitable kernel function Maximize LD (w. .n=1 0 ≤ αn ≤ C .. where k(. b. xn ) = Kmn . β) = N X n=1 subject to N X n=1 αn yn = 0.

2011 13 / 16 .Kernelized SVM Prediction Prediction for a test example x (assume b = 0) y = sign(w⊤ x) (CS5350/6350) Kernel Methods September 15.

examples for which αn > 0) (CS5350/6350) Kernel Methods September 15.e.Kernelized SVM Prediction Prediction for a test example x (assume b = 0) y = sign(w⊤ x) = sign( X αn yn xn ⊤ x) n∈SV SV is the set of support vectors (i.. 2011 13 / 16 .

.Kernelized SVM Prediction Prediction for a test example x (assume b = 0) y = sign(w⊤ x) = sign( X αn yn xn ⊤ x) n∈SV SV is the set of support vectors (i.e. 2011 13 / 16 . examples for which αn > 0) Replacing each example with its feature mapped representation (x → φ(x)) y = sign( X αn yn φ(xn )⊤ φ(x)) n∈SV (CS5350/6350) Kernel Methods September 15.

2011 13 / 16 .e.. examples for which αn > 0) Replacing each example with its feature mapped representation (x → φ(x)) y = sign( X αn yn φ(xn )⊤ φ(x)) = sign( αn yn k(xn . x)) n∈SV n∈SV (CS5350/6350) X Kernel Methods September 15.Kernelized SVM Prediction Prediction for a test example x (assume b = 0) y = sign(w⊤ x) = sign( X αn yn xn ⊤ x) n∈SV SV is the set of support vectors (i.

Kernelized SVM Prediction Prediction for a test example x (assume b = 0) y = sign(w⊤ x) = sign( X αn yn xn ⊤ x) n∈SV SV is the set of support vectors (i. x)) n∈SV n∈SV The weight vector for the kernelized case can be expressed as: w= X αn yn φ(xn ) n∈SV (CS5350/6350) Kernel Methods September 15.e.. examples for which αn > 0) Replacing each example with its feature mapped representation (x → φ(x)) y = sign( X αn yn φ(xn )⊤ φ(x)) = sign( X αn yn k(xn . 2011 13 / 16 .

2011 13 / 16 .Kernelized SVM Prediction Prediction for a test example x (assume b = 0) y = sign(w⊤ x) = sign( X αn yn xn ⊤ x) n∈SV SV is the set of support vectors (i. examples for which αn > 0) Replacing each example with its feature mapped representation (x → φ(x)) y = sign( X αn yn φ(xn )⊤ φ(x)) = sign( X αn yn k(xn . .. x)) n∈SV n∈SV The weight vector for the kernelized case can be expressed as: w= X αn yn φ(xn ) = n∈SV (CS5350/6350) X αn yn k(xn .) n∈SV Kernel Methods September 15.e.

e. ..) n∈SV Important: Kernelized SVM needs the support vectors at the test time (except when you can write φ(xn ) as an explicit. examples for which αn > 0) Replacing each example with its feature mapped representation (x → φ(x)) y = sign( X αn yn φ(xn )⊤ φ(x)) = sign( X αn yn k(xn .Kernelized SVM Prediction Prediction for a test example x (assume b = 0) y = sign(w⊤ x) = sign( X αn yn xn ⊤ x) n∈SV SV is the set of support vectors (i. reasonably-sized vector) (CS5350/6350) Kernel Methods September 15. x)) n∈SV n∈SV The weight vector for the kernelized case can be expressed as: w= X αn yn φ(xn ) = n∈SV X αn yn k(xn . 2011 13 / 16 .

so the support vectors need not be stored (CS5350/6350) Kernel Methods September 15.e.) n∈SV Important: Kernelized SVM needs the support vectors at the test time (except when you can write φ(xn ) as an explicit. 2011 13 / 16 . x)) n∈SV n∈SV The weight vector for the kernelized case can be expressed as: w= X αn yn φ(xn ) = n∈SV X αn yn k(xn .Kernelized SVM Prediction Prediction for a test example x (assume b = 0) y = sign(w⊤ x) = sign( X αn yn xn ⊤ x) n∈SV SV is the set of support vectors (i. examples for which αn > 0) Replacing each example with its feature mapped representation (x → φ(x)) y = sign( X αn yn φ(xn )⊤ φ(x)) = sign( X αn yn k(xn .. reasonably-sized vector) P In the unkernelized version w = n∈SV αn yn xn can be computed and stored as a D × 1 vector. .

SVM with an RBF Kernel The learned decision boundary in the original space is nonlinear (CS5350/6350) Kernel Methods September 15. 2011 14 / 16 .

Kernels: concluding notes Kernels give a modular way to learn nonlinear patterns using linear models All you need to do is replace the inner products with the kernel (CS5350/6350) Kernel Methods September 15. 2011 15 / 16 .

Kernels: concluding notes Kernels give a modular way to learn nonlinear patterns using linear models All you need to do is replace the inner products with the kernel All the computations remain as efficient as in the original space (CS5350/6350) Kernel Methods September 15. 2011 15 / 16 .

2011 15 / 16 .Kernels: concluding notes Kernels give a modular way to learn nonlinear patterns using linear models All you need to do is replace the inner products with the kernel All the computations remain as efficient as in the original space Choice of the kernel is an important factor (CS5350/6350) Kernel Methods September 15.

text classification. Trees (tree kernels): Comparing parse trees of phrases/sentences (CS5350/6350) Kernel Methods September 15. etc. 2011 15 / 16 .Kernels: concluding notes Kernels give a modular way to learn nonlinear patterns using linear models All you need to do is replace the inner products with the kernel All the computations remain as efficient as in the original space Choice of the kernel is an important factor Many kernels are tailor-made for specific types of data Strings (string kernels): DNA matching.

Kernels: concluding notes Kernels give a modular way to learn nonlinear patterns using linear models All you need to do is replace the inner products with the kernel All the computations remain as efficient as in the original space Choice of the kernel is an important factor Many kernels are tailor-made for specific types of data Strings (string kernels): DNA matching. etc. Trees (tree kernels): Comparing parse trees of phrases/sentences Kernels can even be learned from the data (hot research topic!) (CS5350/6350) Kernel Methods September 15. 2011 15 / 16 . text classification.

Trees (tree kernels): Comparing parse trees of phrases/sentences Kernels can even be learned from the data (hot research topic!) Kernel learning means learning the similarities between examples (instead of using some pre-defined notion of similarity) (CS5350/6350) Kernel Methods September 15. 2011 15 / 16 .Kernels: concluding notes Kernels give a modular way to learn nonlinear patterns using linear models All you need to do is replace the inner products with the kernel All the computations remain as efficient as in the original space Choice of the kernel is an important factor Many kernels are tailor-made for specific types of data Strings (string kernels): DNA matching. text classification. etc.

Trees (tree kernels): Comparing parse trees of phrases/sentences Kernels can even be learned from the data (hot research topic!) Kernel learning means learning the similarities between examples (instead of using some pre-defined notion of similarity) A question worth thinking about: Wouldn’t mapping the data to higher dimensional space cause my classifier (say SVM) to overfit? (CS5350/6350) Kernel Methods September 15.Kernels: concluding notes Kernels give a modular way to learn nonlinear patterns using linear models All you need to do is replace the inner products with the kernel All the computations remain as efficient as in the original space Choice of the kernel is an important factor Many kernels are tailor-made for specific types of data Strings (string kernels): DNA matching. etc. 2011 15 / 16 . text classification.

Kernels: concluding notes Kernels give a modular way to learn nonlinear patterns using linear models All you need to do is replace the inner products with the kernel All the computations remain as efficient as in the original space Choice of the kernel is an important factor Many kernels are tailor-made for specific types of data Strings (string kernels): DNA matching. Trees (tree kernels): Comparing parse trees of phrases/sentences Kernels can even be learned from the data (hot research topic!) Kernel learning means learning the similarities between examples (instead of using some pre-defined notion of similarity) A question worth thinking about: Wouldn’t mapping the data to higher dimensional space cause my classifier (say SVM) to overfit? The answer lies in the concepts of large margins and generalization (CS5350/6350) Kernel Methods September 15. text classification. 2011 15 / 16 . etc.

Intro to probabilistic methods for supervised learning Linear Regression (probabilistic version) Logistic Regression (CS5350/6350) Kernel Methods September 15.Next class.. 2011 16 / 16 .