Basis Expansion and Regularization

© All Rights Reserved

9 views

Basis Expansion and Regularization

© All Rights Reserved

- Karl Pribram T-185a
- Part 1
- Xiaohu Wang, Abraham Loeb and Eli Waxman- Stability of the Forward/Reverse-Shock System Formed by the Impact of a Relativistic Fireball on an Ambient Medium
- Feature Extraction of the Fabric Defects Image
- Eigen Vectors
- 7. Electronics - Ijece-color Image Discrimination Diptip.pandit
- Multivariate Laboratory Exercise II
- KPCA
- namiq11
- RPP-04 BDC4013 SEM 1 1112 v1
- Matrix Methods
- Diff EQ Cheat Sheet
- Quaccs Scientific Python module normal mode frequencies code template
- 10.1007_s10470-008-9201-x
- MA_313 SchonefeldSyllabus Fall 2013
- 03 CFD NOTES
- Assignment 06 Clues
- A Compact Shape Descriptor for Triangular Surface Meshes
- AQFT_1
- scl.pdf

You are on page 1of 98

DD3364

April 1, 2012

Introduction

Main idea

Augment the vector of inputs X with additional variables.

These are transformations of X

hm (X) : Rp R

with m = 1, . . . , M .

Then model the relationship between X and Y

f (X) =

M

X

m hm (X) =

m=1

M

X

m Z m

m=1

Have a linear model w.r.t. Z. Can use the same methods as

before.

Which transformations?

Some examples

Linear:

hm (X) = Xm , m = 1, . . . , p

Polynomial:

hm (X) = Xj2 ,

or

hm (X) = Xj Xk

hm (X) = log(Xj ),

p

Xj , ...

hm (X) = kXk

Use of Indicator functions:

Pros

Can model more complicated decision boundaries.

Can model more complicated regression relationships.

Cons

Lack of locality in global basis functions.

Solution Use local polynomial representations such as

There is the danger of over-fitting.

Pros

Can model more complicated decision boundaries.

Can model more complicated regression relationships.

Cons

Lack of locality in global basis functions.

Solution Use local polynomial representations such as

There is the danger of over-fitting.

Common approaches taken:

Restriction Methods

Limit the class of functions considered. Use additive models

f (X) =

Mj

p X

X

jm hjm (Xj )

j=1 m=1

Selection Methods

significantly to the fit of the model - Boosting, CART.

Regularization Methods

Let

f (X) =

M

X

j hj (X)

j=1

of ridge regression and lasso.

To obtain a piecewise polynomial function f (X)

Divide the domain of X into contiguous intervals.

142

Basis Expansions and Regularization

Represent

f5. by

a separate polynomial in each interval.

Examples

Piecewise Constant

O

O

O O

O

OO

O O

OO

Piecewise Linear

O

O

O

O

O

O

OOO

O

O

O

O

O

O O

O

OO

O O

O

O

O

O

O

O

O

OO

O

O

O

O

O

O

O

O

OOO

O

O

O

O

O

O

O O

O

O

O

O

O

O O

Green curve - piecewise constant/linear fit to the training data.

O

O

O O

O

OO

O O

OO

O

O

OOO

O

O

O

O

To obtain a piecewise polynomial function f (X)

Divide the domain of X into contiguous intervals.

142

Basis Expansions and Regularization

Represent

f5. by

a separate polynomial in each interval.

Examples

Piecewise Constant

O

O

O O

O

OO

O O

OO

Piecewise Linear

O

O

O

O

O

O

OOO

O

O

O

O

O

O O

O

OO

O O

O

O

O

O

O

O

O

OO

O

O

O

O

O

O

O

O

OOO

O

O

O

O

O

O

O O

O

O

O

O

O

O O

Green curve - piecewise constant/linear fit to the training data.

O

O

O O

O

OO

O O

OO

O

O

OOO

O

O

O

O

Piecewise Constant

O

O

O O

O

OO

O O

Piecewise Linear

O

O

O

O

O

O

O

O

O

O

OO

O

O

O

O

O

O

O

O

OOO

O

O

O

O

O

O O

O

O

O

O O

O

OO

O O

O

O

O

O

OOO

OO

O

O

O

O

O

O

O

O

O

O O

Continuous

Linear three regions

Piecewise-linear Basis Function

Divide [a, b], the domain

ofPiecewise

X, into

O

O

[a, 1 ), [1 , 2 ), [2 , b]

O O

O

OO

O O

OO

O

O

O

O

O

O

O

O

O

OOO

O

O

O

O O

(X =

1 )+

h1 (X) = Ind(X < 1 ), h2 (X) = Ind(1 X < 2 ), h3 (X)

Ind(

2 X)

O

O

O

O

P3

m=1

of yi s

in

the mth region. FIGURE 5.1. The top left panel shows a piecewise constant function fit to some

1

Piecewise Constant

O

O

O O

O

OO

O O

OO

Piecewise Linear

O

O

O

O

O

O

O

O

O

O

O

O

OO

O

O

O

O

O

O

O

O

OOO

O

O

O

O

O

O

O O

O

OO

O O

O

O

O

O

O O

O

O

O

O

O

O

O

O O

O

O

O

O O

O

OO

O O

O

O

O

O

OOO

OO

O

O

O

O

O

O

OOO

h4 (X) = X h1 (X),

O

O

O

O

O

O

O

O

O

h5 (X) = X h2 (X),

O O

O

O

P6

h3 (X) = Ind(2 X)

h6 (X) = X h3 (X)

(X 1 )+

m=1

O

model to

the data in each

region.

2

FIGURE 5.1. The top left panel shows a piecewise constant function fit to some

O

O

O

O O

O

OO

O O

O

O

OO

O

O

O

O

O

O

O

O

OOO

O

O

O

O

O

O

O

O

O O

O

O

O

(X 1 )+

FIGURE 5.1. The top left panel shows a piecewise constant function fit to some

artificial data. The broken vertical lines indicate the positions of the two knots

1 and 2 . The blue curve represents the true function, from which the data were

generated with Gaussian noise. The remaining two panels show piecewise linear functions fit to the same datathe top right unrestricted, and the lower left

restricted to be continuous at the knots. The lower right panel shows a piecewise

linear basis function, h3 (X) = (X 1 )+ , continuous at 1 . The black points

indicate the sample evaluations h3 (xi ), i = 1, . . . , N .

1 and 2 .

This means

1 + 2 1 = 3 + 4 1 , and

3 + 4 2 = 5 + 6 2

This reduces the # of dof of f (X) from 6 to 4.

Piecewise Constant

O

O

O O

O

OO

O O

Piecewise Linear

O

O

O

O

OO

basis instead:

O

O

O

O

O

O

O

O

O

O

OOO

O

O

O

OO

O

O

O

h1 (X) = 1

h2 (X) = X

h4 (X) = (X 2 )+

h3 (X) = (X 1 )+

O

O

O

O O

O

O

O

O O

O

OO

O O

OO

O

O

O O

O

O

O O

O

OO

O O

O

O

O

O

OOO

O

O

O

O

O

O

O

O

O

O

O

O

OOO

O

O

O

O

O

O O

O

O

O

O

O

O

(X 1 )+

FIGURE 5.1. The top left panel shows a piecewise constant function fit to some

artificial data. The broken vertical lines indicate the positions of the two knots

1 and 2 . The blue curve represents the true function, from which the data were

generated with Gaussian noise. The remaining two panels show piecewise linear functions fit to the same datathe top right unrestricted, and the lower left

Smoother f (X)

Can achieve a smoother f (X) by increasing the order

of the local polynomials

Smoother f (X)

Can achieve a smoother f (X) by increasing the order

of the local polynomials

5.2 Piecewise Polynomials and Splines

143

of the continuity at the knots

Piecewise Cubic Polynomials

Discontinuous

O

O

O O

O

OO O

OO

Continuous

O

O

O

O

OOO

O

O

O

O

O O

O

O

O

O O

O

OO O

OO

O

O

O

O

OOO

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O O

O

OO O

O

OO

O

O

O

O

O

O O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O O

O

OO O

O

O

OO

O

O

O

O

O

O

O

O

O

O

O O

O

O

O

O

O

O

O

O

O

OOO

O

O

OOO

O

O

O

O O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O O

O

OO O

O

OO

O

O

O

O

OOO

O

O

O

O

O O

O

O

O

O

O

O

O

O

O

O

O O

O

OO O

OO

O

O

O

O

OOO

O

O

O

O

O

O

O O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

1

2

has 1st

and 2nd continuity

at the2 knots

O

O

O O

O

OO O

O

OO

O

O

O

O

OOO

O

O

O

O

O

O

O

O O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O O

O

OO O

O

O

OO

O

O

O

O

OOO

O

O

O

O

O

O

O

O O

O

O

O

O

O

O

O

O

O

O

O

O

ntinuity.

A cubic spline

ght panel is continuous, and has continuous first and second derivatives

t the knots. It is known as a cubic spline. Enforcing one more order of

Cubic Spline

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O O

O

OO O

O

OO

O

O

O

O O

A cubic spline

O

O

O

O

O O

O

OO O

O

O

O

O

O

O

O

O

O

O

O

O

OO

O

O

O

O

OOO

O

O

O

O

O

O

O

O

Cubic Spline

OOO

O

O

O

O

O

O

O O

O

O

O

O

O

O

O

O

O

O

O

O

ntinuity.

2 :

2

ght panel is continuous,

and

second

h1 (X)

= has

1, continuoush3first

(X)and

=X

, derivatives

h5 (X) = (X 1 )3+

t the knots. It is known as a cubic spline. Enforcing one more order of

3 hard to show

ontinuity would lead

a global

h2to(X)

= X,cubic polynomial.

h4 (X) It=isXnot

,

h6 (X) = (X 2 )3+

Exercise 5.1) that the following basis represents a cubic spline with knots

t 1 and 2 :

h (X) = 1,

h (X) = X 2 ,

h (X) = (X )3 ,

Order M spline

An order M spline with knots 1 , . . . , K is

a piecewise-polynomial of order M and

has continuous derivatives up to order M 2

hj (X) = X j1

j = 1, . . . , M

1

hM +l (X) = (X l )M

,

+

l = 1, . . . , K

Order M spline

An order M spline with knots 1 , . . . , K is

a piecewise-polynomial of order M and

has continuous derivatives up to order M 2

hj (X) = X j1

j = 1, . . . , M

1

hM +l (X) = (X l )M

,

+

l = 1, . . . , K

Order M spline

An order M spline with knots 1 , . . . , K is

a piecewise-polynomial of order M and

has continuous derivatives up to order M 2

hj (X) = X j1

j = 1, . . . , M

1

hM +l (X) = (X l )M

,

+

l = 1, . . . , K

Regression Splines

Fixed-knot splines are known as regression splines.

For a regression spline one needs to select

the order of the spline,

the number of knots and

the placement of the knots.

There are many equivalent bases for representing splines and

computationally attractive.

Regression Splines

Fixed-knot splines are known as regression splines.

For a regression spline one needs to select

the order of the spline,

the number of knots and

the placement of the knots.

There are many equivalent bases for representing splines and

computationally attractive.

Regression Splines

Fixed-knot splines are known as regression splines.

For a regression spline one needs to select

the order of the spline,

the number of knots and

the placement of the knots.

There are many equivalent bases for representing splines and

computationally attractive.

Problem

The polynomials fit beyond the boundary knots behave wildly.

Solution: Natural Cubic Splines

Have the additional constraints that the function is linear

Near the boundaries one has reduced the variance of the fit

Smoothing Splines

Smoothing Splines

Avoid knot selection problem by using a maximal set of knots.

Complexity of the fit is controlled by regularization.

Consider the following problem:

Find the function f (x) with continuous second derivative

which minimizes

Z

n

X

2

RSS(f, ) =

(yi f (xi )) + (f 00 (t))2 dt

i=1

Smoothing Splines

Avoid knot selection problem by using a maximal set of knots.

Complexity of the fit is controlled by regularization.

Consider the following problem:

Find the function f (x) with continuous second derivative

which minimizes

Z

n

X

2

RSS(f, ) =

(yi f (xi )) + (f 00 (t))2 dt

i=1

Smoothing Splines

Avoid knot selection problem by using a maximal set of knots.

Complexity of the fit is controlled by regularization.

Consider the following problem:

Find the function f (x) with continuous second derivative

which minimizes

Z

n

X

2

RSS(f, ) =

(yi f (xi )) + (f 00 (t))2 dt

i=1

smoothing parameter

closeness to data

curvature penalty

Z

n

X

2

RSS(f, ) =

(yi f (xi )) + (f 00 (t))2 dt

i=1

= 0: f is any function which interpolates the data.

= : f is the simple least squares line fit.

Hope is (0, ) indexes an interesting class of functions in

between.

Z

n

X

2

RSS(f, ) =

(yi f (xi )) + (f 00 (t))2 dt

i=1

= 0: f is any function which interpolates the data.

= : f is the simple least squares line fit.

Hope is (0, ) indexes an interesting class of functions in

between.

Z

n

X

2

RSS(f, ) =

(yi f (xi )) + (f 00 (t))2 dt

i=1

= 0: f is any function which interpolates the data.

= : f is the simple least squares line fit.

Hope is (0, ) indexes an interesting class of functions in

between.

Z

n

X

2

RSS(f, ) =

(yi f (xi )) + (f 00 (t))2 dt

i=1

the xi , i = 1, . . . , n.

That is

f(x) =

n

X

Nj (x)j

j=1

for representing this family of natural splines.

Z

n

X

2

RSS(f, ) =

(yi f (xi )) + (f 00 (t))2 dt

i=1

the xi , i = 1, . . . , n.

That is

f(x) =

n

X

Nj (x)j

j=1

for representing this family of natural splines.

The criterion to be optimized thus reduces to

RSS(, ) = (y N)t (y N) + t N

where

N1 (x1 )

N1 (x2 )

N= .

..

N1 (xn )

..

.

N2 (x1 )

N2 (x2 )

..

.

N2 (xn )

R 00

N2 (t)N100 (t)dt

N =

..

R 00 . 00

Nn (t)N1 (t)dt

y = (y1 , y2 , . . . , yn )t

Nn (x1 )

Nn (x2 )

..

.

Nn (xn )

..

R 00 . 00

Nn (t)N2 (t)dt

..

.

..

.

R 00

00

Nn (t)Nn (t)dt

The criterion to be optimized thus reduces to

RSS(, ) = (y N)t (y N) + t N

and its solution is given by

= (Nt N + N )1 Nt y

The fitted smoothing spline is then given by

f(x) =

n

X

j=1

Nj (x)j

The criterion to be optimized thus reduces to

RSS(, ) = (y N)t (y N) + t N

and its solution is given by

= (Nt N + N )1 Nt y

The fitted smoothing spline is then given by

f(x) =

n

X

j=1

Nj (x)j

Assume that has been set.

Remember the estimated coefficients are a linear

combination of the yi s

= (Nt N + N )1 Nt y

Let

f be the n-vector of the fitted values f(xi ) then

f = N = N(Nt N + N )1 Nt y = S y

where S = N(Nt N + N )1 Nt .

Assume that has been set.

Remember the estimated coefficients are a linear

combination of the yi s

= (Nt N + N )1 Nt y

Let

f be the n-vector of the fitted values f(xi ) then

f = N = N(Nt N + N )1 Nt y = S y

where S = N(Nt N + N )1 Nt .

Properties of S

S is symmetric and positive semi-definite.

S S S

S has rank n.

The book defines the effective degrees of freedom of a

smoothing spline to be

df = trace(S )

152

0.15

0.10

0.05

0.0

-0.05

0.20

Male

Female

10

15

20

25

Age

FIGURE 5.6. The response is the relative change in bone mineral density measured at the spine in adolescents, as a function of age. A separate smoothing spline

was fit to the males and females, with 0.00022. This choice corresponds to

about 12 degrees of freedom.

about 12 degrees of freedom.

where the Nj (x) are an N -dimensional set of basis functions for representing this family of natural splines (Section 5.2.1 and Exercise 5.4). The

criterion thus reduces to

Let N = U SV t be the svd of N .

Using this decomposition it is straightforward to re-write

S = N(Nt N + N )1 Nt

as

S = (1 + K)1

where

K = U S 1 V t N V S 1 U t .

It is also easy to show that

f = S y is the solution to the

optimization problem

min (y f )t (y f ) + f t Kf

f

Let N = U SV t be the svd of N .

Using this decomposition it is straightforward to re-write

S = N(Nt N + N )1 Nt

as

S = (1 + K)1

where

K = U S 1 V t N V S 1 U t .

It is also easy to show that

f = S y is the solution to the

optimization problem

min (y f )t (y f ) + f t Kf

f

The eigen-decomposition of S

Let K = P DP 1 be the real eigen-decomposition of K -

Then

S = (I + K)1 = (I + P DP 1 )1

= (P P 1 + P DP 1 )1

= (P (I + D)P 1 )1

= P (I + D)1 P 1

n

X

1

pk ptk

=

1 + dk

i=1

and pk are the e-vectors of K.

pk are also the e-vectors of S and 1/(1 + dk ) its e-values.

The eigen-decomposition of S

Let K = P DP 1 be the real eigen-decomposition of K -

Then

S = (I + K)1 = (I + P DP 1 )1

= (P P 1 + P DP 1 )1

= (P (I + D)P 1 )1

= P (I + D)1 P 1

n

X

1

pk ptk

=

1 + dk

i=1

and pk are the e-vectors of K.

pk are also the e-vectors of S and 1/(1 + dk ) its e-values.

5.4 Smoothing Splines

155

30

20

10

Ozone Concentration

-50

50

100

1.0

1.2

df=5

df=11

0.6

0.4

0.2

0.0

-0.2

Eigenvalues

0.8

spline with df = trace(S ) = 5.

0.6

0.4

df=5

df=11

0.2

Eigenvalues

0.8

1.0

1.2

Example: Eigenvalues of S

-0.2

0.0

10

15

Order

20

25

-50

50

100

FIGURE

5.7.of(Top:)

Smoothing

spline fit of ozone concentr

Green curve

eigenvalues

S with

df = 11.

pressure gradient. The two fits correspond to different valu

achieve

Red curveparameter,

eigenvalueschosen

of S to

with

df =five

5. and eleven effective degrees

by df = trace(S ). (Lower left:) First 25 eigenvalues for the t

matrices. The first two are exactly 1, and all are 0. (Lo

-50

50

100

Example: Eigenvectors of S

df=5

df=11

15

Order

20

25

-50

50

100

-50

50

100

p:) Smoothing

spline

of ozone

versus

Daggot

Each

bluefit

curve

is an concentration

eigenvector of S

plotted against x. Top left

The two fits has

correspond

to different

values

the smoothing

highest e-value,

bottom

right of

samllest.

o achieve five and eleven effective degrees of freedom, defined

curve

is the eigenvector

by 1/(1 + dk ).

(Lower left:)Red

First

25 eigenvalues

for thedamped

two smoothing-spline

The eigenvectors of S do not depend on .

The smoothing spline decomposes y w.r.t. the basis {pk } and

S y =

n

X

k=1

1

pk (ptk y)

1 + dk

df = trace(S ) =

n

X

k=1

1/(1 + dk ).

The eigenvectors of S do not depend on .

The smoothing spline decomposes y w.r.t. the basis {pk } and

S y =

n

X

k=1

1

pk (ptk y)

1 + dk

df = trace(S ) =

n

X

k=1

1/(1 + dk ).

The eigenvectors of S do not depend on .

The smoothing spline decomposes y w.r.t. the basis {pk } and

S y =

n

X

k=1

1

pk (ptk y)

1 + dk

df = trace(S ) =

n

X

k=1

1/(1 + dk ).

Visualization of a S

Equivalent Kernels

Row 12

Smoother Matrix

12

Row 25

Row 50

25

50

Row 75

75

100

115

Row 100

Row 115

FIGURE 5.8. The smoother matrix for a smoothing spline is nearly banded,

indicating an equivalent kernel with local support. The left panel represents the

Choosing ???

This is a crucial and tricky problem.

Will deal with this problem in Chapter 7 when we consider the

Previously considered a binary classifier s.t.

log

P (Y = 1|X = x)

= 0 + t x

P (Y = 0|X = x)

log

P (Y = 1|X = x)

= f (x)

P (Y = 0|X = x)

P (Y = 1|X = x) =

ef (x)

1 + ef (x)

of P (Y = 1|X = x).

Previously considered a binary classifier s.t.

log

P (Y = 1|X = x)

= 0 + t x

P (Y = 0|X = x)

log

P (Y = 1|X = x)

= f (x)

P (Y = 0|X = x)

P (Y = 1|X = x) =

ef (x)

1 + ef (x)

of P (Y = 1|X = x).

Construct the penalized log-likelihood criterion

`(f ; ) =

Z

n

X

[yi log P (Y = 1|xi ) + (1 yi ) log(1 P (Y = 1|xi ))] .5 (f 00 (t))2 dt

i=1

Z

n

X

=

[yi f (xi ) log(1 + ef (xi ) )] .5 (f 00 (t))2 dt

i=1

Hilbert Spaces

There is a class of generalization problems which have the form

min

f H

"

n

X

i=1

where

L(yi , f (xi )) is a loss function,

J(f ) is a penalty functional,

H is a space of functions on which J(f ) is defined.

These are generated by a positive definite kernel K(x, y) and

the corresponding space of functions HK called a reproducing

well.

What follows is mainly based on the notes of Nuno Vasconcelos.

Types of Kernels

Definition

A kernel is a mapping k : X X R.

These three types of kernels are equivalent

dot-product kernel

m

m

Mercer kernel

Dot-product kernel

Definition

A mapping

k :X X R

is a dot-product kernel if and only if

k(x, y) = h(x), (y)i

where

:X H

and H is a vector space and h, i is an inner-product on H.

Definition

A mapping

k :X X R

is a positive semi-definite kernel on X X if m N and

x1 , . . . , xm with each xi X the Gram matrix

k(x2 , x1 ) k(x2 , x2 ) k(x2 , xm )

K=

.

.

...

.

...

...

k(xm , x1 ) k(xm , x2 ) k(xm , xm )

is positive semi-definite.

Mercer kernel

Definition

A symmetric mapping k : X X R such that

Z Z

k(x, y) f (x) f (y) dx dy 0

for all functions f s.t.

Z

is a Mercer kernel.

f (x)2 dx <

These different definitions lead to different interpretations of what

the kernel does:

Interpretation I

Reproducing kernel map:

X

Hk = f (.) | f () =

i k(, xi )

j=1

hf, gi =

m X

m

X

i=1 j=1

: X k(, x)

i j k(xi , x0j )

These different definitions lead to different interpretations of what

the kernel does:

Interpretation II

Mercer kernel map:

HM = `2 =

x|

X

i

x2i <

hf, gi = f g

p

p

: X ( 1 1 (x), 2 2 (x), ...)t

of k(x, y) with i > 0.

P 2

where `2 is the space of vectors s.t.

i ai < .

When a Gaussian kernel k(x, xi ) = exp(kx xi k2 /) is used

the point xi X is mapped into the Gaussian G(, xi , I)

Hk is the space of all functions that are linear combinations of

Gaussians.

on X .

With the definition of Hk and h, i one has

f Hk

Leads to the reproducing Kernel Hilbert Spaces

Definition

A Hilbert Space is a complete dot-product space.

(vector space + dot product + limit points of all Cauchy

sequences)

With the definition of Hk and h, i one has

f Hk

Leads to the reproducing Kernel Hilbert Spaces

Definition

A Hilbert Space is a complete dot-product space.

(vector space + dot product + limit points of all Cauchy

sequences)

Definition

Let H be a Hilbert space of functions f : X R. H is a

Reproducing Kernel Hilbert Space (rkhs) with inner-product h, i

if there exists a

k :X X R

s. t.

k(, ) spans H that is

H = {f () | f () =

i i k(, xi )

for i R and xi X }

f H

Theorem

Let k : X X R be a Mercer kernel. Then there exists an

orthonormal set of functions

Z

i (x)j (x)dx = ij

and a set of i 0 such that

Z Z

X

2

1

i =

k 2 (x, y)dx dy < and

i

k(x, y) =

X

i=1

i i (x)i (y)

This eigen-decomposition gives another way to design the feature

transformation induced by the kernel k(, ).

Let

: X `2

be defined by

where `2 is the space of square summable sequences.

Clearly

p

X

p

i i (x) i i (y)

h(x), (y)i =

i=1

X

i=1

Issues

Therefore there is a vector space `2 other than Hk such that

k(x, y) is a dot product in that space.

Have two very different interpretations of what the kernel

does

1

2

Mercer kernel map

For HM we write

(x) =

P

i

i i (x)ei

: `2 span{k }

Can write

( )(x) =

P

i

ek =

p

k k ()

i i (x)i () = k(, x)

we have

!

xi

x2

x

x x x

x

x

o o

x

x

o o o o

x

x

o

o

o

o

o o

x

e1

)(xi)

l2 d

x x x

x

x

x

xo o

o oo

o

o

o

o

ed

x1

e3

e2

x

x

x x x

x

x

x

x

I1

7R)(xi)=k(.,x

)=k( xi)

"

xo o

o oo

o

o

o

o

o

o

o

Id

I3

I2

13

Mercer map

Define the inner-product in M as

Z

hf, gim = f (x)g(x) dx

Note we will normalize the eigenfunctions l such that

Z

lk

l (x)k (x) dx =

l

Any function f M can be written as

f (x) =

X

k=1

then

k k (x)

Mercer map

=

=

=

f (x)k(x, y) dx

Z X

k k (x)

k=1

X

l=1

k l l (y)

k=1 l=1

l l l (y)

l=1

1

l

l l (y) = f (y)

l=1

k is a reproducing kernel on M.

l l (x)l (y) dx

k (x)l (x) dx

We want to check if

the space M = Hk

1

2

3

Show Hk M.

Show M Hk .

Hk M

If f Hk then there exists m N, {i } and {xi } such that

f () =

=

=

=

m

X

i=1

m

X

i=1

X

l=1

i k(, xi )

i

l l (xi ) l ()

l=1

m

X

i l l (xi )

i=1

l ()

l l ()

l=1

This shows that if f H then f M and therefore H M.

Let f, g H with

f () =

n

X

i k(, xi ),

g() =

i=1

m

X

j k(, yj )

j=1

Then by definition

hf, gi =

While

hf, gim =

=

=

n X

m

X

i j k(xi , yj )

i=1 j=1

f (x)g(x) dx

Z X

n

i k(x, xi )

i=1

n

m

XX

i=1 j=1

i j

m

X

j k(x, yj ) dx

j=1

k(x, xi ) k(x, yj ) dx

hf, gim =

=

=

n X

m

X

i=1 j=1

n X

m

X

i=1 j=1

n X

m

X

i j

Z X

l l (x)l (xi )

l=1

i j

l l (xi ) l (yj )

l=1

i j k(xi , yj )

i=1 j=1

= hf, gi

hf, gim = hf, gi

X

s=1

s s (x)s (yj ) dx

MH

Can also show that if f M then also f Hk .

Will not prove that here.

But it implies M Hk

Summary

The reproducing kernel map and the Mercer Kernel map lead to

the same RKHS, Mercer gives us an orthonormal basis.

Interpretation I

Reproducing kernel map:

X

Hk = f (.) | f () =

i k(, xi )

j=1

hf, gi =

m X

m

X

i=1 j=1

r : X k(, x)

i j k(xi , x0j )

Summary

The reproducing kernel map and the Mercer Kernel map lead to

the same RKHS, Mercer gives us an orthonormal basis.

Interpretation II

Mercer kernel map:

HM = `2 =

x|

X

i

x2i

<

hf, gi = f t g

p

p

M : X ( 1 1 (x), 2 2 (x), ...)t

: `2 span{k ()}

M = r

Back to Regularization

Back to regularization

We to solve

min

f Hk

" n

X

i=1

Intuition: wigglier functions have larger norm than smoother

functions.

For f Hk we have

f (x) =

i k(x, xi )

X

i

X

l

X

l

i

"

l l (x)l (xi )

X

i

cl l (x)

i l (xi ) l (x)

and therefore

kf (x)k2 =

with cl = l

Hence

lk

cl ck hl (x), k (x)im =

X 1

X c2

l

cl ck lk =

l

l

lk

i i l (xi ).

functions with large e-values get penalized less and vice versa

more coefficients means more high frequencies or less

smoothness.

Representer Theorem

Theorem

Let

: [0, ) R be a strictly monotonically increasing function

H is the RKHS associated with a kernel k(x, y)

L(y, f (x)) be a loss function

then

f = arg min

f Hk

" n

X

i=1

f(x) =

n

X

i=1

i k(x, xi )

Relevance

The remarkable consequence of the theorem is that

space.

This is because as f =

Pn

i=1 i k(, xi )

kfk2 = hf, fi =

=

X

ij

then

i j hk(, xi ), k(, xj )i

i j k(xi , xj ) = t K

ij

and

f(xi ) =

j k(xi , xj ) = Ki

Relevance

The remarkable consequence of the theorem is that

space.

This is because as f =

Pn

i=1 i k(, xi )

kfk2 = hf, fi =

=

X

ij

then

i j hk(, xi ), k(, xj )i

i j k(xi , xj ) = t K

ij

and

f(xi ) =

j k(xi , xj ) = Ki

Representer Theorem

Theorem

Let

: [0, ) R be a strictly monotonically increasing function

H is the RKHS associated with a kernel k(x, y)

then

"

f = arg min

f Hk

n

X

i=1

#

2

Pn

f(x) = i=1

i k(x, xi )

where

= arg min

" n

X

i=1

L(yi , Ki ) + ( K)

When given linearly separable data {(xi , yi )} the optimal

min kk2

0 ,

subject to

yi (0 + t xi ) 1 i

max(0, 1 yi (0 + t xi )) = (1 yi (0 + t xi )+ = 0 i

Hence we can re-write the optimization problem as

min

0 ,

"

n

X

(1 yi (0 + t xi ))+ + kk2

i=1

Finding the optimal separating hyperplane

min

0 ,

"

n

X

(1 yi (0 + t xi ))+ + kk2

i=1

min

f

"

n

X

i=1

where

L(y, f (x)) = (1 yi f (xi ))+

(kf k2 ) = kf k2

From the Representor theorem know the solution to the latter

problem is

f(x) =

n

X

i xti x

i=1

Therefore kf k2 = t K

This is the same form of the solution found via the KKT

conditions

n

X

i=1

i yi x i

- Karl Pribram T-185aUploaded byforizsl
- Part 1Uploaded byJim Anderson
- Xiaohu Wang, Abraham Loeb and Eli Waxman- Stability of the Forward/Reverse-Shock System Formed by the Impact of a Relativistic Fireball on an Ambient MediumUploaded byOkklos
- Feature Extraction of the Fabric Defects ImageUploaded bychotapower
- Eigen VectorsUploaded byAmer Tanwani
- 7. Electronics - Ijece-color Image Discrimination Diptip.panditUploaded byiaset123
- Multivariate Laboratory Exercise IIUploaded byAl-Ahmadgaid Asaad
- KPCAUploaded byParmida Afshar
- namiq11Uploaded by09m008_159913639
- RPP-04 BDC4013 SEM 1 1112 v1Uploaded bytangtszloong
- Matrix MethodsUploaded byiganti
- Diff EQ Cheat SheetUploaded bykykjhjhkjhkj
- Quaccs Scientific Python module normal mode frequencies code templateUploaded byDeborah Crittenden
- 10.1007_s10470-008-9201-xUploaded byPramod Srinivasan
- MA_313 SchonefeldSyllabus Fall 2013Uploaded byPraven Indran
- 03 CFD NOTESUploaded byOsama Ibrahim
- Assignment 06 CluesUploaded byZia Azam
- A Compact Shape Descriptor for Triangular Surface MeshesUploaded byNoelia Revollo
- AQFT_1Uploaded byNhellCerna
- scl.pdfUploaded byMe
- 85109HW3SolutionsCOHEN.pdfUploaded byangelpuentes8
- PageRank-1Uploaded byToulouse18
- 9.fullUploaded byJuan Carlos Tasilla Villanueva
- LetterRobin_SUPP9Uploaded byDiego Ulloa
- JeiUploaded bybsudheertec
- Vector & Matrices CAMUploaded byKalyan naskar
- 2012_Book_SolvingNumericalPDEsProblemsAp.pdfUploaded byarviandy
- Boundary value problems .pdfUploaded byjorgearce321
- Lecture 13 & 14Uploaded byVardaan Taneja
- A Study on Factors Influencing the Performance of Airport Security and on Responsibility Assignment of Security Task at International AirportsUploaded byKing Everest

- Measures of AssociationUploaded bymissinu
- Stata RUploaded bymissinu
- MulticollinearityUploaded bymissinu
- Exit Questionnaire 2005 FinalUploaded bymissinu
- GRE VocabularyUploaded byKoksiong Poon
- Regulations Livestock in VN SummaryUploaded bymissinu
- Economic Accounts 2005 FinalUploaded bymissinu
- Lecture 3 - Linear Methods for ClassificationUploaded bymissinu
- Poe 4 FormulasUploaded bymissinu
- Vocabulary IELTS Speaking Theo TopicUploaded byBBBBBBB
- Lecture 2 - Some Course AdminUploaded bymissinu
- Lecture 1 - Overview of Supervised LearningUploaded bymissinu
- Active Models Human ResourcesUploaded bymissinu
- Quantitative Analysis v1Uploaded bymissinu
- Exam4135 2004 SolutionsUploaded bymissinu
- Exam ECON301B 2002 CommentedUploaded bymissinu
- Manure Estimates.pdfUploaded bymissinu
- ABDC Journal Quality List 2013Uploaded bymissinu
- Balabolka SampleUploaded bymissinu
- AnkiUploaded bymissinu
- Errata of Introduction to Probability 2ndUploaded bymissinu
- Midterm Microeconomics 1 2012-13Uploaded bymissinu
- Bai Tap Giai Tich 2 Chuong 2Uploaded bymissinu
- Introduction to Microeconomic Theory and GE Theory (2015)Uploaded bymissinu
- 08 Chi So Gia NGTT2010Uploaded byThanh Nga Võ
- Designing 20an 20Effective 20QuestionnaireUploaded byec16043
- Jql SubjectUploaded byrikudomessi

- Super ResolutionUploaded byGiri Reddy
- Ex 5Uploaded byTadas Šubonis
- Gl m BookletUploaded byBittu Sharma
- Bayesian Probabilistic Matrix Factorization Using Markov Chain Monte CarloUploaded byPrawan Singh
- Iterative Multi-Frame Super-Resolution Image Reconstruction via Variance-Based Fidelity to the DataUploaded bySEP-Publisher
- M.S.thesis Zhong_Inverse Algorithm for Determination of Heat FluxUploaded byJinsoo Kim
- NONLINEAR CHANNEL ESTIMATION FOR OFDM SYSTEM BY COMPLEX LS-SVM UNDER HIGH MOBILITY CONDITIONSUploaded byThinh T. Pham
- Statistics-Toolbox-R2013a.pdfUploaded byLyly Magnan
- Fitting kirkkoffUploaded byguiana22
- 3D Sound Field Recording with Higher Order Ambisonics - Objective Measurements and Validation of Spherical Microphone.pdfUploaded by郭俊麟
- Lecture Notes on Inverse ProblemsUploaded byMatthias86
- Uses and Abuses of EIDORS - An Extensible Software Base for EITUploaded bysastrakusumawijaya
- A Self Learning Image Super Resolution Method via Sparse Representation and Non Local Similarity 2016 NeurocomputingUploaded byatirina
- Fast DropoutUploaded bySida Wang
- FULL-Study of Planar Magnetic Induction Tomography Depth Detection Capability and Its Imaging Quality Evaluation .pdfUploaded byRyris Silaban
- SPURS Using Graph CutsUploaded byLovely Babu
- inverseproject.pdfUploaded byeetaha
- Tikhonov regularization.pdfUploaded byecce12
- Statistical Arbitrage in High Frequency Trading BasedUploaded byTraderCat Solaris
- leall_en_tUploaded byEdo Demirbilek
- Rtv 4 Manual - Regu toolsUploaded bymijl435765
- Predicting the Oil Production Using the Novel Multivariate-Xin Ma-Liu-17Nov16Uploaded byMiguel Angel Vidal Arango
- Ill-conditioning of the truncated singular value decomposition, Tikhonov regularization and their applications to numerical partial differential equationsUploaded byaccidentalcitizen
- xgboost-150831021017-lva1-app6891Uploaded byPushp Toshniwal
- Introduction to Op Tim Ization ModuleUploaded byEstefanía Lovato
- South Africa Heart Disease ProblemUploaded byDeyaa Eldeen
- cvpr2010.pdfUploaded byDzung Nguyen
- Lecture3 NotesUploaded bynguyen293
- uncpest_sir2010-5211Uploaded byLuis AC
- simpUploaded byRoshan Dcruz

## Much more than documents.

Discover everything Scribd has to offer, including books and audiobooks from major publishers.

Cancel anytime.