Linear Methods for Classification

© All Rights Reserved

2 views

Linear Methods for Classification

© All Rights Reserved

- Automated Clinical Assessment from Smart home-based Behavior Data
- LimitationsOfPCAforHyperspectralTargetRecognition
- liu99comparative
- 10.1.1.7
- mahalanobis distance as used in pattern learning courses
- Implicit Shape Kernel for Discriminative Learning of the Hough Transform Detector - Zhang, Chen - Procedings of the British Machine Vision Conference - 2010
- Kai Gersmann and Barbara Hammer- Improving iterative repair strategies for scheduling with the SVM
- Multiple Regression Model_matrix Form
- Linear discriminants analysis
- Pirooznia_2008_PMID18366602
- Lecture 11
- Error analysis lecture 7
- Neural Network
- Understanding the Unique Spectral Signature of Winter Rape
- Application of Imperialist Competitive Algorithm for Feature Selection
- 2002-A Comparison of Methods for Multicalss Support Vector Machines
- Review of Data Analysis Algorithm and its Applications
- Speaker Recognition Overview
- Evaluation of Anomaly Detection Evaluation of Anomaly Detection Capability for Ground-Based Pre-Launch Shuttle Operations
- Chapter 2 Data Mining Defined

You are on page 1of 176

DD3364

Introduction

Want to learn a predictor G : Rp G = {1, . . . , K}

G divides input space into regions labelled according to their

classification.

decision boundaries.

Want to learn a predictor G : Rp G = {1, . . . , K}

G divides input space into regions labelled according to their

classification.

decision boundaries.

Want to learn a predictor G : Rp G = {1, . . . , K}

G divides input space into regions labelled according to their

classification.

decision boundaries.

Want to learn a predictor G : Rp G = {1, . . . , K}

G divides input space into regions labelled according to their

classification.

decision boundaries.

Learn a discriminant function k (x) for each class k and set

k

g(k (x)) = k0 + kt x

Learn a discriminant function k (x) for each class k and set

k

g(k (x)) = k0 + kt x

Learn a discriminant function k (x) for each class k and set

k

g(k (x)) = k0 + kt x

Example 1: Fit a linear regression model to the class

k (x) = k0 + kt x

exp(0 + t x)

1 + exp(0 + t x)

1

P (G = 2|X = x) =

1 + exp(0 + t x)

P (G = 1|X = x) =

Example 1: Fit a linear regression model to the class

k (x) = k0 + kt x

exp(0 + t x)

1 + exp(0 + t x)

1

P (G = 2|X = x) =

1 + exp(0 + t x)

P (G = 1|X = x) =

Example 1: Fit a linear regression model to the class

k (x) = k0 + kt x

exp(0 + t x)

1 + exp(0 + t x)

1

P (G = 2|X = x) =

1 + exp(0 + t x)

P (G = 1|X = x) =

For a two class problem with p-dimensional inputs this =

SVM model and algorithm of Vapnik

In the forms quoted both these algorithms find separating

hyperplanes if they exist and fail of the points are not linearly

separable.

There are fixes for the non-separable case but we will not

For a two class problem with p-dimensional inputs this =

SVM model and algorithm of Vapnik

In the forms quoted both these algorithms find separating

hyperplanes if they exist and fail of the points are not linearly

separable.

There are fixes for the non-separable case but we will not

For a two class problem with p-dimensional inputs this =

SVM model and algorithm of Vapnik

In the forms quoted both these algorithms find separating

hyperplanes if they exist and fail of the points are not linearly

separable.

There are fixes for the non-separable case but we will not

For a two class problem with p-dimensional inputs this =

SVM model and algorithm of Vapnik

In the forms quoted both these algorithms find separating

hyperplanes if they exist and fail of the points are not linearly

separable.

There are fixes for the non-separable case but we will not

For a two class problem with p-dimensional inputs this =

SVM model and algorithm of Vapnik

In the forms quoted both these algorithms find separating

hyperplanes if they exist and fail of the points are not linearly

separable.

There are fixes for the non-separable case but we will not

For a two class problem with p-dimensional inputs this =

SVM model and algorithm of Vapnik

In the forms quoted both these algorithms find separating

hyperplanes if they exist and fail of the points are not linearly

separable.

There are fixes for the non-separable case but we will not

Can expand the variable set X1 , X2 , . . . , Xp by including their

Linear decision boundaries in the augmented space

space.

4.2 Linear Regression of an Indicator Matrix

103

2

1

1

2

1

3

22

1

2 22 2

3

2 22

33 3 3 3

22 2

2

12

2

33

3 3

2 2 22 2 22 2

3 3

3

2 222 2

1

22

2

22

3

22

1

2

2

3 3

22 2 2 2

22 2

22 22222 2 2

3 3 33

2222 2222 2222

33

2 2222 2 2 2 2222222 21

3 33 333

222 2 2

2

3 333 3333 3 3

222 222 22 2 2 2222 2 2

3333 33

2 2 2222 22 22 22 22222222 22 2

13

3 3 33

2

2

2

2

2

3

1

2

2

2

1 2 222 222222 2222

1

3 3

33333

21222212

1

2

2

1

1 3 33

1 1

33 3

2 122122 22

1

22

1

2 1 1222 1

11 1 33

3

22 2 2

1

2

1

11

1 1 1133333 33

11 1 1

2211

12 2

3 33

2

1

1

2

1

1

1

1

1

2

1

1

3 33 3333

1

1

1

22

12

11 1 1 1

22

1 11 11 1 1 11 1 13313333333 33 3

211 2 1

3

3

1

3

2 111

1

3

1

1 1 1111 1 11 1 3333333

111 1 1

11 1 1

33333333

3

1

1

3

1

1 1

3333 3

1 1 11 1 11 111 111 1 1 11 11 1 111 3

313 1

111 1 1

33 33

1 11 1

1 1 3333333

1

1

31 33 3 3

1 11 1

1 333333333333

11 1 1 1 1 111111 11 111 1 1 1 1

1 11 1

1

33

3 3 333333

11 3

1

1

1 1

33333 3

11

1

3

1 1 11

3 33

1

1

3

333 3

3

3 3

2

1

1

2

1

3

22

1

2 22 2

3

2 22

33 3 3 3

22 2

2

12

2

33

3 3

2 2 22 2 22 2

3 3

3

2 222 2

1

22

2

22

3

22

1

2

2

3 3

22 2 2 2

22 2

22 22222 2 2

3 3 33

2222 2222 2222

33

2 2222 2 2 2 2222222 21

3 33 333

222 2 2

2

3 333 3333 3 3

222 222 22 2 2 2222 2 2

3333 33

2 2 2222 22 22 22 22222222 22 2

13

3 3 33

2

2

2

2

2

3

1

2

2

2

1 2 222 222222 2222

1

3 3

33333

21222212

1

2

2

1

1 3 33

1 1

33 3

2 122122 22

1

22

1

2 1 1222 1

11 1 33

3

22 2 2

1

2

1

11

1 1 1133333 33

11 1 1

2211

12 2

3 33

2

1

1

2

1

1

1

1

1

2

1

1

3 33 3333

1

1

1

22

12

11 1 1 1

22

1 11 11 1 1 11 1 13313333333 33 3

211 2 1

3

3

1

3

2 111

1

3

1

1 1 1111 1 11 1 3333333

111 1 1

11 1 1

33333333

3

1

1

3

1

1 1

3333 3

1 1 11 1 11 111 111 1 1 11 11 1 111 3

313 1

111 1 1

33 33

1 11 1

1 1 3333333

1

1

31 33 3 3

1 11 1

1 333333333333

11 1 1 1 1 111111 11 111 1 1 1 1

1 11 1

1

33

3 3 333333

11 3

1

1

1 1

33333 3

11

1

3

1 1 11

3 33

1

1

3

333 3

3

3 3

FIGURE 4.1. The left plot shows some data from three classes, with linear

Can expand the variable set X1 , X2 , . . . , Xp by including their

Linear decision boundaries in the augmented space

space.

4.2 Linear Regression of an Indicator Matrix

103

2

1

1

2

1

3

22

1

2 22 2

3

2 22

33 3 3 3

22 2

2

12

2

33

3 3

2 2 22 2 22 2

3 3

3

2 222 2

1

22

2

22

3

22

1

2

2

3 3

22 2 2 2

22 2

22 22222 2 2

3 3 33

2222 2222 2222

33

2 2222 2 2 2 2222222 21

3 33 333

222 2 2

2

3 333 3333 3 3

222 222 22 2 2 2222 2 2

3333 33

2 2 2222 22 22 22 22222222 22 2

13

3 3 33

2

2

2

2

2

3

1

2

2

2

1 2 222 222222 2222

1

3 3

33333

21222212

1

2

2

1

1 3 33

1 1

33 3

2 122122 22

1

22

1

2 1 1222 1

11 1 33

3

22 2 2

1

2

1

11

1 1 1133333 33

11 1 1

2211

12 2

3 33

2

1

1

2

1

1

1

1

1

2

1

1

3 33 3333

1

1

1

22

12

11 1 1 1

22

1 11 11 1 1 11 1 13313333333 33 3

211 2 1

3

3

1

3

2 111

1

3

1

1 1 1111 1 11 1 3333333

111 1 1

11 1 1

33333333

3

1

1

3

1

1 1

3333 3

1 1 11 1 11 111 111 1 1 11 11 1 111 3

313 1

111 1 1

33 33

1 11 1

1 1 3333333

1

1

31 33 3 3

1 11 1

1 333333333333

11 1 1 1 1 111111 11 111 1 1 1 1

1 11 1

1

33

3 3 333333

11 3

1

1

1 1

33333 3

11

1

3

1 1 11

3 33

1

1

3

333 3

3

3 3

1

1

2

1

3

22

1

2 22 2

3

2 22

33 3 3 3

22 2

2

12

2

33

3 3

2 2 22 2 22 2

3 3

3

2 222 2

1

22

2

22

3

22

1

2

2

3 3

22 2 2 2

22 2

22 22222 2 2

3 3 33

2222 2222 2222

33

2 2222 2 2 2 2222222 21

3 33 333

222 2 2

2

3 333 3333 3 3

222 222 22 2 2 2222 2 2

3333 33

2 2 2222 22 22 22 22222222 22 2

13

3 3 33

2

2

2

2

2

3

1

2

2

2

1 2 222 222222 2222

1

3 3

33333

21222212

1

2

2

1

1 3 33

1 1

33 3

2 122122 22

1

22

1

2 1 1222 1

11 1 33

3

22 2 2

1

2

1

11

1 1 1133333 33

11 1 1

2211

12 2

3 33

2

1

1

2

1

1

1

1

1

2

1

1

3 33 3333

1

1

1

22

12

11 1 1 1

22

1 11 11 1 1 11 1 13313333333 33 3

211 2 1

3

3

1

3

2 111

1

3

1

1 1 1111 1 11 1 3333333

111 1 1

11 1 1

33333333

3

1

1

3

1

1 1

3333 3

1 1 11 1 11 111 111 1 1 11 11 1 111 3

313 1

111 1 1

33 33

1 11 1

1 1 3333333

1

1

31 33 3 3

1 11 1

1 333333333333

11 1 1 1 1 111111 11 111 1 1 1 1

1 11 1

1

33

3 3 333333

11 3

1

1

1 1

33333 3

11

1

3

1 1 11

3 33

1

1

3

333 3

3

3 3

FIGURE 4.1. The left plot shows some data from three classes, with linear

Can expand the variable set X1 , X2 , . . . , Xp by including their

Linear decision boundaries in the augmented space

space.

4.2 Linear Regression of an Indicator Matrix

103

1

1

2

1

3

22

1

2 22 2

3

2 22

33 3 3 3

22 2

2

12

2

33

3 3

2 2 22 2 22 2

3 3

3

2 222 2

1

22

2

22

3

22

1

2

2

3 3

22 2 2 2

22 2

22 22222 2 2

3 3 33

2222 2222 2222

33

2 2222 2 2 2 2222222 21

3 33 333

222 2 2

2

3 333 3333 3 3

222 222 22 2 2 2222 2 2

3333 33

2 2 2222 22 22 22 22222222 22 2

13

3 3 33

2

2

2

2

2

3

1

2

2

2

1 2 222 222222 2222

1

3 3

33333

21222212

1

2

2

1

1 3 33

1 1

33 3

2 122122 22

1

22

1

2 1 1222 1

11 1 33

3

22 2 2

1

2

1

11

1 1 1133333 33

11 1 1

2211

12 2

3 33

2

1

1

2

1

1

1

1

1

2

1

1

3 33 3333

1

1

1

22

12

11 1 1 1

22

1 11 11 1 1 11 1 13313333333 33 3

211 2 1

3

3

1

3

2 111

1

3

1

1 1 1111 1 11 1 3333333

111 1 1

11 1 1

33333333

3

1

1

3

1

1 1

3333 3

1 1 11 1 11 111 111 1 1 11 11 1 111 3

313 1

111 1 1

33 33

1 11 1

1 1 3333333

1

1

31 33 3 3

1 11 1

1 333333333333

11 1 1 1 1 111111 11 111 1 1 1 1

1 11 1

1

33

3 3 333333

11 3

1

1

1 1

33333 3

11

1

3

1 1 11

3 33

1

1

3

333 3

3

3 3

1

1

2

1

3

22

1

2 22 2

3

2 22

33 3 3 3

22 2

2

12

2

33

3 3

2 2 22 2 22 2

3 3

3

2 222 2

1

22

2

22

3

22

1

2

2

3 3

22 2 2 2

22 2

22 22222 2 2

3 3 33

2222 2222 2222

33

2 2222 2 2 2 2222222 21

3 33 333

222 2 2

2

3 333 3333 3 3

222 222 22 2 2 2222 2 2

3333 33

2 2 2222 22 22 22 22222222 22 2

13

3 3 33

2

2

2

2

2

3

1

2

2

2

1 2 222 222222 2222

1

3 3

33333

21222212

1

2

2

1

1 3 33

1 1

33 3

2 122122 22

1

22

1

2 1 1222 1

11 1 33

3

22 2 2

1

2

1

11

1 1 1133333 33

11 1 1

2211

12 2

3 33

2

1

1

2

1

1

1

1

1

2

1

1

3 33 3333

1

1

1

22

12

11 1 1 1

22

1 11 11 1 1 11 1 13313333333 33 3

211 2 1

3

3

1

3

2 111

1

3

1

1 1 1111 1 11 1 3333333

111 1 1

11 1 1

33333333

3

1

1

3

1

1 1

3333 3

1 1 11 1 11 111 111 1 1 11 11 1 111 3

313 1

111 1 1

33 33

1 11 1

1 1 3333333

1

1

31 33 3 3

1 11 1

1 333333333333

11 1 1 1 1 111111 11 111 1 1 1 1

1 11 1

1

33

3 3 333333

11 3

1

1

1 1

33333 3

11

1

3

1 1 11

3 33

1

1

3

333 3

3

3 3

FIGURE 4.1. The left plot shows some data from three classes, with linear

p

Have training data {(xi , gi )}n

i=1 where each xi R and

gi {1, . . . , K}.

1 For i = 1, . . . , n set

(

0 if gi 6= k

yi =

1 if gi = k

0k , k ) = arg min Pn (yi 0 t xi )2

2 Compute (

k

i=1

0 ,k

Define

k (x) = 0k + kt x

Classify a new point x with

k

p

Have training data {(xi , gi )}n

i=1 where each xi R and

gi {1, . . . , K}.

1 For i = 1, . . . , n set

(

0 if gi 6= k

yi =

1 if gi = k

0k , k ) = arg min Pn (yi 0 t xi )2

2 Compute (

k

i=1

0 ,k

Define

k (x) = 0k + kt x

Classify a new point x with

k

p

Have training data {(xi , gi )}n

i=1 where each xi R and

gi {1, . . . , K}.

1 For i = 1, . . . , n set

(

0 if gi 6= k

yi =

1 if gi = k

0k , k ) = arg min Pn (yi 0 t xi )2

2 Compute (

k

i=1

0 ,k

Define

k (x) = 0k + kt x

Classify a new point x with

k

p

Have training data {(xi , gi )}n

i=1 where each xi R and

gi {1, . . . , K}.

1 For i = 1, . . . , n set

(

0 if gi 6= k

yi =

1 if gi = k

0k , k ) = arg min Pn (yi 0 t xi )2

2 Compute (

k

i=1

0 ,k

Define

k (x) = 0k + kt x

Classify a new point x with

k

p

Have training data {(xi , gi )}n

i=1 where each xi R and

gi {1, . . . , K}.

1 For i = 1, . . . , n set

(

0 if gi 6= k

yi =

1 if gi = k

0k , k ) = arg min Pn (yi 0 t xi )2

2 Compute (

k

i=1

0 ,k

Define

k (x) = 0k + kt x

Classify a new point x with

k

p

Have training data {(xi , gi )}n

i=1 where each xi R and

gi {1, . . . , K}.

1 For i = 1, . . . , n set

(

0 if gi 6= k

yi =

1 if gi = k

0k , k ) = arg min Pn (yi 0 t xi )2

2 Compute (

k

i=1

0 ,k

Define

k (x) = 0k + kt x

Classify a new point x with

k

3 class example

functions for the above 3-classes.

For each k construct the response vectors from the class labels

0.5

0.5

0.5

10

0

0

10

20

10

0

0

10

20

10

0

0

10

20

0.5

0.5

10

0

0

10

20

0.5

10

0

0

10

20

10

0

0

10

20

For each k construct the response vectors from the class labels

0.5

0.5

0.5

10

0

0

10

20

10

0

0

10

20

10

0

0

10

20

0.5

0.5

1 (x)

1.5

0.5

0.5

2 (x)

1.5

0.5

0.5

3 (x)

1.5

The training data from 3 classes

0.5

1 (x)

0.5

2 (x)

0.5

3 (x)

0.5

1 (x)

0.5

2 (x)

0.5

3 (x)

In this last example masking has occurred.

This occurs because of the rigid nature of the linear

discriminant functions.

In this last example masking has occurred.

This occurs because of the rigid nature of the linear

discriminant functions.

In this last example masking has occurred.

This occurs because of the rigid nature of the linear

discriminant functions.

In this last example masking has occurred.

This occurs because of the rigid nature of the linear

discriminant functions.

To perform optimal classification need to know P (G | X). Let

fk (x) represent the class-conditional P (X | G = k) and

PK

k=1

fk (x)k

P (G = k | X = x) = PK

l=1 fl (x)l

k = 1

to having P (G = k | X = x).

To perform optimal classification need to know P (G | X). Let

fk (x) represent the class-conditional P (X | G = k) and

k be the prior probability of class k with

PK

k=1

fk (x)k

P (G = k | X = x) = PK

l=1 fl (x)l

k = 1

to having P (G = k | X = x).

To perform optimal classification need to know P (G | X). Let

fk (x) represent the class-conditional P (X | G = k) and

k be the prior probability of class k with

PK

k=1

fk (x)k

P (G = k | X = x) = PK

l=1 fl (x)l

k = 1

to having P (G = k | X = x).

To perform optimal classification need to know P (G | X). Let

fk (x) represent the class-conditional P (X | G = k) and

k be the prior probability of class k with

PK

k=1

fk (x)k

P (G = k | X = x) = PK

l=1 fl (x)l

k = 1

to having P (G = k | X = x).

Many methods are based on specific models of fk (x)

linear and quadratic discriminant functions use Gaussian

distributions,

boundaries,

flexibility,

Qp

j=1

fkj (Xj ).

Many methods are based on specific models of fk (x)

linear and quadratic discriminant functions use Gaussian

distributions,

boundaries,

flexibility,

Qp

j=1

fkj (Xj ).

Many methods are based on specific models of fk (x)

linear and quadratic discriminant functions use Gaussian

distributions,

boundaries,

flexibility,

Qp

j=1

fkj (Xj ).

Many methods are based on specific models of fk (x)

linear and quadratic discriminant functions use Gaussian

distributions,

boundaries,

flexibility,

Qp

j=1

fkj (Xj ).

Many methods are based on specific models of fk (x)

linear and quadratic discriminant functions use Gaussian

distributions,

boundaries,

flexibility,

Qp

j=1

fkj (Xj ).

Model each fk (x) as a multivariate Gaussian

1

p

fk (x) =

exp {.5(x k )t 1

k (x k )}

p

2 |k |

Similar discriminant functions were derived where each p(x

distributed

with(LDA)

equal arises

covariance

LinearNormally

Discriminant

Analysis

in thematrices.

special

case when

k = for all k

class distributions

decision boundary

partition

In this

lecture,

no assumptions,

One gets

linear

decision

boundaries. made about the underlying de

Model each fk (x) as a multivariate Gaussian

1

p

fk (x) =

exp {.5(x k )t 1

k (x k )}

p

2 |k |

Similar discriminant functions were derived where each p(x

distributed

with(LDA)

equal arises

covariance

LinearNormally

Discriminant

Analysis

in thematrices.

special

case when

k = for all k

class distributions

decision boundary

partition

In this

lecture,

no assumptions,

One gets

linear

decision

boundaries. made about the underlying de

LDA

Can see this as

log

P (G = k | X = x)

fk (x)

k

= log

+ log

P (G = l | X = x)

fl (x)

l

k

t

1

= log

.5 k k + .5 tl 1 l

l

+ xt 1 (k l )

= xt a + b

a linear function

t 1

The equal covariance matrices allow the xt 1

k x and x l x

functions

k (x) = xt 1 k .5 tk 1 k + log k

are an equivalent description of the decision rule with

G(x) = arg max k (x)

k

LDA

Can see this as

log

P (G = k | X = x)

fk (x)

k

= log

+ log

P (G = l | X = x)

fl (x)

l

k

t

1

= log

.5 k k + .5 tl 1 l

l

+ xt 1 (k l )

= xt a + b

a linear function

t 1

The equal covariance matrices allow the xt 1

k x and x l x

functions

k (x) = xt 1 k .5 tk 1 k + log k

are an equivalent description of the decision rule with

G(x) = arg max k (x)

k

LDA

Can see this as

log

P (G = k | X = x)

fk (x)

k

= log

+ log

P (G = l | X = x)

fl (x)

l

k

t

1

= log

.5 k k + .5 tl 1 l

l

+ xt 1 (k l )

= xt a + b

a linear function

t 1

The equal covariance matrices allow the xt 1

k x and x l x

functions

k (x) = xt 1 k .5 tk 1 k + log k

are an equivalent description of the decision rule with

G(x) = arg max k (x)

k

In practice dont know the parameters of the Gaussian distributions

and estimate these from the training data.

Let nk be the number of class k observations then

k = nk /n

P

k = gi =k xi /nk

k =

PK P

k=1

(xi

k )(xi Analysis

k )t /(n109

Linear

Discriminant

g4.3

i =k

13

3

3

3

33

3

2 2

13

2

3

3

3

31 3

3 22

2

1

3 3

2

3

33

11 23 33 1

22 2 2

2

3

2

1 1 1 1 22

13

3

2

1 31 1

3

1 11

2 22

11

22

1

1

2

2

1 1 2

1

1

1 2

1

2

2

2

1 3

33

K)

Bivariate example

(QDA)

Have a two class problem with

t 1

1 log | |

.9 .5 .4

2.6

k (x)

= .5

(x

2k =

1 =

, 1 =k

, P(

=k.5 (x

2=

k )1 )

k ) + log,

.4

.3

.4

.2

class distributions

decision boundaries

partition

Bivariate example

(QDA)

Have a two class problem with

t 1

1 log | |

.9 .5 .4

2.6

k (x)

= .5

(x

2k =

1 =

, 1 =k

, P(

=k.5 (x

2=

k )1 )

k ) + log,

.4

.3

.4

.2

class distributions

decision boundaries

partition

Left plot shows the quadratic decision boundaries found using

2

2

LDA in the five dimensional space4.3X

.

1 , XDiscriminant

2 , X1 , XAnalysis

2 , X1 X2111

Linear

2

1

1

2

1

3

22 2

1

2 22

3

2 22

33 3 3 3

22 2

2

1

2

2

33

3 3

2 2 22 2 22 2

3 3

3

2 222 2

1

22

2

22

3

22

1

2

2

2

3 3

2 2

22

22 2

22 22222 2 2

3 3 33

2222 2222 2222

33

2 2222 2 2 2 2222222 21

3 33 333

222 2 2

2

3 333 3333 3 3

222 222 22 2 2 2222 2 2

333 33

2 2 2222 22 22 22 22222222 22 2

13

222 222 22

3 333 3 3

1 2 22222 2122

1

2

3 3

33333

2122222122

1

2

2

1

1 3 33

2

1 1

3 3

2 122122 22

1

22

1

2 1 1222 1

11 1 33

3 3 33

22 2 2 2

1

1

11

2

1

1

333 333

1

3

1

1

2

11 1 11

2211

3 3

2

1

2121 2

1 111

3 3 3333

11 1

1

112

1 121 11 1

222

1 1 1 1 1 11 1 13313333333 33 3

211 2 1

3

3

1

3

2 111

1

3

1

1

3

1

1

1

33333333

1

1

111 1 1

11

1 11 33333

333

1

1 1 1

33333

1 1 11 1 11 1111 111 1 1 1111 1111 1

11 313 1

111 1 1

333333

3

1 11 1

1

3

3

3

3

1

3

3

1 3133 3 3

1

1

1 11

1 33333333333

11 1 1 1 1 111111 11 111 1 1 1 1

1 11 1

1

33

3 3 333333

11 3

1

1

1 1

33333 3

11

1

3

1 1 11

3 33

1

1

3

333 3

3

3 3

2

1

1

2

1

3

22 2

1

2 22

3

2 22

33 3 3 3

22 2

2

1

2

2

33

3 3

2 2 22 2 22 2

3 3

3

2 222 2

1

22

2

22

3

22

1

2

2

2

3 3

2 2

22

22 2

22 22222 2 2

3 3 33

2222 2222 2222

33

2 2222 2 2 2 2222222 21

3 33 333

222 2 2

2

3 333 3333 3 3

222 222 22 2 2 2222 2 2

333 33

2 2 2222 22 22 22 22222222 22 2

13

222 222 22

3 333 3 3

1 2 22222 2122

1

2

3 3

33333

2122222122

1

2

2

1

1 3 33

2

1 1

3 3

2 122122 22

1

22

1

2 1 1222 1

11 1 33

3 3 33

22 2 2 2

1

1

11

2

1

1

333 333

1

3

1

1

2

11 1 11

2211

3 3

2

1

2121 2

1 111

3 3 3333

11 1

1

112

1 121 11 1

222

1 1 1 1 1 11 1 13313333333 33 3

211 2 1

3

3

1

3

2 111

1

3

1

1

3

1

1

1

33333333

1

1

111 1 1

11

1 11 33333

333

1

1 1 1

33333

1 1 11 1 11 1111 111 1 1 1111 1111 1

11 31 1

111 1 1

333333

33333

1 11 1

1

3

3

1

3

1 3133 3 3

1

1

1 11

1 33333333333

11 1 1 1 1 111111 11 111 1 1 1 1

1 11 1

1

33

3 3 333333

11 3

1

1

1 1

33333 3

11

1

3

1 1 11

3 33

1

1

3

333 3

3

3 3

FIGURE 4.6. Two methods for fitting quadratic boundaries. The left plot shows

the quadratic decision boundaries for the data in Figure 4.1 (obtained using LDA

in the five-dimensional space X1 , X2 , X1 X2 , X12 , X22 ). The right plot shows the

quadratic decision boundaries found by QDA. The differences are small, as is

usually the case.

we have chosen the last), and each difference requires p + 1 parameters3 .

These methods can be surprisingly effective.

Can explain this

Have K centroids in a p-dimensional input space: 1 , . . . , K

These centroids define an K 1 dimensional affine subspace

u = 1 + 1 (2 1 ) + 2 (3 1 ) + + K1 (K 1 )

= 1 + 1 d1 + 2 d2 + + K1 dK1

x = 1 + 1 d1 + 2 d2 + + K1 dK1 + x ,

where x HK1

.

kx j k = k1 + 1 d1 + 2 d2 + + K1 dK1 + x j k

Have K centroids in a p-dimensional input space: 1 , . . . , K

These centroids define an K 1 dimensional affine subspace

u = 1 + 1 (2 1 ) + 2 (3 1 ) + + K1 (K 1 )

= 1 + 1 d1 + 2 d2 + + K1 dK1

x = 1 + 1 d1 + 2 d2 + + K1 dK1 + x ,

where x HK1

.

kx j k = k1 + 1 d1 + 2 d2 + + K1 dK1 + x j k

Have K centroids in a p-dimensional input space: 1 , . . . , K

These centroids define an K 1 dimensional affine subspace

u = 1 + 1 (2 1 ) + 2 (3 1 ) + + K1 (K 1 )

= 1 + 1 d1 + 2 d2 + + K1 dK1

x = 1 + 1 d1 + 2 d2 + + K1 dK1 + x ,

where x HK1

.

kx j k = k1 + 1 d1 + 2 d2 + + K1 dK1 + x j k

Have K centroids in a p-dimensional input space: 1 , . . . , K

These centroids define an K 1 dimensional affine subspace

u = 1 + 1 (2 1 ) + 2 (3 1 ) + + K1 (K 1 )

= 1 + 1 d1 + 2 d2 + + K1 dK1

x = 1 + 1 d1 + 2 d2 + + K1 dK1 + x ,

where x HK1

.

kx j k = k1 + 1 d1 + 2 d2 + + K1 dK1 + x j k

Have K centroids in a p-dimensional input space: 1 , . . . , K

These centroids define an K 1 dimensional affine subspace

u = 1 + 1 (2 1 ) + 2 (3 1 ) + + K1 (K 1 )

= 1 + 1 d1 + 2 d2 + + K1 dK1

x = 1 + 1 d1 + 2 d2 + + K1 dK1 + x ,

where x HK1

.

kx j k = k1 + 1 d1 + 2 d2 + + K1 dK1 + x j k

To summarize

K centroids in p-dimensional input space lie in an affine

subspace of dimension K 1.

To locate the closest centroid can ignore the directions

K 1.

If K > 3 can ask the question:

onto for optimality w.r.t. LDA?

4.3 Linear Discriminant Analysis

107

4

2

0

o o oo

ooo

oo

o o ooo

o o o

oo o

oooooo ooooo o oooo

oo o

o o

oo

o o

o ooo

oo ooo

o o o o oo

o

o

oo

oooooo

o

oo

o o o ooo o

ooooo o

o o o

o

oo o o oo

o

o

o

o

o

o

oo

o oo

oo o o o

ooo

o o oooo oo o

oo o o o

o

oo o o o o o

o ooo o

o

oooo o o

o o

o

o

o o

o

oo o oo

o o

oo

o

o o

o

ooo o

o

o

oooo o o o o

o

o

o

o

o

o

o

o

o oo o oo o o

o

o

o

o ooooo

ooo ooo

o o oo

o

o ooo

o

ooooo

oooooooo ooo

oo

o

o

o

o

o

o

o o o

o o oo o

oo

oo o o

o o oo

ooo o o oo

oo

o oo o o o

o o oo o

o

o

o

o

o o

o

o oo o

ooo

oo o

oooooooo o

oo oo o

o

o

o

oo oo

o

oo

o

o

o

o oo

oo o o

o

o o

o o o

o o o oo o

o

o

o

oo o

o

o

o

o o o

oo

o

o

o

o

o

o

o oooo o o

oo

o

oo

o o

o o ooo o oo o ooo

o

o

o

o

o

o

o o

o

o ooo o

o

o

o o o

oo

o o

oo

o

oo

o

o o

oo

o

o

o

o

o

o

oo

o

o

o

o

oo o

oo o o

o o

oo oo

ooo

o o

o

o

o

o

o

oo o

ooo

o oo o

oo

o o

o

o

o

o

o o

o

o

-6

-2

-4

oooo

oo

oooo

o

oo

oo

o

-4

-2

o

0

classes with 10 dimensional

input vectors.

The bold dots correspond

to the centroids projected

onto the top 2 principal

directions.

If K > 3 can ask the question:

onto for optimality w.r.t. LDA?

4.3 Linear Discriminant Analysis

107

4

2

0

o o oo

ooo

oo

o o ooo

o o o

oo o

oooooo ooooo o oooo

oo o

o o

oo

o o

o ooo

oo ooo

o o o o oo

o

o

oo

oooooo

o

oo

o o o ooo o

ooooo o

o o o

o

oo o o oo

o

o

o

o

o

o

oo

o oo

oo o o o

ooo

o o oooo oo o

oo o o o

o

oo o o o o o

o ooo o

o

oooo o o

o o

o

o

o o

o

oo o oo

o o

oo

o

o o

o

ooo o

o

o

oooo o o o o

o

o

o

o

o

o

o

o

o oo o oo o o

o

o

o

o ooooo

ooo ooo

o o oo

o

o ooo

o

ooooo

oooooooo ooo

oo

o

o

o

o

o

o

o o o

o o oo o

oo

oo o o

o o oo

ooo o o oo

oo

o oo o o o

o o oo o

o

o

o

o

o o

o

o oo o

ooo

oo o

oooooooo o

oo oo o

o

o

o

oo oo

o

oo

o

o

o

o oo

oo o o

o

o o

o o o

o o o oo o

o

o

o

oo o

o

o

o

o o o

oo

o

o

o

o

o

o

o oooo o o

oo

o

oo

o o

o o ooo o oo o ooo

o

o

o

o

o

o

o o

o

o ooo o

o

o

o o o

oo

o o

oo

o

oo

o

o o

oo

o

o

o

o

o

o

oo

o

o

o

o

oo o

oo o o

o o

oo oo

ooo

o o

o

o

o

o

o

oo o

ooo

o oo o

oo

o o

o

o

o

o

o o

o

o

-6

-2

-4

oooo

oo

oooo

o

oo

oo

o

-4

-2

o

0

classes with 10 dimensional

input vectors.

The bold dots correspond

to the centroids projected

onto the top 2 principal

directions.

If K > 3 can ask the question:

onto for optimality w.r.t. LDA?

4.3 Linear Discriminant Analysis

107

4

2

0

o o oo

ooo

oo

o o ooo

o o o

oo o

oooooo ooooo o oooo

oo o

o o

oo

o o

o ooo

oo ooo

o o o o oo

o

o

oo

oooooo

o

oo

o o o ooo o

ooooo o

o o o

o

oo o o oo

o

o

o

o

o

o

oo

o oo

oo o o o

ooo

o o oooo oo o

oo o o o

o

oo o o o o o

o ooo o

o

oooo o o

o o

o

o

o o

o

oo o oo

o o

oo

o

o o

o

ooo o

o

o

oooo o o o o

o

o

o

o

o

o

o

o

o oo o oo o o

o

o

o

o ooooo

ooo ooo

o o oo

o

o ooo

o

ooooo

oooooooo ooo

oo

o

o

o

o

o

o

o o o

o o oo o

oo

oo o o

o o oo

ooo o o oo

oo

o oo o o o

o o oo o

o

o

o

o

o o

o

o oo o

ooo

oo o

oooooooo o

oo oo o

o

o

o

oo oo

o

oo

o

o

o

o oo

oo o o

o

o o

o o o

o o o oo o

o

o

o

oo o

o

o

o

o o o

oo

o

o

o

o

o

o

o oooo o o

oo

o

oo

o o

o o ooo o oo o ooo

o

o

o

o

o

o

o o

o

o ooo o

o

o

o o o

oo

o o

oo

o

oo

o

o o

oo

o

o

o

o

o

o

oo

o

o

o

o

oo o

oo o o

o o

oo oo

ooo

o o

o

o

o

o

o

oo o

ooo

o oo o

oo

o o

o

o

o

o

o o

o

o

-6

-2

-4

oooo

oo

oooo

o

oo

oo

o

-4

-2

o

0

classes with 10 dimensional

input vectors.

The bold dots correspond

to the centroids projected

onto the top 2 principal

directions.

If K > 3 can ask the question:

onto for optimality w.r.t. LDA?

4.3 Linear Discriminant Analysis

107

4

2

0

ooo

oo

o o ooo

o o o

oo o

oooooo ooooo o oooo

oo o

o o

oo

o o

o ooo

oo ooo

o o o o oo

o

o

oo

oooooo

o

oo

o o o ooo o

ooooo o

o o o

o

oo o o oo

o

o

o

o

o

o

oo

o oo

oo o o o

ooo

o o oooo oo o

oo o o o

o

oo o o o o o

o ooo o

o

oooo o o

o o

o

o

o o

o

oo o oo

o o

oo

o

o o

o

ooo o

o

o

oooo o o o o

o

o

o

o

o

o

o

o

o oo o oo o o

o

o

o

o ooooo

ooo ooo

o o oo

o

o ooo

o

ooooo

oooooooo ooo

oo

o

o

o

o

o

o

o o o

o o oo o

oo

oo o o

o o oo

ooo o o oo

oo

o oo o o o

o o oo o

o

o

o

o

o o

o

o oo o

ooo

oo o

oooooooo o

oo oo o

o

o

o

oo oo

o

oo

o

o

o

o oo

oo o o

o

o o

o o o

o o o oo o

o

o

o

oo o

o

o

o o o

oo

o

o

o

o

o

o

o oooo o o

oo

o

oo

o o

o o ooo o oo o ooo

o

o

o

o

o

o

o o

o

o ooo o

o

o

o o o

oo

o o

oo

o

oo

o

o o

oo

o

o

o

o

o

o

oo

o

o

o

o

oo o

oo o o

o o

oo oo

ooo

o o

o

o

o

o

o

oo o

ooo

o oo o

oo

o o

o

o

o

o

o o

o

o

-6

-2

-4

oooo

oo

oooo

o

oo

oo

o

-4

-2

o

0

classes with 10 dimensional

input vectors.

The bold dots correspond

to the centroids projected

onto the top 2 principal

directions.

To find the sequences of optimal subspaces for LDA:

1

common covariance matrix W - the within-class variance.

1

2

3

Compute B the covariance matrix of M - the between-class

variance.

B s eigen-decomposition is B = V DB V . The columns of

vl of V define basis of the optimal subspace.

1

To find the sequences of optimal subspaces for LDA:

1

common covariance matrix W - the within-class variance.

1

2

3

Compute B the covariance matrix of M - the between-class

variance.

B s eigen-decomposition is B = V DB V . The columns of

vl of V define basis of the optimal subspace.

1

-4

-2

-6

-2

2

o

o

1

-2

-3

Coordinate 10

-2

0

Coordinate 1

o

o

oo

o o o

o o

o

o o oooo o

o

o

o o o o

o

o oooo o o o oo

o

oo

o

o o

o

oo o oo

o

oo

o

o

oo

o o

o

o

ooooo o

o oo o o o o

ooo

o

o

o

o

o

o

o

o o

oo

o ooo ooooooo ooo oooo

o

o o oo ooo ooooo

o

o

o

o

o

o

o oo oooooo o o

o

o

o o

o o

oo ooo o o o

oo

ooo

o oo

oooooooooo oo

o oo o oooo

ooo o

o

o

o

o

o oo

o o o o oooo oo o

ooooo

o

o ooooo o

oo o

o

o oo

o ooo ooooo o o

o

o

oo o oo o oo oo

o

o

o

o

o

o

o

o

o

o

o

oo

o oo

oo oooo

oo oooooooooooooo

o oo o

oo o

o oo oooo o o

ooo

oo oo oo oooo

o o

o

o

o

o

o ooo ooooo oooo oo o oooooo o

o

oo o o o o oooo o

o oo o oooo ooo

o

oo

oo o o o o o o o o

o o oo o oooooooo ooo ooooooo o o

oo o oo

o o oo o o

o

o

o o ooo o o o ooo oo ooo

o

ooo oooo oo

ooo oo

o

oo o

o o o o oo o

o oo o oo

oo o

o

o o

oo o o

oooo o

o

o o

o

o

o

o o

o

o

o

o

o

o

-4

o o

oo

-1

3

2

1

0

Coordinate 7

-1

-2

oo o o

o

o

o oo

o oo

oo

oo oo

oo o

oo oooo o

o

o

oo

ooo oo

ooo

o

o

oooo o o o

o o

o o

o oo oo

oo o o

o

o oo oo

ooooo

o

oo oo

oo oo o

o

o o o o o oo o

o

oo o

oo

ooo oooo o

ooo o

o

o

oooo o o ooo o o

o o oo

o o oo

o

o

o o

o o ooooo ooo o

ooo

o oo oooooo o ooo ooo ooo o oo oooo

oo o

ooooo

o

o ooo o o oo o o oo oo

oo ooooooooo o oo ooo ooo oooo o

o

ooo oo o o oo o o ooo o

o o

o ooooo

oo

o oo

ooo o o ooo o o o o

o oo

oooooo

o

o

o oo

o ooooooo

o

o

oo oooo

o

oooo o o ooo

o o o ooo

ooo o o oo oooooo oo

o

o ooo

ooo

o

o o

o

oo oooo o o

o o o oo o o ooo o o o

oo oo oo o o

o

o

o

o

o

o

o

o

o

o

oo

oooo o oo oo oo o

oo o o oooooo

oo

oo o

o

oo

ooo

o o o

oo

ooo

oo

ooo

o oooo

oo

o

o

o ooo

o

o ooooooo

o o

oo o o

o

o oo o

o

o

oo o o

oo oo

o o

o

oo

ooo

Coordinate 2

o o

o

o

oo o o

o

o

-4

Coordinate 1

o

o

oo

o

o

o

oo

o

oo

o oo

o

o

o

o ooo o o

o

oo

o o

o

ooooo oo o oo

oo

o o oo

oo

o

o

o

oo

o

o

o

o

oo o

ooo ooo o o

o o o oo

ooooo

ooooo

o

o

o

oooooo ooooo oooo

o oo ooo oo

oooo oooo o oooooo

o oooo oooo o oo

o oooo ooooooooooo o

o o

oo

o o oo

o oo

oo oooooo

ooooo ooo

o ooo o

oo

o

o

o

oo

o ooooooo o oooo

oooooooo o ooo

o o oo o oo o ooo o oo

oo

ooo

o oo o o ooo o oo o ooo oo oooo

ooooo oo oooooo ooooooo o o o

ooo oooo o

o o o ooo o

o oo oo ooooooooooo o oo

o o oo o

o oo

o

o o ooooooooooo

o

o oooo o oooo oo

oo oooo oooooo

o

o

o

o

o

o

o

o

o

o

o

o oo o

ooo o o

o o

oo o

o oo o oo oo

o

oo oo

oo

o o o o oo o o o o o o

oo o o o

oo o

o o

oo

ooo

o oo

o o o oooo o

o

o o

o

o

o

oo

o

ooo

o

oo

oo

o

o

o

o

o

ooo

o

o oo o oo

o

Coordinate 3

-2

o

o

o

o

o

o

o

o

o o

o

o

o

o

oo

o

oo o

o o

o o

oo o oo o

o o

o

o o ooo

oooooo oo

o oo

o

o

o o

o

o oo

o

o o

oooo

oo

o o

o

o ooo

o

o oo

o

o

o

o

o

o

o

o

o o oo o o

o o

oooo oo o o ooo ooo

o

ooooo o ooo o oo o

oo ooo o

o

o oo

o

ooooo o oo

ooo oo ooo

o o

oo

oo

o ooooo ooooooo

o oo oooooo ooo

oo

oo o

oo

oo oo

oo oooo oooo oo ooo

oo

o

oo o

o

o o

o o o oooooo oo

ooo oo o

oo o

oooo o oooo o o oo oo

o

o oooo

oo

o o oo oooo ooo

oooo

oo

o oo ooo o o

oo

ooo o oo o ooooo o o

oo oo o o o

o o oo oooo ooo o oo ooo

oo

o o o ooooo

o

o

o

o

o

o o o oo

oo o ooo ooooo

o o ooooo

o o

o

oo oo oooooo oo o o o

o

o oo o

oo

ooo o o

oo o o

oo o

oo

oo o

o

oooo

o o oo

o oo o o

ooo

o

o

o

o

o oo

o

o oo o o

oo oo o oo

oo

o oooo o o o o

o

oo

o o o

oo

oo o

o

o

ooo

o oo

o

oo

o o o

o

oo

o

o

o

oo

oo o

o

o

oooo

o

o

o

o o

o

-2

Coordinate 3

-2

-1

oo

oo

o

Coordinate 9

FIGURE 4.8. Four projections onto pairs of canonical variates. Notice that as

the

canonical

variates

increase

the

projected

the rankof

of the

canonical

variates increases,

the centroids

become less

spread

out.

In the lower right panel they appear to be superimposed, and the classes most

centroids become

confused.less spread out.

Fisher arrived at this decomposition via a different route. He

posed the problem

Find the linear combination Z = aX such that the

between-class variance is maximized relative to the

within-class variance.

116

+

+

Althoughthis

the line

joining themakes

centroidssense

defines the direction of

criterion

greatest centroid spread, the projected data overlap because of the covariance

(left panel). The discriminant direction minimizes this overlap for Gaussian data

(right panel).

Fisher arrived at this decomposition via a different route. He

posed the problem

Find the linear combination Z = aX such that the

between-class variance is maximized relative to the

within-class variance.

116

+

+

Althoughthis

the line

joining themakes

centroidssense

defines the direction of

criterion

greatest centroid spread, the projected data overlap because of the covariance

(left panel). The discriminant direction minimizes this overlap for Gaussian data

(right panel).

W is the common covariance matrix of the original data X.

B is the covariance matrix of the centroid matrix M

Then for the projected data Z

1

max

a

at B a

at W a

or equivalently

max at B a subject to at W a = 1

a

W is the common covariance matrix of the original data X.

B is the covariance matrix of the centroid matrix M

Then for the projected data Z

1

max

a

at B a

at W a

or equivalently

max at B a subject to at W a = 1

a

W is the common covariance matrix of the original data X.

B is the covariance matrix of the centroid matrix M

Then for the projected data Z

1

max

a

at B a

at W a

or equivalently

max at B a subject to at W a = 1

a

Fishers problem amounts to maximizing the Raleigh quotient

a

largest eigenvalue of W 1 B.

Can find the next direction a2

a2 = arg max

a

at B a

subject to at W a1 = 0

at W a

1

Once again a2 = W 2 v2 .

In a similar fashion can find a3 , a4 , . . .

Fishers problem amounts to maximizing the Raleigh quotient

a

largest eigenvalue of W 1 B.

Can find the next direction a2

a2 = arg max

a

at B a

subject to at W a1 = 0

at W a

1

Once again a2 = W 2 v2 .

In a similar fashion can find a3 , a4 , . . .

Fishers problem amounts to maximizing the Raleigh quotient

a

largest eigenvalue of W 1 B.

Can find the next direction a2

a2 = arg max

a

at B a

subject to at W a1 = 0

at W a

1

Once again a2 = W 2 v2 .

In a similar fashion can find a3 , a4 , . . .

Fishers problem amounts to maximizing the Raleigh quotient

a

largest eigenvalue of W 1 B.

Can find the next direction a2

a2 = arg max

a

at B a

subject to at W a1 = 0

at W a

1

Once again a2 = W 2 v2 .

In a similar fashion can find a3 , a4 , . . .

Fishers problem amounts to maximizing the Raleigh quotient

a

largest eigenvalue of W 1 B.

Can find the next direction a2

a2 = arg max

a

at B a

subject to at W a1 = 0

at W a

1

Once again a2 = W 2 v2 .

In a similar fashion can find a3 , a4 , . . .

118

canonical

variates.

Classification

in Reduced Subspace

ooo

ooo

ooooo

oo

oo

o o oo

ooo

oo

oo ooo o

o o o

ooooo ooooo o oooo

ooo

oo

o o

oo

o o

o ooo

o

ooooo

o

o

o

o

o

o

oo

o

ooooo

o

o ooooo oo

o

o o o ooo o

oo o

o

oo oo o o oo

o

ooo

o oo o

oo o o o

oo o o o

oooooo o oo o oooo oo o

o o oo o

oooo o o

o oo o

o oo o o o o

o

oo o o

o

oo

oo

o

ooo o oooo

o

o

oo ooo o o

ooo o

oooo oooo

ooooo o o

oooo o o

o o

o

o

o

o

o

o

o

o

o

ooooo

o

o

ooooo oo o o

oo

o

o

o o oooo o o

oo

ooo oo oo o oo oo o o

o o oo

ooo o o oo

o

o

oo o o

oo o o

oo o o

oo

o

o

ooooo o

ooo o

ooo

o

o

o

o

o

o

o

o

o

oo o

o

oo oo

oooo

o oo o

o

o o

o

o o

o

o o o o

o o o oo o

o

o

o

o o

o o

o o o oo

oo

o oo o o o

o

o

o

o ooo

oo

o

oo

oo

oo

oooo

o

o

o

o

o

o

o o o oooo

oo

o

o

o

o o o oo

o

o

oo

o o

oo

o

oo

o

o o

oo

o

o

oooo

oo o

o o ooo

oo o o o

o

o

oo o

o o

oo

o

ooo

o o

o

oo o

ooo

o

oo o oo o o

o

o

o

o

Canonical Coordinate 2

o o

o

Canonical Coordinate 1

FIGURE 4.11. Decision boundaries for the vowel training data, in the two-dimensional subspace spanned by the first two canonical variates. Note that in

any higher-dimensional subspace, the decision boundaries are higher-dimensional

affine planes, and could not be represented as lines.

classes with 10

dimensional input

vectors.

The decision boundaries

based on using basic

linear discrimination in

the low dimensional

space given by the first

2 canonical variates.

Logistic Regression

Logistic regression

Arises from trying to model the posterior probabilities of the

to one.

P (G = k|X = x) =

and k = K

P (G = K|X = x) =

exp(k0 + kt x)

P

t

1 + K1

l=1 exp(l0 + l x)

1+

PK1

l=1

1

exp(l0 + lt x)

Logistic regression

Arises from trying to model the posterior probabilities of the

to one.

P (G = k|X = x) =

and k = K

P (G = K|X = x) =

exp(k0 + kt x)

P

t

1 + K1

l=1 exp(l0 + l x)

1+

PK1

l=1

1

exp(l0 + lt x)

Logistic regression

This model: k = 1, . . . , K 1

P (G = k|X = x) =

and k = K

P (G = K|X = x) =

exp(k0 + kt x)

P

t

1 + K1

l=1 exp(l0 + l x)

1+

PK1

l=1

1

exp(l0 + lt x)

{x : P (G = k|X = x) = P (G = l|X = x)}

is the same as

{x : (k0 l0 ) + (k l )t x = 0}

for 1 k < K and 1 l < K.

To simplify notation let

1

= {10 , 1t , 20 , 2t , . . .} and

P (G = k|X = x) = pk (x; )

i=1 one usually fits the logistic

`() = log

n

Y

i=1

pgi (xi ; )

n

X

log(pgi (xi ; ))

i=1

probabilities are being used...

p1 (x; ) =

exp( t x)

and p2 (x; ) = 1 p1 (x; )

1 + exp( t x)

A convenient way to write the likelihood for one sample (xi , gi ) is:

Code the two-class gi as a {0, 1} response yi where

(

1 if gi = 1

yi =

0 if gi = 2

Then one can write

Similarly

log pgi (xi ; ) = yi log p1 (xi ; ) + (1 yi ) log(1 p1 (xi ; ))

The log-likelihood of the data becomes

`() =

=

n

X

i=1

n h

X

i=1

n h

X

i=1

yi t xi yi log(1 + e

yi t xi log(1 + e

tx

tx

) (1 yi ) log(1 + e

tx

`() =

n h

X

i=1

yi t xi log(1 + e

tx

n

`() X

exp( t xi )

=

xi yi xi

1 + exp( t xi )

i=1

n

X

exp( t xi )

=

xi yi

1 + exp( t xi )

i=1

=

n

X

i=1

xi (yi p1 (xi ; )) = 0

Must solve iteratively and in the book they use the

Newton-Raphson algorithm.

Newton-Raphson requires both the gradient

n

`() X

=

xi (yi p1 (xi ; ))

i=1

n

X

`()

=

xi xti p1 (xi ; )(1 p1 (xi ; ))

t

i=1

new =

old

`()

t

1

`()

Write the Hessian and gradient in matrix notation. Let

X be the N (p + 1) matrix with (1, xti ) on each row,

p = (p1 (x1 ; old ), p1 (x2 ; old ), . . . , p1 (xn ; old ))t

W is n n diagonal matrix with ith diagonal element

Then

`()

= Xt (y p)

and

`()

= Xt WX

t

The Newton step is then

new = old + (Xt WX)1 Xt (y p)

= (Xt WX)1 Xt W X old + W1 (y p)

= (Xt WX)1 Xt Wz

new = arg min (z X)t W(z X)

with response

z = X old + W1 (y p)

known as the adjusted response. Note at iteration each W, p and

z change.

An toy example

Use Logistic Regression to find a decision boundary

Size 1/Wii

Size 1/Wii

Size 1/Wii

Logistic regression converges to this decision boundary.

The L1 penalty can be used for variable selection in logistic

regression by maximizing a penalized version of the log-likelihood

p

n h

X

i

X

t

yi (0 + t xi ) log(1 + e0 + xi )

|j |

max

0 ,1

i=1

j=1

Note:

the predictors should be standardized to ensure the penalty is

meaningful,

Separating Hyperplanes

In this section describe separating hyperplane classifiers - will

only consider separable training data.

Construct linear decision boundaries that explicitly try to

A hyperplane is defined as

{x : 0 + t x = 0}

In this section describe separating hyperplane classifiers - will

only consider separable training data.

Construct linear decision boundaries that explicitly try to

A hyperplane is defined as

{x : 0 + t x = 0}

In this section describe separating hyperplane classifiers - will

only consider separable training data.

Construct linear decision boundaries that explicitly try to

A hyperplane is defined as

{x : 0 + t x = 0}

130

x0

x

0 + T x = 0

Perceptrons

f (x) =1958).

0 +

t x = set

0 the foundations

for the neural network models of the 1980s and 1990s.

Before we continue,

let

us

digress

slightly

and

review

some

t

vector algebra.

2Figure 4.15 depicts a hyperplane

1

2 or affine set L defined by the equation

2

T

f (x) = 0 + t x = 0; since we are in IR this is a line.

Here we list some

0 properties:

0

If x0 L then x = .

1. For

any two points

x1 and x2xlying

The signed

distance

of point

to inLL,is T (x1 x2 ) = 0, and hence

1 T t

1

x0 =x

2.tFor

(xany

point

x0 ) x=

+0 .0 ) =

f (x) f (x)

0 in L, (

kk

kf 0 (x)k

3. The signed distance of any point x to L is given by

130

x0

x

0 + T x = 0

Perceptrons

f (x) =1958).

0 +

t x = set

0 the foundations

for the neural network models of the 1980s and 1990s.

Before we continue,

let

us

digress

slightly

and

review

some

t

vector algebra.

2Figure 4.15 depicts a hyperplane

1

2 or affine set L defined by the equation

2

T

f (x) = 0 + t x = 0; since we are in IR this is a line.

Here we list some

0 properties:

0

If x0 L then x = .

1. For

any two points

x1 and x2xlying

The signed

distance

of point

to inLL,is T (x1 x2 ) = 0, and hence

1 T t

1

x0 =x

2.tFor

(xany

point

x0 ) x=

+0 .0 ) =

f (x) f (x)

0 in L, (

kk

kf 0 (x)k

3. The signed distance of any point x to L is given by

130

x0

x

0 + T x = 0

Perceptrons

f (x) =1958).

0 +

t x = set

0 the foundations

for the neural network models of the 1980s and 1990s.

Before we continue,

let

us

digress

slightly

and

review

some

t

vector algebra.

2Figure 4.15 depicts a hyperplane

1

2 or affine set L defined by the equation

2

T

f (x) = 0 + t x = 0; since we are in IR this is a line.

Here we list some

0 properties:

0

If x0 L then x = .

1. For

any two points

x1 and x2xlying

The signed

distance

of point

to inLL,is T (x1 x2 ) = 0, and hence

1 T t

1

x0 =x

2.tFor

(xany

point

x0 ) x=

+0 .0 ) =

f (x) f (x)

0 in L, (

kk

kf 0 (x)k

3. The signed distance of any point x to L is given by

130

x0

x

0 + T x = 0

Perceptrons

f (x) =1958).

0 +

t x = set

0 the foundations

for the neural network models of the 1980s and 1990s.

Before we continue,

let

us

digress

slightly

and

review

some

t

vector algebra.

2Figure 4.15 depicts a hyperplane

1

2 or affine set L defined by the equation

2

T

f (x) = 0 + t x = 0; since we are in IR this is a line.

Here we list some

0 properties:

0

If x0 L then x = .

1. For

any two points

x1 and x2xlying

The signed

distance

of point

to inLL,is T (x1 x2 ) = 0, and hence

1

x0 =x

2.tFor

(xany

point

x0 ) x=

+0 .0 ) =

f (x) f (x)

0 in L, (

kk

kf 0 (x)k

3. The signed distance of any point x to L is given by

Perceptron Learning

Perceptron learning algorithm tries to find a separating hyperplane

by minimizing the distance of misclassified points to the decision

boundary.

The Objective Function

Have labelled training data {(xi , yi )} with xi Rp and

yi {1, 1}.

A point xi is misclassified if sign(0 + t xi ) 6= yi

This can be re-stated as: a point xi is misclassified if

yi (0 + t xi ) < 0

The goal is to find 0 and which minimize

X

D(, 0 ) =

yi (xti + 0 )

iM

Perceptron learning algorithm tries to find a separating hyperplane

by minimizing the distance of misclassified points to the decision

boundary.

The Objective Function

Have labelled training data {(xi , yi )} with xi Rp and

yi {1, 1}.

A point xi is misclassified if sign(0 + t xi ) 6= yi

This can be re-stated as: a point xi is misclassified if

yi (0 + t xi ) < 0

The goal is to find 0 and which minimize

X

D(, 0 ) =

yi (xti + 0 )

iM

Perceptron learning algorithm tries to find a separating hyperplane

by minimizing the distance of misclassified points to the decision

boundary.

The Objective Function

Have labelled training data {(xi , yi )} with xi Rp and

yi {1, 1}.

A point xi is misclassified if sign(0 + t xi ) 6= yi

This can be re-stated as: a point xi is misclassified if

yi (0 + t xi ) < 0

The goal is to find 0 and which minimize

X

D(, 0 ) =

yi (xti + 0 )

iM

Perceptron learning algorithm tries to find a separating hyperplane

by minimizing the distance of misclassified points to the decision

boundary.

The Objective Function

Have labelled training data {(xi , yi )} with xi Rp and

yi {1, 1}.

A point xi is misclassified if sign(0 + t xi ) 6= yi

This can be re-stated as: a point xi is misclassified if

yi (0 + t xi ) < 0

The goal is to find 0 and which minimize

X

D(, 0 ) =

yi (xti + 0 )

iM

Perceptron learning algorithm tries to find a separating hyperplane

by minimizing the distance of misclassified points to the decision

boundary.

The Objective Function

Have labelled training data {(xi , yi )} with xi Rp and

yi {1, 1}.

A point xi is misclassified if sign(0 + t xi ) 6= yi

This can be re-stated as: a point xi is misclassified if

yi (0 + t xi ) < 0

The goal is to find 0 and which minimize

X

D(, 0 ) =

yi (xti + 0 )

iM

Want to find 0 and which minimize

D(, 0 ) =

iM

yi (xti + 0 ) =

yi f,0 (xi )

iM

D(, 0 ) is non-negative.

D(, 0 ) is proportional to the distance of the misclassified

Questions:

Is there a unique , 0 which minimizes D(, 0 ) (disregarding

re-scaling of and 0 )

Want to find 0 and which minimize

D(, 0 ) =

iM

yi (xti + 0 ) =

yi f,0 (xi )

iM

D(, 0 ) is non-negative.

D(, 0 ) is proportional to the distance of the misclassified

Questions:

Is there a unique , 0 which minimizes D(, 0 ) (disregarding

re-scaling of and 0 )

Want to find 0 and which minimize

D(, 0 ) =

iM

yi (xti + 0 ) =

yi f,0 (xi )

iM

D(, 0 ) is non-negative.

D(, 0 ) is proportional to the distance of the misclassified

Questions:

Is there a unique , 0 which minimizes D(, 0 ) (disregarding

re-scaling of and 0 )

Want to find 0 and which minimize

D(, 0 ) =

iM

yi (xti + 0 ) =

yi f,0 (xi )

iM

D(, 0 ) is non-negative.

D(, 0 ) is proportional to the distance of the misclassified

Questions:

Is there a unique , 0 which minimizes D(, 0 ) (disregarding

re-scaling of and 0 )

Want to find 0 and which minimize

D(, 0 ) =

iM

yi (xti + 0 ) =

yi f,0 (xi )

iM

D(, 0 ) is non-negative.

D(, 0 ) is proportional to the distance of the misclassified

Questions:

Is there a unique , 0 which minimizes D(, 0 ) (disregarding

re-scaling of and 0 )

The gradient, assuming a fixed M, is given by

X

D(, 0 )

=

yi xi ,

iM

X

D(, 0 )

=

yi

0

iM

+ yi xi

and

0 0 + yi

Repeat this step until no points are misclassified.

The gradient, assuming a fixed M, is given by

X

D(, 0 )

=

yi xi ,

iM

X

D(, 0 )

=

yi

0

iM

+ yi xi

and

0 0 + yi

Repeat this step until no points are misclassified.

The gradient, assuming a fixed M, is given by

X

D(, 0 )

=

yi xi ,

iM

X

D(, 0 )

=

yi

0

iM

+ yi xi

and

0 0 + yi

Repeat this step until no points are misclassified.

The gradient, assuming a fixed M, is given by

X

D(, 0 )

=

yi xi ,

iM

X

D(, 0 )

=

yi

0

iM

+ yi xi

and

0 0 + yi

Repeat this step until no points are misclassified.

points.

Current estimate

(0)

Point misclassified

by (0)

to get (1)

(2)

(3)

(4)

(5)

(6)

(7)

(8)

(9)

(10)

(11)

(12)

(13)

(14)

(15)

(16)

(17)

(17)

Is this the best separating hyperplane we could have found?

Pros

If the classes are linearly separable, the algorithm converges to

Cons

All separating hyperplanes are considered equally valid.

One found depends on the initial guess for and 0 .

The finite number of steps can be very large.

If the data is non-separable, the algorithm will not converge.

Pros

If the classes are linearly separable, the algorithm converges to

Cons

All separating hyperplanes are considered equally valid.

One found depends on the initial guess for and 0 .

The finite number of steps can be very large.

If the data is non-separable, the algorithm will not converge.

Pros

If the classes are linearly separable, the algorithm converges to

Cons

All separating hyperplanes are considered equally valid.

One found depends on the initial guess for and 0 .

The finite number of steps can be very large.

If the data is non-separable, the algorithm will not converge.

Pros

If the classes are linearly separable, the algorithm converges to

All separating hyperplanes are considered equally valid.

One found depends on the initial guess for and 0 .

The finite number of steps can be very large.

If the data is non-separable, the algorithm will not converge.

Pros

If the classes are linearly separable, the algorithm converges to

All separating hyperplanes are considered equally valid.

One found depends on the initial guess for and 0 .

The finite number of steps can be very large.

If the data is non-separable, the algorithm will not converge.

6?#)@,/*1-?,$,#)"A*B0?-$?/,"-1*C=D

6?#)@,/*1-?,$,#)"A*B0?-$?/,"-1*C=D

Optimal

Separating Hyperplane

Intuitively

!

!"#$%&'()*+'),("-.'/)"0)0%#&%#1)2)$',2(2*%#1)+3,'(,.2#')0"()2).%#'2(.3)

!"#$%&'()*+'),("-.'/)"0)0%#&%#1)2)$',2(2*%#1)+3,'(,.2#')0"()2).%#'2(.3)

?

$',2(2-.')&2*2$'*)456

?8)3!4@78A7=

78383

79856

:8383

:98;856

<8383

<9=8)6!>

$',2(2-.')&2*2$'*)456

9856

98;856

9=8)6!>8)3!4@78A7=

The optimal

separatinghyperplanes

hyperplane

separates the

two too close to t

Optimal

separating

!"#$"%&'%(")%#*'#*#()%"+,)-,./*)0%0"&1.2%3)%$"&&0)4%

Bad a hyperplane

passing

!"#$"%&'%(")%#*'#*#()%"+,)-,./*)0%0"&1.2%3)%$"&&0)4%

7

"

<

<

"

5*(1#(#6).+7%/%"+,)-,./*)%("/(%,/00)0%(&&%$.&0)%(&%(")%(-/#*#*8%)9/:,.)0%3#..%;)%0)*0#(#6)%

classes and

maximizes the distance

theand

closes

pointless

from

5*(1#(#6).+7%/%"+,)-,./*)%("/(%,/00)0%(&&%$.&0)%(&%(")%(-/#*#*8%)9/:,.)0%3#..%;)%0)*0#(#6)%

to to

noise

probably

likely to gen

(&%*�)%/*27%(")-)'&-)7%.)00%.#<).+%(&%8)*)-/.#=)%3)..%'&-%2/(/

(&%*�)%/*27%(")-)'&-)7%.)00%.#<).+%(&%8)*)-/.#=)%3)..%'&-%2/(/&1(0#2)%(")%(-/#*#*8%0)(

&1(0#2)%(")%(-/#*#*8%0)(

either class

[Vapnik

1996].

5*0()/27%#(%0)):0%-)/0&*/;.)%(&%)9,)$(%("/(%/%"+,)-,./*)%("/(%#0

'/-(")0(%'-&:%/..%

5*0()/27%#(%0)):0%-)/0&*/;.)%(&%)9,)$(%("/(%/%"+,)-,./*)%("/(%#0 '/-(")0(%'-&:%/..%

!

a separating hyperplane for a linearly

(-/#*#*8%)9/:,.)0%3#..%"/6)%;)(()-%8)*)-/.#=/(#&*%$/,/;#.#(#)0

(-/#*#*8%)9/:,.)0%3#..%"/6)%;)(()-%8)*)-/.#=/(#&*%$/,/;#.#(#)0

Better

hyperplane

d

le dataset {(x

,

y

),

(x

,

y

),

xi a R

and y far away from all

>")-)'&-)7%(")%&,(#:/.%0),/-/(#*8%"+,)-,./*)%3#..%;)%(")%&*)%3#("%(")%./-8)0(%

1

1

2

2

>")-)'&-)7%(")%&,(#:/.%0),/-/(#*8%"+,)-,./*)%3#..%;)%(")%&*)%3#("%(")%./-8)0(%

This provides . . . , (xn, yn)} with better

generalization capabilities.

/2(1%#7%!"#$"%#&%'()#*('%+&%,"(%-#*#-.-%'#&,+*$(%/)%+*%(0+-12(%,/%,"(%'($#&#/*%

/2(1%#7%!"#$"%#&%'()#*('%+&%,"(%-#*#-.-%'#&,+*$(%/)%+*%(0+-12(%,/%,"(%'($#&#/*%

1}.

a&.3)+$(

&.3)+$( definition of the separating hyperplane

unique

"

"

6: 6:

6: 6:

T

h

la

m

th

? ?

,( ,

#: (#:

/. /

%" .%"

+, +

)-,)

, . -,

/* ./

) *)

@/9#:1:

@/9#:1:

:/-8#*

:/-8#*

67 67

Which separating

!"#$%&'(#)%"*#%*+,##-$"*.",/01)1

!"#$%&'(#)%"*#%*+,##-$"*.",/01)1

2)(,$&%*3'#)-$$-4561'",

2)(,$&%*3'#)-$$-4561'",

7-8,1*.9:*;")<-$1)#0

7-8,1*.9:*;")<-$1)#0

6767

43/-%56"(37+&&78

+*'%9.2#(3:%;<<=>

+*'%9.2#(3:%;<<=>

hyperplane?43/-%56"(37+&&78

One which

maximizes

a decision boundary that generalizes well.

margin=>=>

6?#)@,/*1-?,$,#)"A*B0?-$?/,"-1*C=D

6?#)@,/*1-?,$,#)"A*B0?-$?/,"-1*C=D

Optimal

Separating Hyperplane

Intuitively

!

!"#$%&'()*+'),("-.'/)"0)0%#&%#1)2)$',2(2*%#1)+3,'(,.2#')0"()2).%#'2(.3)

!"#$%&'()*+'),("-.'/)"0)0%#&%#1)2)$',2(2*%#1)+3,'(,.2#')0"()2).%#'2(.3)

?

$',2(2-.')&2*2$'*)456

?8)3!4@78A7=

78383

79856

:8383

:98;856

<8383

<9=8)6!>

$',2(2-.')&2*2$'*)456

9856

98;856

9=8)6!>8)3!4@78A7=

The optimal

separatinghyperplanes

hyperplane

separates the

two too close to t

Optimal

separating

!"#$"%&'%(")%#*'#*#()%"+,)-,./*)0%0"&1.2%3)%$"&&0)4%

Bad a hyperplane

passing

!"#$"%&'%(")%#*'#*#()%"+,)-,./*)0%0"&1.2%3)%$"&&0)4%

7

"

<

<

"

5*(1#(#6).+7%/%"+,)-,./*)%("/(%,/00)0%(&&%$.&0)%(&%(")%(-/#*#*8%)9/:,.)0%3#..%;)%0)*0#(#6)%

classes and

maximizes the distance

theand

closes

pointless

from

5*(1#(#6).+7%/%"+,)-,./*)%("/(%,/00)0%(&&%$.&0)%(&%(")%(-/#*#*8%)9/:,.)0%3#..%;)%0)*0#(#6)%

to to

noise

probably

likely to gen

(&%*�)%/*27%(")-)'&-)7%.)00%.#<).+%(&%8)*)-/.#=)%3)..%'&-%2/(/

(&%*�)%/*27%(")-)'&-)7%.)00%.#<).+%(&%8)*)-/.#=)%3)..%'&-%2/(/&1(0#2)%(")%(-/#*#*8%0)(

&1(0#2)%(")%(-/#*#*8%0)(

either class

[Vapnik

1996].

5*0()/27%#(%0)):0%-)/0&*/;.)%(&%)9,)$(%("/(%/%"+,)-,./*)%("/(%#0

'/-(")0(%'-&:%/..%

5*0()/27%#(%0)):0%-)/0&*/;.)%(&%)9,)$(%("/(%/%"+,)-,./*)%("/(%#0 '/-(")0(%'-&:%/..%

!

a separating hyperplane for a linearly

(-/#*#*8%)9/:,.)0%3#..%"/6)%;)(()-%8)*)-/.#=/(#&*%$/,/;#.#(#)0

(-/#*#*8%)9/:,.)0%3#..%"/6)%;)(()-%8)*)-/.#=/(#&*%$/,/;#.#(#)0

Better

hyperplane

d

le dataset {(x

,

y

),

(x

,

y

),

xi a R

and y far away from all

>")-)'&-)7%(")%&,(#:/.%0),/-/(#*8%"+,)-,./*)%3#..%;)%(")%&*)%3#("%(")%./-8)0(%

1

1

2

2

>")-)'&-)7%(")%&,(#:/.%0),/-/(#*8%"+,)-,./*)%3#..%;)%(")%&*)%3#("%(")%./-8)0(%

This provides . . . , (xn, yn)} with better

generalization capabilities.

/2(1%#7%!"#$"%#&%'()#*('%+&%,"(%-#*#-.-%'#&,+*$(%/)%+*%(0+-12(%,/%,"(%'($#&#/*%

/2(1%#7%!"#$"%#&%'()#*('%+&%,"(%-#*#-.-%'#&,+*$(%/)%+*%(0+-12(%,/%,"(%'($#&#/*%

1}.

a&.3)+$(

&.3)+$( definition of the separating hyperplane

unique

"

"

6: 6:

6: 6:

T

h

la

m

th

? ?

,( ,

#: (#:

/. /

%" .%"

+, +

)-,)

, . -,

/* ./

) *)

@/9#:1:

@/9#:1:

:/-8#*

:/-8#*

67 67

Which separating

!"#$%&'(#)%"*#%*+,##-$"*.",/01)1

!"#$%&'(#)%"*#%*+,##-$"*.",/01)1

2)(,$&%*3'#)-$$-4561'",

2)(,$&%*3'#)-$$-4561'",

7-8,1*.9:*;")<-$1)#0

7-8,1*.9:*;")<-$1)#0

6767

43/-%56"(37+&&78

+*'%9.2#(3:%;<<=>

+*'%9.2#(3:%;<<=>

hyperplane?43/-%56"(37+&&78

One which

maximizes

a decision boundary that generalizes well.

margin=>=>

A first attempt

max

M subject to yi ( t xi + 0 ) M kk, i = 1, . . . , n

, 0 , kk = 1

Remove the constraint kk = 1 by adjusting the constraints

, 0

Then the above optimization problem is equivalent to

1

min kk2 subject to yi ( t xi + 0 ) 1, i = 1, . . . , n

, 0 2

Remove the constraint kk = 1 by adjusting the constraints

, 0

Then the above optimization problem is equivalent to

1

min kk2 subject to yi ( t xi + 0 ) 1, i = 1, . . . , n

, 0 2

Remove the constraint kk = 1 by adjusting the constraints

, 0

Then the above optimization problem is equivalent to

1

min kk2 subject to yi ( t xi + 0 ) 1, i = 1, . . . , n

, 0 2

Remove the constraint kk = 1 by adjusting the constraints

, 0

Then the above optimization problem is equivalent to

1

min kk2 subject to yi ( t xi + 0 ) 1, i = 1, . . . , n

, 0 2

"A*B0?-$?/,"-1*CDE

"+"-%&).%&+(/0"#1&2%)34&%,5/%44&")&(4&(&67#$)"*#&*6&

With this formulation of the problem

&:"(4&*6&).%&4%5(/()"#0&.;5%/52(#%

.+"/0%+1.%2)(+'-*.%&.+3..-%'%4#)-+%5%'-2%'%46'-.%730&8%)(

=

3 min

5 !&

, 0

1

kk2 subject to yi ( t xi + 0 ) 1, i = 1, . . . , n

2

$'6%1/4."46'-.%1'(%)-:)-)+.%(#6;+)#-(%&/%()$46/%(*'6)-,%+1.%

(0%3.%*1##(.%+1.%(#6;+)#-%:#"%31)*1%+1.%2)(*")$)-'-+%

The margin has thickness 1/kk as shown

distance between a point x and a plane (w, b)

.%:#"%+1.%+"')-)-,%.5'$46.(%*6#(.(+%+#%+1.%&#;-2'"/

slightly different).

3 5) ! & " @

=

,=

1.%!"#$#%!"&'1/4."46'-.

A

3

*.%:"#$%+1.%*6#(.(+%

2'"/%)(

3=5 ! &

3

"

$.(

$"

A

3

@

3

&

3

3=5 ! &

3

,<

in figure

(notation

|wT x+b|

is

1

min kk2 subject to yi ( t xi + 0 ) 1, i = 1, . . . , n

, 0 2

This is a convex optimization problem - quadratic objective

n

X

1

Lp (, 0 , ) = kk2 +

i yi (1 t xi 0 )

2

i=1

1

min kk2 subject to yi ( t xi + 0 ) 1, i = 1, . . . , n

, 0 2

This is a convex optimization problem - quadratic objective

n

X

1

i yi (1 t xi 0 )

Lp (, 0 , ) = kk2 +

2

i=1

1

min kk2 subject to yi ( t xi + 0 ) 1, i = 1, . . . , n

, 0 2

This is a convex optimization problem - quadratic objective

n

X

1

i yi (1 t xi 0 )

Lp (, 0 , ) = kk2 +

2

i=1

1

min kk2 subject to yi ( t xi + 0 ) 1, i = 1, . . . , n

, 0 2

The Karush-Kuhn-Tucker conditions state that 1 = (0 , ) is a

minimum of this cost function if a unique s.t.

1

1 Lp (1 , ) = 0

j 0 for j = 1, . . . , n

j (1 yj (0 + xtj )) = 0 for j = 1, . . . , n

(1 yj (0 + xtj )) 0 for j = 1, . . . , n

Active constraints and Inactive constraints:

Let A be the set of indices with j > 0 then

X

1

Lp (1 , ) = k k2 +

j (1 yj (0 + xtj )).

2

jA

j yj xj

and 0 =

jA

j yj

jA

1

/A

Lp (1 , ) = .5k k2 .

Active constraints and Inactive constraints:

Let A be the set of indices with j > 0 then

X

1

Lp (1 , ) = k k2 +

j (1 yj (0 + xtj )).

2

jA

j yj xj

and 0 =

jA

j yj

jA

1

/A

Lp (1 , ) = .5k k2 .

To summarize

As we have a convex optimization problem it has one local

minimum.

1

boundary of the margin.

xi is called a support vector.

And if i

/ A then yi (0 + xti ) > 1 and xi lies outside of the

margin.

is a linear combination of the support vectors

X

=

j yj xj

jA

To summarize

As we have a convex optimization problem it has one local

minimum.

1

boundary of the margin.

xi is called a support vector.

And if i

/ A then yi (0 + xti ) > 1 and xi lies outside of the

margin.

is a linear combination of the support vectors

X

=

j yj xj

jA

To summarize

As we have a convex optimization problem it has one local

minimum.

1

boundary of the margin.

xi is called a support vector.

And if i

/ A then yi (0 + xti ) > 1 and xi lies outside of the

margin.

is a linear combination of the support vectors

X

=

j yj xj

jA

To summarize

As we have a convex optimization problem it has one local

minimum.

1

boundary of the margin.

xi is called a support vector.

/ A then yi (0 + xti ) > 1 and xi lies outside of the

margin.

is a linear combination of the support vectors

X

=

j yj xj

jA

To summarize

As we have a convex optimization problem it has one local

minimum.

1

boundary of the margin.

xi is called a support vector.

/ A then yi (0 + xti ) > 1 and xi lies outside of the

margin.

is a linear combination of the support vectors

X

=

j yj xj

jA

;<0(0"#>(./#(&>(&>#(&%(0"#(05&("3-#$-.)>#<(0")0(

3()0(0"#<#("3-#$-.)>#<(0"#(0#$,(3/45!+/6789:(

g

To summarize

"#(:9))'.,$;#&,'.2

(!/12(

0&$<(*&>0$/7;0#(

)>#

<>

!/ 3 / + /

:9))'.,$

;#&,'.2$?!@AB

>=(%$&,(0"#(II!

0"#(<;--&$0(D#*0&$<

#0(*&;.=(7#(

#*0&$<'()>=(

;.=(7#(0"#(<),#

If i

const

<=

==

X

Thus the SVMin

fact

only depends only a sm

=

yj x j

j

jA

How do I calculate ?

You have seen that the optimal solution is a weighted sum of

Most common approach is to solve the Dual Lagrange

solutions to these problems are the same because of the original quadratic

cost function and linear inequality constraints.)

max

n

X

1 XX

i

i k yi yk xti xk

2

i=1

i=1

k=1

subject to i 0 i

How do I calculate ?

You have seen that the optimal solution is a weighted sum of

Most common approach is to solve the Dual Lagrange

solutions to these problems are the same because of the original quadratic

cost function and linear inequality constraints.)

max

n

X

1 XX

i

i k yi yk xti xk

2

i=1

i=1

k=1

subject to i 0 i

How do I calculate ?

You have seen that the optimal solution is a weighted sum of

Most common approach is to solve the Dual Lagrange

solutions to these problems are the same because of the original quadratic

cost function and linear inequality constraints.)

max

n

X

1 XX

i

i k yi yk xti xk

2

i=1

i=1

k=1

subject to i 0 i

How do I calculate ?

You have seen that the optimal solution is a weighted sum of

Most common approach is to solve the Dual Lagrange

solutions to these problems are the same because of the original quadratic

cost function and linear inequality constraints.)

max

n

X

1 XX

i

i k yi yk xti xk

2

i=1

i=1

k=1

subject to i 0 i

- Automated Clinical Assessment from Smart home-based Behavior DataUploaded bysdsen
- LimitationsOfPCAforHyperspectralTargetRecognitionUploaded byyary_upr
- liu99comparativeUploaded byHarshal Verma
- 10.1.1.7Uploaded byAnuradha Kumari
- mahalanobis distance as used in pattern learning coursesUploaded bySandeep Pandey
- Implicit Shape Kernel for Discriminative Learning of the Hough Transform Detector - Zhang, Chen - Procedings of the British Machine Vision Conference - 2010Uploaded byzukun
- Kai Gersmann and Barbara Hammer- Improving iterative repair strategies for scheduling with the SVMUploaded byGrettsz
- Multiple Regression Model_matrix FormUploaded byGeorgiana Dumitrica
- Linear discriminants analysisUploaded byKim Long Vu
- Pirooznia_2008_PMID18366602Uploaded bySalvador Martinez
- Lecture 11Uploaded byswati_jain
- Error analysis lecture 7Uploaded byOmegaUser
- Neural NetworkUploaded byMohit Arora
- Understanding the Unique Spectral Signature of Winter RapeUploaded byJames Konthou
- Application of Imperialist Competitive Algorithm for Feature SelectionUploaded byvalstav
- 2002-A Comparison of Methods for Multicalss Support Vector MachinesUploaded bythanhduc880
- Review of Data Analysis Algorithm and its ApplicationsUploaded byIJRASETPublications
- Speaker Recognition OverviewUploaded byShah Dhaval
- Evaluation of Anomaly Detection Evaluation of Anomaly Detection Capability for Ground-Based Pre-Launch Shuttle OperationsUploaded byEddy Vino
- Chapter 2 Data Mining DefinedUploaded byIulia Marcu
- Machine Learning Techniques in FinTechUploaded byMark Keaton
- sensors-15-02774-v2.pdfUploaded byBall Dekdee
- A Survey on Road Sign Detection and ClassificationUploaded byAnonymous kw8Yrp0R5r
- Verma 2012Uploaded byVitthal Patnecha
- An Efficient AI Model for Financial Market Prediction Optimized by SVRUploaded byIJRASETPublications
- Traffic Sign RecognitionUploaded byEditor IJTSRD
- Class_14_8.pdfUploaded byRam Bhagat Soni
- DatabaseUploaded bykarthick
- M. Sc. Statistics-1Uploaded byAsma
- 13. a Comparative Study of Land Cover Classification by Using Multispectral and Texture DataUploaded bySyed Shah Muhammad

- Measures of AssociationUploaded bymissinu
- Stata RUploaded bymissinu
- MulticollinearityUploaded bymissinu
- Exit Questionnaire 2005 FinalUploaded bymissinu
- GRE VocabularyUploaded byKoksiong Poon
- Regulations Livestock in VN SummaryUploaded bymissinu
- Economic Accounts 2005 FinalUploaded bymissinu
- Lecture 4 - Basis Expansion and RegularizationUploaded bymissinu
- Poe 4 FormulasUploaded bymissinu
- Vocabulary IELTS Speaking Theo TopicUploaded byBBBBBBB
- Lecture 2 - Some Course AdminUploaded bymissinu
- Lecture 1 - Overview of Supervised LearningUploaded bymissinu
- Quantitative Analysis v1Uploaded bymissinu
- Active Models Human ResourcesUploaded bymissinu
- Exam4135 2004 SolutionsUploaded bymissinu
- Exam ECON301B 2002 CommentedUploaded bymissinu
- ABDC Journal Quality List 2013Uploaded bymissinu
- Manure Estimates.pdfUploaded bymissinu
- Balabolka SampleUploaded bymissinu
- AnkiUploaded bymissinu
- Errata of Introduction to Probability 2ndUploaded bymissinu
- Midterm Microeconomics 1 2012-13Uploaded bymissinu
- Bai Tap Giai Tich 2 Chuong 2Uploaded bymissinu
- Introduction to Microeconomic Theory and GE Theory (2015)Uploaded bymissinu
- 08 Chi So Gia NGTT2010Uploaded byThanh Nga Võ
- Designing 20an 20Effective 20QuestionnaireUploaded byec16043
- Jql SubjectUploaded byrikudomessi

- Fireye NEX-1502 Comfire SoftwareUploaded bysarkaft
- Ncomputing VoipUploaded byapi-2549066
- 0110DB0201 Transform TapUploaded byLimuel Espiritu
- FRIT7232-TechnologyAssessmentPlanUploaded byAdrian
- Blue Yeti Pro ManualUploaded byheyheylah
- Logic of 8b10b DecoUploaded byMaya Reshmi
- f8913d Zigbee Module Technical Specification v2.0.0Uploaded byXiamen Four-Faith Industrial Router supplier
- Android_Based_ECG_Monitoring_System.pdfUploaded byChandrashekar Reddy
- CDM-625_ReleaseNotes_v153.pdfUploaded byIvanStefanovich
- Aap Nrp Online Test Instructions 6th Edition 2012Uploaded bynursesed
- Doc ViewUploaded byscribduserlogin
- 115-oks-apiUploaded byNidhi Saxena
- polygon viconUploaded byAlejandro Patricio Castillo Fajardo
- Insight 2017 Mathematical Methods Examination 2.pdfUploaded bynochnoch
- Strategy: Analysis and Action 8th EdUploaded byCindyYin
- Internet Programming - javascriptUploaded bymiss_amelli
- IJEAMS paper akhilesh jeriya.docUploaded byNitesh Sharma
- Kode Scrip ProgramUploaded byEry Badar
- ChangeUploaded byJab5450
- GPU ComputeUploaded byc0de517e.blogspot.com
- Vigitron IP Products Guide_1117Uploaded byMila
- Libreofficewriter IncompleteUploaded bysvfgsvgscdvgc
- Il 03900006 eUploaded byJonathan Parra Mondaca
- Gate Level Diagrams of LatchUploaded bymdhuq1
- LYS03124USENUploaded byDipesh
- Virtual Identity AG: Social Media as a success factor for companiesUploaded by4gen_1
- MathTalk ManualUploaded byCubby Hill
- Rectangular Tank Stiffener CalcUploaded byadonara_je
- Smart BoardsUploaded bySteve Mackenzie
- Traffic SystemUploaded byArinjayKumar