You are on page 1of 176

# Chapter 4: Linear Methods for Classification

DD3364

Introduction

## Focus on linear classification

Want to learn a predictor G : Rp G = {1, . . . , K}
G divides input space into regions labelled according to their

classification.

## The boundaries between these regions are termed the

decision boundaries.

## Focus on linear classification

Want to learn a predictor G : Rp G = {1, . . . , K}
G divides input space into regions labelled according to their

classification.

## The boundaries between these regions are termed the

decision boundaries.

## Focus on linear classification

Want to learn a predictor G : Rp G = {1, . . . , K}
G divides input space into regions labelled according to their

classification.

## The boundaries between these regions are termed the

decision boundaries.

## Focus on linear classification

Want to learn a predictor G : Rp G = {1, . . . , K}
G divides input space into regions labelled according to their

classification.

## The boundaries between these regions are termed the

decision boundaries.

## An example when a linear decision boundaries arises

Learn a discriminant function k (x) for each class k and set

k

## That is g is a monotone function s.t.

g(k (x)) = k0 + kt x

## An example when a linear decision boundaries arises

Learn a discriminant function k (x) for each class k and set

k

## That is g is a monotone function s.t.

g(k (x)) = k0 + kt x

## An example when a linear decision boundaries arises

Learn a discriminant function k (x) for each class k and set

k

## That is g is a monotone function s.t.

g(k (x)) = k0 + kt x

## Examples of discriminant functions

Example 1: Fit a linear regression model to the class

## indicator variables. Then the discriminant functions are

k (x) = k0 + kt x

## A popular model when there are two classes is:

exp(0 + t x)
1 + exp(0 + t x)
1
P (G = 2|X = x) =
1 + exp(0 + t x)
P (G = 1|X = x) =

## Examples of discriminant functions

Example 1: Fit a linear regression model to the class

## indicator variables. Then the discriminant functions are

k (x) = k0 + kt x

## A popular model when there are two classes is:

exp(0 + t x)
1 + exp(0 + t x)
1
P (G = 2|X = x) =
1 + exp(0 + t x)
P (G = 1|X = x) =

## Examples of discriminant functions

Example 1: Fit a linear regression model to the class

## indicator variables. Then the discriminant functions are

k (x) = k0 + kt x

## A popular model when there are two classes is:

exp(0 + t x)
1 + exp(0 + t x)
1
P (G = 2|X = x) =
1 + exp(0 + t x)
P (G = 1|X = x) =

## Can directly learn the linear decision boundary

For a two class problem with p-dimensional inputs this =

## Perceptron model and algorithm of Rosenblatt,

SVM model and algorithm of Vapnik
In the forms quoted both these algorithms find separating

hyperplanes if they exist and fail of the points are not linearly
separable.

There are fixes for the non-separable case but we will not

## Can directly learn the linear decision boundary

For a two class problem with p-dimensional inputs this =

## Perceptron model and algorithm of Rosenblatt,

SVM model and algorithm of Vapnik
In the forms quoted both these algorithms find separating

hyperplanes if they exist and fail of the points are not linearly
separable.

There are fixes for the non-separable case but we will not

## Can directly learn the linear decision boundary

For a two class problem with p-dimensional inputs this =

## Perceptron model and algorithm of Rosenblatt,

SVM model and algorithm of Vapnik
In the forms quoted both these algorithms find separating

hyperplanes if they exist and fail of the points are not linearly
separable.

There are fixes for the non-separable case but we will not

## Can directly learn the linear decision boundary

For a two class problem with p-dimensional inputs this =

## Perceptron model and algorithm of Rosenblatt,

SVM model and algorithm of Vapnik
In the forms quoted both these algorithms find separating

hyperplanes if they exist and fail of the points are not linearly
separable.

There are fixes for the non-separable case but we will not

## Can directly learn the linear decision boundary

For a two class problem with p-dimensional inputs this =

## Perceptron model and algorithm of Rosenblatt,

SVM model and algorithm of Vapnik
In the forms quoted both these algorithms find separating

hyperplanes if they exist and fail of the points are not linearly
separable.

There are fixes for the non-separable case but we will not

## Can directly learn the linear decision boundary

For a two class problem with p-dimensional inputs this =

## Perceptron model and algorithm of Rosenblatt,

SVM model and algorithm of Vapnik
In the forms quoted both these algorithms find separating

hyperplanes if they exist and fail of the points are not linearly
separable.

There are fixes for the non-separable case but we will not

## Linear decision boundaries can be made non-linear

Can expand the variable set X1 , X2 , . . . , Xp by including their

## squares and cross-products X12 , X22 , . . . , Xp2 , X1 X2 , X1 X2 , . . .

Linear decision boundaries in the augmented space

## corresponds to quadratic decision boundaries in the original

space.
4.2 Linear Regression of an Indicator Matrix
103

2
1
1
2
1
3
22
1
2 22 2
3
2 22
33 3 3 3
22 2
2
12
2
33
3 3
2 2 22 2 22 2
3 3
3
2 222 2
1
22
2
22
3
22
1
2
2
3 3
22 2 2 2
22 2
22 22222 2 2
3 3 33
2222 2222 2222
33
2 2222 2 2 2 2222222 21
3 33 333
222 2 2
2
3 333 3333 3 3
222 222 22 2 2 2222 2 2
3333 33
2 2 2222 22 22 22 22222222 22 2
13
3 3 33
2
2
2
2
2
3
1
2
2
2
1 2 222 222222 2222
1
3 3
33333
21222212
1
2
2
1
1 3 33
1 1
33 3
2 122122 22
1
22
1
2 1 1222 1
11 1 33
3
22 2 2
1
2
1
11
1 1 1133333 33
11 1 1
2211
12 2
3 33
2
1
1
2
1
1
1
1
1
2
1
1
3 33 3333
1
1
1
22
12
11 1 1 1
22
1 11 11 1 1 11 1 13313333333 33 3
211 2 1
3
3
1
3
2 111
1
3
1
1 1 1111 1 11 1 3333333
111 1 1
11 1 1
33333333
3
1
1
3
1
1 1
3333 3
1 1 11 1 11 111 111 1 1 11 11 1 111 3
313 1
111 1 1
33 33
1 11 1
1 1 3333333
1
1
31 33 3 3
1 11 1
1 333333333333
11 1 1 1 1 111111 11 111 1 1 1 1
1 11 1
1
33
3 3 333333
11 3
1
1
1 1
33333 3
11
1
3
1 1 11
3 33
1
1
3
333 3
3
3 3

2
1
1
2
1
3
22
1
2 22 2
3
2 22
33 3 3 3
22 2
2
12
2
33
3 3
2 2 22 2 22 2
3 3
3
2 222 2
1
22
2
22
3
22
1
2
2
3 3
22 2 2 2
22 2
22 22222 2 2
3 3 33
2222 2222 2222
33
2 2222 2 2 2 2222222 21
3 33 333
222 2 2
2
3 333 3333 3 3
222 222 22 2 2 2222 2 2
3333 33
2 2 2222 22 22 22 22222222 22 2
13
3 3 33
2
2
2
2
2
3
1
2
2
2
1 2 222 222222 2222
1
3 3
33333
21222212
1
2
2
1
1 3 33
1 1
33 3
2 122122 22
1
22
1
2 1 1222 1
11 1 33
3
22 2 2
1
2
1
11
1 1 1133333 33
11 1 1
2211
12 2
3 33
2
1
1
2
1
1
1
1
1
2
1
1
3 33 3333
1
1
1
22
12
11 1 1 1
22
1 11 11 1 1 11 1 13313333333 33 3
211 2 1
3
3
1
3
2 111
1
3
1
1 1 1111 1 11 1 3333333
111 1 1
11 1 1
33333333
3
1
1
3
1
1 1
3333 3
1 1 11 1 11 111 111 1 1 11 11 1 111 3
313 1
111 1 1
33 33
1 11 1
1 1 3333333
1
1
31 33 3 3
1 11 1
1 333333333333
11 1 1 1 1 111111 11 111 1 1 1 1
1 11 1
1
33
3 3 333333
11 3
1
1
1 1
33333 3
11
1
3
1 1 11
3 33
1
1
3
333 3
3
3 3

FIGURE 4.1. The left plot shows some data from three classes, with linear

## Linear decision boundaries can be made non-linear

Can expand the variable set X1 , X2 , . . . , Xp by including their

## squares and cross-products X12 , X22 , . . . , Xp2 , X1 X2 , X1 X2 , . . .

Linear decision boundaries in the augmented space

## corresponds to quadratic decision boundaries in the original

space.
4.2 Linear Regression of an Indicator Matrix
103

2
1
1
2
1
3
22
1
2 22 2
3
2 22
33 3 3 3
22 2
2
12
2
33
3 3
2 2 22 2 22 2
3 3
3
2 222 2
1
22
2
22
3
22
1
2
2
3 3
22 2 2 2
22 2
22 22222 2 2
3 3 33
2222 2222 2222
33
2 2222 2 2 2 2222222 21
3 33 333
222 2 2
2
3 333 3333 3 3
222 222 22 2 2 2222 2 2
3333 33
2 2 2222 22 22 22 22222222 22 2
13
3 3 33
2
2
2
2
2
3
1
2
2
2
1 2 222 222222 2222
1
3 3
33333
21222212
1
2
2
1
1 3 33
1 1
33 3
2 122122 22
1
22
1
2 1 1222 1
11 1 33
3
22 2 2
1
2
1
11
1 1 1133333 33
11 1 1
2211
12 2
3 33
2
1
1
2
1
1
1
1
1
2
1
1
3 33 3333
1
1
1
22
12
11 1 1 1
22
1 11 11 1 1 11 1 13313333333 33 3
211 2 1
3
3
1
3
2 111
1
3
1
1 1 1111 1 11 1 3333333
111 1 1
11 1 1
33333333
3
1
1
3
1
1 1
3333 3
1 1 11 1 11 111 111 1 1 11 11 1 111 3
313 1
111 1 1
33 33
1 11 1
1 1 3333333
1
1
31 33 3 3
1 11 1
1 333333333333
11 1 1 1 1 111111 11 111 1 1 1 1
1 11 1
1
33
3 3 333333
11 3
1
1
1 1
33333 3
11
1
3
1 1 11
3 33
1
1
3
333 3
3
3 3

2
1
1
2
1
3
22
1
2 22 2
3
2 22
33 3 3 3
22 2
2
12
2
33
3 3
2 2 22 2 22 2
3 3
3
2 222 2
1
22
2
22
3
22
1
2
2
3 3
22 2 2 2
22 2
22 22222 2 2
3 3 33
2222 2222 2222
33
2 2222 2 2 2 2222222 21
3 33 333
222 2 2
2
3 333 3333 3 3
222 222 22 2 2 2222 2 2
3333 33
2 2 2222 22 22 22 22222222 22 2
13
3 3 33
2
2
2
2
2
3
1
2
2
2
1 2 222 222222 2222
1
3 3
33333
21222212
1
2
2
1
1 3 33
1 1
33 3
2 122122 22
1
22
1
2 1 1222 1
11 1 33
3
22 2 2
1
2
1
11
1 1 1133333 33
11 1 1
2211
12 2
3 33
2
1
1
2
1
1
1
1
1
2
1
1
3 33 3333
1
1
1
22
12
11 1 1 1
22
1 11 11 1 1 11 1 13313333333 33 3
211 2 1
3
3
1
3
2 111
1
3
1
1 1 1111 1 11 1 3333333
111 1 1
11 1 1
33333333
3
1
1
3
1
1 1
3333 3
1 1 11 1 11 111 111 1 1 11 11 1 111 3
313 1
111 1 1
33 33
1 11 1
1 1 3333333
1
1
31 33 3 3
1 11 1
1 333333333333
11 1 1 1 1 111111 11 111 1 1 1 1
1 11 1
1
33
3 3 333333
11 3
1
1
1 1
33333 3
11
1
3
1 1 11
3 33
1
1
3
333 3
3
3 3

FIGURE 4.1. The left plot shows some data from three classes, with linear

## Linear decision boundaries can be made non-linear

Can expand the variable set X1 , X2 , . . . , Xp by including their

## squares and cross-products X12 , X22 , . . . , Xp2 , X1 X2 , X1 X2 , . . .

Linear decision boundaries in the augmented space

## corresponds to quadratic decision boundaries in the original

space.
4.2 Linear Regression of an Indicator Matrix
103

2
1
1
2
1
3
22
1
2 22 2
3
2 22
33 3 3 3
22 2
2
12
2
33
3 3
2 2 22 2 22 2
3 3
3
2 222 2
1
22
2
22
3
22
1
2
2
3 3
22 2 2 2
22 2
22 22222 2 2
3 3 33
2222 2222 2222
33
2 2222 2 2 2 2222222 21
3 33 333
222 2 2
2
3 333 3333 3 3
222 222 22 2 2 2222 2 2
3333 33
2 2 2222 22 22 22 22222222 22 2
13
3 3 33
2
2
2
2
2
3
1
2
2
2
1 2 222 222222 2222
1
3 3
33333
21222212
1
2
2
1
1 3 33
1 1
33 3
2 122122 22
1
22
1
2 1 1222 1
11 1 33
3
22 2 2
1
2
1
11
1 1 1133333 33
11 1 1
2211
12 2
3 33
2
1
1
2
1
1
1
1
1
2
1
1
3 33 3333
1
1
1
22
12
11 1 1 1
22
1 11 11 1 1 11 1 13313333333 33 3
211 2 1
3
3
1
3
2 111
1
3
1
1 1 1111 1 11 1 3333333
111 1 1
11 1 1
33333333
3
1
1
3
1
1 1
3333 3
1 1 11 1 11 111 111 1 1 11 11 1 111 3
313 1
111 1 1
33 33
1 11 1
1 1 3333333
1
1
31 33 3 3
1 11 1
1 333333333333
11 1 1 1 1 111111 11 111 1 1 1 1
1 11 1
1
33
3 3 333333
11 3
1
1
1 1
33333 3
11
1
3
1 1 11
3 33
1
1
3
333 3
3
3 3

2
1
1
2
1
3
22
1
2 22 2
3
2 22
33 3 3 3
22 2
2
12
2
33
3 3
2 2 22 2 22 2
3 3
3
2 222 2
1
22
2
22
3
22
1
2
2
3 3
22 2 2 2
22 2
22 22222 2 2
3 3 33
2222 2222 2222
33
2 2222 2 2 2 2222222 21
3 33 333
222 2 2
2
3 333 3333 3 3
222 222 22 2 2 2222 2 2
3333 33
2 2 2222 22 22 22 22222222 22 2
13
3 3 33
2
2
2
2
2
3
1
2
2
2
1 2 222 222222 2222
1
3 3
33333
21222212
1
2
2
1
1 3 33
1 1
33 3
2 122122 22
1
22
1
2 1 1222 1
11 1 33
3
22 2 2
1
2
1
11
1 1 1133333 33
11 1 1
2211
12 2
3 33
2
1
1
2
1
1
1
1
1
2
1
1
3 33 3333
1
1
1
22
12
11 1 1 1
22
1 11 11 1 1 11 1 13313333333 33 3
211 2 1
3
3
1
3
2 111
1
3
1
1 1 1111 1 11 1 3333333
111 1 1
11 1 1
33333333
3
1
1
3
1
1 1
3333 3
1 1 11 1 11 111 111 1 1 11 11 1 111 3
313 1
111 1 1
33 33
1 11 1
1 1 3333333
1
1
31 33 3 3
1 11 1
1 333333333333
11 1 1 1 1 111111 11 111 1 1 1 1
1 11 1
1
33
3 3 333333
11 3
1
1
1 1
33333 3
11
1
3
1 1 11
3 33
1
1
3
333 3
3
3 3

FIGURE 4.1. The left plot shows some data from three classes, with linear

## Use linear regression to find discriminant functions

p
Have training data {(xi , gi )}n
i=1 where each xi R and

gi {1, . . . , K}.

## For each k construct a linear discriminant k (x) via:

1 For i = 1, . . . , n set
(
0 if gi 6= k
yi =
1 if gi = k
0k , k ) = arg min Pn (yi 0 t xi )2
2 Compute (
k
i=1
0 ,k

Define

k (x) = 0k + kt x
Classify a new point x with

k

## Use linear regression to find discriminant functions

p
Have training data {(xi , gi )}n
i=1 where each xi R and

gi {1, . . . , K}.

## For each k construct a linear discriminant k (x) via:

1 For i = 1, . . . , n set
(
0 if gi 6= k
yi =
1 if gi = k
0k , k ) = arg min Pn (yi 0 t xi )2
2 Compute (
k
i=1
0 ,k

Define

k (x) = 0k + kt x
Classify a new point x with

k

## Use linear regression to find discriminant functions

p
Have training data {(xi , gi )}n
i=1 where each xi R and

gi {1, . . . , K}.

## For each k construct a linear discriminant k (x) via:

1 For i = 1, . . . , n set
(
0 if gi 6= k
yi =
1 if gi = k
0k , k ) = arg min Pn (yi 0 t xi )2
2 Compute (
k
i=1
0 ,k

Define

k (x) = 0k + kt x
Classify a new point x with

k

## Use linear regression to find discriminant functions

p
Have training data {(xi , gi )}n
i=1 where each xi R and

gi {1, . . . , K}.

## For each k construct a linear discriminant k (x) via:

1 For i = 1, . . . , n set
(
0 if gi 6= k
yi =
1 if gi = k
0k , k ) = arg min Pn (yi 0 t xi )2
2 Compute (
k
i=1
0 ,k

Define

k (x) = 0k + kt x
Classify a new point x with

k

## Use linear regression to find discriminant functions

p
Have training data {(xi , gi )}n
i=1 where each xi R and

gi {1, . . . , K}.

## For each k construct a linear discriminant k (x) via:

1 For i = 1, . . . , n set
(
0 if gi 6= k
yi =
1 if gi = k
0k , k ) = arg min Pn (yi 0 t xi )2
2 Compute (
k
i=1
0 ,k

Define

k (x) = 0k + kt x
Classify a new point x with

k

## Use linear regression to find discriminant functions

p
Have training data {(xi , gi )}n
i=1 where each xi R and

gi {1, . . . , K}.

## For each k construct a linear discriminant k (x) via:

1 For i = 1, . . . , n set
(
0 if gi 6= k
yi =
1 if gi = k
0k , k ) = arg min Pn (yi 0 t xi )2
2 Compute (
k
i=1
0 ,k

Define

k (x) = 0k + kt x
Classify a new point x with

k

3 class example

## Use linear regression of an indicator matrix to find the discriminant

functions for the above 3-classes.

## Construct K linear regression problems

For each k construct the response vectors from the class labels

0.5

0.5

0.5

10

0
0

10

20

10

0
0

10

20

10

0
0

10

20

0.5

0.5
10

0
0

10

20

0.5
10

0
0

10

20

10

0
0

10

20

## Construct K discriminant functions

For each k construct the response vectors from the class labels

0.5

0.5

0.5

10

0
0

10

20

10

0
0

10

20

10

0
0

10

20

0.5

0.5

1 (x)

1.5

0.5

0.5

2 (x)

1.5

0.5

0.5

3 (x)

1.5

## This approach will fail in this case

The training data from 3 classes

0.5

1 (x)

0.5

2 (x)

0.5

3 (x)

0.5

1 (x)

0.5

2 (x)

0.5

3 (x)

## The resulting decision boundary

In this last example masking has occurred.
This occurs because of the rigid nature of the linear

discriminant functions.

## The resulting decision boundary

In this last example masking has occurred.
This occurs because of the rigid nature of the linear

discriminant functions.

## The resulting decision boundary

In this last example masking has occurred.
This occurs because of the rigid nature of the linear

discriminant functions.

## The resulting decision boundary

In this last example masking has occurred.
This occurs because of the rigid nature of the linear

discriminant functions.

## Optimal classification requires the posterior

To perform optimal classification need to know P (G | X). Let
fk (x) represent the class-conditional P (X | G = k) and

## A simple application of Bayes Rule gives

PK

k=1

fk (x)k
P (G = k | X = x) = PK
l=1 fl (x)l

k = 1

## Therefore for classification having fk (x) is almost equivalent

to having P (G = k | X = x).

## Optimal classification requires the posterior

To perform optimal classification need to know P (G | X). Let
fk (x) represent the class-conditional P (X | G = k) and
k be the prior probability of class k with

## A simple application of Bayes Rule gives

PK

k=1

fk (x)k
P (G = k | X = x) = PK
l=1 fl (x)l

k = 1

## Therefore for classification having fk (x) is almost equivalent

to having P (G = k | X = x).

## Optimal classification requires the posterior

To perform optimal classification need to know P (G | X). Let
fk (x) represent the class-conditional P (X | G = k) and
k be the prior probability of class k with

## A simple application of Bayes Rule gives

PK

k=1

fk (x)k
P (G = k | X = x) = PK
l=1 fl (x)l

k = 1

## Therefore for classification having fk (x) is almost equivalent

to having P (G = k | X = x).

## Optimal classification requires the posterior

To perform optimal classification need to know P (G | X). Let
fk (x) represent the class-conditional P (X | G = k) and
k be the prior probability of class k with

## A simple application of Bayes Rule gives

PK

k=1

fk (x)k
P (G = k | X = x) = PK
l=1 fl (x)l

k = 1

## Therefore for classification having fk (x) is almost equivalent

to having P (G = k | X = x).

## How to model the class densities

Many methods are based on specific models of fk (x)
linear and quadratic discriminant functions use Gaussian

distributions,

boundaries,

flexibility,

Qp

j=1

fkj (Xj ).

## How to model the class densities

Many methods are based on specific models of fk (x)
linear and quadratic discriminant functions use Gaussian

distributions,

boundaries,

flexibility,

Qp

j=1

fkj (Xj ).

## How to model the class densities

Many methods are based on specific models of fk (x)
linear and quadratic discriminant functions use Gaussian

distributions,

boundaries,

flexibility,

Qp

j=1

fkj (Xj ).

## How to model the class densities

Many methods are based on specific models of fk (x)
linear and quadratic discriminant functions use Gaussian

distributions,

boundaries,

flexibility,

Qp

j=1

fkj (Xj ).

## How to model the class densities

Many methods are based on specific models of fk (x)
linear and quadratic discriminant functions use Gaussian

distributions,

boundaries,

flexibility,

Qp

j=1

fkj (Xj ).

## Multivariate Gaussian class densities

Model each fk (x) as a multivariate Gaussian

## Linear discriminant functions

1
p
fk (x) =
exp {.5(x k )t 1
k (x k )}
p
2 |k |
Similar discriminant functions were derived where each p(x
distributed
with(LDA)
equal arises
covariance
LinearNormally
Discriminant
Analysis
in thematrices.
special
case when
k = for all k

class distributions

decision boundary

partition

In this
lecture,
no assumptions,
One gets
linear
decision

## Multivariate Gaussian class densities

Model each fk (x) as a multivariate Gaussian

## Linear discriminant functions

1
p
fk (x) =
exp {.5(x k )t 1
k (x k )}
p
2 |k |
Similar discriminant functions were derived where each p(x
distributed
with(LDA)
equal arises
covariance
LinearNormally
Discriminant
Analysis
in thematrices.
special
case when
k = for all k

class distributions

decision boundary

partition

In this
lecture,
no assumptions,
One gets
linear
decision

LDA
Can see this as
log

P (G = k | X = x)
fk (x)
k
= log
+ log
P (G = l | X = x)
fl (x)
l
k
t
1
= log
.5 k k + .5 tl 1 l
l
+ xt 1 (k l )
= xt a + b

a linear function

t 1
The equal covariance matrices allow the xt 1
k x and x l x

## From the log-odds function we see that the linear discriminant

functions

k (x) = xt 1 k .5 tk 1 k + log k
are an equivalent description of the decision rule with
G(x) = arg max k (x)
k

LDA
Can see this as
log

P (G = k | X = x)
fk (x)
k
= log
+ log
P (G = l | X = x)
fl (x)
l
k
t
1
= log
.5 k k + .5 tl 1 l
l
+ xt 1 (k l )
= xt a + b

a linear function

t 1
The equal covariance matrices allow the xt 1
k x and x l x

## From the log-odds function we see that the linear discriminant

functions

k (x) = xt 1 k .5 tk 1 k + log k
are an equivalent description of the decision rule with
G(x) = arg max k (x)
k

LDA
Can see this as
log

P (G = k | X = x)
fk (x)
k
= log
+ log
P (G = l | X = x)
fl (x)
l
k
t
1
= log
.5 k k + .5 tl 1 l
l
+ xt 1 (k l )
= xt a + b

a linear function

t 1
The equal covariance matrices allow the xt 1
k x and x l x

## From the log-odds function we see that the linear discriminant

functions

k (x) = xt 1 k .5 tk 1 k + log k
are an equivalent description of the decision rule with
G(x) = arg max k (x)
k

## LDA: Some practicalities

In practice dont know the parameters of the Gaussian distributions
and estimate these from the training data.
Let nk be the number of class k observations then

k = nk /n
P

k = gi =k xi /nk
k =

PK P

k=1

(xi
k )(xi Analysis
k )t /(n109

Linear
Discriminant
g4.3
i =k

13
3
3
3
33
3
2 2
13
2
3
3
3
31 3
3 22
2
1
3 3
2
3
33
11 23 33 1
22 2 2
2
3
2
1 1 1 1 22
13
3
2
1 31 1
3
1 11
2 22
11
22
1
1
2
2
1 1 2
1
1
1 2
1
2
2
2
1 3
33

K)

## When k s are not all equal

Bivariate example

## terms remain and we get quadratic discriminant functions

(QDA)
Have a two class problem with

t 1
1 log | |
.9 .5 .4
2.6
k (x)
= .5
(x
2k =
1 =
, 1 =k
, P(
=k.5 (x
2=
k )1 )
k ) + log,

.4

.3

.4
.2

## described by a quadratic equation {x : k (x) = l (x)}.

class distributions

decision boundaries

partition

## When k s are not all equal

Bivariate example

## terms remain and we get quadratic discriminant functions

(QDA)
Have a two class problem with

t 1
1 log | |
.9 .5 .4
2.6
k (x)
= .5
(x
2k =
1 =
, 1 =k
, P(
=k.5 (x
2=
k )1 )
k ) + log,

.4

.3

.4
.2

## described by a quadratic equation {x : k (x) = l (x)}.

class distributions

decision boundaries

partition

## Best way to compute a quadratic discriminant function?

Left plot shows the quadratic decision boundaries found using
2
2
LDA in the five dimensional space4.3X
.
1 , XDiscriminant
2 , X1 , XAnalysis
2 , X1 X2111
Linear

2
1
1
2
1
3
22 2
1
2 22
3
2 22
33 3 3 3
22 2
2
1
2
2
33
3 3
2 2 22 2 22 2
3 3
3
2 222 2
1
22
2
22
3
22
1
2
2
2
3 3
2 2
22
22 2
22 22222 2 2
3 3 33
2222 2222 2222
33
2 2222 2 2 2 2222222 21
3 33 333
222 2 2
2
3 333 3333 3 3
222 222 22 2 2 2222 2 2
333 33
2 2 2222 22 22 22 22222222 22 2
13
222 222 22
3 333 3 3
1 2 22222 2122
1
2
3 3
33333
2122222122
1
2
2
1
1 3 33
2
1 1
3 3
2 122122 22
1
22
1
2 1 1222 1
11 1 33
3 3 33
22 2 2 2
1
1
11
2
1
1
333 333
1
3
1
1
2
11 1 11
2211
3 3
2
1
2121 2
1 111
3 3 3333
11 1
1
112
1 121 11 1
222
1 1 1 1 1 11 1 13313333333 33 3
211 2 1
3
3
1
3
2 111
1
3
1
1
3
1
1
1
33333333
1
1
111 1 1
11
1 11 33333
333
1
1 1 1
33333
1 1 11 1 11 1111 111 1 1 1111 1111 1
11 313 1
111 1 1
333333
3
1 11 1
1
3
3
3
3
1
3
3
1 3133 3 3
1
1
1 11
1 33333333333
11 1 1 1 1 111111 11 111 1 1 1 1
1 11 1
1
33
3 3 333333
11 3
1
1
1 1
33333 3
11
1
3
1 1 11
3 33
1
1
3
333 3
3
3 3

2
1
1
2
1
3
22 2
1
2 22
3
2 22
33 3 3 3
22 2
2
1
2
2
33
3 3
2 2 22 2 22 2
3 3
3
2 222 2
1
22
2
22
3
22
1
2
2
2
3 3
2 2
22
22 2
22 22222 2 2
3 3 33
2222 2222 2222
33
2 2222 2 2 2 2222222 21
3 33 333
222 2 2
2
3 333 3333 3 3
222 222 22 2 2 2222 2 2
333 33
2 2 2222 22 22 22 22222222 22 2
13
222 222 22
3 333 3 3
1 2 22222 2122
1
2
3 3
33333
2122222122
1
2
2
1
1 3 33
2
1 1
3 3
2 122122 22
1
22
1
2 1 1222 1
11 1 33
3 3 33
22 2 2 2
1
1
11
2
1
1
333 333
1
3
1
1
2
11 1 11
2211
3 3
2
1
2121 2
1 111
3 3 3333
11 1
1
112
1 121 11 1
222
1 1 1 1 1 11 1 13313333333 33 3
211 2 1
3
3
1
3
2 111
1
3
1
1
3
1
1
1
33333333
1
1
111 1 1
11
1 11 33333
333
1
1 1 1
33333
1 1 11 1 11 1111 111 1 1 1111 1111 1
11 31 1
111 1 1
333333
33333
1 11 1
1
3
3
1
3
1 3133 3 3
1
1
1 11
1 33333333333
11 1 1 1 1 111111 11 111 1 1 1 1
1 11 1
1
33
3 3 333333
11 3
1
1
1 1
33333 3
11
1
3
1 1 11
3 33
1
1
3
333 3
3
3 3

FIGURE 4.6. Two methods for fitting quadratic boundaries. The left plot shows
the quadratic decision boundaries for the data in Figure 4.1 (obtained using LDA
in the five-dimensional space X1 , X2 , X1 X2 , X12 , X22 ). The right plot shows the
quadratic decision boundaries found by QDA. The differences are small, as is
usually the case.

## between the discriminant functions where K is some pre-chosen class (here

we have chosen the last), and each difference requires p + 1 parameters3 .

## LDA and QDA summary

These methods can be surprisingly effective.
Can explain this

## Affine subspace defined by centroids of the classes

Have K centroids in a p-dimensional input space: 1 , . . . , K
These centroids define an K 1 dimensional affine subspace

## HK1 where if u HK1 then

u = 1 + 1 (2 1 ) + 2 (3 1 ) + + K1 (K 1 )
= 1 + 1 d1 + 2 d2 + + K1 dK1

## If x Rp then it can be written as

x = 1 + 1 d1 + 2 d2 + + K1 dK1 + x ,

where x HK1
.

## matrix then the Mahalhobnis distance to centroid j

kx j k = k1 + 1 d1 + 2 d2 + + K1 dK1 + x j k

## Affine subspace defined by centroids of the classes

Have K centroids in a p-dimensional input space: 1 , . . . , K
These centroids define an K 1 dimensional affine subspace

## HK1 where if u HK1 then

u = 1 + 1 (2 1 ) + 2 (3 1 ) + + K1 (K 1 )
= 1 + 1 d1 + 2 d2 + + K1 dK1

## If x Rp then it can be written as

x = 1 + 1 d1 + 2 d2 + + K1 dK1 + x ,

where x HK1
.

## matrix then the Mahalhobnis distance to centroid j

kx j k = k1 + 1 d1 + 2 d2 + + K1 dK1 + x j k

## Affine subspace defined by centroids of the classes

Have K centroids in a p-dimensional input space: 1 , . . . , K
These centroids define an K 1 dimensional affine subspace

## HK1 where if u HK1 then

u = 1 + 1 (2 1 ) + 2 (3 1 ) + + K1 (K 1 )
= 1 + 1 d1 + 2 d2 + + K1 dK1

## If x Rp then it can be written as

x = 1 + 1 d1 + 2 d2 + + K1 dK1 + x ,

where x HK1
.

## matrix then the Mahalhobnis distance to centroid j

kx j k = k1 + 1 d1 + 2 d2 + + K1 dK1 + x j k

## Affine subspace defined by centroids of the classes

Have K centroids in a p-dimensional input space: 1 , . . . , K
These centroids define an K 1 dimensional affine subspace

## HK1 where if u HK1 then

u = 1 + 1 (2 1 ) + 2 (3 1 ) + + K1 (K 1 )
= 1 + 1 d1 + 2 d2 + + K1 dK1

## If x Rp then it can be written as

x = 1 + 1 d1 + 2 d2 + + K1 dK1 + x ,

where x HK1
.

## matrix then the Mahalhobnis distance to centroid j

kx j k = k1 + 1 d1 + 2 d2 + + K1 dK1 + x j k

## Affine subspace defined by centroids of the classes

Have K centroids in a p-dimensional input space: 1 , . . . , K
These centroids define an K 1 dimensional affine subspace

## HK1 where if u HK1 then

u = 1 + 1 (2 1 ) + 2 (3 1 ) + + K1 (K 1 )
= 1 + 1 d1 + 2 d2 + + K1 dK1

## If x Rp then it can be written as

x = 1 + 1 d1 + 2 d2 + + K1 dK1 + x ,

where x HK1
.

## matrix then the Mahalhobnis distance to centroid j

kx j k = k1 + 1 d1 + 2 d2 + + K1 dK1 + x j k

## centroid can ignore it.

To summarize
K centroids in p-dimensional input space lie in an affine

subspace of dimension K 1.

## If p  K this is a big drop in dimension.

To locate the closest centroid can ignore the directions

K 1.

## What about a subspace of dimension L < K 1?

If K > 3 can ask the question:

## Which subspace of dimensional L < K 1 should we project

onto for optimality w.r.t. LDA?

## out as much as possible in terms of variance.

4.3 Linear Discriminant Analysis

107

4

## Linear Discriminant Analysis

2
0

o o oo
ooo
oo
o o ooo
o o o
oo o
oooooo ooooo o oooo
oo o
o o
oo
o o
o ooo
oo ooo
o o o o oo
o
o
oo
oooooo
o
oo
o o o ooo o
ooooo o
o o o
o
oo o o oo
o
o
o
o
o
o
oo
o oo
oo o o o
ooo
o o oooo oo o
oo o o o
o
oo o o o o o
o ooo o
o
oooo o o
o o
o
o
o o
o
oo o oo
o o
oo
o
o o
o
ooo o
o
o
oooo o o o o
o
o
o
o
o
o
o
o
o oo o oo o o
o
o
o
o ooooo
ooo ooo
o o oo
o
o ooo
o
ooooo
oooooooo ooo
oo
o
o
o
o
o
o
o o o
o o oo o
oo
oo o o
o o oo
ooo o o oo
oo
o oo o o o
o o oo o
o
o
o
o
o o
o
o oo o
ooo
oo o
oooooooo o
oo oo o
o
o
o
oo oo
o
oo
o
o
o
o oo
oo o o
o
o o
o o o
o o o oo o
o
o
o
oo o
o
o

o
o o o
oo
o
o
o
o
o
o
o oooo o o
oo
o
oo
o o
o o ooo o oo o ooo
o
o
o
o
o
o
o o
o
o ooo o
o
o
o o o
oo
o o
oo
o
oo
o
o o
oo
o
o
o
o
o
o
oo
o
o
o
o
oo o
oo o o
o o
oo oo
ooo
o o
o
o
o
o
o
oo o
ooo
o oo o
oo
o o
o
o
o
o

o o
o

o
-6

-2
-4

oooo
oo

oooo
o

oo
oo

o
-4

-2

o
0

## In this example have 11

classes with 10 dimensional
input vectors.
The bold dots correspond
to the centroids projected
onto the top 2 principal
directions.

## What about a subspace of dimension L < K 1?

If K > 3 can ask the question:

## Which subspace of dimensional L < K 1 should we project

onto for optimality w.r.t. LDA?

## out as much as possible in terms of variance.

4.3 Linear Discriminant Analysis

107

4

## Linear Discriminant Analysis

2
0

o o oo
ooo
oo
o o ooo
o o o
oo o
oooooo ooooo o oooo
oo o
o o
oo
o o
o ooo
oo ooo
o o o o oo
o
o
oo
oooooo
o
oo
o o o ooo o
ooooo o
o o o
o
oo o o oo
o
o
o
o
o
o
oo
o oo
oo o o o
ooo
o o oooo oo o
oo o o o
o
oo o o o o o
o ooo o
o
oooo o o
o o
o
o
o o
o
oo o oo
o o
oo
o
o o
o
ooo o
o
o
oooo o o o o
o
o
o
o
o
o
o
o
o oo o oo o o
o
o
o
o ooooo
ooo ooo
o o oo
o
o ooo
o
ooooo
oooooooo ooo
oo
o
o
o
o
o
o
o o o
o o oo o
oo
oo o o
o o oo
ooo o o oo
oo
o oo o o o
o o oo o
o
o
o
o
o o
o
o oo o
ooo
oo o
oooooooo o
oo oo o
o
o
o
oo oo
o
oo
o
o
o
o oo
oo o o
o
o o
o o o
o o o oo o
o
o
o
oo o
o
o

o
o o o
oo
o
o
o
o
o
o
o oooo o o
oo
o
oo
o o
o o ooo o oo o ooo
o
o
o
o
o
o
o o
o
o ooo o
o
o
o o o
oo
o o
oo
o
oo
o
o o
oo
o
o
o
o
o
o
oo
o
o
o
o
oo o
oo o o
o o
oo oo
ooo
o o
o
o
o
o
o
oo o
ooo
o oo o
oo
o o
o
o
o
o

o o
o

o
-6

-2
-4

oooo
oo

oooo
o

oo
oo

o
-4

-2

o
0

## In this example have 11

classes with 10 dimensional
input vectors.
The bold dots correspond
to the centroids projected
onto the top 2 principal
directions.

## What about a subspace of dimension L < K 1?

If K > 3 can ask the question:

## Which subspace of dimensional L < K 1 should we project

onto for optimality w.r.t. LDA?

## out as much as possible in terms of variance.

4.3 Linear Discriminant Analysis

107

4

## Linear Discriminant Analysis

2
0

o o oo
ooo
oo
o o ooo
o o o
oo o
oooooo ooooo o oooo
oo o
o o
oo
o o
o ooo
oo ooo
o o o o oo
o
o
oo
oooooo
o
oo
o o o ooo o
ooooo o
o o o
o
oo o o oo
o
o
o
o
o
o
oo
o oo
oo o o o
ooo
o o oooo oo o
oo o o o
o
oo o o o o o
o ooo o
o
oooo o o
o o
o
o
o o
o
oo o oo
o o
oo
o
o o
o
ooo o
o
o
oooo o o o o
o
o
o
o
o
o
o
o
o oo o oo o o
o
o
o
o ooooo
ooo ooo
o o oo
o
o ooo
o
ooooo
oooooooo ooo
oo
o
o
o
o
o
o
o o o
o o oo o
oo
oo o o
o o oo
ooo o o oo
oo
o oo o o o
o o oo o
o
o
o
o
o o
o
o oo o
ooo
oo o
oooooooo o
oo oo o
o
o
o
oo oo
o
oo
o
o
o
o oo
oo o o
o
o o
o o o
o o o oo o
o
o
o
oo o
o
o

o
o o o
oo
o
o
o
o
o
o
o oooo o o
oo
o
oo
o o
o o ooo o oo o ooo
o
o
o
o
o
o
o o
o
o ooo o
o
o
o o o
oo
o o
oo
o
oo
o
o o
oo
o
o
o
o
o
o
oo
o
o
o
o
oo o
oo o o
o o
oo oo
ooo
o o
o
o
o
o
o
oo o
ooo
o oo o
oo
o o
o
o
o
o

o o
o

o
-6

-2
-4

oooo
oo

oooo
o

oo
oo

o
-4

-2

o
0

## In this example have 11

classes with 10 dimensional
input vectors.
The bold dots correspond
to the centroids projected
onto the top 2 principal
directions.

## What about a subspace of dimension L < K 1?

If K > 3 can ask the question:

## Which subspace of dimensional L < K 1 should we project

onto for optimality w.r.t. LDA?

## out as much as possible in terms of variance.

4.3 Linear Discriminant Analysis

107

4

## Linear Discriminant Analysis

2
0

o o oo
ooo
oo
o o ooo
o o o
oo o
oooooo ooooo o oooo
oo o
o o
oo
o o
o ooo
oo ooo
o o o o oo
o
o
oo
oooooo
o
oo
o o o ooo o
ooooo o
o o o
o
oo o o oo
o
o
o
o
o
o
oo
o oo
oo o o o
ooo
o o oooo oo o
oo o o o
o
oo o o o o o
o ooo o
o
oooo o o
o o
o
o
o o
o
oo o oo
o o
oo
o
o o
o
ooo o
o
o
oooo o o o o
o
o
o
o
o
o
o
o
o oo o oo o o
o
o
o
o ooooo
ooo ooo
o o oo
o
o ooo
o
ooooo
oooooooo ooo
oo
o
o
o
o
o
o
o o o
o o oo o
oo
oo o o
o o oo
ooo o o oo
oo
o oo o o o
o o oo o
o
o
o
o
o o
o
o oo o
ooo
oo o
oooooooo o
oo oo o
o
o
o
oo oo
o
oo
o
o
o
o oo
oo o o
o
o o
o o o
o o o oo o
o
o
o
oo o
o
o

o
o o o
oo
o
o
o
o
o
o
o oooo o o
oo
o
oo
o o
o o ooo o oo o ooo
o
o
o
o
o
o
o o
o
o ooo o
o
o
o o o
oo
o o
oo
o
oo
o
o o
oo
o
o
o
o
o
o
oo
o
o
o
o
oo o
oo o o
o o
oo oo
ooo
o o
o
o
o
o
o
oo o
ooo
o oo o
oo
o o
o
o
o
o

o o
o

o
-6

-2
-4

oooo
oo

oooo
o

oo
oo

o
-4

-2

o
0

## In this example have 11

classes with 10 dimensional
input vectors.
The bold dots correspond
to the centroids projected
onto the top 2 principal
directions.

## The optimal sequence of subspaces

To find the sequences of optimal subspaces for LDA:
1

## Compute the K p matrix of class centroids M and the

common covariance matrix W - the within-class variance.
1

2
3

## Compute M = M W 2 using the eigen-decomposition of W

Compute B the covariance matrix of M - the between-class
variance.
B s eigen-decomposition is B = V DB V . The columns of
vl of V define basis of the optimal subspace.
1

## The optimal sequence of subspaces

To find the sequences of optimal subspaces for LDA:
1

## Compute the K p matrix of class centroids M and the

common covariance matrix W - the within-class variance.
1

2
3

## Compute M = M W 2 using the eigen-decomposition of W

Compute B the covariance matrix of M - the between-class
variance.
B s eigen-decomposition is B = V DB V . The columns of
vl of V define basis of the optimal subspace.
1

## The lth discriminant variable is given by Zl = vl W 2 X

-4

-2

-6

-2

2
o
o

1
-2

-3

Coordinate 10

-2

0
Coordinate 1

o
o
oo
o o o
o o
o
o o oooo o
o
o
o o o o
o
o oooo o o o oo
o
oo
o
o o
o
oo o oo
o
oo
o
o
oo
o o
o
o
ooooo o
o oo o o o o
ooo
o
o
o
o
o
o
o
o o
oo
o ooo ooooooo ooo oooo
o
o o oo ooo ooooo
o
o
o
o
o
o
o oo oooooo o o
o
o
o o
o o
oo ooo o o o
oo
ooo
o oo
oooooooooo oo
o oo o oooo
ooo o
o
o
o
o
o oo
o o o o oooo oo o
ooooo
o
o ooooo o
oo o
o
o oo
o ooo ooooo o o
o
o
oo o oo o oo oo
o
o
o
o
o
o
o
o
o
o
o
oo
o oo
oo oooo
oo oooooooooooooo
o oo o
oo o
o oo oooo o o
ooo
oo oo oo oooo
o o
o
o
o
o
o ooo ooooo oooo oo o oooooo o
o
oo o o o o oooo o
o oo o oooo ooo
o
oo
oo o o o o o o o o
o o oo o oooooooo ooo ooooooo o o
oo o oo
o o oo o o
o
o
o o ooo o o o ooo oo ooo
o
ooo oooo oo
ooo oo
o
oo o
o o o o oo o
o oo o oo
oo o
o
o o
oo o o
oooo o
o
o o
o
o
o
o o
o
o
o

o
o
o

-4

o o
oo

-1

3
2
1
0

Coordinate 7

-1
-2

oo o o
o
o
o oo
o oo
oo
oo oo
oo o
oo oooo o
o
o
oo
ooo oo
ooo
o
o
oooo o o o
o o
o o
o oo oo
oo o o
o
o oo oo
ooooo
o
oo oo
oo oo o
o
o o o o o oo o
o
oo o
oo
ooo oooo o
ooo o
o
o
oooo o o ooo o o
o o oo
o o oo
o
o
o o
o o ooooo ooo o
ooo
o oo oooooo o ooo ooo ooo o oo oooo
oo o
ooooo
o
o ooo o o oo o o oo oo
oo ooooooooo o oo ooo ooo oooo o
o
ooo oo o o oo o o ooo o
o o
o ooooo
oo
o oo
ooo o o ooo o o o o
o oo
oooooo
o
o
o oo
o ooooooo
o
o
oo oooo
o
oooo o o ooo
o o o ooo
ooo o o oo oooooo oo
o
o ooo
ooo
o
o o
o
oo oooo o o
o o o oo o o ooo o o o
oo oo oo o o
o
o
o
o
o
o
o
o
o
o
oo
oooo o oo oo oo o
oo o o oooooo
oo
oo o
o
oo
ooo
o o o
oo
ooo
oo
ooo
o oooo
oo
o
o
o ooo
o
o ooooooo
o o
oo o o
o
o oo o
o
o
oo o o
oo oo
o o
o
oo
ooo

Coordinate 2

o o
o
o
oo o o
o
o

-4

Coordinate 1

o
o
oo
o
o

o
oo
o
oo
o oo
o
o
o
o ooo o o
o
oo
o o
o
ooooo oo o oo
oo
o o oo
oo
o
o
o
oo
o
o
o
o
oo o
ooo ooo o o
o o o oo
ooooo
ooooo
o
o
o
oooooo ooooo oooo
o oo ooo oo
oooo oooo o oooooo
o oooo oooo o oo
o oooo ooooooooooo o
o o
oo
o o oo
o oo
oo oooooo
ooooo ooo
o ooo o
oo
o
o
o
oo
o ooooooo o oooo
oooooooo o ooo
o o oo o oo o ooo o oo
oo
ooo
o oo o o ooo o oo o ooo oo oooo
ooooo oo oooooo ooooooo o o o
ooo oooo o
o o o ooo o
o oo oo ooooooooooo o oo
o o oo o
o oo
o
o o ooooooooooo
o
o oooo o oooo oo
oo oooo oooooo
o
o
o
o
o
o
o
o
o
o
o
o oo o
ooo o o
o o
oo o
o oo o oo oo
o
oo oo
oo
o o o o oo o o o o o o
oo o o o
oo o
o o
oo
ooo
o oo
o o o oooo o
o
o o
o
o
o
oo
o
ooo
o
oo
oo
o
o
o
o
o
ooo
o
o oo o oo
o

Coordinate 3

-2

o
o
o

o
o
o
o
o
o o
o
o
o
o
oo
o
oo o
o o
o o
oo o oo o
o o
o
o o ooo
oooooo oo
o oo
o
o
o o
o
o oo
o
o o
oooo
oo
o o
o
o ooo
o
o oo
o
o
o
o
o
o
o
o
o o oo o o
o o
oooo oo o o ooo ooo
o
ooooo o ooo o oo o
oo ooo o
o
o oo
o
ooooo o oo
ooo oo ooo
o o
oo
oo
o ooooo ooooooo
o oo oooooo ooo
oo
oo o
oo
oo oo
oo oooo oooo oo ooo
oo
o
oo o
o
o o
o o o oooooo oo
ooo oo o
oo o
oooo o oooo o o oo oo
o
o oooo
oo
o o oo oooo ooo
oooo
oo
o oo ooo o o
oo
ooo o oo o ooooo o o
oo oo o o o
o o oo oooo ooo o oo ooo
oo
o o o ooooo
o
o
o
o
o
o o o oo
oo o ooo ooooo
o o ooooo
o o
o
oo oo oooooo oo o o o
o
o oo o
oo
ooo o o
oo o o
oo o
oo
oo o
o
oooo
o o oo
o oo o o
ooo
o
o
o
o
o oo
o
o oo o o
oo oo o oo
oo
o oooo o o o o
o
oo
o o o
oo
oo o
o
o
ooo
o oo
o
oo
o o o
o
oo
o
o
o
oo
oo o
o
o
oooo
o
o
o
o o
o

-2

Coordinate 3

## Linear Discriminant Analysis

-2

-1

oo
oo
o

Coordinate 9

FIGURE 4.8. Four projections onto pairs of canonical variates. Notice that as

## Note as the rank

the
canonical
variates
increase
the
projected
the rankof
of the
canonical
variates increases,
the centroids
become less
out.
In the lower right panel they appear to be superimposed, and the classes most
centroids become

## LDA via the Fisher criterion

Fisher arrived at this decomposition via a different route. He
posed the problem
Find the linear combination Z = aX such that the
between-class variance is maximized relative to the
within-class variance.
116

+
+

## FIGURE 4.9. Why

Althoughthis
the line
joining themakes
centroidssense
defines the direction of
criterion
greatest centroid spread, the projected data overlap because of the covariance
(left panel). The discriminant direction minimizes this overlap for Gaussian data
(right panel).

## LDA via the Fisher criterion

Fisher arrived at this decomposition via a different route. He
posed the problem
Find the linear combination Z = aX such that the
between-class variance is maximized relative to the
within-class variance.
116

+
+

## FIGURE 4.9. Why

Althoughthis
the line
joining themakes
centroidssense
defines the direction of
criterion
greatest centroid spread, the projected data overlap because of the covariance
(left panel). The discriminant direction minimizes this overlap for Gaussian data
(right panel).

## The Fisher criterion

W is the common covariance matrix of the original data X.
B is the covariance matrix of the centroid matrix M
Then for the projected data Z
1

## Fishers problem amounts to maximizing the Raleigh quotient

max
a

at B a
at W a

or equivalently
max at B a subject to at W a = 1
a

## The Fisher criterion

W is the common covariance matrix of the original data X.
B is the covariance matrix of the centroid matrix M
Then for the projected data Z
1

## Fishers problem amounts to maximizing the Raleigh quotient

max
a

at B a
at W a

or equivalently
max at B a subject to at W a = 1
a

## The Fisher criterion

W is the common covariance matrix of the original data X.
B is the covariance matrix of the centroid matrix M
Then for the projected data Z
1

## Fishers problem amounts to maximizing the Raleigh quotient

max
a

at B a
at W a

or equivalently
max at B a subject to at W a = 1
a

## The Fisher criterion

Fishers problem amounts to maximizing the Raleigh quotient

a

## This is a generalized eigenvalue problem with a given by the

largest eigenvalue of W 1 B.

## Can be shown that a1 is equal to W 2 v1 defined earlier.

Can find the next direction a2

a2 = arg max
a

at B a
subject to at W a1 = 0
at W a
1

Once again a2 = W 2 v2 .
In a similar fashion can find a3 , a4 , . . .

## The Fisher criterion

Fishers problem amounts to maximizing the Raleigh quotient

a

## This is a generalized eigenvalue problem with a given by the

largest eigenvalue of W 1 B.

## Can be shown that a1 is equal to W 2 v1 defined earlier.

Can find the next direction a2

a2 = arg max
a

at B a
subject to at W a1 = 0
at W a
1

Once again a2 = W 2 v2 .
In a similar fashion can find a3 , a4 , . . .

## The Fisher criterion

Fishers problem amounts to maximizing the Raleigh quotient

a

## This is a generalized eigenvalue problem with a given by the

largest eigenvalue of W 1 B.

## Can be shown that a1 is equal to W 2 v1 defined earlier.

Can find the next direction a2

a2 = arg max
a

at B a
subject to at W a1 = 0
at W a
1

Once again a2 = W 2 v2 .
In a similar fashion can find a3 , a4 , . . .

## The Fisher criterion

Fishers problem amounts to maximizing the Raleigh quotient

a

## This is a generalized eigenvalue problem with a given by the

largest eigenvalue of W 1 B.

## Can be shown that a1 is equal to W 2 v1 defined earlier.

Can find the next direction a2

a2 = arg max
a

at B a
subject to at W a1 = 0
at W a
1

Once again a2 = W 2 v2 .
In a similar fashion can find a3 , a4 , . . .

## The Fisher criterion

Fishers problem amounts to maximizing the Raleigh quotient

a

## This is a generalized eigenvalue problem with a given by the

largest eigenvalue of W 1 B.

## Can be shown that a1 is equal to W 2 v1 defined earlier.

Can find the next direction a2

a2 = arg max
a

at B a
subject to at W a1 = 0
at W a
1

Once again a2 = W 2 v2 .
In a similar fashion can find a3 , a4 , . . .

118

## The al s are referred to as discriminant coordinates or

canonical
variates.
Classification
in Reduced Subspace
ooo
ooo

ooooo

oo
oo

o o oo
ooo
oo
oo ooo o
o o o
ooooo ooooo o oooo
ooo
oo
o o
oo
o o
o ooo
o
ooooo
o
o
o

o
o
o
oo
o
ooooo
o
o ooooo oo
o
o o o ooo o
oo o
o
oo oo o o oo
o
ooo
o oo o
oo o o o
oo o o o
oooooo o oo o oooo oo o
o o oo o
oooo o o
o oo o
o oo o o o o
o
oo o o
o
oo
oo
o
ooo o oooo
o
o
oo ooo o o
ooo o
oooo oooo
ooooo o o
oooo o o
o o
o
o
o
o
o
o
o
o
o
ooooo
o
o
ooooo oo o o
oo
o
o
o o oooo o o
oo
ooo oo oo o oo oo o o
o o oo
ooo o o oo
o
o
oo o o
oo o o
oo o o
oo
o
o
ooooo o
ooo o
ooo
o
o
o
o
o
o
o
o
o
oo o
o
oo oo
oooo
o oo o
o
o o
o
o o
o
o o o o
o o o oo o
o
o
o
o o
o o
o o o oo
oo
o oo o o o
o
o
o
o ooo
oo
o
oo
oo
oo
oooo
o
o
o
o
o
o
o o o oooo
oo
o
o
o
o o o oo
o
o
oo
o o
oo
o
oo
o
o o
oo
o
o
oooo
oo o
o o ooo
oo o o o
o
o
oo o
o o
oo
o
ooo
o o
o
oo o
ooo
o
oo o oo o o
o
o
o
o

Canonical Coordinate 2

o o
o

Canonical Coordinate 1

FIGURE 4.11. Decision boundaries for the vowel training data, in the two-dimensional subspace spanned by the first two canonical variates. Note that in
any higher-dimensional subspace, the decision boundaries are higher-dimensional
affine planes, and could not be represented as lines.

## In this example have 11

classes with 10
dimensional input
vectors.
The decision boundaries
based on using basic
linear discrimination in
the low dimensional
space given by the first
2 canonical variates.

Logistic Regression

Logistic regression
Arises from trying to model the posterior probabilities of the

to one.

## The simple model used is for k = 1, . . . , K 1

P (G = k|X = x) =
and k = K
P (G = K|X = x) =

exp(k0 + kt x)
P
t
1 + K1
l=1 exp(l0 + l x)

1+

PK1
l=1

1
exp(l0 + lt x)

## These posterior probabilities clearly sum to one.

Logistic regression
Arises from trying to model the posterior probabilities of the

to one.

## The simple model used is for k = 1, . . . , K 1

P (G = k|X = x) =
and k = K
P (G = K|X = x) =

exp(k0 + kt x)
P
t
1 + K1
l=1 exp(l0 + l x)

1+

PK1
l=1

1
exp(l0 + lt x)

## These posterior probabilities clearly sum to one.

Logistic regression
This model: k = 1, . . . , K 1

P (G = k|X = x) =
and k = K
P (G = K|X = x) =

exp(k0 + kt x)
P
t
1 + K1
l=1 exp(l0 + l x)

1+

PK1
l=1

1
exp(l0 + lt x)

## induces linear decision boundaries between classes as

{x : P (G = k|X = x) = P (G = l|X = x)}
is the same as
{x : (k0 l0 ) + (k l )t x = 0}
for 1 k < K and 1 l < K.

## Fitting Logistic regression models

To simplify notation let
1

= {10 , 1t , 20 , 2t , . . .} and

P (G = k|X = x) = pk (x; )

## Given training data {(xi , gi )}n

i=1 one usually fits the logistic

`() = log

n
Y
i=1

pgi (xi ; )

n
X

log(pgi (xi ; ))

i=1

## in my opinion this is an abuse of terminology as the posterior

probabilities are being used...

## Fitting Logistic regression models: The two class case

p1 (x; ) =

exp( t x)
and p2 (x; ) = 1 p1 (x; )
1 + exp( t x)

## Let = = (10 , 1t ) and assume xi s include the constant term 1.

A convenient way to write the likelihood for one sample (xi , gi ) is:
Code the two-class gi as a {0, 1} response yi where
(
1 if gi = 1
yi =
0 if gi = 2
Then one can write

## Fitting Logistic regression models: The two class case

Similarly
log pgi (xi ; ) = yi log p1 (xi ; ) + (1 yi ) log(1 p1 (xi ; ))
The log-likelihood of the data becomes
`() =
=

n
X

## [yi log p1 (xi ; ) + (1 yi ) log(1 p1 (xi ; ))]

i=1
n h
X
i=1

n h
X
i=1

yi t xi yi log(1 + e
yi t xi log(1 + e

tx

tx

) (1 yi ) log(1 + e

tx

## Fitting Logistic regression models: The two class case

`() =

n h
X
i=1

yi t xi log(1 + e

tx

## To maximize the log-likelihood set its derivatives to zero to get


n 
`() X
exp( t xi )
=
xi yi xi

1 + exp( t xi )
i=1


n
X
exp( t xi )
=
xi yi
1 + exp( t xi )
i=1
=

n
X
i=1

xi (yi p1 (xi ; )) = 0

## These are (p + 1) equations non-linear equations in .

Must solve iteratively and in the book they use the

Newton-Raphson algorithm.

## The two class case: Iterative optimization

n

`() X
=
xi (yi p1 (xi ; ))

i=1

## and Hessian matrix

n

X
`()
=
xi xti p1 (xi ; )(1 p1 (xi ; ))
t

i=1

new =

old

`()
t

1

`()

## Iterative optimization in matrix notation

Write the Hessian and gradient in matrix notation. Let
X be the N (p + 1) matrix with (1, xti ) on each row,
p = (p1 (x1 ; old ), p1 (x2 ; old ), . . . , p1 (xn ; old ))t
W is n n diagonal matrix with ith diagonal element

Then
`()
= Xt (y p)

and
`()
= Xt WX
t

## Iterative optimization as iterative weighted ls

The Newton step is then
new = old + (Xt WX)1 Xt (y p)


= (Xt WX)1 Xt W X old + W1 (y p)
= (Xt WX)1 Xt Wz

## Have re-expressed the Newton step as a weighted least squares step

new = arg min (z X)t W(z X)

with response
z = X old + W1 (y p)
known as the adjusted response. Note at iteration each W, p and
z change.

An toy example

## Two class problem with 2 dimensional input vectors.

Use Logistic Regression to find a decision boundary

Size 1/Wii

Size 1/Wii

Size 1/Wii

## The current estimate cur

Logistic regression converges to this decision boundary.

## L1 regularized logistic regression

The L1 penalty can be used for variable selection in logistic
regression by maximizing a penalized version of the log-likelihood

p
n h

X
i
X
t
yi (0 + t xi ) log(1 + e0 + xi )
|j |
max

0 ,1
i=1

j=1

Note:

## the intercept, 0 , is not included in the penalty term,

the predictors should be standardized to ensure the penalty is

meaningful,

## found using non-linear programming methods.

Separating Hyperplanes

## Directly estimating separating hyperplanes

In this section describe separating hyperplane classifiers - will
only consider separable training data.
Construct linear decision boundaries that explicitly try to

## separate the data into different classes as well as possible.

A hyperplane is defined as

{x : 0 + t x = 0}

## Directly estimating separating hyperplanes

In this section describe separating hyperplane classifiers - will
only consider separable training data.
Construct linear decision boundaries that explicitly try to

## separate the data into different classes as well as possible.

A hyperplane is defined as

{x : 0 + t x = 0}

## Directly estimating separating hyperplanes

In this section describe separating hyperplane classifiers - will
only consider separable training data.
Construct linear decision boundaries that explicitly try to

## separate the data into different classes as well as possible.

A hyperplane is defined as

{x : 0 + t x = 0}

130

x0

x
0 + T x = 0

## ature in the late 1950s (Rosenblatt,

Perceptrons
f (x) =1958).
0 +
t x = set
0 the foundations
for the neural network models of the 1980s and 1990s.
Before we continue,
let
us
digress
slightly
and
review
some
t
vector algebra.
2Figure 4.15 depicts a hyperplane
1
2 or affine set L defined by the equation
2
T
f (x) = 0 + t x = 0; since we are in IR this is a line.
Here we list some
0 properties:
0

## If x1 , x L then (x x ) = 0 = = /kk is normal to L

If x0 L then x = .
1. For
any two points
x1 and x2xlying
The signed
distance
of point
to inLL,is T (x1 x2 ) = 0, and hence

## = /|||| is the vector normal to the surface of L.

1 T t
1
x0 =x
2.tFor
(xany
point
x0 ) x=
+0 .0 ) =
f (x) f (x)
0 in L, (
kk
kf 0 (x)k
3. The signed distance of any point x to L is given by

130

x0

x
0 + T x = 0

## ature in the late 1950s (Rosenblatt,

Perceptrons
f (x) =1958).
0 +
t x = set
0 the foundations
for the neural network models of the 1980s and 1990s.
Before we continue,
let
us
digress
slightly
and
review
some
t
vector algebra.
2Figure 4.15 depicts a hyperplane
1
2 or affine set L defined by the equation
2
T
f (x) = 0 + t x = 0; since we are in IR this is a line.
Here we list some
0 properties:
0

## If x1 , x L then (x x ) = 0 = = /kk is normal to L

If x0 L then x = .
1. For
any two points
x1 and x2xlying
The signed
distance
of point
to inLL,is T (x1 x2 ) = 0, and hence

## = /|||| is the vector normal to the surface of L.

1 T t
1
x0 =x
2.tFor
(xany
point
x0 ) x=
+0 .0 ) =
f (x) f (x)
0 in L, (
kk
kf 0 (x)k
3. The signed distance of any point x to L is given by

130

x0

x
0 + T x = 0

## ature in the late 1950s (Rosenblatt,

Perceptrons
f (x) =1958).
0 +
t x = set
0 the foundations
for the neural network models of the 1980s and 1990s.
Before we continue,
let
us
digress
slightly
and
review
some
t
vector algebra.
2Figure 4.15 depicts a hyperplane
1
2 or affine set L defined by the equation
2
T
f (x) = 0 + t x = 0; since we are in IR this is a line.
Here we list some
0 properties:
0

## If x1 , x L then (x x ) = 0 = = /kk is normal to L

If x0 L then x = .
1. For
any two points
x1 and x2xlying
The signed
distance
of point
to inLL,is T (x1 x2 ) = 0, and hence

## = /|||| is the vector normal to the surface of L.

1 T t
1
x0 =x
2.tFor
(xany
point
x0 ) x=
+0 .0 ) =
f (x) f (x)
0 in L, (
kk
kf 0 (x)k
3. The signed distance of any point x to L is given by

130

x0

x
0 + T x = 0

## ature in the late 1950s (Rosenblatt,

Perceptrons
f (x) =1958).
0 +
t x = set
0 the foundations
for the neural network models of the 1980s and 1990s.
Before we continue,
let
us
digress
slightly
and
review
some
t
vector algebra.
2Figure 4.15 depicts a hyperplane
1
2 or affine set L defined by the equation
2
T
f (x) = 0 + t x = 0; since we are in IR this is a line.
Here we list some
0 properties:
0

## If x1 , x L then (x x ) = 0 = = /kk is normal to L

If x0 L then x = .
1. For
any two points
x1 and x2xlying
The signed
distance
of point
to inLL,is T (x1 x2 ) = 0, and hence

## = /|||| is the vector normal to the surface of L.

1 T t
1
x0 =x
2.tFor
(xany
point
x0 ) x=
+0 .0 ) =
f (x) f (x)
0 in L, (
kk
kf 0 (x)k
3. The signed distance of any point x to L is given by

Perceptron Learning

## Rosenblatts Perceptron Learning Algorithm

Perceptron learning algorithm tries to find a separating hyperplane
by minimizing the distance of misclassified points to the decision
boundary.
The Objective Function
Have labelled training data {(xi , yi )} with xi Rp and
yi {1, 1}.
A point xi is misclassified if sign(0 + t xi ) 6= yi
This can be re-stated as: a point xi is misclassified if
yi (0 + t xi ) < 0
The goal is to find 0 and which minimize
X
D(, 0 ) =
yi (xti + 0 )
iM

## Rosenblatts Perceptron Learning Algorithm

Perceptron learning algorithm tries to find a separating hyperplane
by minimizing the distance of misclassified points to the decision
boundary.
The Objective Function
Have labelled training data {(xi , yi )} with xi Rp and
yi {1, 1}.
A point xi is misclassified if sign(0 + t xi ) 6= yi
This can be re-stated as: a point xi is misclassified if
yi (0 + t xi ) < 0
The goal is to find 0 and which minimize
X
D(, 0 ) =
yi (xti + 0 )
iM

## Rosenblatts Perceptron Learning Algorithm

Perceptron learning algorithm tries to find a separating hyperplane
by minimizing the distance of misclassified points to the decision
boundary.
The Objective Function
Have labelled training data {(xi , yi )} with xi Rp and
yi {1, 1}.
A point xi is misclassified if sign(0 + t xi ) 6= yi
This can be re-stated as: a point xi is misclassified if
yi (0 + t xi ) < 0
The goal is to find 0 and which minimize
X
D(, 0 ) =
yi (xti + 0 )
iM

## Rosenblatts Perceptron Learning Algorithm

Perceptron learning algorithm tries to find a separating hyperplane
by minimizing the distance of misclassified points to the decision
boundary.
The Objective Function
Have labelled training data {(xi , yi )} with xi Rp and
yi {1, 1}.
A point xi is misclassified if sign(0 + t xi ) 6= yi
This can be re-stated as: a point xi is misclassified if
yi (0 + t xi ) < 0
The goal is to find 0 and which minimize
X
D(, 0 ) =
yi (xti + 0 )
iM

## Rosenblatts Perceptron Learning Algorithm

Perceptron learning algorithm tries to find a separating hyperplane
by minimizing the distance of misclassified points to the decision
boundary.
The Objective Function
Have labelled training data {(xi , yi )} with xi Rp and
yi {1, 1}.
A point xi is misclassified if sign(0 + t xi ) 6= yi
This can be re-stated as: a point xi is misclassified if
yi (0 + t xi ) < 0
The goal is to find 0 and which minimize
X
D(, 0 ) =
yi (xti + 0 )
iM

## Perceptron Learning: The Objective Function

Want to find 0 and which minimize
D(, 0 ) =

iM

yi (xti + 0 ) =

yi f,0 (xi )

iM

D(, 0 ) is non-negative.
D(, 0 ) is proportional to the distance of the misclassified

## points to the decision boundary defined by 0 + t x = 0.

Questions:
Is there a unique , 0 which minimizes D(, 0 ) (disregarding
re-scaling of and 0 )

## Perceptron Learning: The Objective Function

Want to find 0 and which minimize
D(, 0 ) =

iM

yi (xti + 0 ) =

yi f,0 (xi )

iM

D(, 0 ) is non-negative.
D(, 0 ) is proportional to the distance of the misclassified

## points to the decision boundary defined by 0 + t x = 0.

Questions:
Is there a unique , 0 which minimizes D(, 0 ) (disregarding
re-scaling of and 0 )

## Perceptron Learning: The Objective Function

Want to find 0 and which minimize
D(, 0 ) =

iM

yi (xti + 0 ) =

yi f,0 (xi )

iM

D(, 0 ) is non-negative.
D(, 0 ) is proportional to the distance of the misclassified

## points to the decision boundary defined by 0 + t x = 0.

Questions:
Is there a unique , 0 which minimizes D(, 0 ) (disregarding
re-scaling of and 0 )

## Perceptron Learning: The Objective Function

Want to find 0 and which minimize
D(, 0 ) =

iM

yi (xti + 0 ) =

yi f,0 (xi )

iM

D(, 0 ) is non-negative.
D(, 0 ) is proportional to the distance of the misclassified

## points to the decision boundary defined by 0 + t x = 0.

Questions:
Is there a unique , 0 which minimizes D(, 0 ) (disregarding
re-scaling of and 0 )

## Perceptron Learning: The Objective Function

Want to find 0 and which minimize
D(, 0 ) =

iM

yi (xti + 0 ) =

yi f,0 (xi )

iM

D(, 0 ) is non-negative.
D(, 0 ) is proportional to the distance of the misclassified

## points to the decision boundary defined by 0 + t x = 0.

Questions:
Is there a unique , 0 which minimizes D(, 0 ) (disregarding
re-scaling of and 0 )

## Perceptron Learning: Optimizing the Objective Function

The gradient, assuming a fixed M, is given by
X
D(, 0 )
=
yi xi ,

iM

X
D(, 0 )
=
yi
0
iM

+ yi xi

and

0 0 + yi

## where is the learning rate.

Repeat this step until no points are misclassified.

## Perceptron Learning: Optimizing the Objective Function

The gradient, assuming a fixed M, is given by
X
D(, 0 )
=
yi xi ,

iM

X
D(, 0 )
=
yi
0
iM

+ yi xi

and

0 0 + yi

## where is the learning rate.

Repeat this step until no points are misclassified.

## Perceptron Learning: Optimizing the Objective Function

The gradient, assuming a fixed M, is given by
X
D(, 0 )
=
yi xi ,

iM

X
D(, 0 )
=
yi
0
iM

+ yi xi

and

0 0 + yi

## where is the learning rate.

Repeat this step until no points are misclassified.

## Perceptron Learning: Optimizing the Objective Function

The gradient, assuming a fixed M, is given by
X
D(, 0 )
=
yi xi ,

iM

X
D(, 0 )
=
yi
0
iM

+ yi xi

and

0 0 + yi

## where is the learning rate.

Repeat this step until no points are misclassified.

points.

## Perceptron Learning: One Iteration

Current estimate
(0)

Point misclassified
by (0)

to get (1)

(2)

(3)

(4)

(5)

(6)

(7)

(8)

(9)

(10)

(11)

(12)

(13)

(14)

(15)

(16)

(17)

## Perceptron Learning: Sequence of iterations

(17)
Is this the best separating hyperplane we could have found?

## Perceptron Learning Algorithm: Properties

Pros
If the classes are linearly separable, the algorithm converges to

## a separating hyperplane in a finite number of steps.

Cons
All separating hyperplanes are considered equally valid.
One found depends on the initial guess for and 0 .
The finite number of steps can be very large.
If the data is non-separable, the algorithm will not converge.

## Perceptron Learning Algorithm: Properties

Pros
If the classes are linearly separable, the algorithm converges to

## a separating hyperplane in a finite number of steps.

Cons
All separating hyperplanes are considered equally valid.
One found depends on the initial guess for and 0 .
The finite number of steps can be very large.
If the data is non-separable, the algorithm will not converge.

## Perceptron Learning Algorithm: Properties

Pros
If the classes are linearly separable, the algorithm converges to

## a separating hyperplane in a finite number of steps.

Cons
All separating hyperplanes are considered equally valid.
One found depends on the initial guess for and 0 .
The finite number of steps can be very large.
If the data is non-separable, the algorithm will not converge.

## Perceptron Learning Algorithm: Properties

Pros
If the classes are linearly separable, the algorithm converges to

## a separating hyperplane in a finite number of steps.

Cons
All separating hyperplanes are considered equally valid.
One found depends on the initial guess for and 0 .
The finite number of steps can be very large.
If the data is non-separable, the algorithm will not converge.

## Perceptron Learning Algorithm: Properties

Pros
If the classes are linearly separable, the algorithm converges to

## a separating hyperplane in a finite number of steps.

Cons
All separating hyperplanes are considered equally valid.
One found depends on the initial guess for and 0 .
The finite number of steps can be very large.
If the data is non-separable, the algorithm will not converge.

## Optimal Separating Hyperplanes

6?#)@,/*1-?,\$,#)"A*B0?-\$?/,"-1*C=D
6?#)@,/*1-?,\$,#)"A*B0?-\$?/,"-1*C=D
Optimal
Separating Hyperplane
Intuitively
!

!"#\$%&'()*+'),("-.'/)"0)0%#&%#1)2)\$',2(2*%#1)+3,'(,.2#')0"()2).%#'2(.3)
!"#\$%&'()*+'),("-.'/)"0)0%#&%#1)2)\$',2(2*%#1)+3,'(,.2#')0"()2).%#'2(.3)
?
\$',2(2-.')&2*2\$'*)456
?8)3!4@78A7=
78383
79856
:8383
:98;856
<8383
<9=8)6!>
\$',2(2-.')&2*2\$'*)456
9856
98;856
9=8)6!>8)3!4@78A7=

The optimal
separatinghyperplanes
hyperplane
separates the
two too close to t
Optimal
separating
!"#\$"%&'%(")%#*'#*#()%"+,)-,./*)0%0"&1.2%3)%\$"&&0)4%
passing
!"#\$"%&'%(")%#*'#*#()%"+,)-,./*)0%0"&1.2%3)%\$"&&0)4%
7

"

<

<

"

5*(1#(#6).+7%/%"+,)-,./*)%("/(%,/00)0%(&&%\$.&0)%(&%(")%(-/#*#*8%)9/:,.)0%3#..%;)%0)*0#(#6)%
classes and
maximizes the distance
theand
closes
pointless
from
5*(1#(#6).+7%/%"+,)-,./*)%("/(%,/00)0%(&&%\$.&0)%(&%(")%(-/#*#*8%)9/:,.)0%3#..%;)%0)*0#(#6)%
to to
noise
probably
likely to gen
(&%*&#0)%/*27%(")-)'&-)7%.)00%.#<).+%(&%8)*)-/.#=)%3)..%'&-%2/(/
(&%*&#0)%/*27%(")-)'&-)7%.)00%.#<).+%(&%8)*)-/.#=)%3)..%'&-%2/(/&1(0#2)%(")%(-/#*#*8%0)(
&1(0#2)%(")%(-/#*#*8%0)(
either class
[Vapnik
1996].
5*0()/27%#(%0)):0%-)/0&*/;.)%(&%)9,)\$(%("/(%/%"+,)-,./*)%("/(%#0
'/-(")0(%'-&:%/..%
5*0()/27%#(%0)):0%-)/0&*/;.)%(&%)9,)\$(%("/(%/%"+,)-,./*)%("/(%#0 '/-(")0(%'-&:%/..%
!

## er the problem of finding

a separating hyperplane for a linearly
(-/#*#*8%)9/:,.)0%3#..%"/6)%;)(()-%8)*)-/.#=/(#&*%\$/,/;#.#(#)0
(-/#*#*8%)9/:,.)0%3#..%"/6)%;)(()-%8)*)-/.#=/(#&*%\$/,/;#.#(#)0
Better
hyperplane
d
le dataset {(x
,
y
),
(x
,
y
),
xi a R
and y far away from all
>")-)'&-)7%(")%&,(#:/.%0),/-/(#*8%"+,)-,./*)%3#..%;)%(")%&*)%3#("%(")%./-8)0(%
1
1
2
2
>")-)'&-)7%(")%&,(#:/.%0),/-/(#*8%"+,)-,./*)%3#..%;)%(")%&*)%3#("%(")%./-8)0(%
This provides . . . , (xn, yn)} with better
generalization capabilities.
/2(1%#7%!"#\$"%#&%'()#*('%+&%,"(%-#*#-.-%'#&,+*\$(%/)%+*%(0+-12(%,/%,"(%'(\$#&#/*%
/2(1%#7%!"#\$"%#&%'()#*('%+&%,"(%-#*#-.-%'#&,+*\$(%/)%+*%(0+-12(%,/%,"(%'(\$#&#/*%
1}.
a&.3)+\$(
&.3)+\$( definition of the separating hyperplane
unique
"

"

6: 6:

6: 6:

T
h
la
m
th

? ?
,( ,
#: (#:
/. /
%" .%"
+, +
)-,)
, . -,
/* ./
) *)

@/9#:1:
@/9#:1:
:/-8#*
:/-8#*

67 67

Which separating

!"#\$%&'(#)%"*#%*+,##-\$"*.",/01)1
!"#\$%&'(#)%"*#%*+,##-\$"*.",/01)1
2)(,\$&%*3'#)-\$\$-4561'",
2)(,\$&%*3'#)-\$\$-4561'",
7-8,1*.9:*;")<-\$1)#0
7-8,1*.9:*;")<-\$1)#0

6767

43/-%56"(37+&&78
+*'%9.2#(3:%;<<=>
+*'%9.2#(3:%;<<=>
hyperplane?43/-%56"(37+&&78
One which
maximizes

## Which of the infinite hyperplanes should we choose?

a decision boundary that generalizes well.

margin=>=>

6?#)@,/*1-?,\$,#)"A*B0?-\$?/,"-1*C=D
6?#)@,/*1-?,\$,#)"A*B0?-\$?/,"-1*C=D
Optimal
Separating Hyperplane
Intuitively
!

!"#\$%&'()*+'),("-.'/)"0)0%#&%#1)2)\$',2(2*%#1)+3,'(,.2#')0"()2).%#'2(.3)
!"#\$%&'()*+'),("-.'/)"0)0%#&%#1)2)\$',2(2*%#1)+3,'(,.2#')0"()2).%#'2(.3)
?
\$',2(2-.')&2*2\$'*)456
?8)3!4@78A7=
78383
79856
:8383
:98;856
<8383
<9=8)6!>
\$',2(2-.')&2*2\$'*)456
9856
98;856
9=8)6!>8)3!4@78A7=

The optimal
separatinghyperplanes
hyperplane
separates the
two too close to t
Optimal
separating
!"#\$"%&'%(")%#*'#*#()%"+,)-,./*)0%0"&1.2%3)%\$"&&0)4%
passing
!"#\$"%&'%(")%#*'#*#()%"+,)-,./*)0%0"&1.2%3)%\$"&&0)4%
7

"

<

<

"

5*(1#(#6).+7%/%"+,)-,./*)%("/(%,/00)0%(&&%\$.&0)%(&%(")%(-/#*#*8%)9/:,.)0%3#..%;)%0)*0#(#6)%
classes and
maximizes the distance
theand
closes
pointless
from
5*(1#(#6).+7%/%"+,)-,./*)%("/(%,/00)0%(&&%\$.&0)%(&%(")%(-/#*#*8%)9/:,.)0%3#..%;)%0)*0#(#6)%
to to
noise
probably
likely to gen
(&%*&#0)%/*27%(")-)'&-)7%.)00%.#<).+%(&%8)*)-/.#=)%3)..%'&-%2/(/
(&%*&#0)%/*27%(")-)'&-)7%.)00%.#<).+%(&%8)*)-/.#=)%3)..%'&-%2/(/&1(0#2)%(")%(-/#*#*8%0)(
&1(0#2)%(")%(-/#*#*8%0)(
either class
[Vapnik
1996].
5*0()/27%#(%0)):0%-)/0&*/;.)%(&%)9,)\$(%("/(%/%"+,)-,./*)%("/(%#0
'/-(")0(%'-&:%/..%
5*0()/27%#(%0)):0%-)/0&*/;.)%(&%)9,)\$(%("/(%/%"+,)-,./*)%("/(%#0 '/-(")0(%'-&:%/..%
!

## er the problem of finding

a separating hyperplane for a linearly
(-/#*#*8%)9/:,.)0%3#..%"/6)%;)(()-%8)*)-/.#=/(#&*%\$/,/;#.#(#)0
(-/#*#*8%)9/:,.)0%3#..%"/6)%;)(()-%8)*)-/.#=/(#&*%\$/,/;#.#(#)0
Better
hyperplane
d
le dataset {(x
,
y
),
(x
,
y
),
xi a R
and y far away from all
>")-)'&-)7%(")%&,(#:/.%0),/-/(#*8%"+,)-,./*)%3#..%;)%(")%&*)%3#("%(")%./-8)0(%
1
1
2
2
>")-)'&-)7%(")%&,(#:/.%0),/-/(#*8%"+,)-,./*)%3#..%;)%(")%&*)%3#("%(")%./-8)0(%
This provides . . . , (xn, yn)} with better
generalization capabilities.
/2(1%#7%!"#\$"%#&%'()#*('%+&%,"(%-#*#-.-%'#&,+*\$(%/)%+*%(0+-12(%,/%,"(%'(\$#&#/*%
/2(1%#7%!"#\$"%#&%'()#*('%+&%,"(%-#*#-.-%'#&,+*\$(%/)%+*%(0+-12(%,/%,"(%'(\$#&#/*%
1}.
a&.3)+\$(
&.3)+\$( definition of the separating hyperplane
unique
"

"

6: 6:

6: 6:

T
h
la
m
th

? ?
,( ,
#: (#:
/. /
%" .%"
+, +
)-,)
, . -,
/* ./
) *)

@/9#:1:
@/9#:1:
:/-8#*
:/-8#*

67 67

Which separating

!"#\$%&'(#)%"*#%*+,##-\$"*.",/01)1
!"#\$%&'(#)%"*#%*+,##-\$"*.",/01)1
2)(,\$&%*3'#)-\$\$-4561'",
2)(,\$&%*3'#)-\$\$-4561'",
7-8,1*.9:*;")<-\$1)#0
7-8,1*.9:*;")<-\$1)#0

6767

43/-%56"(37+&&78
+*'%9.2#(3:%;<<=>
+*'%9.2#(3:%;<<=>
hyperplane?43/-%56"(37+&&78
One which
maximizes

## Which of the infinite hyperplanes should we choose?

a decision boundary that generalizes well.

margin=>=>

## Stating the optimization problem

A first attempt

max

M subject to yi ( t xi + 0 ) M kk, i = 1, . . . , n

, 0 , kk = 1

## Stating the optimization problem

Remove the constraint kk = 1 by adjusting the constraints

, 0

## Therefore can arbitrarily set kk = 1/M .

Then the above optimization problem is equivalent to

1
min kk2 subject to yi ( t xi + 0 ) 1, i = 1, . . . , n
, 0 2

## Stating the optimization problem

Remove the constraint kk = 1 by adjusting the constraints

, 0

## Therefore can arbitrarily set kk = 1/M .

Then the above optimization problem is equivalent to

1
min kk2 subject to yi ( t xi + 0 ) 1, i = 1, . . . , n
, 0 2

## Stating the optimization problem

Remove the constraint kk = 1 by adjusting the constraints

, 0

## Therefore can arbitrarily set kk = 1/M .

Then the above optimization problem is equivalent to

1
min kk2 subject to yi ( t xi + 0 ) 1, i = 1, . . . , n
, 0 2

## Stating the optimization problem

Remove the constraint kk = 1 by adjusting the constraints

, 0

## Therefore can arbitrarily set kk = 1/M .

Then the above optimization problem is equivalent to

1
min kk2 subject to yi ( t xi + 0 ) 1, i = 1, . . . , n
, 0 2

"A*B0?-\$?/,"-1*CDE

## Stating the optimization problem

"+"-%&).%&+(/0"#1&2%)34&%,5/%44&")&(4&(&67#\$)"*#&*6&
With this formulation of the problem
&:"(4&*6&).%&4%5(/()"#0&.;5%/52(#%

## Optimal separating hyperplanes

.+"/0%+1.%2)(+'-*.%&.+3..-%'%4#)-+%5%'-2%'%46'-.%730&8%)(
=
3 min
5 !&

, 0

1
kk2 subject to yi ( t xi + 0 ) 1, i = 1, . . . , n
2

## ress the margin in terms of w and b of the separating hyperplane.

\$'6%1/4."46'-.%1'(%)-:)-)+.%(#6;+)#-(%&/%()\$46/%(*'6)-,%+1.%
(0%3.%*1##(.%+1.%(#6;+)#-%:#"%31)*1%+1.%2)(*")\$)-'-+%
The margin has thickness 1/kk as shown
distance between a point x and a plane (w, b)
.%:#"%+1.%+"')-)-,%.5'\$46.(%*6#(.(+%+#%+1.%&#;-2'"/
slightly different).
3 5) ! & " @
=

,=

1.%!"#\$#%!"&'1/4."46'-.

A
3

*.%:"#\$%+1.%*6#(.(+%
2'"/%)(
3=5 ! &
3

"

\$.(
\$"

A
3

@
3

&
3

3=5 ! &
3

,<

in figure
(notation
|wT x+b|

is

## The solution to this constrained optimization problem

1
min kk2 subject to yi ( t xi + 0 ) 1, i = 1, . . . , n
, 0 2
This is a convex optimization problem - quadratic objective

## Its associated primal Lagrangian function is

n

X
1
Lp (, 0 , ) = kk2 +
i yi (1 t xi 0 )
2
i=1

## The solution to this constrained optimization problem

1
min kk2 subject to yi ( t xi + 0 ) 1, i = 1, . . . , n
, 0 2
This is a convex optimization problem - quadratic objective

## Its associated primal Lagrangian function is

n

X
1
i yi (1 t xi 0 )
Lp (, 0 , ) = kk2 +
2
i=1

## The solution to this constrained optimization problem

1
min kk2 subject to yi ( t xi + 0 ) 1, i = 1, . . . , n
, 0 2
This is a convex optimization problem - quadratic objective

## Its associated primal Lagrangian function is

n

X
1
i yi (1 t xi 0 )
Lp (, 0 , ) = kk2 +
2
i=1

## The solution to this constrained optimization problem

1
min kk2 subject to yi ( t xi + 0 ) 1, i = 1, . . . , n
, 0 2
The Karush-Kuhn-Tucker conditions state that 1 = (0 , ) is a
minimum of this cost function if a unique s.t.
1

1 Lp (1 , ) = 0

j 0 for j = 1, . . . , n

j (1 yj (0 + xtj )) = 0 for j = 1, . . . , n

(1 yj (0 + xtj )) 0 for j = 1, . . . , n

## Lets check what the KKT conditions imply

Active constraints and Inactive constraints:
Let A be the set of indices with j > 0 then
X
1
Lp (1 , ) = k k2 +
j (1 yj (0 + xtj )).
2
jA

j yj xj

and 0 =

jA

j yj

jA

1

## if yi (0 + xti ) > 1 then i = 0 and i

/A

Lp (1 , ) = .5k k2 .

## Lets check what the KKT conditions imply

Active constraints and Inactive constraints:
Let A be the set of indices with j > 0 then
X
1
Lp (1 , ) = k k2 +
j (1 yj (0 + xtj )).
2
jA

j yj xj

and 0 =

jA

j yj

jA

1

## if yi (0 + xti ) > 1 then i = 0 and i

/A

Lp (1 , ) = .5k k2 .

To summarize
As we have a convex optimization problem it has one local

minimum.

1

## if i A then yi (0 + xti ) = 1 and therefore xi lies on the

boundary of the margin.
xi is called a support vector.

And if i
/ A then yi (0 + xti ) > 1 and xi lies outside of the
margin.
is a linear combination of the support vectors
X
=
j yj xj
jA

To summarize
As we have a convex optimization problem it has one local

minimum.

1

## if i A then yi (0 + xti ) = 1 and therefore xi lies on the

boundary of the margin.
xi is called a support vector.

And if i
/ A then yi (0 + xti ) > 1 and xi lies outside of the
margin.
is a linear combination of the support vectors
X
=
j yj xj
jA

To summarize
As we have a convex optimization problem it has one local

minimum.

1

## if i A then yi (0 + xti ) = 1 and therefore xi lies on the

boundary of the margin.
xi is called a support vector.

And if i
/ A then yi (0 + xti ) > 1 and xi lies outside of the
margin.
is a linear combination of the support vectors
X
=
j yj xj
jA

To summarize
As we have a convex optimization problem it has one local

minimum.

1

## if i A then yi (0 + xti ) = 1 and therefore xi lies on the

boundary of the margin.
xi is called a support vector.

And if i
/ A then yi (0 + xti ) > 1 and xi lies outside of the
margin.
is a linear combination of the support vectors
X
=
j yj xj
jA

To summarize
As we have a convex optimization problem it has one local

minimum.

1

## if i A then yi (0 + xti ) = 1 and therefore xi lies on the

boundary of the margin.
xi is called a support vector.

And if i
/ A then yi (0 + xti ) > 1 and xi lies outside of the
margin.
is a linear combination of the support vectors
X
=
j yj xj
jA

;<0(0"#>(./#(&>(&>#(&%(0"#(05&("3-#\$-.)>#<(0")0(
3()0(0"#<#("3-#\$-.)>#<(0"#(0#\$,(3/45!+/6789:(

g
To summarize

"#(:9))'.,\$;#&,'.2

(!/12(
0&\$<(*&>0\$/7;0#(
)>#

<>

!/ 3 / + /
:9))'.,\$
;#&,'.2\$?!@AB

>=(%\$&,(0"#(II!
0"#(<;--&\$0(D#*0&\$<

#0(*&;.=(7#(
#*0&\$<'()>=(
;.=(7#(0"#(<),#

If i
const

<=
==

X
Thus the SVMin
fact
only depends only a sm

=
yj x j
j

jA

How do I calculate ?
You have seen that the optimal solution is a weighted sum of

## But how can we calculate these weights?

Most common approach is to solve the Dual Lagrange

## problem as opposed to the Primal Lagrange problem. (The

solutions to these problems are the same because of the original quadratic
cost function and linear inequality constraints.)

## also convex. It has the form

max

n
X

1 XX
i
i k yi yk xti xk
2
i=1
i=1
k=1

subject to i 0 i

How do I calculate ?
You have seen that the optimal solution is a weighted sum of

## But how can we calculate these weights?

Most common approach is to solve the Dual Lagrange

## problem as opposed to the Primal Lagrange problem. (The

solutions to these problems are the same because of the original quadratic
cost function and linear inequality constraints.)

## also convex. It has the form

max

n
X

1 XX
i
i k yi yk xti xk
2
i=1
i=1
k=1

subject to i 0 i

How do I calculate ?
You have seen that the optimal solution is a weighted sum of

## But how can we calculate these weights?

Most common approach is to solve the Dual Lagrange

## problem as opposed to the Primal Lagrange problem. (The

solutions to these problems are the same because of the original quadratic
cost function and linear inequality constraints.)

## also convex. It has the form

max

n
X

1 XX
i
i k yi yk xti xk
2
i=1
i=1
k=1

subject to i 0 i

How do I calculate ?
You have seen that the optimal solution is a weighted sum of

## But how can we calculate these weights?

Most common approach is to solve the Dual Lagrange

## problem as opposed to the Primal Lagrange problem. (The

solutions to these problems are the same because of the original quadratic
cost function and linear inequality constraints.)

max

n
X

1 XX
i
i k yi yk xti xk
2
i=1
i=1
k=1

subject to i 0 i