Linear Models for Classification

assumes values in an unordered discrete set.

target value (class label) given a set of input values

x = {x1, . . . , xd} measured on the same object.

classes, in which case we usually code them as

y ∈ {0, 1}.

1 JJ J I II J • X 1 JJ J I II J • X

I Credit Scoring: applicant will default or not? At a particular point x the value of y is not uniquely

determined.

I SPAM filter: e-mail message is SPAM or not?

I Medical diagnosis: does patient have breast cancer? It can assume both its values with respective

I Handwritten digit recognition. probabilities that depend on the location of the point x

I Music Genre Classification. in the input space. We write

p(y = 1|x) = 1 − p(y = 0|x).

The goal of a classification procedure is to produce an

estimate of p(y|x) at every input point x.

2 JJ J I II J • X 3 JJ J I II J • X

distribution of y given x. The probability distribution of

I Generative Models (density estimation). x itself is not modeled. For the binary classification

problem:

p(y = 1|x) = f (x, w)

where f (x, w) is some deterministic function of x.

strategy.

4 JJ J I II J • X 5 JJ J I II J • X

Discriminative Models Generative Models

Examples of discriminative classification methods: An alternative paradigm for estimating p(y|x) is based

on density estimation. Here Bayes’ theorem

I Linear probability model (this lecture) p(y = 1)p(x|y = 1)

p(y = 1|x) =

I Logistic regression (this lecture) p(y = 1)p(x|y = 1) + p(y = 0)p(x|y = 0)

I Classification Trees (Book: section 4.3) is applied where p(x|y) are the class conditional

I Feed-forward neural networks (Book: section 5.4) probability density functions and p(y) are the

unconditional (prior) probabilities of each class.

I ...

6 JJ J I II J • X 7 JJ J I II J • X

Examples of density estimation based classification Consider the linear regression model

y = wT x + ε y ∈ {0, 1}

methods:

Note:

I Linear/Quadratic Discriminant Analysis (not

1

discussed), x1

wT = [w0 w1 . . . wd], x = .. .

.

I Naive Bayes classifier (Book: section 5.3), xd

so wT x = w0 + di=1 wixi.

P

I...

By assumption E[ε|x] = 0, so we have

E[y|x] = wT x

But

E[y|x] = 1 · p(y = 1|x) + 0 · p(y = 0|x)

= p(y = 1|x)

8 JJ J I II J • X 9 JJ J I II J • X

T

1

ew x

E[y|x] E[y|x] = T

1 + ew x

T

or (divide numerator and denominator by ew x)

1 T

E[y|x] = T

= (1 + e−w x)−1

1 + e−w x

1 wT x

10 JJ J I II J • X 11 JJ J I II J • X

Logistic Response Function Linearization: the logit transformation

Write z = wT x:

1.0

p(y = 1|x) (1 + e−z )−1

ln = ln

E[y|x] 1 − p(y = 1|x) 1 − (1 + e−z )−1

1 1

= ln = ln −z

(1 + e−z ) − 1 e

= ln ez = z = wT x

0.5

In the second step, we divided the numerator and the denominator by (1 + e−z )−1.

The ratio p(y = 1|x)/(1 − p(y = 1|x)) is called the odds.

0.0

12 JJ J I II J • X 13 JJ J I II J • X

x2

Assign to class 1 if p(y = 1|x) > p(y = 0|x), i.e. if

p(y = 1|x)

>1 w T x = w0 + w1 x1 + w2 x2 = 0

1 − p(y = 1|x)

ln >0

1 − p(y = 1|x)

x1

otherwise.

14 JJ J I II J • X 15 JJ J I II J • X

One coin flip y = (1, 0, 1, 1, 0, 1, 1, 1, 1, 0).

p(y) = µy (1 − µ)1−y

Note that p(1) = µ, p(0) = 1 − µ as required. The corresponding likelihood function is

Sequence of N independent coin flips p(y|µ) = µ · (1 − µ) · µ · µ · (1 − µ) · µ · µ · µ · µ

N ·(1 − µ) = µ7(1 − µ)3

µyi (1 − µ)1−yi

Y

p(y) = p(y1, y2, ..., yN ) =

The corresponding loglikelihood function is

i=1

which defines the likelihood function when viewed as a ln p(y|µ) = ln(µ7(1 − µ)3) = 7 ln µ + 3 ln(1 − µ)

function of µ. Note: log ab = log a + log b, log ab = b log a.

16 JJ J I II J • X 17 JJ J I II J • X

Computing the maximum Loglikelihood function for y = (1, 0, 1, 1, 0, 1, 1, 1, 1, 0)

equate it to zero

−10

d ln p(y|µ) 7 3

= − =0

µ 1−µ

−15

dµ

loglikelihood

which yields maximum likelihood estimate µ = 0.7. ML

−20

This is just the relative frequency of heads in the sample.

−25

Note:

−30

d ln x 1

=

dx x 0.0 0.2 0.4 0.6 0.8 1.0

18 JJ J I II J • X 19mu JJ J I II J • X

T i xi yi p(yi)

µi = p(y = 1|xi) = (1 + e−w xi )−1

T 1 8 0 (1 + ew0+8w1 )−1

1 − µi = p(y = 0|xi) = (1 + ew xi )−1 2 12 0 (1 + ew0+12w1 )−1

we can represent its probability distribution as follows 3 15 1 (1 + e−w0−15w1 )−1

y 4 10 1 (1 + e−w0−10w1 )−1

p(yi) = µi i (1 − µi)1−yi yi ∈ {0, 1}; i = 1, . . . , N

p(y|w) = (1 + ew0+8w1 )−1 × (1 + ew0+12w1 )−1×

× (1 + e−w0−15w1 )−1 × (1 + e−w0−10w1 )−1

20 JJ J I II J • X 21 JJ J I II J • X

Since the yi observations are independent: Since for the logistic regression model:

T

N

Y N

Y µi = (1 + e−w xi )−1

p(y|w) = p(yi) = µyi i (1 − µi)1−yi T

1 − µi = (1 + ew xi )−1

i=1 i=1

N n

X o

T T

N

Y E(w) = yi ln(1 + e−w xi ) + (1 − yi) ln(1 + ew xi )

− ln p(y|w) = − ln µyi i (1 − µi)1−yi i=1

i=1 I Non-linear function of the parameters.

N

I Likelihood function globally concave.

X

=− {yi ln µi + (1 − yi) ln(1 − µi)}

i=1 I Numerical Optimization required.

This is called the cross-entropy error function.

22 JJ J I II J • X 23 JJ J I II J • X

Fitted Response Function Example: Programming Assignment

Substitute maximum likelihood estimates into the Model the probability of succesfully completing a

response function to obtain the fitted response function programming assignment.

T

ew x

ML

Explanatory variable: “programming experience”.

p̂(y = 1|x) = T

1 + ew x ML

We find w0 = −3.0597 and w1 = 0.1615, so

e−3.0597+0.1615xi

p̂(y = 1|xi) =

1 + e−3.0597+0.1615xi

14 months of programming experience:

e−3.0597+0.1615(14)

p̂(y = 1|x = 14) = ≈ 0.31

1 + e−3.0597+0.1615(14)

24 JJ J I II J • X 25 JJ J I II J • X

month.exp success fitted month.exp success fitted Probability of the classes is equal when

1 14 0 0.310262 16 13 0 0.276802

2 29 0 0.835263 17 9 0 0.167100 −3.0597 + 0.1615x = 0

3 6 0 0.109996 18 32 1 0.891664

4 25 1 0.726602 19 24 0 0.693379

5 18 1 0.461837 20 13 1 0.276802 Solving for x we get x ≈ 18.95.

6 4 0 0.082130 21 19 0 0.502134

7 18 0 0.461837 22 4 0 0.082130

8 12 0 0.245666 23 28 1 0.811825 Allocation Rule:

9 22 1 0.620812 24 22 1 0.620812

10 6 0 0.109996 25 8 1 0.145815 x ≥ 19: assign to class 1

11 30 1 0.856299 x < 19: assign to class 0

12 11 0 0.216980

13 30 1 0.856299

14 5 0 0.095154

15 20 1 0.542404

26 JJ J I II J • X 27 JJ J I II J • X

Cross table of observed and predicted class label: Two possible causes:

0 1 a) Benign tumor (adenoma) of the adrenal cortex.

0 11 3 b) More diffuse affection of the adrenal glands (bilateral

1 3 8 hyperplasia).

Row: observed, Column: predicted

Pre-operative diagnosis on basis of

Error rate: 6/25=0.24 1. Sodium concentration (mmol/l)

2. CO2 concentration (mmol/l)

Default: 11/25=0.44

28 JJ J I II J • X 29 JJ J I II J • X

Conn’s syndrome: the data Conn’s Syndrome: Plot of Data

a=1, b=0

34

b b

b b b

sodium co2 cause sodium co2 cause b

32

1 140.6 30.3 0 16 139.0 31.4 0 b

2 143.0 27.1 0 17 144.8 33.5 0

b

30

3 140.0 27.0 0 18 145.7 27.4 0 b

b b

4 146.0 33.0 0 19 144.0 33.0 0 b

5 138.7 24.1 0 20 143.5 27.5 0 a

28

ab

co2

6 143.7 28.0 0 21 140.3 23.4 1 b b b

b a b

7 137.3 29.6 0 22 141.2 25.8 1

b a aa

26

8 141.0 30.0 0 23 142.0 22.0 1 a

9 143.8 32.2 0 24 143.5 27.8 1 a

10 144.6 29.5 0 25 139.7 28.0 1 b

24

11 139.5 26.0 0 26 141.1 25.0 1 a

12 144.0 33.7 0 27 141.0 26.0 1 a

22

13 145.0 33.0 0 28 140.5 27.0 1

138 140 142 144 146

14 140.2 29.1 0 29 140.0 26.0 1

15 144.7 27.4 0 30 140.0 25.6 1 sodium

30 JJ J I II J • X 31 JJ J I II J • X

34

b b

b b b

b

w0 = 36.6874320

32

b

w1 = −0.1164658 30

b

b

b b

w2 = −0.7626711 b

a

28

ab

co2

b b b

b a b

b a aa

26

Assign to group a if a

a

b

36.69 − 0.12 × sodium − 0.76 × CO2 > 0

24

22

sodium

32 JJ J I II J • X 33 JJ J I II J • X

Conn’s Syndrome: Confusion Matrix Conn’s Syndrome: Line with lower empirical error

34

b b

a b b

b b b

32

a 7 3 b

b b

b 2 18

30

b b

b

Row: observed, Column: predicted

28

co2

a a b

b b b

b a b

26

b a aa

Error rate: 5/30=1/6 a

a

24

b

a

Default: 1/3

22

a

138 140 142 144 146

sodium

34 JJ J I II J • X 35 JJ J I II J • X

Likelihood and Error Rate Quadratic Model

minimization! (Intercept) -13100.69

sodium 177.42

i yi p̂1(yi = 1) p̂2(yi = 1) CO2 41.36

1 0 0.9 0.6 sodium2 -0.60

2 0 0.4 0.1 CO2 2 -0.12

3 1 0.6 0.9 sodium × CO2 -0.25

4 1 0.55 0.4 Cross table of observed (row) and predicted class label:

Which model has the lower error-rate? a b

Which one the higher likelihood? a 8 2

b 2 18

36 JJ J I II J • X 37 JJ J I II J • X

34

b b

b b b

b

32

b

b b

30

b b

b

28

co2

a ab

b b b

b a b

26

b a aa

a

a

24

b

a

22

a

138 140 142 144 146

sodium

38 JJ J I II J • X

