You are on page 1of 7

# Classification Problems

## Data Mining In classification problem there is a target variable y that

Linear Models for Classification
assumes values in an unordered discrete set.

## The goal of a classification procedure is to predict the

target value (class label) given a set of input values
x = {x1, . . . , xd} measured on the same object.

## An important special case is when there are only two

classes, in which case we usually code them as
y ∈ {0, 1}.

1 JJ J I II J • X 1 JJ J I II J • X

## Examples of Classification Problems Classification Problems

I Credit Scoring: applicant will default or not? At a particular point x the value of y is not uniquely
determined.
I SPAM filter: e-mail message is SPAM or not?
I Medical diagnosis: does patient have breast cancer? It can assume both its values with respective
I Handwritten digit recognition. probabilities that depend on the location of the point x
I Music Genre Classification. in the input space. We write
p(y = 1|x) = 1 − p(y = 0|x).
The goal of a classification procedure is to produce an
estimate of p(y|x) at every input point x.

2 JJ J I II J • X 3 JJ J I II J • X

## I Discriminative Models (regression). Discriminative methods only model the conditional

distribution of y given x. The probability distribution of
I Generative Models (density estimation). x itself is not modeled. For the binary classification
problem:
p(y = 1|x) = f (x, w)
where f (x, w) is some deterministic function of x.

## Note that the linear regression model follows the same

strategy.

4 JJ J I II J • X 5 JJ J I II J • X

## logreg.pdf — May 4, 2010 — 1

Discriminative Models Generative Models

Examples of discriminative classification methods: An alternative paradigm for estimating p(y|x) is based
on density estimation. Here Bayes’ theorem
I Linear probability model (this lecture) p(y = 1)p(x|y = 1)
p(y = 1|x) =
I Logistic regression (this lecture) p(y = 1)p(x|y = 1) + p(y = 0)p(x|y = 0)

I Classification Trees (Book: section 4.3) is applied where p(x|y) are the class conditional
I Feed-forward neural networks (Book: section 5.4) probability density functions and p(y) are the
unconditional (prior) probabilities of each class.
I ...

6 JJ J I II J • X 7 JJ J I II J • X

## Generative Models Discriminative Models: linear probability model

Examples of density estimation based classification Consider the linear regression model
y = wT x + ε y ∈ {0, 1}
methods:
Note:
I Linear/Quadratic Discriminant Analysis (not
 
1
discussed),  x1
 
wT = [w0 w1 . . . wd], x =  .. .

 . 
I Naive Bayes classifier (Book: section 5.3), xd
so wT x = w0 + di=1 wixi.
P
I...
By assumption E[ε|x] = 0, so we have
E[y|x] = wT x
But
E[y|x] = 1 · p(y = 1|x) + 0 · p(y = 0|x)
= p(y = 1|x)
8 JJ J I II J • X 9 JJ J I II J • X

## Logistic response function

T
1
ew x
E[y|x] E[y|x] = T
1 + ew x
T
or (divide numerator and denominator by ew x)
1 T
E[y|x] = T
= (1 + e−w x)−1
1 + e−w x
1 wT x

10 JJ J I II J • X 11 JJ J I II J • X

## logreg.pdf — May 4, 2010 — 2

Logistic Response Function Linearization: the logit transformation

Write z = wT x:
1.0
p(y = 1|x) (1 + e−z )−1
ln = ln
E[y|x] 1 − p(y = 1|x) 1 − (1 + e−z )−1
1 1
= ln = ln −z
(1 + e−z ) − 1 e
= ln ez = z = wT x
0.5

In the second step, we divided the numerator and the denominator by (1 + e−z )−1.
The ratio p(y = 1|x)/(1 − p(y = 1|x)) is called the odds.
0.0

12 JJ J I II J • X 13 JJ J I II J • X

## Linear Separation Linear Decision Boundary

x2
Assign to class 1 if p(y = 1|x) > p(y = 0|x), i.e. if
p(y = 1|x)
>1 w T x = w0 + w1 x1 + w2 x2 = 0
1 − p(y = 1|x)

ln >0
1 − p(y = 1|x)
x1

## So assign to class 1 if wT x > 0, and to class 0

otherwise.
14 JJ J I II J • X 15 JJ J I II J • X

## y = 1 if heads, y = 0 if tails. Let µ = p(y = 1). In a sequence of 10 coin flips we observe

One coin flip y = (1, 0, 1, 1, 0, 1, 1, 1, 1, 0).
p(y) = µy (1 − µ)1−y
Note that p(1) = µ, p(0) = 1 − µ as required. The corresponding likelihood function is
Sequence of N independent coin flips p(y|µ) = µ · (1 − µ) · µ · µ · (1 − µ) · µ · µ · µ · µ
N ·(1 − µ) = µ7(1 − µ)3
µyi (1 − µ)1−yi
Y
p(y) = p(y1, y2, ..., yN ) =
The corresponding loglikelihood function is
i=1
which defines the likelihood function when viewed as a ln p(y|µ) = ln(µ7(1 − µ)3) = 7 ln µ + 3 ln(1 − µ)
function of µ. Note: log ab = log a + log b, log ab = b log a.
16 JJ J I II J • X 17 JJ J I II J • X

## logreg.pdf — May 4, 2010 — 3

Computing the maximum Loglikelihood function for y = (1, 0, 1, 1, 0, 1, 1, 1, 1, 0)

## To determine the maximum we take the derivative and

equate it to zero

−10
d ln p(y|µ) 7 3
= − =0
µ 1−µ

−15

loglikelihood
which yields maximum likelihood estimate µ = 0.7. ML

−20
This is just the relative frequency of heads in the sample.

−25
Note:

−30
d ln x 1
=
dx x 0.0 0.2 0.4 0.6 0.8 1.0

18 JJ J I II J • X 19mu JJ J I II J • X

## Now probability of success depends on xi: Example

T i xi yi p(yi)
µi = p(y = 1|xi) = (1 + e−w xi )−1
T 1 8 0 (1 + ew0+8w1 )−1
1 − µi = p(y = 0|xi) = (1 + ew xi )−1 2 12 0 (1 + ew0+12w1 )−1
we can represent its probability distribution as follows 3 15 1 (1 + e−w0−15w1 )−1
y 4 10 1 (1 + e−w0−10w1 )−1
p(yi) = µi i (1 − µi)1−yi yi ∈ {0, 1}; i = 1, . . . , N
p(y|w) = (1 + ew0+8w1 )−1 × (1 + ew0+12w1 )−1×
× (1 + e−w0−15w1 )−1 × (1 + e−w0−10w1 )−1

20 JJ J I II J • X 21 JJ J I II J • X

## LR: likelihood function LR: error function

Since the yi observations are independent: Since for the logistic regression model:
T
N
Y N
Y µi = (1 + e−w xi )−1
p(y|w) = p(yi) = µyi i (1 − µi)1−yi T
1 − µi = (1 + ew xi )−1
i=1 i=1

## Or, taking minus the natural log: we get

N n
X o
T T
N
Y E(w) = yi ln(1 + e−w xi ) + (1 − yi) ln(1 + ew xi )
− ln p(y|w) = − ln µyi i (1 − µi)1−yi i=1
i=1 I Non-linear function of the parameters.
N
I Likelihood function globally concave.
X
=− {yi ln µi + (1 − yi) ln(1 − µi)}
i=1 I Numerical Optimization required.
This is called the cross-entropy error function.

## Comparable to sum of squared errors for regression problems.

22 JJ J I II J • X 23 JJ J I II J • X

## logreg.pdf — May 4, 2010 — 4

Fitted Response Function Example: Programming Assignment

Substitute maximum likelihood estimates into the Model the probability of succesfully completing a
response function to obtain the fitted response function programming assignment.
T
ew x
ML
Explanatory variable: “programming experience”.
p̂(y = 1|x) = T
1 + ew x ML
We find w0 = −3.0597 and w1 = 0.1615, so
e−3.0597+0.1615xi
p̂(y = 1|xi) =
1 + e−3.0597+0.1615xi
14 months of programming experience:
e−3.0597+0.1615(14)
p̂(y = 1|x = 14) = ≈ 0.31
1 + e−3.0597+0.1615(14)
24 JJ J I II J • X 25 JJ J I II J • X

## Example: Programming Assignment Allocation Rule

month.exp success fitted month.exp success fitted Probability of the classes is equal when
1 14 0 0.310262 16 13 0 0.276802
2 29 0 0.835263 17 9 0 0.167100 −3.0597 + 0.1615x = 0
3 6 0 0.109996 18 32 1 0.891664
4 25 1 0.726602 19 24 0 0.693379
5 18 1 0.461837 20 13 1 0.276802 Solving for x we get x ≈ 18.95.
6 4 0 0.082130 21 19 0 0.502134
7 18 0 0.461837 22 4 0 0.082130
8 12 0 0.245666 23 28 1 0.811825 Allocation Rule:
9 22 1 0.620812 24 22 1 0.620812
10 6 0 0.109996 25 8 1 0.145815 x ≥ 19: assign to class 1
11 30 1 0.856299 x < 19: assign to class 0
12 11 0 0.216980
13 30 1 0.856299
14 5 0 0.095154
15 20 1 0.542404

26 JJ J I II J • X 27 JJ J I II J • X

## Programming Assignment: Confusion Matrix Example: Conn’s syndrome

Cross table of observed and predicted class label: Two possible causes:
0 1 a) Benign tumor (adenoma) of the adrenal cortex.
0 11 3 b) More diffuse affection of the adrenal glands (bilateral
1 3 8 hyperplasia).
Row: observed, Column: predicted
Pre-operative diagnosis on basis of
Error rate: 6/25=0.24 1. Sodium concentration (mmol/l)
2. CO2 concentration (mmol/l)
Default: 11/25=0.44

28 JJ J I II J • X 29 JJ J I II J • X

## logreg.pdf — May 4, 2010 — 5

Conn’s syndrome: the data Conn’s Syndrome: Plot of Data

a=1, b=0

34
b b
b b b
sodium co2 cause sodium co2 cause b

32
1 140.6 30.3 0 16 139.0 31.4 0 b
2 143.0 27.1 0 17 144.8 33.5 0
b

30
3 140.0 27.0 0 18 145.7 27.4 0 b
b b
4 146.0 33.0 0 19 144.0 33.0 0 b
5 138.7 24.1 0 20 143.5 27.5 0 a

28
ab

co2
6 143.7 28.0 0 21 140.3 23.4 1 b b b
b a b
7 137.3 29.6 0 22 141.2 25.8 1
b a aa

26
8 141.0 30.0 0 23 142.0 22.0 1 a
9 143.8 32.2 0 24 143.5 27.8 1 a
10 144.6 29.5 0 25 139.7 28.0 1 b

24
11 139.5 26.0 0 26 141.1 25.0 1 a
12 144.0 33.7 0 27 141.0 26.0 1 a

22
13 145.0 33.0 0 28 140.5 27.0 1
138 140 142 144 146
14 140.2 29.1 0 29 140.0 26.0 1
15 144.7 27.4 0 30 140.0 25.6 1 sodium
30 JJ J I II J • X 31 JJ J I II J • X

## The maximum likelihood estimates are:

34
b b
b b b
b
w0 = 36.6874320
32
b
w1 = −0.1164658 30
b
b
b b
w2 = −0.7626711 b
a
28

ab
co2

b b b
b a b
b a aa
26

Assign to group a if a
a
b
36.69 − 0.12 × sodium − 0.76 × CO2 > 0
24

22

## 138 140 142 144 146

sodium
32 JJ J I II J • X 33 JJ J I II J • X

Conn’s Syndrome: Confusion Matrix Conn’s Syndrome: Line with lower empirical error

## Cross table of observed and predicted class label:

34

b b
a b b
b b b
32

a 7 3 b
b b
b 2 18
30

b b
b
Row: observed, Column: predicted
28
co2

a a b
b b b
b a b
26

b a aa
Error rate: 5/30=1/6 a
a
24

b
a
Default: 1/3
22

a
138 140 142 144 146

sodium
34 JJ J I II J • X 35 JJ J I II J • X

## logreg.pdf — May 4, 2010 — 6

Likelihood and Error Rate Quadratic Model

## Likelihood maximization is not the same as error rate Coefficient Value

minimization! (Intercept) -13100.69
sodium 177.42
i yi p̂1(yi = 1) p̂2(yi = 1) CO2 41.36
1 0 0.9 0.6 sodium2 -0.60
2 0 0.4 0.1 CO2 2 -0.12
3 1 0.6 0.9 sodium × CO2 -0.25
4 1 0.55 0.4 Cross table of observed (row) and predicted class label:
Which model has the lower error-rate? a b
Which one the higher likelihood? a 8 2
b 2 18
36 JJ J I II J • X 37 JJ J I II J • X

## Conn’s Syndrome: Quadratic Specification

34

b b
b b b
b
32

b
b b
30

b b
b
28
co2

a ab
b b b
b a b
26

b a aa
a
a
24

b
a
22

a
138 140 142 144 146

sodium
38 JJ J I II J • X