You are on page 1of 23

Probabilistic generative models

Linear models for classification


Probabilistic generative models
Probabilistic generative models

Probabilistic view of classification


To show how the Models with linear decision boundaries arise from simple
assumptions about the data
In a generative approach to classification, we first model the class-
conditional densities p(x|Ck ) and the class priors p(Ck ), and then

◮ we compute posterior probabilities p(Ck |x) through Bayes’ theorem


P robabilistic generative
models

Probabilistic generative models (cont.)


For the two-class problem, the posterior probability of class C1 is

p(C1 |x) p(x|C1 )p(C1) = = σ(a) (1)


p(x|C1 )p(C1 ) + p(x|C2 )p(C ) 1 + exp
= `(−a)Σ2 1 ˛¸ Σ
p(x)= k p(x,Ck )= k p(x|Ck )p(Ck )

where we defined
a = ln p(x|C1 )p(C1) (2)
p(x| 2 2
C )p(C )
1
σ(a) is the logistic sigmoid
function (plotted in red)

1
σ(a) = (3) 0.5
1 + exp
(−a)
or squashing function, because
it maps R(whole real axis) 0
into a finite interval −5 0 5
Probabilistic generative models (cont.)

The logistic sigmoid plays an important role in many


classification algorithms. It satisfies the following
symmetry property
σ(−a) = 1 − (4)
σ(a)

The inverse of the logistic sigmoid is known as logit function

σ
a= (5)
1−
ln
σ
It represents the log of the ratio of probabilities for
two classes

ln (p(C1|x)/p(C2 |x))

Also known as log odds


Probabilistic generative models (cont.)

= p(x|C1)p(C1)
p(C1|x)
p(x|C1 )p(C1) + p(x|C2)p(C2)
1
=
1+ — lnp(x|C1 )p(C1 )
p(x| 2 2
exp ` C )p(C ) ˛¸
x
a
= σ
ln p(x|C12 )p(C12)
` p(x|C˛¸ )p(C x)
a

We have written the posterior probabilities in an equivalent form


that will have significance when a(x) is a linear function of x
◮ Here, the posterior probability is governed by a
generalised linear model
Probabilistic generative models (cont.)

For the case K > 2 classes, we have

p(x|Ck )p(Ck ) exp (ak ) (6)


p(Ck |x) =p(x| j j exp
Σj = Σj j
C )p(C ) (a )
known as normalised exponential1

We have defined the quantity ak as

(7)
ak = ln p(x|

Ck )p(Ck ) ≃1
If ak >> aj , for all j /= k , p(Ck |x)
then ≃0
p(Cj |x)
We analyse the consequences of choosing specific forms for class-conditional
densities

1
It is a multiclass generalisation of the logistic sigmoid and it is also known as the softmax
function
Continuous inputs
Probabilistic generative models
Continuous inputs

Let us assume that the class-conditional densities p(x|Ck ) are Gaussian


1 1
exp { - 1 (x − µ) Σ
T −1
p(x|Ck ) = (8)
k
(2π) D/2 |Σ | 1/ 2 2 (x − µ k )

and, we want to explore the form of the posterior probabilities p(Ck |x)

The Gaussians have different means µ k but share the same covariance matrix Σ
Continuous inputs (cont.)
1
With 2 classes, p(C1 |x) = p(x|C1 )p(C1) = = σ(a)
1 + exp
p(x|C1)p(C1) + p(x|C2)p(C2)
and a = ln p(x|C1 )p(C1) , we (−a)
have 2 2
p(x|C )p(C )
(9)
p(C1|x) = σ(w T x + w0 )
wher
e
w = (10)
Σ −1
(µ 1 − µ 2 )
1 1
w0 = − −1 −1 p(C1) (11)
µ 1Σ µ1 + µ2 Σ µ 2 + ln
2 2 p(C2)

The quadratic terms in x from the exponents of the Gaussian densities have
cancelled (due to the assumption of common covariance matrices) leading to
◮ a linear function of x in the argument of the logistic sigmoid
Continuous inputs (cont.)
The left-hand plot shows the class-conditional densities for two classes over 2D

The posterior probability p(C1|x) is a logistic sigmoid of a linear function of x

The surface in the right-hand plot is coloured using a proportion of red


given by p(C1|x) and a proportion of blue given by p(C2|x) = 1 − p(C1|x)
Continuous inputs (cont.)

Decision boundaries are surfaces with constant posterior probabilities p(Ck |x)
◮ Given by linear functions of x
◮ Decision boundaries are linear in input space

Prior probabilities p(Ck ) enter only through the bias parameter w0 so changes in priors
have the effect of making parallel shifts of the decision boundary
◮ more generally of the parallel contours of constant posterior probability
Continuous inputs (cont.)
p(x|Ck )p(Ck ) exp (ak ) and
For the K -class case, using p(Ck |x) = Σ K p(x| j j
= Σ K
exp (aj )
j =1 j =1
C )p(C )
ak = ln (p(x|Ck )p(Ck )), we have
(12)
ak (x) = w Tkx + wk0

(13)
wk = Σ−1µk
1 T −1
wk0 = − µk Σ (14)
2 µ k + ln p(Ck )

T he ak (x) are again linear functions of x as a consequence of the


cancellation of the quadratic terms due to the shared covariances
The resulting decision boundaries (minimum misclassification rate)
occur when two of the posterior probabilities (the two largest) are
equal, and so they are defined by linear functions of x
◮ again we have a generalised linear model
Continuous inputs (cont.)

If we relax the assumption of a shared covariance matrix and allow each


class-conditional density p(x|Ck ) to have its own covariance matrix Σ k , then
the earlier cancellations will no longer occur, and we will obtain
quadratic functions of x, giving rise to a quadratic discriminant
P robabilistic generative
models

Continuous inputs (cont.)


Class-conditional densities for three classes each having a Gaussian distribution
◮ red and green classes have the same covariance matrix

2.5
2
1.5
1
0.5
0
−0.5
−1
−1.5
−2
−2.5
−2 −1 0 1
2

The corresponding posterior probabilities and the decision boundaries


◮ Linear boundary between red and green classes, same covariance matrix

◮ Quadratic boundaries between other pairs, different covariance matrix


Maximum likelihood solution
Probabilistic generative models
P robabilistic generative
models

Maximum likelihood solution

Once we specified a parametric functional form for class-conditional densities


p(x|Ck ), we can determine parameters,with prior class probabilities p(Ck )
◮ U s i n g Maximum likelihood

This requires dataset comprising observations of x and corresponding class


labels
P robabilistic generative
models

Maximum likelihood solution (cont.)

Consider first the two-class case, each having a Gaussian density with
shared covariance matrix Σ , and suppose we have data {x n , t n } N

t n = 1, for C1 with prior probability


p(C1) = π

t n = 0, for C2 with prior

probability p(C2) = 1 − π
=
p(xn , C1 ) p(C1 )p(xn |C1 ) = πN (xn |µ1 , Σ)
p(x , C2 ) p(C )p(x |C ) = (1 − π)N (x |µ , Σ)
For a data point xnn from = C1 (C22), we nhave2 t n = 1 (t n = 0) nand2thus
class

For t = (t1, . . . , t n ) T , the likelihood function is given by


N
tn
1−tn (15)
p(t|π, µ 1, µ 2 , Σ ) = π πN (xn |µ 1 , Σ ) (1 − π)N (xn |µ 2 , Σ )
n=1
P robabilistic generative
models

Maximum likelihood solution (cont.)

As usual, we maximise the log of the likelihood function(Max w.r.t. π)

ΣN
t` n ln (π) + (1 −˛¸ t n ) ln (1 − π)x+ t`n ln (N (x˛¸n |µ 1 , Σ ))
x + `(1 − t n ) ln (N
˛¸ (xn |µ 2 ,
n=1 µ 1 ,Σ µ 2 ,Σ
xΣ )) π
` ˛¸ x
µ 1 ,µ 2 ,Σ
P robabilistic generative
models

Maximum likelihood solution (cont.)

Consider first maximisation with respect to π, where the terms on π are

ΣN
(16)
t n ln (π) + (1 − t n ) ln (1 − π)
n=1

Setting the derivative wrt π to zero and rearranging

1 ΣN N1 N1 (17)
π= tn = N = N1 + 2
N n=1
N
The maximum likelihood estimate for π is the fraction of points in C1
This result is easily generalized to the multiclass case where
again the maximum likelihood estimate of the prior probability
associated with class Ck is given by the fraction of the training
set points assigned to that class.

Probabilistic generative models


P robabilistic generative
models

Maximum likelihood solution (cont.)

Now consider maximisation with respect to µ 1 , where the terms on µ 1 are

ΣN 1 (18)
t n ln N (xn |µ 1 , Σ ) = − 2 t n (xn − µ 1 )T Σ −1
(xn − µ 1 ) + const
Σn=1
N n=1

Setting the derivative wrt µ 1 to zero and rearranging

1 ΣN

µ1 = t n xn
N
(19)
1 n=1

is the meant nof inputs xn in class C1


The maximum likelihood estimate of µµ21 = (20)
N 2 n=1
xn

1 ΣN
P robabilistic generative
models

Maximum likelihood solution (cont.)


Lastly consider maximisation with
respect to Σ , where the terms on Σ
are
—2 t n ln |Σ | − t n (xn − µ 1)T Σ −1
(xn − µ 1)
n=1 n=1

1 N
1Σ N
N
− 1Σ 1
2 n=1 (1 − t n ) ln |Σ | − 2 n=1 (1 − t n )(xn − µ 2 ) Σ (xn − µ 2 )
T −1

ΣN
N N
=− ln |Σ | − (21)
Tr(Σ −1 S) 2 2
Where we have defined,
N1 N2
S = S1 + (22)
S2
1 Σ
N N
S1 = (xn − µ 1)(xn − T
(23)
N
1
n∈C1 µ )
1

1 Σ T
= (24)
S2 N (xn − µ 2 )(xn − µ 2)
2 n∈C2
P robabilistic generative
models

Maximum likelihood solution (cont.)

N1 1 Σ N2 1 Σ
Σ = S= (xn − µ 1)(xn − µ )T T
NN NN (xn − µ 2 )(xn − µ 2)
1
n∈C1 +1 2
n∈C2

It is the weighted average of the covariance matrices associated with each


class separately

You might also like