Unit 3-Generative Models

Probabilistic generative models
Linear models for classification

Probabilistic view of classification

To show how the Models with linear decision boundaries arise from simple
assumptions about the data
In a generative approach to classification, we first model the class-
conditional densities p(x|Ck ) and the class priors p(Ck ), and then
◮ we compute posterior probabilities p(Ck |x) through Bayes’ theorem

P robabilistic generative
models
Probabilistic generative models (cont.)

For the two-class problem, the posterior probability of class C1 is
p(C1 |x) p(x|C1 )p(C1) = = σ(a) (1)

p(x|C1 )p(C1 ) + p(x|C2 )p(C ) 1 + exp
= `(−a)Σ2 1 ˛¸ Σ
p(x)= k p(x,Ck )= k p(x|Ck )p(Ck )
where we defined
a = ln p(x|C1 )p(C1) (2)
p(x| 2 2
C )p(C )
1
σ(a) is the logistic sigmoid
function (plotted in red)
1
σ(a) = (3) 0.5
1 + exp
(−a)
or squashing function, because
it maps R(whole real axis) 0
into a finite interval −5 0 5
The logistic sigmoid plays an important role in many

classification algorithms. It satisfies the following
symmetry property
σ(−a) = 1 − (4)
σ(a)
The inverse of the logistic sigmoid is known as logit function
σ
a= (5)
1−
ln
σ
It represents the log of the ratio of probabilities for
two classes
ln (p(C1|x)/p(C2 |x))
Also known as log odds

= p(x|C1)p(C1)
p(C1|x)
p(x|C1 )p(C1) + p(x|C2)p(C2)
1
=
1+ — lnp(x|C1 )p(C1 )
p(x| 2 2
exp ` C )p(C ) ˛¸
x
a
= σ
ln p(x|C12 )p(C12)
` p(x|C˛¸ )p(C x)
a
We have written the posterior probabilities in an equivalent form

that will have significance when a(x) is a linear function of x
◮ Here, the posterior probability is governed by a
generalised linear model
For the case K > 2 classes, we have
p(x|Ck )p(Ck ) exp (ak ) (6)

p(Ck |x) =p(x| j j exp
Σj = Σj j
C )p(C ) (a )
known as normalised exponential1
We have defined the quantity ak as
(7)
ak = ln p(x|
Ck )p(Ck ) ≃1
If ak >> aj , for all j /= k , p(Ck |x)
then ≃0
p(Cj |x)
We analyse the consequences of choosing specific forms for class-conditional
densities
1
It is a multiclass generalisation of the logistic sigmoid and it is also known as the softmax
function
Continuous inputs
Continuous inputs
Let us assume that the class-conditional densities p(x|Ck ) are Gaussian

1 1
exp { - 1 (x − µ) Σ
T −1
p(x|Ck ) = (8)
k
(2π) D/2 |Σ | 1/ 2 2 (x − µ k )
and, we want to explore the form of the posterior probabilities p(Ck |x)
The Gaussians have different means µ k but share the same covariance matrix Σ
Continuous inputs (cont.)
1
With 2 classes, p(C1 |x) = p(x|C1 )p(C1) = = σ(a)
1 + exp
p(x|C1)p(C1) + p(x|C2)p(C2)
and a = ln p(x|C1 )p(C1) , we (−a)
have 2 2
p(x|C )p(C )
(9)
p(C1|x) = σ(w T x + w0 )
wher
e
w = (10)
Σ −1
(µ 1 − µ 2 )
1 1
w0 = − −1 −1 p(C1) (11)
µ 1Σ µ1 + µ2 Σ µ 2 + ln
2 2 p(C2)
The quadratic terms in x from the exponents of the Gaussian densities have
cancelled (due to the assumption of common covariance matrices) leading to
◮ a linear function of x in the argument of the logistic sigmoid
The left-hand plot shows the class-conditional densities for two classes over 2D
The posterior probability p(C1|x) is a logistic sigmoid of a linear function of x
The surface in the right-hand plot is coloured using a proportion of red

given by p(C1|x) and a proportion of blue given by p(C2|x) = 1 − p(C1|x)
Decision boundaries are surfaces with constant posterior probabilities p(Ck |x)
◮ Given by linear functions of x
◮ Decision boundaries are linear in input space
Prior probabilities p(Ck ) enter only through the bias parameter w0 so changes in priors
have the effect of making parallel shifts of the decision boundary
◮ more generally of the parallel contours of constant posterior probability
p(x|Ck )p(Ck ) exp (ak ) and
For the K -class case, using p(Ck |x) = Σ K p(x| j j
= Σ K
exp (aj )
j =1 j =1
C )p(C )
ak = ln (p(x|Ck )p(Ck )), we have
(12)
ak (x) = w Tkx + wk0
(13)
wk = Σ−1µk
1 T −1
wk0 = − µk Σ (14)
2 µ k + ln p(Ck )
T he ak (x) are again linear functions of x as a consequence of the

cancellation of the quadratic terms due to the shared covariances
The resulting decision boundaries (minimum misclassification rate)
occur when two of the posterior probabilities (the two largest) are
equal, and so they are defined by linear functions of x
◮ again we have a generalised linear model
If we relax the assumption of a shared covariance matrix and allow each

class-conditional density p(x|Ck ) to have its own covariance matrix Σ k , then
the earlier cancellations will no longer occur, and we will obtain
quadratic functions of x, giving rise to a quadratic discriminant
models

Class-conditional densities for three classes each having a Gaussian distribution
◮ red and green classes have the same covariance matrix
2.5
2
1.5
1
0.5
0
−0.5
−1
−1.5
−2
−2.5
−2 −1 0 1
2
The corresponding posterior probabilities and the decision boundaries

◮ Linear boundary between red and green classes, same covariance matrix
◮ Quadratic boundaries between other pairs, different covariance matrix

Maximum likelihood solution
models
Maximum likelihood solution
Once we specified a parametric functional form for class-conditional densities

p(x|Ck ), we can determine parameters,with prior class probabilities p(Ck )
◮ U s i n g Maximum likelihood
This requires dataset comprising observations of x and corresponding class

labels
models
Maximum likelihood solution (cont.)
Consider first the two-class case, each having a Gaussian density with
shared covariance matrix Σ , and suppose we have data {x n , t n } N
t n = 1, for C1 with prior probability

p(C1) = π
t n = 0, for C2 with prior
probability p(C2) = 1 − π
=
p(xn , C1 ) p(C1 )p(xn |C1 ) = πN (xn |µ1 , Σ)
p(x , C2 ) p(C )p(x |C ) = (1 − π)N (x |µ , Σ)
For a data point xnn from = C1 (C22), we nhave2 t n = 1 (t n = 0) nand2thus
class
For t = (t1, . . . , t n ) T , the likelihood function is given by

N
tn
1−tn (15)
p(t|π, µ 1, µ 2 , Σ ) = π πN (xn |µ 1 , Σ ) (1 − π)N (xn |µ 2 , Σ )
n=1
models
As usual, we maximise the log of the likelihood function(Max w.r.t. π)
ΣN
t` n ln (π) + (1 −˛¸ t n ) ln (1 − π)x+ t`n ln (N (x˛¸n |µ 1 , Σ ))
x + `(1 − t n ) ln (N
˛¸ (xn |µ 2 ,
n=1 µ 1 ,Σ µ 2 ,Σ
xΣ )) π
` ˛¸ x
µ 1 ,µ 2 ,Σ
models
Consider first maximisation with respect to π, where the terms on π are
ΣN
(16)
t n ln (π) + (1 − t n ) ln (1 − π)
n=1
Setting the derivative wrt π to zero and rearranging
1 ΣN N1 N1 (17)
π= tn = N = N1 + 2
N n=1
N
The maximum likelihood estimate for π is the fraction of points in C1
This result is easily generalized to the multiclass case where
again the maximum likelihood estimate of the prior probability
associated with class Ck is given by the fraction of the training
set points assigned to that class.

models
Now consider maximisation with respect to µ 1 , where the terms on µ 1 are
ΣN 1 (18)
t n ln N (xn |µ 1 , Σ ) = − 2 t n (xn − µ 1 )T Σ −1
(xn − µ 1 ) + const
Σn=1
N n=1
Setting the derivative wrt µ 1 to zero and rearranging
1 ΣN
µ1 = t n xn
N
(19)
1 n=1
is the meant nof inputs xn in class C1

The maximum likelihood estimate of µµ21 = (20)
N 2 n=1
xn
1 ΣN
models

Lastly consider maximisation with
respect to Σ , where the terms on Σ
are
—2 t n ln |Σ | − t n (xn − µ 1)T Σ −1
(xn − µ 1)
n=1 n=1
2Σ
1 N
1Σ N
N
− 1Σ 1
2 n=1 (1 − t n ) ln |Σ | − 2 n=1 (1 − t n )(xn − µ 2 ) Σ (xn − µ 2 )
T −1
ΣN
N N
=− ln |Σ | − (21)
Tr(Σ −1 S) 2 2
Where we have defined,
N1 N2
S = S1 + (22)
S2
1 Σ
N N
S1 = (xn − µ 1)(xn − T
(23)
N
1
n∈C1 µ )
1
1 Σ T
= (24)
S2 N (xn − µ 2 )(xn − µ 2)
2 n∈C2
models
N1 1 Σ N2 1 Σ
Σ = S= (xn − µ 1)(xn − µ )T T
NN NN (xn − µ 2 )(xn − µ 2)
1
n∈C1 +1 2
n∈C2
It is the weighted average of the covariance matrices associated with each

class separately

Unit 3-Generative Models

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 3-Generative Models

Uploaded by

Copyright:

Available Formats

Probabilistic generative models

Linear models for classification

Probabilistic view of classification

◮ we compute posterior probabilities p(Ck |x) through Bayes’ theorem

Probabilistic generative models (cont.)

p(C1 |x) p(x|C1 )p(C1) = = σ(a) (1)

The logistic sigmoid plays an important role in many

The inverse of the logistic sigmoid is known as logit function

Also known as log odds

We have written the posterior probabilities in an equivalent form

For the case K > 2 classes, we have

p(x|Ck )p(Ck ) exp (ak ) (6)

We have defined the quantity ak as

Let us assume that the class-conditional densities p(x|Ck ) are Gaussian

The posterior probability p(C1|x) is a logistic sigmoid of a linear function of x

The surface in the right-hand plot is coloured using a proportion of red

T he ak (x) are again linear functions of x as a consequence of the

If we relax the assumption of a shared covariance matrix and allow each

Continuous inputs (cont.)

The corresponding posterior probabilities and the decision boundaries

◮ Quadratic boundaries between other pairs, different covariance matrix

Maximum likelihood solution

Once we specified a parametric functional form for class-conditional densities

This requires dataset comprising observations of x and corresponding class

Maximum likelihood solution (cont.)

t n = 1, for C1 with prior probability

t n = 0, for C2 with prior

For t = (t1, . . . , t n ) T , the likelihood function is given by

Maximum likelihood solution (cont.)

As usual, we maximise the log of the likelihood function(Max w.r.t. π)

Maximum likelihood solution (cont.)

Consider first maximisation with respect to π, where the terms on π are

Setting the derivative wrt π to zero and rearranging

Probabilistic generative models

Maximum likelihood solution (cont.)

Now consider maximisation with respect to µ 1 , where the terms on µ 1 are

Setting the derivative wrt µ 1 to zero and rearranging

is the meant nof inputs xn in class C1

Maximum likelihood solution (cont.)

Maximum likelihood solution (cont.)

It is the weighted average of the covariance matrices associated with each

You might also like