You are on page 1of 12

CSE 1513 Soft Computing Dr.

Djamel Bouchaffra

NOT PART OF THE FINAL

Unsupervised Learning and


Other Neural Networks

z Introduction
z Mixture Densities and Identifiability
z ML Estimates
z Application to Normal Mixtures
z Other Neural Networks

z Introduction

Previously, all our training samples were labeled: these


samples were said supervised

We now investigate a number of unsupervised


procedures which use unlabeled samples

Collecting and Labeling a large set of sample patterns


can be costly

We can train with large amounts of (less expensive)


unlabeled data, and only then use supervision to label
the groupings found, this is appropriate for large data
mining applications where the contents of a large
database are not known beforehand

Unsupervised Learning and Other Neural


Networks 1
CSE 1513 Soft Computing Dr. Djamel Bouchaffra

This is also appropriate in many applications


when the characteristics of the patterns can
change slowly with time (such as in
automated food classification as the seasons
change)

Improved performance can be achieved if


classifiers running in a unsupervised mode
are used

We can use unsupervised methods to identify


features (through clustering) that will then be
useful for categorization (or classification)

We gain some insight into the nature (or


structure) of the data

z Mixture Densities and Identifiability

We shall begin with the assumption that the


functional forms for the underlying probability
densities are known and that the only thing that must
be learned is the value of an unknown parameter
vector

We make the following assumptions:

The samples come from a known number c of classes

The prior probabilities P(j) for each class are known


(j = 1, ,c)

P(x | j, j) (j = 1, ,c) are known but might be different

The values of the c parameter vectors 1, 2, , c are


unknown

Unsupervised Learning and Other Neural


Networks 2
CSE 1513 Soft Computing Dr. Djamel Bouchaffra

The category labels are unknown


c 6component
47densities
48
P ( x | ) = P ( x | j , j ). P ( j )
j =1 123
mixing parameters

where = [1 , 2 ,..., c ] t

This density function is called a mixture


density

Our goal will be to use samples drawn from


this mixture density to estimate the unknown
parameter vector . Once is known, we can
decompose the mixture into its components
and use a MAP classifier on the derived
densities

Definition

A density P(x | ) is said to be identifiable (or injective)


if implies that there exists an x such that:
P(x | ) P(x | )
As a simple example, consider the case where x is
binary and P(x | ) is the following mixture:

1 x 1
P(x | ) = 1 (1 1 )1 x + 2x (1 2 )1 x
2 2
1
2 (1 + 2 ) if x = 1
=
1 - 1 (1 + 2 ) if x = 0
2
Assume that:
P(x = 1 | ) = 0.6 P(x = 0 | ) = 0.4
by replacing these probabilities values, we obtain:
1 + 2 = 1.2

Unsupervised Learning and Other Neural


Networks 3
CSE 1513 Soft Computing Dr. Djamel Bouchaffra

Thus, we have a case in which the mixture distribution is


completely unidentifiable, and therefore unsupervised
learning is impossible

In the discrete distributions, if there are too many


components in the mixture, there may be more unknowns
than independent equations, and identifiability can
become a serious problem!

While it can be shown that mixtures of normal densities


are usually identifiable, the parameters in the simple
mixture density
P(1 ) 1 P(2 ) 1
P(x | ) = exp ( x 1 )2 + exp (x 2 )2
2 2 2 2

Cannot be uniquely identified if P(1) = P(2)


(we cannot recover a unique even from an infinite
amount of data!)

= (1, 2) and = (2, 1) are two possible vectors that


can be interchanged without affecting P(x | )

Identifiability can be a problem, we always assume


that the densities we are dealing with are identifiable!

Unsupervised Learning and Other Neural


Networks 4
CSE 1513 Soft Computing Dr. Djamel Bouchaffra

z ML Estimates

Suppose that we have a set D = {x1, , xn} of n


unlabeled samples drawn independently from
the mixture density
c
p(x | ) = p( x | j , j )P( j )
j =1

( is fixed but unknown!)


n
= arg max p(D | ) with p(D | ) = p(x k | )
k =1

The gradient of the log-likelihood is:

n
i l = P(i | x k , ) i ln p(x k | i , i )
k =1

Since the gradient must vanish at the value of i


n
that maximizes l (l = ln p(x k | ,)) therefore,
k =1
the ML estimate i must satisfy the conditions
n
P(i | x k , ) i ln p(x k | i , i ) = 0 (i = 1,..., c)
k =1

By including the prior probabilities as unknown


variables, we finally can show that:
1
P(i ) =
n
P(i | x k , )
n
and P(i | x k , ) i ln p( x k | i , i ) = 0
k =1

p(x k | i , i )P(i )
where : P(i | x k , ) = c This equation
enables clustering
p(x k | j , j )P( j )
j= 1

Unsupervised Learning and Other Neural


Networks 5
CSE 1513 Soft Computing Dr. Djamel Bouchaffra

z Applications to Normal Mixtures


p(x | i, i) ~ N(i, i)

Case i i P(i) c
1 ? x x x
2 ? ? ? x

3 ? ? ? ?

Case 1 = Simplest case

Case 1: Unknown mean vectors


i = i i = 1, , c

[
ln p(x | i , i ) = ln (2)d / 2 i
1/ 2
] 21 (x )
i
t 1
i
(x i )

ML estimate of = (i) is:

n Weighted average of the samples


P(i | x k , )x k Coming from the i-th class
k =1
i = n
(1)
P(i | x k , )
k =1

P(i | x k , ) is the fraction of those samples


having value xk that come from the ith class, and i is the
average of the samples coming from the ith class.

Unsupervised Learning and Other Neural


Networks 6
CSE 1513 Soft Computing Dr. Djamel Bouchaffra

Unfortunately, equation (1) does not give i explicitly

However, if we have some way of obtaining good initial


estimates i (0) for the unknown means, therefore
equation (1) can be seen as an iterative process for
improving the estimates

n
P(i | x k , ( j))x k
k =1
i ( j + 1) = n
P(i | x k , ( j))
k =1

This is a gradient ascent for maximizing the log-


likelihood function

Example (Class):
Consider the simple two-component one-dimensional
normal mixture

1 1 2 1
p( x | 1 , 2 ) = exp ( x 1 ) 2 + exp ( x 2 )2
3 2 2 3 2 2
(2 clusters!)

Lets draw n=25 samples sequentially from this mixture


(see Table p.524: xk with k) with 1 = -2, and 2 = 2 The
log-likelihood function is:
n
l( 1 , 2 ) = ln p(x k | 1, 2 )
k =1

Unsupervised Learning and Other Neural


Networks 7
CSE 1513 Soft Computing Dr. Djamel Bouchaffra

The maximum value of l occurs at:

1 = 2.130 and 2 = 1.668

(which are not far from the true values: 1 = -2 and 2 = +2)

There is another peak at 1 = 2.085 and 2 = 1.257 which


has almost the same height as can be seen from the
following figure:

Unsupervised Learning and Other Neural


Networks 8
CSE 1513 Soft Computing Dr. Djamel Bouchaffra

This mixture of normal densities is identifiable


When the mixture density is not identifiable, the ML
solution is not unique

Case 2: All parameters unknown

No constraints are placed on the covariance matrix

Let p(x | , 2) be the two-component normal mixture:

2 1 1 x 2 1 1
p( x | , ) = exp + exp x 2
2 2 . 2 2 2 2

Suppose = x1, therefore:

1 1 1
p(x 1 | , 2 ) + exp x 12
2 2 2 2 2
For the rest of the samples:

1 1
p( x k | , 2 ) exp x k2
2 2 2

Finally,

1 1 1 1 n 2
p(x 1,..., x n | , 2 ) + exp x 12 exp 2 xk
14442 42443 (2 2 )n k =2
this term
0

The likelihood is therefore large and the maximum-likelihood


solution becomes singular.

Unsupervised Learning and Other Neural


Networks 9
CSE 1513 Soft Computing Dr. Djamel Bouchaffra

Adding an assumption
Consider the largest of the finite local maxima of the
likelihood function and use the ML estimation.
We obtain the following:

1 n
P(i ) = P(i | xk , )
n k =1
n
P(i | xk , )xk
i = k =1n
Iterative
P(i | xk , )
scheme k =1
n
P(i | xk , )(xk i )(xk i )t
i = k =1 n
P(i | xk , )
k =1

Where:

1 / 2 1
i exp (x k i ) t i1(x k i )P(i )
P(i | x k , ) = c 2
1 / 2 1
j exp 2 (x k j ) t j 1(x k j )P( j )
j =1

K-Means Clustering

Goal: find the c mean vectors 1, 2, , c


Replace the squared Mahalanobis distance
2
( x k i ) t i1( x k i ) by the squared Euclidean distance x k i
Find the mean m nearest to xk and approximate
1 if i = m
P(i | x k , ) as: P(i | x k , )
0 otherwise

Unsupervised Learning and Other Neural


Networks 10
CSE 1513 Soft Computing Dr. Djamel Bouchaffra

Use the iterative scheme to find 1 , 2 ,..., c


if n is the known number of patterns and c the desired
number of clusters, the k-means algorithm is:

Begin
initialize n, c, 1, 2, , c(randomly
selected)
do classify n samples according to
nearest i
recompute i
until no change in I
return 1, 2, , c
End

Other Neural Networks:


z Competitive Learning Networks (winner-take-all)

Output with highest


w11 1 Activation is selected
x1

2
Input x2 Output Units
units 3
x3
i=3 4
activation aj = xw
i =1
i ij = x T w j = w Tj x
value aj of
wk (t ) + ( x (t ) wk (t )) Weights Update Only for
Output j wk (t + 1) =
wk (t ) + ( x (t ) wk (t )) the winner Output k

Unsupervised Learning and Other Neural


Networks 11
CSE 1513 Soft Computing Dr. Djamel Bouchaffra

z Weight vectors move towards those area


where most input appears

z The weight vectors become the cluster


centers

The weight update


finds the cluster centers

The following topics can be considered by the students for


their oral presentations

z Kohonen Self-Organizing Networks

z Learning Vector Quantization

Unsupervised Learning and Other Neural


Networks 12

You might also like