Unsupervised Learning and Other Neural Networks: CSE 1513 Soft Computing Dr. Djamel Bouchaffra

CSE 1513 Soft Computing Dr.
Djamel Bouchaffra
NOT PART OF THE FINAL
Unsupervised Learning and

Other Neural Networks
z Introduction
z Mixture Densities and Identifiability
z ML Estimates
z Application to Normal Mixtures
z Other Neural Networks
z Introduction
Previously, all our training samples were labeled: these

samples were said supervised
We now investigate a number of unsupervised

procedures which use unlabeled samples
Collecting and Labeling a large set of sample patterns

can be costly
We can train with large amounts of (less expensive)

unlabeled data, and only then use supervision to label
the groupings found, this is appropriate for large data
mining applications where the contents of a large
database are not known beforehand
Unsupervised Learning and Other Neural

Networks 1
CSE 1513 Soft Computing Dr. Djamel Bouchaffra
This is also appropriate in many applications

when the characteristics of the patterns can
change slowly with time (such as in
automated food classification as the seasons
change)
Improved performance can be achieved if

classifiers running in a unsupervised mode
are used
We can use unsupervised methods to identify

features (through clustering) that will then be
useful for categorization (or classification)
We gain some insight into the nature (or

structure) of the data
z Mixture Densities and Identifiability
We shall begin with the assumption that the

functional forms for the underlying probability
densities are known and that the only thing that must
be learned is the value of an unknown parameter
vector
We make the following assumptions:
The samples come from a known number c of classes
The prior probabilities P(j) for each class are known

(j = 1, ,c)
P(x | j, j) (j = 1, ,c) are known but might be different
The values of the c parameter vectors 1, 2, , c are

unknown

Networks 2
The category labels are unknown

c 6component
47densities
48
P ( x | ) = P ( x | j , j ). P ( j )
j =1 123
mixing parameters
where = [1 , 2 ,..., c ] t
This density function is called a mixture

density
Our goal will be to use samples drawn from

this mixture density to estimate the unknown
parameter vector . Once is known, we can
decompose the mixture into its components
and use a MAP classifier on the derived
densities
Definition
A density P(x | ) is said to be identifiable (or injective)

if implies that there exists an x such that:
P(x | ) P(x | )
As a simple example, consider the case where x is
binary and P(x | ) is the following mixture:
1 x 1
P(x | ) = 1 (1 1 )1 x + 2x (1 2 )1 x
2 2
1
2 (1 + 2 ) if x = 1
=
1 - 1 (1 + 2 ) if x = 0
2
Assume that:
P(x = 1 | ) = 0.6 P(x = 0 | ) = 0.4
by replacing these probabilities values, we obtain:
1 + 2 = 1.2

Networks 3
Thus, we have a case in which the mixture distribution is

completely unidentifiable, and therefore unsupervised
learning is impossible
In the discrete distributions, if there are too many

components in the mixture, there may be more unknowns
than independent equations, and identifiability can
become a serious problem!
While it can be shown that mixtures of normal densities

are usually identifiable, the parameters in the simple
mixture density
P(1 ) 1 P(2 ) 1
P(x | ) = exp ( x 1 )2 + exp (x 2 )2
2 2 2 2
Cannot be uniquely identified if P(1) = P(2)

(we cannot recover a unique even from an infinite
amount of data!)
= (1, 2) and = (2, 1) are two possible vectors that

can be interchanged without affecting P(x | )
Identifiability can be a problem, we always assume

that the densities we are dealing with are identifiable!

Networks 4
z ML Estimates
Suppose that we have a set D = {x1, , xn} of n

unlabeled samples drawn independently from
the mixture density
c
p(x | ) = p( x | j , j )P( j )
j =1
( is fixed but unknown!)

n
= arg max p(D | ) with p(D | ) = p(x k | )
k =1
The gradient of the log-likelihood is:
n
i l = P(i | x k , ) i ln p(x k | i , i )
k =1
Since the gradient must vanish at the value of i

n
that maximizes l (l = ln p(x k | ,)) therefore,
k =1
the ML estimate i must satisfy the conditions
n
P(i | x k , ) i ln p(x k | i , i ) = 0 (i = 1,..., c)
k =1
By including the prior probabilities as unknown

variables, we finally can show that:
1
P(i ) =
n
P(i | x k , )
n
and P(i | x k , ) i ln p( x k | i , i ) = 0
k =1
p(x k | i , i )P(i )
where : P(i | x k , ) = c This equation
enables clustering
p(x k | j , j )P( j )
j= 1

Networks 5
z Applications to Normal Mixtures

p(x | i, i) ~ N(i, i)
Case i i P(i) c
1 ? x x x
2 ? ? ? x
3 ? ? ? ?
Case 1 = Simplest case
Case 1: Unknown mean vectors

i = i i = 1, , c
[
ln p(x | i , i ) = ln (2)d / 2 i
1/ 2
] 21 (x )
i
t 1
i
(x i )
ML estimate of = (i) is:
n Weighted average of the samples

P(i | x k , )x k Coming from the i-th class
k =1
i = n
(1)
P(i | x k , )
k =1
P(i | x k , ) is the fraction of those samples

having value xk that come from the ith class, and i is the
average of the samples coming from the ith class.

Networks 6
Unfortunately, equation (1) does not give i explicitly
However, if we have some way of obtaining good initial

estimates i (0) for the unknown means, therefore
equation (1) can be seen as an iterative process for
improving the estimates
n
P(i | x k , ( j))x k
k =1
i ( j + 1) = n
P(i | x k , ( j))
k =1
This is a gradient ascent for maximizing the log-

likelihood function
Example (Class):
Consider the simple two-component one-dimensional
normal mixture
1 1 2 1
p( x | 1 , 2 ) = exp ( x 1 ) 2 + exp ( x 2 )2
3 2 2 3 2 2
(2 clusters!)
Lets draw n=25 samples sequentially from this mixture

(see Table p.524: xk with k) with 1 = -2, and 2 = 2 The
log-likelihood function is:
n
l( 1 , 2 ) = ln p(x k | 1, 2 )
k =1

Networks 7
The maximum value of l occurs at:
1 = 2.130 and 2 = 1.668
(which are not far from the true values: 1 = -2 and 2 = +2)
There is another peak at 1 = 2.085 and 2 = 1.257 which

has almost the same height as can be seen from the
following figure:

Networks 8
This mixture of normal densities is identifiable

When the mixture density is not identifiable, the ML
solution is not unique
Case 2: All parameters unknown
No constraints are placed on the covariance matrix
Let p(x | , 2) be the two-component normal mixture:
2 1 1 x 2 1 1
p( x | , ) = exp + exp x 2
2 2 . 2 2 2 2
Suppose = x1, therefore:
1 1 1
p(x 1 | , 2 ) + exp x 12
2 2 2 2 2
For the rest of the samples:
1 1
p( x k | , 2 ) exp x k2
2 2 2
Finally,
1 1 1 1 n 2
p(x 1,..., x n | , 2 ) + exp x 12 exp 2 xk
14442 42443 (2 2 )n k =2
this term
0
The likelihood is therefore large and the maximum-likelihood

solution becomes singular.

Networks 9
Adding an assumption
Consider the largest of the finite local maxima of the
likelihood function and use the ML estimation.
We obtain the following:
1 n
P(i ) = P(i | xk , )
n k =1
n
P(i | xk , )xk
i = k =1n
Iterative
P(i | xk , )
scheme k =1
n
P(i | xk , )(xk i )(xk i )t
i = k =1 n
P(i | xk , )
k =1
Where:
1 / 2 1
i exp (x k i ) t i1(x k i )P(i )
P(i | x k , ) = c 2
1 / 2 1
j exp 2 (x k j ) t j 1(x k j )P( j )
j =1
K-Means Clustering
Goal: find the c mean vectors 1, 2, , c

Replace the squared Mahalanobis distance
2
( x k i ) t i1( x k i ) by the squared Euclidean distance x k i
Find the mean m nearest to xk and approximate
1 if i = m
P(i | x k , ) as: P(i | x k , )
0 otherwise


Networks 10
Use the iterative scheme to find 1 , 2 ,..., c

if n is the known number of patterns and c the desired
number of clusters, the k-means algorithm is:
Begin
initialize n, c, 1, 2, , c(randomly
selected)
do classify n samples according to
nearest i
recompute i
until no change in I
return 1, 2, , c
End
Other Neural Networks:

z Competitive Learning Networks (winner-take-all)
Output with highest

w11 1 Activation is selected
x1
2
Input x2 Output Units
units 3
x3
i=3 4
activation aj = xw
i =1
i ij = x T w j = w Tj x
value aj of
wk (t ) + ( x (t ) wk (t )) Weights Update Only for
Output j wk (t + 1) =
wk (t ) + ( x (t ) wk (t )) the winner Output k

Networks 11
z Weight vectors move towards those area

where most input appears
z The weight vectors become the cluster

centers
The weight update

finds the cluster centers
The following topics can be considered by the students for

their oral presentations
z Kohonen Self-Organizing Networks
z Learning Vector Quantization

Networks 12

Unsupervised Learning and Other Neural Networks: CSE 1513 Soft Computing Dr. Djamel Bouchaffra

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unsupervised Learning and Other Neural Networks: CSE 1513 Soft Computing Dr. Djamel Bouchaffra

Uploaded by

Copyright:

Available Formats

CSE 1513 Soft Computing Dr.

NOT PART OF THE FINAL

Unsupervised Learning and

Previously, all our training samples were labeled: these

We now investigate a number of unsupervised

Collecting and Labeling a large set of sample patterns

We can train with large amounts of (less expensive)

Unsupervised Learning and Other Neural

This is also appropriate in many applications

Improved performance can be achieved if

We can use unsupervised methods to identify

We gain some insight into the nature (or

z Mixture Densities and Identifiability

We shall begin with the assumption that the

We make the following assumptions:

The samples come from a known number c of classes

The prior probabilities P(j) for each class are known

P(x | j, j) (j = 1, ,c) are known but might be different

The values of the c parameter vectors 1, 2, , c are

Unsupervised Learning and Other Neural

The category labels are unknown

This density function is called a mixture

Our goal will be to use samples drawn from

A density P(x | ) is said to be identifiable (or injective)

Unsupervised Learning and Other Neural

Thus, we have a case in which the mixture distribution is

In the discrete distributions, if there are too many

While it can be shown that mixtures of normal densities

Cannot be uniquely identified if P(1) = P(2)

= (1, 2) and = (2, 1) are two possible vectors that

Identifiability can be a problem, we always assume

Unsupervised Learning and Other Neural

Suppose that we have a set D = {x1, , xn} of n

( is fixed but unknown!)

The gradient of the log-likelihood is:

Since the gradient must vanish at the value of i

By including the prior probabilities as unknown

Unsupervised Learning and Other Neural

z Applications to Normal Mixtures

Case 1 = Simplest case

Case 1: Unknown mean vectors

ML estimate of = (i) is:

n Weighted average of the samples

P(i | x k , ) is the fraction of those samples

Unsupervised Learning and Other Neural

Unfortunately, equation (1) does not give i explicitly

However, if we have some way of obtaining good initial

This is a gradient ascent for maximizing the log-

Lets draw n=25 samples sequentially from this mixture

Unsupervised Learning and Other Neural

The maximum value of l occurs at:

1 = 2.130 and 2 = 1.668

There is another peak at 1 = 2.085 and 2 = 1.257 which

Unsupervised Learning and Other Neural

This mixture of normal densities is identifiable

Case 2: All parameters unknown

No constraints are placed on the covariance matrix

Let p(x | , 2) be the two-component normal mixture:

Suppose = x1, therefore:

The likelihood is therefore large and the maximum-likelihood

Unsupervised Learning and Other Neural

Goal: find the c mean vectors 1, 2, , c