Unsupervised Prsentationssssdas

Chapter 5
Unsupervised learning
Introduction
Training samples contain only input patterns
No desired output is given (teacher-less)
Learn to form classes/clusters of sample

patterns according to similarities among them
Patterns in a cluster would have similar features
No prior knowledge as what features are important
for classification, and how many classes are there.
Introduction
NN models to be covered
Competitive networks and competitive learning
Winner-takes-all (WTA)
Maxnet
Hemming net
Counterpropagation nets
Adaptive Resonance Theory
Self-organizing map (SOM)
Applications
Clustering
Vector quantization
Feature extraction
Dimensionality reduction
optimization
NN Based on Competition
Competition is important for NN
Competition between neurons has been observed in

biological nerve systems
Competition is important in solving many problems
To classify an input pattern

into one of the m classes
idea case: one class node

has output 1, all other 0 ;
often more than one class
nodes have non-zero output
x_1
C_1
x_n
C_m
INPUT
CLASSIFICATION
If these class nodes compete with each other, maybe only

one will win eventually and all others lose (winnertakes-all). The winner represents the computed
classification of the input
Winner-takes-all (WTA):
Among all competing nodes, only one will win and all
others will lose
We mainly deal with single winner WTA, but multiple
winners WTA are possible (and useful in some
applications)
Easiest way to realize WTA: have an external, central
arbitrator (a program) to decide the winner by
comparing the current outputs of the competitors (break
the tie arbitrarily)
This is biologically unsound (no such external
arbitrator exists in biological nerve system).
Ways to realize competition in NN

Lateral inhibition (Maxnet, Mexican hat)
output of each node feeds
xi
to others through inhibitory
connections (with negative weights)
Resource competition
output of node k is distributed to
node i and j proportional to wik
and wjk , as well as xi and xj
self decay
biologically sound
wij , w ji 0
w ii 0
xj
w jj 0
xj
xi
w jk
wik
xk
Fixed-weight Competitive Nets

Maxnet
Lateral inhibition between competitors

if i j

otherwise
weights : w ji
node function :
x if x 0
f ( x)
0 otherwise
Notes:
Competition: iterative process until the net stabilizes (at most
one node with positive activation)
0 1 / m , where m is the # of competitors
too small: takes too long to converge
too big: may suppress the entire network (no winner)
Fixed-weight Competitive Nets

Example
= 1, = 1/5 = 0.2
x(0) = (0.5
x(1) = (0
x(2) = (0
x(3) = (0
x(4) = (0
stabilized
0.9
0.24
0.072
0
0
1
0.36
0.216
0.1728
0.1728
0.9
0.24
0.072
0
0
0.9 ) initial input

0.24 )
0.072)
0
)
0
) = x(3)
Mexican Hat
Architecture: For a given node,
close neighbors: cooperative (mutually excitatory , w > 0)
farther away neighbors: competitive (mutually inhibitory,w < 0)
too far away neighbors: irrelevant (w = 0)
Need a definition of distance (neighborhood):

one dimensional: ordering by index (1,2,n)
two dimensional: lattice
weights
c1
wij c2
c
3
if distance(i, j ) k (c1 0)
if distance(i, j ) k (0 c2 c1 )
if distance(i, j ) k (c3 0)
activation function
if x 0
0
f ( x) x
if 0 x max
max if x max
ramp function :
example :
x (0) (0.0, 0.5, 0.8, 1.0, 0.8, 0.5, 0.0)
x (1) (0.0, 0.38, 1.06, 1.16, 1.06, 0.38, 0.9)
x (2) (0.0, 0.39, 1.14, 1.66, 1.14, 0.39, 0.0)
Equilibrium:
negative input = positive input for all nodes

winner has the highest activation;
its cooperative neighbors also have positive activation;
its competitive neighbors have negative (or zero)
activations.
Hamming Network
Hamming distance of two vectors, x and y of
dimension n,
Number of bits in disagreement.
In bipolar:
xT y i xi yi a d
where : a is number of bits in agreement in x and y
d is number of bits differring in x and y
d n a hamming distance
x T y 2a n
a 0.5( x T y n)
d a n 0.5( x T y n) n 0.5( xT y ) 0.5n
(negative) distance between x and y can be
determied by xT y and n
Hamming Network
Hamming network: net computes d between an input
i and each of the P vectors i1,, iP of dimension n
n input nodes, P output nodes, one for each of P stored
vector ip whose output = d(i, ip)
Weights and bias:
i1T
n
1
1
W ,
2 T
2 n
i
Output of the net:

i1T i n
1
o Wi
,
2 T
i
i
n
P
where ok 0.5(ikT i n) is the negative distance between i and ik
Example:
i1 ( 1,1,1, 1, 1)
Three stored vectors: i2 (1, 1 1, 1,1)
i3 ( 1,1, 1,1, 1)
i ( 1, 1, 1,1,1)
Input vector:
Distance: (4, 3, 2)
o1 0.5[(1,1,1,1,1)(1,1,1,1,1) 5]
Output vector
0.5[(1 1 1 1 1) 5] 4
o2 0.5[(1, 1 1,1,1)(1,1,1,1,1) 5]
0.5[(1 1 1 1 1) 5] 3
o3 0.5[(1,1, 1,1, 1)(1,1,1,1,1) 5]
0.5[(1 1 1 1 1) 5] 2
If we what the vector with smallest distance to I to win,

put a Maxnet on top of the Hamming net (for WTA)
We have a associate memory: input pattern recalls the
stored vector that is closest to it (more on AM later)
Simple Competitive Learning

Goal:
Learn to form classes/clusters of examplers/sample patterns
according to similarities of these exampers.
Patterns in a cluster would have similar features
No prior knowledge as what features are important for
classification, and how many classes are there.
Architecture:
Output nodes:
Y_1,.Y_m,
representing the m classes
They are competitors
(WTA realized either by
an external procedure or
by lateral inhibition as in Maxnet)
Training:
Train the network such that the weight vector wj associated
with jth output node becomes the representative vector of a
class of similar input patterns.
Initially all weights are randomly assigned
Two phase unsupervised learning
competing phase:
apply an input vector il randomly chosen from sample set.

compute output for all output nodes: o j il w j
determine the winner j *among all output nodes (winner is not
given in training samples so this is unsupervised)
rewarding phase:
the winner is reworded by updating its weights w *to be
j
closer to il (weights associated with all other output nodes are
not updated: kind of WTA)
repeat the two phases many times (and gradually reduce the
learning rate) until all weights are stabilized.
Weight update: w j w j w j
Method 1:
Method 2
w j (il w j )
w j il
il wj
il + wj
(il - wj)
il
il
wj
wj + (il - wj)
In each method,
il
wj
w j is moved closer to i
wj + il
Normalize the weight vector to unit length after it is updated
wj wj / wj
Sample input vectors are also normalized i i / i

l
l
l
Distance i w i w
l
j
l
j
i (il ,i w j ,i ) 2
w j is moving to the center of a cluster of sample vectors after

repeated weight updates
Node j wins for three training
samples: i1 , i2 and i3
Initial weight vector wj(0)
After successively trained
by i1 , i2 and i3 ,
the weight vector
changes to wj(1),
wj(2), and wj(3),
wj(0)
wj(3)
wj(1)
i3
i1
i2
wj(2)
Examples
A simple example of competitive learning (pp. 168-170)
6 vectors of dimension 3 in 3 classes (6 input nodes, 3 output nodes)
Weight matrices:
Node A: for class {i2, i4, i5}

Node B: for class {i3}
Node C: for class {i1, i6}
Comments
1. Ideally, when learning stops, each w j is close to the
centroid of a group/cluster of sample input vectors.
2. To stabilize w j, the learning rate may be reduced slowly
toward zero during learning, e.g., (t 1) (t )
3. # of output nodes:
too few: several clusters may be combined into one class
too many: over classification
ART model (later) allows dynamic add/remove output nodes
4. Initial
w j results depend on initial weights (node positions)

learning
training samples known to be in distinct classes, provided
such info is available
random (bad choices may cause anomaly)
5. Results also depend on sequence of sample presentation
Example
w2
w1
w1 will always win no matter
the sample is from which class

w2 is stuck and will not participate
in learning
unstuck:
let output nodes have some conscience
temporarily shot off nodes which have had very high
winning rate (hard to determine what rate should be
considered as very high)
Example
w2
1
Results depend on wthe
sequence
of sample presentation
w2
w1
Solution:
Initialize wj to randomly
selected input vectors that
are far away from each other
Self-Organizing Maps (SOM) ( 5.5)

Competitive learning (Kohonen 1982) is a special case of
SOM (Kohonen 1989)
In competitive learning,
the network is trained to organize input vector space into
subspaces/classes/clusters
each output node corresponds to one class
the output nodes are not ordered: random map
cluster_1
cluster_2
w_2
w_3
w_1
cluster_3
The topological order of the

three clusters is 1, 2, 3
The order of their maps at
output nodes are 2, 3, 1
The map does not preserve
the topological order of the
training vectors
Topographic map
a mapping that preserves neighborhood relations
between input vectors, (topology preserving or feature
preserving).
if i1 and i2 are two neighboring input vectors ( by some
distance metrics),
their corresponding winning output nodes (classes), i and j
must also be close to each other in some fashion
one dimensional: line or ring, node i has neighbors i 1
or i 1 mod n
two dimensional: grid.

rectangular: node(i, j) has neighbors:
(i , j 1), (i 1, j ), (or additional (i 1, j 1))
hexagonal: 6 neighbors
Biological motivation
Mapping two dimensional continuous inputs from
sensory organ (eyes, ears, skin, etc) to two dimensional
discrete outputs in the nerve system.
Retinotopic map: from eye (retina) to the visual cortex.
Tonotopic map: from the ear to the auditory cortex
These maps preserve topographic orders of input.

Biological evidence shows that the connections in these
maps are not entirely pre-programmed or pre-wired
at birth. Learning must occur after the birth to create
the necessary connections for appropriate topographic
mapping.
SOM Architecture
Two layer network:
Output layer:
Each node represents a class (of inputs)
Node function : o j il w j k w j , k il ,k
Neighborhood relation is defined over these nodes
Nj(t): set of nodes within distance D(t) to node j.
Each node cooperates with all its neighbors and

competes with all other output nodes.
Cooperation and competition of these nodes can be
realized by Mexican Hat model
D = 0: all nodes are competitors (no cooperative)
random map
D > 0: topology preserving map
Notes
1. Initial weights: small random value from (-e, e)
2. Reduction of :
Linear:
(t 1) (t )
Geometric: (t 1) (t ) where 0 1
3. Reduction of D: D (t t ) D(t ) 1 while D(t ) 0
should be much slower than
reduction.
D can be a constant through out the learning.
4. Effect of learning
For each input i, not only the weight vector of winner
j*
is pulled closer to i, but also the weights of s close
*
j
neighbors (within the radius of D).
5. Eventually, becomes close (similar) to
. The classes they
w j 1
j similar.
represent are w
also
6. May need large initial D in order to establish topological order
of all nodes
Notes
7. Find j* for a given input il:
With minimum distance between wj and il.
n
2
dist
(
w
,
i
)
(
i
w
)
j l
j
l
l ,k
j ,k
Distance:
2
k 1
Minimizing dist(wj, il) can be realized by maximizing
o j il w j k w j ,k il ,k
n
dist ( w j , il ) (il2,k w 2j ,k 2il , k w j ,k )

k 1
n
il2,k
k 1
n
k 1
w 2j , k
2 il , k w j , k 2
k 1
2il w j 2
2 il , k w j , k
k 1
Examples
A simple example of competitive learning (pp. 172-175)
6 vectors of dimension 3 in 3 classes, node ordering: B A C
wA : 0.2
Initialization: 0.5, weight matrix: W (0) wB : 0.1
D(t) = 1 for the first epoch, = 0 afterwards wC : 1

Training with i1 (1.1, 1.7, 1.8)
determine winner: squared Euclidean distance between
d A2 ,1 (1.1 0.2) 2 (1.7 0.7) 2 (1.8 0.3) 2 4.1
0.7 0.3
0.1 0.9
1 1
i1 and w j
d B2 ,1 4.4, d C2 ,1 1.1
C wins, since D(t) = 1, weights of node C and its neighbor A
are updated, bit not wB
Examples
wA : 0.2 0.7 0.3
W (0) wB : 0. 1 0.1 0.9
w : 1 1 1
C
wA : 0.83 0.77 0.81

W (15) wB : 0. 47 0.23 0.30
w : 0.61 0.95 1.34
C
Observations:
Distance between weights of
non-neighboring nodes (B,
C) increase
Input vectors switch
allegiance between nodes,
especially in the early stage
of training
wA : 0.65 1.2 1.05

W (1) wB : 0. 1 0.1 0.9
w : 1.05 1.35 1.4
C
W ( 0)
wA wB 0.85
wB wC 1.28
wC wA 1.22
t
1 6
7 12
13 16
W (15)
1.28
1.75
0.80
A
B
( 3, 5) (2, 4)
(6)
( 2, 4, 5)
(6)
( 2,4,5)
C
(1, 6)
(1, 3)
(1, 3)
How to illustrate Kohonen map (for 2 dimensional patterns)

Input vector: 2 dimensional
Output vector: 1 dimensional line/ring or 2 dimensional grid.
Weight vector is also 2 dimensional
Represent the topology of output nodes by points on a 2
dimensional plane. Plotting each output node on the plane with
its weight vector as its coordinates.
Connecting neighboring output nodes by a line
output nodes:
(1, 1)
(2, 1)
(1, 2)
weight vectors: (0.5, 0.5) (0.7, 0.2) (0.9, 0.9)
C(1, 2)
C(1, 1)
C(2, 1)
Illustration
examples
Input vectors are uniformly distributed in the region,

and randomly drawn from the region
Weight vectors are initially drawn from the same
region randomly (not necessarily uniformly)
Weight vectors become ordered according to the
given topology (neighborhood), at the end of training
Traveling Salesman Problem (TSP)

Given a road map of n cities, find the shortest tour
which visits every city on the map exactly once and
then return to the original city (Hamiltonian circuit)
(Geometric version):
A complete graph of n vertices on a unit square.
Each city is represented by its coordinates (x_i, y_i)
n!/2n legal tours
Find one legal tour that is shortest
Approximating TSP by SOM

Each city is represented as a 2 dimensional input vector (its
coordinates (x, y)),
Output nodes C_j, form a SOM of one dimensional ring, (C_1,
C_2, , C_n, C_1).
Initially, C_1, ... , C_n have random weight vectors, so we dont
know how these nodes correspond to individual cities.
During learning, a winner C_j on an input (x_I, y_I) of city i, not
only moves its w_j toward (x_I, y_I), but also that of of its
neighbors (w_(j+1), w_(j-1)).
As the result, C_(j-1) and C_(j+1) will later be more likely to win
with input vectors similar to (x_I, y_I), i.e, those cities closer to I
At the end, if a node j represents city I, it would end up to have its
neighbors j+1 or j-1 to represent cities similar to city I (i,e., cities
close to city I).
This can be viewed as a concurrent greedy algorithm
Initial position
Two candidate solutions:

ADFGHIJBC
ADFGHIJCB
Convergence of SOM Learning

Objective of SOM: converge to an ordered map
Nodes are ordered if for all nodes r, s, q
One-dimensional SOP
If neighborhood relation satisfies certain properties, then there
exists a sequence of input patterns that will lead the learn to
converge to an ordered map
When other sequence is used, it may converge, but not
necessarily to an ordered map
SOM learning can be viewed as of two phases

Volatile phase: search for niches to move into
Sober phase: nodes converge to centroids of its class of inputs
Whether a right order can be established depends on volatile
phase,
Convergence of SOM Learning

For multi-dimensional SOM
More complicated
No theoretical results
Example
4 nodes located at 4 corners
Inputs are drawn from the region that is near the
center of the square but slightly closer to w1
Node 1 will always win, w1, w0, and w2 will be
pulled toward inputs, but w3 will remain at the
far corner
Nodes 0 and 2 are adjacent to node 3, but not to
each other. However, this is not reflected in the
distances of the weight vectors:
|w0 w2| < |w3 w2|
Extensions to SOM
Hierarchical maps:
Hierarchical clustering algorithm
Tree of clusters:
Each node corresponds to a cluster
Children of a node correspond to subclusters
Bottom-up: smaller clusters merged to higher level clusters
Simple SOM not adequate

When clusters have arbitrary shapes
It treats every dimension equally (spherical shape clusters)
Hierarchical maps
First layer clusters similar training inputs
Second level combine these clusters into arbitrary shape
Growing Cell Structure (GCS):

Dynamically changing the size of the network
Insert/delete nodes according to the signal count (# of
inputs associated with a node)
Node insertion
Let l be the node with the largest .
Add new node lnew when l > upper bound
Place lnew in between l and lfar, where lfar is the farthest
neighbor of l,
Neighbors of lnew include both l and lfar (and possibly other
existing neighbors of l and lfar ).
Node deletion
Delete a node (and its incident edges) if its < lower bound
Node with no other neighbors are also deleted.
Example
Distance-Based Learning ( 5.6)

Which nodes will have the weights updated when an
input i is applied
Simple competitive learning: winner node only
SOM: winner and its neighbors
Distance-based learning: all nodes within a given distance
to i
Maximum entropy procedure

Depending on Euclidean distance |i wj|
Neural gas algorithm

Depending on distance rank
Maximum entropy procedure
T: artificial temperature, monotonically decreasing

Every node may have its weight vector updated
Learning rate for each depends on the distance
where
is normalization factor
When T 0, only the winners weight vectored w j* is
updated, because, for any other node l
exp( | i wl |2 / T )
2
2
exp(
(|
i
w
|
|
i
w
|
) /T ) 0
l
2
j
*
exp( | i w j* | / T )
Neural gas algorithm

Rank kj(i, W): # of nodes have their weight vectors closer to
input vector i than w j
Weight update depends on rank
where h(x) is a monotonically decreasing function
for the highest ranking node: kj*(i, W) = 0: h(0) = 1.
for others: kj(i, W) > 0: h < 1
e.g: decay function
when 0, winner takes all!!!
Better clustering results than many others (e.g., SOM, kmean, max entropy)
Example
Counter propagation network (CPN) ( 5.3)

Basic idea of CPN
Purpose: fast and coarse approximation of vector mapping y (x)

not to map any given x to its (x) with given precision,
input vectors x are divided into clusters/classes.
each cluster of x has one output y, which is (hopefully) the
average of (x) for all x in that class.
Architecture: Simple case: FORWARD ONLY CPN,
x1
xi
z1
w k,i
xn
from input to
hidden (class)
zk
zp
y1
v j,k
yj
ym
from hidden
(class) to output
Learning in two phases:
training sample (x, d ) where d (x) is the desired precise mapping

Phase1: weights wk coming into hidden nodes z k are trained by
competitive learning to become the representative vector of a
cluster of input vectors x: (use only x, the input part of (x, d ))
1. For a chosen x, feedforward to determined the winning z k *
2. wk *,i (new) wk *,i (old ) ( xi wk *,i (old ))
3. Reduce , then repeat steps 1 and 2 until stop condition is met
Phase 2: weights going out of hidden nodes vk are trained by delta
rule to be an average output of (x) where x is an input vector that
causes z k to win (use both x and d).
1. For a chosen x, feedforward to determined the winning z k *
2. wk *,i ( new) wk *,i (old ) ( xi wk *,i (old )) (optional)
3. v j ,k * (new) v j ,k * (old ) (d j v j ,k * (old ))
4. Repeat steps 1 3 until stop condition is met
Notes
A combination of both unsupervised learning (for wk in phase 1) and
supervised learning (for vk in phase 2).
After phase 1, clusters are formed among sample input x , each wk is
a representative of a cluster (average).
After phase 2, each cluster k maps to an output vector y, which is the
average of ( x) : x cluster _ k
View phase 2 learning as following delta rule
E
v j ,k * v j , k * (d j v j ,k * ) where d j v j ,k *
, because
v j ,k *
E
(d j v j ,k * z k * ) 2 2(d j v j ,k * z k * )
v j , k * v j , k *
It can be shown that, when t ,
wk (t ) x and vk (t ) ( x)
where x is the mean of all training samples that make k * win
Show only on wk * (proof of vk * is similar.) Weight update rule

can be rewriteen as wk *,i (t 1) (1 ) wk *,i (t ) xi (t 1)
wk *,i (t 1) (1 ) wk *,i (t ) xi (t 1)
(1 )((1 ) wk *,i (t 1) xi (t )) xi (t 1)
(1 ) 2 wk *,i (t 1) (1 ) xi (t ) xi (t 1)
xi (t 1) xi (t )(1 ) xi (t 1)(1 ) 2 ...xi (1)(1 ) t
If x are drawn randomly from the training set, then

E[ wik *,i (t 1)] E[( xi (t 1) xi (t )(1 ) ...x1 (1 ) t ]
[ E ( x(t 1)) (1 ) E ( x(t )) ...(1 ) t E ( xi (1))]
x [1 (1 ) ....(1 ) t ]
1
x
x
1 (1 )
After training, the network works like a look-up of math

table.
For any input x, find a region where x falls (represented
by the wining z node);
use the region as the index to look-up the table for the
function value.
CPN works in multi-dimensional input space
More cluster nodes (z), more accurate mapping.
Training is much faster than BP
May have linear separability problem
Full CPN
1
If both y ( x) and its inverse function x ( y ) exist

we can establish bi-directional approximation
Two pairs of weights matrices:
W(x to z) and V(z to y) for approx. map x to y (x)
U(y to z) and T(z to x) for approx. map y to x 1 ( y )
When training sample (x, y) is applied (x on X and y on Y ),
they can jointly determine the winner zk* or separately for
z k *( x ) and z k *( y )
Adaptive Resonance Theory (ART) ( 5.4)

ART1: for binary patterns; ART2: for continuous patterns
Motivations: Previous methods have the following problems:
1.Number of class nodes is pre-determined and fixed.
Under- and over- classification may result from training
Some nodes may have empty classes.
no control of the degree of similarity of inputs grouped in one
class.
2.Training is non-incremental:
with a fixed set of samples,
adding new samples often requires re-train the network with
the enlarged training set until a new stable state is reached.
Ideas of ART model:

suppose the input samples have been appropriately classified
into k clusters (say by some fashion of competitive learning).
each weight vector w j is a representative (average) of all
samples in that cluster.
when a new input vector x arrives
1.Find the winner j* among all k cluster nodes
2.Compare w j* with x
if they are sufficiently similar (x resonates with class j*),
then update w j* based on | x w j* |
else, find/create a free class node and make x as its
first member.
To achieve these, we need:

a mechanism for testing and determining (dis)similarity
between x and w j*.
a control for finding/creating new class nodes.
need to have all operations implemented by units of
local computation.
Only the basic ideas are presented

Simplified from the original ART model
Some of the control mechanisms realized by various
specialized neurons are done by logic statements of the
algorithm
ART1 Architecture
x : input (input vectors)

y : output (classes)
b j ,i : bottom up weights from xi to y j (real values)
ti , j : top down weights from y j to xi (binary/bipolar)
: vigilance parameter for similarity comparison (0 1)
Working of ART1
3 phases after each input vector x is applied
Recognition phase: determine the winner cluster
for x
Using bottom-up weights b
Winner j* with max yj* = bj* x
x is tentatively classified to cluster j*
the winner may be far away from x (e.g., |tj* - x| is
unacceptably large)
Working of ART1 (3 phases)

Comparison phase:
Compute similarity using top-down weights t:
*
*
*
*
vector: s ( s1 ,..., sn ) where si tl , j* xl
si*
if both tl , j* and xl are 1

otherwise
If (# of 1s in s)|/(# of 1s in x) > , accept the

classification, update bj* and tj*
else: remove j* from further consideration, look
for other potential winner or create a new node
with x as its first patter.
Weight update/adaptive phase

Initial weight: (no bias)
bottom up: b j ,l (0) 1 /(n 1)top down: tl , j (0) 1
When a resonance occurs with node j*, update b j* and t j*
tl , j* (old) xl
sl*
b j *,l (new )
n
n
*
0.5 si 0.5 tl , j* (old) xl
t j *,l (new)
sl*
i 1
t j *,l (old) xl
l 1
If k sample patterns are clustered to node j then

t = pattern whose 1s are common to all these k samples
j
t j (new ) t j (0) x(1) x(2).....x(k ) x(1) x(2).....x(k )

b j ,l (new ) 0 iff sl 0 only if xl (i ) 0
b j is a normalized t j
Example
0.7, n 7
Input patterns
x (1) (1, 1, 0, 0, 0, 0,1) x ( 2) (0, 0,1,1,1, 1, 0)
x (3) (1, 0, 1,1,1,1, 0)
x ( 4) (0, 0, 0,1,1, 1, 0)
x (5) (1, 1, 0,1,1,1, 0)
initially : tl ,1 (0) 1, b1,l (0) 1 / 8
for input x(1)

Node 1 wins
Notes
1. Classification as a search process
2. No two classes have the same b and t
3. Outliers that do not belong to any cluster will be assigned
separate nodes
4. Different ordering of sample input presentations may result
in different classification.
5. Increase of increases # of classes learned, and decreases
the average class size.
6. Classification may shift during search, will reach stability
eventually.
7. There are different versions of ART1 with minor variations
8. ART2 is the same in spirit but different in details.
ART1 Architecture
F2
y1
ym
x1
xi
xn
s1
si
sn
+
F1 (a )
t ji bij
F1 (b)
yj
+
+
G1
G2
bij : bottom up weights

F1 (a ) : input units
from x i to y j (real value)
F (b)1 : interface units
t ji : top down weights
F2 :
cluster units
from y j to x i (representing class j
F1 (a ) to F1 (b) : pair - wise connection
binary/bipolar)
between F2 and F1 (b) : full connection R, G1, G2 : control units
F2 cluster units: competitive, receive input vector x

through weights b: to determine winner j.
F1 (a ) input units: placeholder or external inputs
F1 (b) interface units:
pass s to x as input vector for classification byF2
compare x and t j (projection from winner y j )
controlled by gain control unit G1
Nodes in both F1 (b) and F2 obey 2/3 rule (output 1 if two of
G
the three inputs are 1)
Input to F1 (b) : si , G1, t ji Input to F2 : t ji , G 2, R
1
Needs to sequence the three phases (by control units G1,

G2, and R)
1 if s 0 and y 0
G1
0 otherwise
G1 1 : F1 (b) open to receive s 0
G1 0 : F1 (b) open for t J
1 if s 0
G2
0 otherwise
G2 1 signals the start of a new classification
for a new input
0 if
s
1 otherwise
o 1 vigilance parameter
R = 0: resonance occurs, update b J and t J
R = 1: fails similarity test, inhibits J from further computation
Principle Component Analysis (PCA) Networks ( 5.8)

PCA: a statistical procedure
Reduce dimensionality of input vectors
Too many features, some of them are dependent of others
Extract important (new) features of data which are
functions of original features
Minimize information loss in the process
This is done by forming new interesting features

As linear combinations of original features (first order of
approximation)
New features are required to be linearly independent (to
avoid redundancy)
New features are desired to be different from each other
as much as possible (maximum variability)
Linear Algebra
Two vectors x ( x1 ,..., xn ) and y ( y1 ,..., y n )
are said to be orthogonal to each other if
x y in1 xi yi 0.
A set of vectors x (1) ,..., x ( k ) of dimension n are said to be
linearly independent of each other if there does not exist a
set of real numbers a1 ,..., ak which are not all zero such that
a1 x (1) ak x ( k ) 0
otherwise, these vectors are linearly dependent and each one
can be expressed as a linear combination of the others
a j ( j)
ak ( k )
a1 (1)
(i )
x x
x j i
x
ai
ai
ai
Vector x is an eigenvector of matrix A if there exists a

constant != 0 such that Ax = x
is called a eigenvalue of A (wrt x)
A matrix A may have more than one eigenvectors, each with its
own eigenvalue
Eigenvectors of a matrix corresponding to distinct eigenvalues
are linearly independent of each other
Matrix B is called the inverse matrix of matrix A if AB = 1

1 is the identity matrix
Denote B as A-1
Not every matrix has inverse (e.g., when one of the row/column
can be expressed as a linear combination of other
rows/columns)
Every matrix A has a unique pseudo-inverse A*, which

satisfies the following properties
AA*A = A; A*AA* = A*;
A*A = (A*A)T;
AA* = (AA*)T
Example of PCA: 3-dim x is transformed to 2-dem y
2-d
feature
vector
Transformation 3-d
matrix W
feature
vector
If rows of W have unit length and are orthogonal (e.g., w1 w2 = ap + bq + cr = 0), then
WT is a pseudo-inverse of W
Generalization
Transform n-dim x to m-dem y (m < n) , the pseudo-inverse matrix W
is a m x n matrix
Transformation: y = Wx
Opposite transformation: x = WTy = WTWx
If W minimizes information loss in the transformation, then
||x x|| = ||x WTWx|| should also be minimized
If WT is the pseudo-inverse of W, then x = x: perfect transformation (no
information loss)
How to find such a W for a given set of input vectors

Let T = {x1, , xk} be a set of input vectors
Making them zero-mean vectors by subtracting the mean vector ( xi) /
k from each xi.
Compute the correlation matrix S(T) of these zero-mean vectors, which
is a n x n matrix (book calls covariance-variance matrix)
Find the m eigenvectors of S(T): w1, , wm corresponding to m

largest eigenvalues 1, , m
w1, , wm are the first m principal components of T
W = (w1, , wm) is the transformation matrix we are looking for
m new features extract from transformation with W would be
linearly independent and have maximum variability
This is based on the following mathematical result:
Example
Original 3 dimensional vectors transofmed into 1 - dimensional

y1 W1 x1 (0.823, 0.541, 0.169)T (0, 0.2, 0.7) 0.101
y 2 W1 x2 0.0677
Original 3 dimensional vectors transofmed into 2 - dimensional

0.1099
0.0677
y1 W2 x1
y
W
x
2
2 2
0
.
1462
0
.
2295
PCA network architecture

Output: vector y of m-dim
W: transformation matrix
y = Wx
x = WTy
Input: vector x of n-dim
Train W so that it can transform sample input vector xl from n-dim to mdim output vector yl.
Transformation should minimize information loss:
Find W which minimizes
l||xl xl|| = l||xl WTWxl|| = l||xl WTyl||
where xl is the opposite transformation of yl = Wxl via WT
Training W for PCA net

Unsupervised learning:
only depends on input samples xl
Error driven: W depends on ||xl xl|| = ||xl WTWxl||
Start with randomly selected weight, change W according to
(
This is only one of a number of suggestions for Kl, (Williams)

Weight update rule becomes
W l ( yl xlT yl ylT W ) l yl ( xlT ylT W ) l yl ( xlT W T yl )

column
vector
row
vector
transf.
error
Example (sample sample inputs as in previous example)

-
After x3
After x4
After x5
After second epoch
After second epoch
eventually converging to 1st PC (-0.823 -0.542 -0.169)
Notes
PCA net approximates principal components (error may exist)

It obtains PC by learning, without using statistical methods
Forced stabilization by gradually reducing
Some suggestions to improve learning results.
instead of using identity function for output y = Wx, using
non-linear function S, then try to minimize
If S is differentiable, use gradient descent approach
For example: S be monotonically increasing odd function
S(-x) = -S(x) (e.g., S(x) = x3

Unsupervised Prsentationssssdas

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unsupervised Prsentationssssdas

Uploaded by

Copyright:

Available Formats

Chapter 5

Learn to form classes/clusters of sample

Competition between neurons has been observed in

To classify an input pattern

idea case: one class node

If these class nodes compete with each other, maybe only

Ways to realize competition in NN

Fixed-weight Competitive Nets

Lateral inhibition between competitors

Fixed-weight Competitive Nets

0.9 ) initial input

Need a definition of distance (neighborhood):

negative input = positive input for all nodes

Output of the net:

where ok 0.5(ikT i n) is the negative distance between i and ik

If we what the vector with smallest distance to I to win,

Simple Competitive Learning

apply an input vector il randomly chosen from sample set.

Normalize the weight vector to unit length after it is updated

Sample input vectors are also normalized i i / i

w j is moving to the center of a cluster of sample vectors after

Node A: for class {i2, i4, i5}

w j results depend on initial weights (node positions)

5. Results also depend on sequence of sample presentation

w1 will always win no matter

the sample is from which class

Self-Organizing Maps (SOM) ( 5.5)

The topological order of the

two dimensional: grid.

These maps preserve topographic orders of input.

Each node cooperates with all its neighbors and

Minimizing dist(wj, il) can be realized by maximizing

dist ( w j , il ) (il2,k w 2j ,k 2il , k w j ,k )

D(t) = 1 for the first epoch, = 0 afterwards wC : 1

wA : 0.83 0.77 0.81

wA : 0.65 1.2 1.05

How to illustrate Kohonen map (for 2 dimensional patterns)

Input vectors are uniformly distributed in the region,

Traveling Salesman Problem (TSP)

Approximating TSP by SOM

Two candidate solutions:

Convergence of SOM Learning

SOM learning can be viewed as of two phases

Convergence of SOM Learning

Simple SOM not adequate

Growing Cell Structure (GCS):

Distance-Based Learning ( 5.6)

Maximum entropy procedure

Neural gas algorithm

Maximum entropy procedure

T: artificial temperature, monotonically decreasing

Neural gas algorithm

Counter propagation network (CPN) ( 5.3)

Purpose: fast and coarse approximation of vector mapping y (x)

Learning in two phases:

training sample (x, d ) where d (x) is the desired precise mapping

Show only on wk * (proof of vk * is similar.) Weight update rule

xi (t 1) xi (t )(1 ) xi (t 1)(1 ) 2 ...xi (1)(1 ) t

If x are drawn randomly from the training set, then

After training, the network works like a look-up of math

If both y ( x) and its inverse function x ( y ) exist

Adaptive Resonance Theory (ART) ( 5.4)

Ideas of ART model:

To achieve these, we need: