Professional Documents
Culture Documents
Unsupervised learning
Introduction
Unsupervised learning
Training samples contain only input patterns
No desired output is given (teacher-less)
Introduction
NN models to be covered
Competitive networks and competitive learning
Winner-takes-all (WTA)
Maxnet
Hemming net
Counterpropagation nets
Adaptive Resonance Theory
Self-organizing map (SOM)
Applications
Clustering
Vector quantization
Feature extraction
Dimensionality reduction
optimization
NN Based on Competition
Competition is important for NN
x_1
C_1
x_n
C_m
INPUT
CLASSIFICATION
Winner-takes-all (WTA):
Among all competing nodes, only one will win and all
others will lose
We mainly deal with single winner WTA, but multiple
winners WTA are possible (and useful in some
applications)
Easiest way to realize WTA: have an external, central
arbitrator (a program) to decide the winner by
comparing the current outputs of the competitors (break
the tie arbitrarily)
This is biologically unsound (no such external
arbitrator exists in biological nerve system).
Resource competition
output of node k is distributed to
node i and j proportional to wik
and wjk , as well as xi and xj
self decay
biologically sound
wij , w ji 0
w ii 0
xj
w jj 0
xj
xi
w jk
wik
xk
weights : w ji
node function :
x if x 0
f ( x)
0 otherwise
Notes:
Competition: iterative process until the net stabilizes (at most
one node with positive activation)
0 1 / m , where m is the # of competitors
too small: takes too long to converge
too big: may suppress the entire network (no winner)
= 1, = 1/5 = 0.2
x(0) = (0.5
x(1) = (0
x(2) = (0
x(3) = (0
x(4) = (0
stabilized
0.9
0.24
0.072
0
0
1
0.36
0.216
0.1728
0.1728
0.9
0.24
0.072
0
0
Mexican Hat
Architecture: For a given node,
close neighbors: cooperative (mutually excitatory , w > 0)
farther away neighbors: competitive (mutually inhibitory,w < 0)
too far away neighbors: irrelevant (w = 0)
weights
c1
wij c2
c
3
if distance(i, j ) k (c1 0)
if distance(i, j ) k (0 c2 c1 )
if distance(i, j ) k (c3 0)
activation function
if x 0
0
f ( x) x
if 0 x max
max if x max
ramp function :
example :
x (0) (0.0, 0.5, 0.8, 1.0, 0.8, 0.5, 0.0)
x (1) (0.0, 0.38, 1.06, 1.16, 1.06, 0.38, 0.9)
x (2) (0.0, 0.39, 1.14, 1.66, 1.14, 0.39, 0.0)
Equilibrium:
Hamming Network
Hamming distance of two vectors, x and y of
dimension n,
Number of bits in disagreement.
In bipolar:
xT y i xi yi a d
where : a is number of bits in agreement in x and y
d is number of bits differring in x and y
d n a hamming distance
x T y 2a n
a 0.5( x T y n)
d a n 0.5( x T y n) n 0.5( xT y ) 0.5n
(negative) distance between x and y can be
determied by xT y and n
Hamming Network
Hamming network: net computes d between an input
i and each of the P vectors i1,, iP of dimension n
n input nodes, P output nodes, one for each of P stored
vector ip whose output = d(i, ip)
Weights and bias:
i1T
n
1
1
W ,
2 T
2 n
i
1
o Wi
,
2 T
i
i
n
P
Example:
i1 ( 1,1,1, 1, 1)
Three stored vectors: i2 (1, 1 1, 1,1)
i3 ( 1,1, 1,1, 1)
i ( 1, 1, 1,1,1)
Input vector:
Distance: (4, 3, 2)
o1 0.5[(1,1,1,1,1)(1,1,1,1,1) 5]
Output vector
0.5[(1 1 1 1 1) 5] 4
o2 0.5[(1, 1 1,1,1)(1,1,1,1,1) 5]
0.5[(1 1 1 1 1) 5] 3
o3 0.5[(1,1, 1,1, 1)(1,1,1,1,1) 5]
0.5[(1 1 1 1 1) 5] 2
Architecture:
Output nodes:
Y_1,.Y_m,
representing the m classes
They are competitors
(WTA realized either by
an external procedure or
by lateral inhibition as in Maxnet)
Training:
Train the network such that the weight vector wj associated
with jth output node becomes the representative vector of a
class of similar input patterns.
Initially all weights are randomly assigned
Two phase unsupervised learning
competing phase:
rewarding phase:
the winner is reworded by updating its weights w *to be
j
closer to il (weights associated with all other output nodes are
not updated: kind of WTA)
repeat the two phases many times (and gradually reduce the
learning rate) until all weights are stabilized.
Weight update: w j w j w j
Method 1:
Method 2
w j (il w j )
w j il
il wj
il + wj
(il - wj)
il
il
wj
wj + (il - wj)
In each method,
il
wj
w j is moved closer to i
wj + il
wj wj / wj
i (il ,i w j ,i ) 2
wj(0)
wj(3)
wj(1)
i3
i1
i2
wj(2)
Examples
A simple example of competitive learning (pp. 168-170)
6 vectors of dimension 3 in 3 classes (6 input nodes, 3 output nodes)
Weight matrices:
Comments
1. Ideally, when learning stops, each w j is close to the
centroid of a group/cluster of sample input vectors.
2. To stabilize w j, the learning rate may be reduced slowly
toward zero during learning, e.g., (t 1) (t )
3. # of output nodes:
too few: several clusters may be combined into one class
too many: over classification
ART model (later) allows dynamic add/remove output nodes
4. Initial
Example
w2
w1
Example
w2
1
Results depend on wthe
sequence
of sample presentation
w2
w1
Solution:
Initialize wj to randomly
selected input vectors that
are far away from each other
cluster_2
w_2
w_3
w_1
cluster_3
Topographic map
a mapping that preserves neighborhood relations
between input vectors, (topology preserving or feature
preserving).
if i1 and i2 are two neighboring input vectors ( by some
distance metrics),
their corresponding winning output nodes (classes), i and j
must also be close to each other in some fashion
one dimensional: line or ring, node i has neighbors i 1
or i 1 mod n
hexagonal: 6 neighbors
Biological motivation
Mapping two dimensional continuous inputs from
sensory organ (eyes, ears, skin, etc) to two dimensional
discrete outputs in the nerve system.
Retinotopic map: from eye (retina) to the visual cortex.
Tonotopic map: from the ear to the auditory cortex
SOM Architecture
Two layer network:
Output layer:
Each node represents a class (of inputs)
Node function : o j il w j k w j , k il ,k
Neighborhood relation is defined over these nodes
Nj(t): set of nodes within distance D(t) to node j.
Notes
1. Initial weights: small random value from (-e, e)
2. Reduction of :
Linear:
(t 1) (t )
Geometric: (t 1) (t ) where 0 1
3. Reduction of D: D (t t ) D(t ) 1 while D(t ) 0
should be much slower than
reduction.
D can be a constant through out the learning.
4. Effect of learning
For each input i, not only the weight vector of winner
j*
is pulled closer to i, but also the weights of s close
*
j
neighbors (within the radius of D).
5. Eventually, becomes close (similar) to
. The classes they
w j 1
j similar.
represent are w
also
6. May need large initial D in order to establish topological order
of all nodes
Notes
7. Find j* for a given input il:
With minimum distance between wj and il.
n
2
dist
(
w
,
i
)
(
i
w
)
j l
j
l
l ,k
j ,k
Distance:
2
k 1
o j il w j k w j ,k il ,k
n
il2,k
k 1
n
k 1
w 2j , k
2 il , k w j , k 2
k 1
2il w j 2
2 il , k w j , k
k 1
Examples
A simple example of competitive learning (pp. 172-175)
6 vectors of dimension 3 in 3 classes, node ordering: B A C
wA : 0.2
Initialization: 0.5, weight matrix: W (0) wB : 0.1
0.7 0.3
0.1 0.9
1 1
i1 and w j
d B2 ,1 4.4, d C2 ,1 1.1
C wins, since D(t) = 1, weights of node C and its neighbor A
are updated, bit not wB
Examples
wA : 0.2 0.7 0.3
W (0) wB : 0. 1 0.1 0.9
w : 1 1 1
C
Observations:
Distance between weights of
non-neighboring nodes (B,
C) increase
Input vectors switch
allegiance between nodes,
especially in the early stage
of training
W ( 0)
wA wB 0.85
wB wC 1.28
wC wA 1.22
t
1 6
7 12
13 16
W (15)
1.28
1.75
0.80
A
B
( 3, 5) (2, 4)
(6)
( 2, 4, 5)
(6)
( 2,4,5)
C
(1, 6)
(1, 3)
(1, 3)
Illustration
examples
Initial position
One-dimensional SOP
If neighborhood relation satisfies certain properties, then there
exists a sequence of input patterns that will lead the learn to
converge to an ordered map
When other sequence is used, it may converge, but not
necessarily to an ordered map
Example
4 nodes located at 4 corners
Inputs are drawn from the region that is near the
center of the square but slightly closer to w1
Node 1 will always win, w1, w0, and w2 will be
pulled toward inputs, but w3 will remain at the
far corner
Nodes 0 and 2 are adjacent to node 3, but not to
each other. However, this is not reflected in the
distances of the weight vectors:
|w0 w2| < |w3 w2|
Extensions to SOM
Hierarchical maps:
Hierarchical clustering algorithm
Tree of clusters:
Each node corresponds to a cluster
Children of a node correspond to subclusters
Bottom-up: smaller clusters merged to higher level clusters
Hierarchical maps
First layer clusters similar training inputs
Second level combine these clusters into arbitrary shape
Node insertion
Let l be the node with the largest .
Add new node lnew when l > upper bound
Place lnew in between l and lfar, where lfar is the farthest
neighbor of l,
Neighbors of lnew include both l and lfar (and possibly other
existing neighbors of l and lfar ).
Node deletion
Delete a node (and its incident edges) if its < lower bound
Node with no other neighbors are also deleted.
Example
where
is normalization factor
When T 0, only the winners weight vectored w j* is
updated, because, for any other node l
exp( | i wl |2 / T )
2
2
exp(
(|
i
w
|
|
i
w
|
) /T ) 0
l
2
j
*
exp( | i w j* | / T )
Example
x1
xi
z1
w k,i
xn
from input to
hidden (class)
zk
zp
y1
v j,k
yj
ym
from hidden
(class) to output
Notes
A combination of both unsupervised learning (for wk in phase 1) and
supervised learning (for vk in phase 2).
After phase 1, clusters are formed among sample input x , each wk is
a representative of a cluster (average).
After phase 2, each cluster k maps to an output vector y, which is the
average of ( x) : x cluster _ k
View phase 2 learning as following delta rule
E
v j ,k * v j , k * (d j v j ,k * ) where d j v j ,k *
, because
v j ,k *
E
(d j v j ,k * z k * ) 2 2(d j v j ,k * z k * )
v j , k * v j , k *
It can be shown that, when t ,
wk (t ) x and vk (t ) ( x)
where x is the mean of all training samples that make k * win
Full CPN
1
2.Training is non-incremental:
with a fixed set of samples,
adding new samples often requires re-train the network with
the enlarged training set until a new stable state is reached.
ART1 Architecture
Working of ART1
3 phases after each input vector x is applied
Recognition phase: determine the winner cluster
for x
Using bottom-up weights b
Winner j* with max yj* = bj* x
x is tentatively classified to cluster j*
the winner may be far away from x (e.g., |tj* - x| is
unacceptably large)
n
n
*
0.5 si 0.5 tl , j* (old) xl
t j *,l (new)
sl*
i 1
t j *,l (old) xl
l 1
Example
0.7, n 7
Input patterns
x (1) (1, 1, 0, 0, 0, 0,1) x ( 2) (0, 0,1,1,1, 1, 0)
x (3) (1, 0, 1,1,1,1, 0)
x ( 4) (0, 0, 0,1,1, 1, 0)
x (5) (1, 1, 0,1,1,1, 0)
initially : tl ,1 (0) 1, b1,l (0) 1 / 8
Notes
1. Classification as a search process
2. No two classes have the same b and t
3. Outliers that do not belong to any cluster will be assigned
separate nodes
4. Different ordering of sample input presentations may result
in different classification.
5. Increase of increases # of classes learned, and decreases
the average class size.
6. Classification may shift during search, will reach stability
eventually.
7. There are different versions of ART1 with minor variations
8. ART2 is the same in spirit but different in details.
ART1 Architecture
F2
y1
ym
x1
xi
xn
s1
si
sn
+
F1 (a )
t ji bij
F1 (b)
yj
+
+
G1
G2
1 if s 0 and y 0
G1
0 otherwise
G1 1 : F1 (b) open to receive s 0
G1 0 : F1 (b) open for t J
1 if s 0
G2
0 otherwise
G2 1 signals the start of a new classification
for a new input
0 if
s
1 otherwise
o 1 vigilance parameter
R = 0: resonance occurs, update b J and t J
R = 1: fails similarity test, inhibits J from further computation
Linear Algebra
Two vectors x ( x1 ,..., xn ) and y ( y1 ,..., y n )
are said to be orthogonal to each other if
x y in1 xi yi 0.
A set of vectors x (1) ,..., x ( k ) of dimension n are said to be
linearly independent of each other if there does not exist a
set of real numbers a1 ,..., ak which are not all zero such that
a1 x (1) ak x ( k ) 0
otherwise, these vectors are linearly dependent and each one
can be expressed as a linear combination of the others
a j ( j)
ak ( k )
a1 (1)
(i )
x x
x j i
x
ai
ai
ai
A*A = (A*A)T;
AA* = (AA*)T
2-d
feature
vector
Transformation 3-d
matrix W
feature
vector
If rows of W have unit length and are orthogonal (e.g., w1 w2 = ap + bq + cr = 0), then
WT is a pseudo-inverse of W
Generalization
Transform n-dim x to m-dem y (m < n) , the pseudo-inverse matrix W
is a m x n matrix
Transformation: y = Wx
Opposite transformation: x = WTy = WTWx
If W minimizes information loss in the transformation, then
||x x|| = ||x WTWx|| should also be minimized
If WT is the pseudo-inverse of W, then x = x: perfect transformation (no
information loss)
Example
W
x
2
2 2
0
.
1462
0
.
2295
row
vector
transf.
error
After x3
After x4
After x5
After second epoch
After second epoch
eventually converging to 1st PC (-0.823 -0.542 -0.169)
Notes