You are on page 1of 8

Proceedings of International Joint Conference on Neural Networks, Atlanta, Georgia, USA, June 14-19, 2009

Analysis of Hyperspectral Data with Diffusion Maps and Fuzzy ART
Rui Xu, Louis du Plessis, Steven Damelin, Michael Sears, and Donald C. Wunsch II
Abstract-The presence of large amounts of data in
hyperspectral images makes it very difficult to perform further
tractable analyses. Here, we present a method of analyzing real
hyperspectral data by dimensionality reduction using diffusion
maps. Diffusion maps interpret the eigenfunctions of Markov
matrices as a system of coordinates on the original data set in
order to obtain an efficient representation of data geometric
descriptions. A neural network clustering theory, Fuzzy ART, is
further applied to the reduced data to form clusters of the
potential minerals. Experimental results on a subset of
hyperspectral core imager data show that the proposed methods
are promising in addressing the complicated hyperspectral data
and identifying the minerals in core samples.
1. INTRODUCTION
T
HE advent of new spectral imaging techniques makes
hyperspectral data more commonly accessible in recent
decades. Spectral imaging refers to the process of
sampling an image at several different frequencies [1]. Digital
color photography is a form of spectral imaging. The picture
is sampled at three different frequencies , one in the blue range
of the spectrum and the other two in red and green
respectively. Every sample gives a matrix of intensity values,
which could be plotted to give a gray-scale image. When all
three matrices are blended together a color image is produced.
Multispectral imaging refers to the process of sampling an
image at more frequencies. Images are commonly sampled in
the frequency range between 400 and 2,500 nm, since this is
the range of the optical spectrum where the sun provides
useful illumination [1]. In hyperspectral imaging the image is
sampled at hundreds of frequencies, compared to only a few
for multispectral imaging. Furthermore, the different
frequencies in multispectral imaging are usually distributed in
an irregular fashion, whereas the bands in hyperspectraI
imaging are regularly spaced [1].
Manuscript received January 5, 2009.
R. Xu is with the Applied Computational Intelligence Laboratory,
Department of Electrical & Computer Engineering, Missouri University of
Science and Technology, MO 65409 USA (phone: 573-341-6811 ; fax:
573·341-4521 ; e-mail: rxu@mst.edu) .
L. du Plessis is with the School of Computat ional and Applied
Mathemat ics, University of the Witwater srand, Johannesburg, South Africa
(email: Laduplessis@gmai l.com).
S. Damelin is with the Department of Mathematical Sciences, Georgia
Southern University, Statesboro, GA 30460 USA, and the School of
Computational and Applied Mathemat ics, University of the Witwatersrand ,
Johanne sburg, South Africa (e-mail: damelin@georgiasouthem.edu).
M. Sears is with the School of Computer Science, University of the
Witwat ersrand, Johannesburg, South Africa (email:
michael.sears @wits.ac. za).
D. C. Wunsch [[ is with the Department of Electrical & Computer
Engineering, Missouri University of Science & Technol ogy, Rolla, MO
65409 USA (e-mail: dwunsch@mst.edu) .
978-1-4244-3553-1/09/$25.00 ©2009 IEEE
q
Fig. 1. The hyperspectral data cube. For every one of the m frequencies
sampled, there is an image of p x q of intensity values. Similarly, for
every one of the pixels in the image, there is a compl ete spectrum of
values. This cube image is generated with the data used in our studies.
Because of the regular spacing of narrow bands, a
continuous spectrum can be drawn for every pixel in the
image. Instead of ending up with a flat two-dimensional
matrix of values, we obtain a "hypercube" of data, as shown
in Fig. 1. This is where the problem in analyzing and storing
hyperspectral data comes in. Having more than a hundred
bands for every pixel means having enormous amounts of
data. If the hyperspectraI imager scans in m bands, then every
pixel of the hyperspectral image can be seen as an m
dimensional vector.
Currently, hyperspectral imaging is mainly used In
airborne surveillance techniques [2]. Some uses for
hyperspectral imaging include crop assessment ,
environmental applications, and mineral exploit ation [3].
Here, we are interested in the identification of minerals in
core samples. Core samples are long pieces of rock that was
drilled in areas suspected of being rich in minerals . Sites for
new mines are identified using data collected from core
samples. It would be advantageous to automatically identify
the different minerals resident in a piece of core. AngIoGold
Ashanti has constructed the Hyperspectral Core Imager (HC!)
that scans in core samples. The HCI scans in 5 meters of core
every hour, at 400 bands, producing 2 gigabytes of data every
hour [2]. With this amount of data being produced every hour,
3390
(1)
be the degree of Xi; the Markov or transition matrix P is then
constructed by calculating each entry as
w(x
j
, Xj)
p(X.,X.) =. (3)
1 ) d(x
j
)
From the definition of the weight function, P(Xi' Xj) can be
interpreted as the transition probability from Xi to Xj in one
time step. From the definition of the Gaussian kernel it can be
seen that the transition probability will be high for similar
elements. This idea can be further extended by considering
pt(Xi' Xj) in the lhpower p
t
ofP as the probability of transition
from Xi to Xj in t time steps [4]. Therefore, the parameter t
defines the granularity of the analysis. With the increase of
the value of t, local geometric information of data is also
integrated. The change in direction of t makes it possible to
control the generation of more specific or broader clusters.
Because of the symmetry property of the kernel function,
for each t 1, we may obtain a sequence ofN eigenvalues of
P, 1=AO Al ... AN, with the corresponding eigenvectors
{<pj, j = 1,... ,N}, satisfying,
p
t
<P j =.,1/ <P j • (4)
Using the eigenvectors as a new set of coordinates on the data
set, the mapping from the original data space to an
L-dimensional (L < m) Euclidean space 9l
L
can be defined as
'1', : Xi (Ar'CPl (x.), ..., ;L;,cpt. (xJr. (5)
Correspondingly, the diffusion distance between a pair of
points Xi and Xj
D,(Xi,X) =IIp'(Xi' .)- p'(X
j
' -)IL'l-b' (6)
where qJo is the unique stationary distribution
m (x) = d(x) X E 9l
n1
(7)
'1'"0 Ld(x;)' ,
XjEX
is approximated with the Euclidean distance in 9l
L
, written as
Dt(xj,x
j
) =II'Pt(x
j
) - 'Pt(xj)ll. (8)
It can be seen that the more paths that connect two points in
the graph, the smaller the diffusion distance is.
The kernel width parameter (J represents the rate at which
the similarity between two points decays. There is no good
theory to guide the choice of (J. Several heuristics have been
proposed, and they boil down to trading off sparseness of the
kernel matrix (small (J) with adequate characterization of the
true affinity of two points. One of the main reasons for using
spectral clustering methods is that, with sparse kernel
matrices, long range affinities are accommodated through the
chaining of many local interactions as opposed to standard
Euclidean distance methods - e.g. correlation - that impute
global influence into each pair-wise affinity metric, making
long range interactions dominate local interactions.
It is apparent from the previous introductions that the most
costly part of the diffusion map is the construction of the
affinity matrix, as we square the amount of data. However,
this matrix is symmetric, and all the diagonal entries are equal
to one. This means that only (N
2
- N) / 2 entries of the matrix
storage and analysis becomes a major problem. The sheer
amount of data produced motivates the need for an automated
analysis technique.
Here, we address high-dimensional hyperspectral data
using diffusion maps, which consider the eigenfunctions of
Markov matrices as a system of coordinates on the original
data set in order to obtain efficient representation of data
geometric descriptions [4-6]. The major difference between
diffusion maps and methods like principle component
analysis (PCA) is that in diffusion maps, a kernel is chosen
before the procedure. This kernel is chosen by our prior
defmition of the geometry of the data [4]. In PCA, all
correlations between values are considered, while only high
correlation values are considered in diffusion maps.
Diffusion maps have already been applied in the analyses of
protein data [7], gene expression data [8], video sequences
[9], and so on, and have achieved attractive performances.
The assumption is that every core sample contains only a few
different kinds of minerals so that there is a lot of redundant
data. It should therefore be sufficient to have only a few key
values per pixel to identify different materials. The reduced
data obtained are then clustered with a neural network cluster
theory, Fuzzy ART (FA) [10], to generate clusters of the
potential minerals. FA is based on Adaptive Resonance
Theory (ART) [11-12], which was inspired by neural
modeling research and was developed as a solution to the
plasticity-stability dilemma: how adaptable (plastic) should a
learning system be so that it does not suffer from catastrophic
forgetting of previously-learned rules (stability)? ART can
learn arbitrary input patterns in a stable, fast, and
self-organizing way, thus overcoming the effect of learning
instability that plagues many other competitive networks.
Experimental results on a subset of hyperspectral data show
that the proposed methods are promising in addressing the
complicated hyperspectral data and identifying the minerals
in core samples.
The remainder of this paper is organized as follows.
Section II and III briefly introduce diffusion maps and FA.
The experimental results are presented and discussed in
section IV, and section V concludes the paper.
II. DIFFUSION MAPS
Given a data set X = {Xi, i = 1,.. . ,N} on am-dimensional
data space, a finite graph with N nodes corresponding to N
data points can be constructed on X as follows. Every two
nodes in the graph are connected by an edge weighted
through a non-negative, symmetric, and positive definite
kernel w: X x X (0, 00). Typically, a Gaussian kernel is
defined as
W(Xi,X) =exp[ Ilx
i
  ,
where (J is the kernel width parameter. The kernel reflects the
degree of similarity between Xi and Xj, and 11·11 is the Euclidean
norm in 9l
m
. The resulting symmetric semi-positive definite
matrix W = {W(Xi' Xj)}NxN is called the affinity matrix.
Let
d(xj) = L w(xj,X
j
)
XjEX
(2)
3391
(9)
(11)
Reset
Input Pattern I
Layer F
1
Layer F2 0 0 0 ... 0 0 0
Fig. 3. The original ROB image of the data (left) and the ROB image of
the masked hyperspectral image (right) . The core tray and the blurred
edges have been cut out for the masked image.
uncommitted neuron is selected for coding, a new
uncommitted neuron is created to represent a potential new
cluster.
FA displays many attractive characteristics that are also
inherent and general in the ART family. FA is capable of both
on-line (incremental) and off-line (batch) learning. The
computational cost of FA is O(N1ogN) or O(N) for one-pass
variant [13], and it can cope with large amounts of
multidimensional data, maintaining efficiency. In comparison,
the commonly used standard agglomerative hierarchical
clustering algorithms run at least O(N
2
) [14]. FA dynamically
generates clusters without the requirement of specifying the
number of clusters in advance as in the classical K-means
algorithm. Another important feature of FA is the capability
Fig. 2. Topological structure of Fuzzy ART. Layers F
1
and F
2
arc
connected via adaptive weights W. The orienting subsystem is
controlled by the vigilance parameter p.
The winning neuron J then becomes activated, and an
expectation is reflected in layer F] and compared with the
input pattern. The orienting subsystem with the pre-specified
vigilance parameter p (0 :s p :s 1) determines whether the
expectation and the input pattern are closely matched. If the
match meets the vigilance criterion,
IXI\ w
J
!
P"';'-lxl-' (12)
weight adaptation occurs, where learning starts and the
weights are updated using the following learning rule,
w . ui cw) = p( x 1\ WJ(old)) +(1- P)wJ(old) , (13)
where fJ E [0,1] is the learning rate parameter. This procedure
is called resonance, which suggests the name of ART. On the
other hand, if the vigilance criterion is not met, a reset signal
is sent back to layer F
2
to shut off the current winning neuron,
which will remain disabled for the entire duration of the
presentation of this input pattern, and a new competition is
performed among the rest of the neurons. This new
expectation is then projected into layer F], and this process
repeats until the vigilance criterion is met. In the case that an
need to be calculated. Calculating this matrix could be very
easily parallelized however , since any two entries are
completely independent of each other. The transition matrix
can be obtained from W by dividing every row element-wise
with d(xD. This could also be done in parallel. Experimental
results show that the resulting matrix is sparse, and since we
only need to find the first couple of eigenvectors and
eigenvalues, this should not pose too much of a problem.
III. FuzzyART
Fuzzy ART (FA) incorporates fuzzy set theory into ART
and extends the ART family by allowing stable recognition of
clusters in response to both binary and real-valued input
patterns with either fast or slow learning [10]. The basic FA
architecture consists of two-layer nodes or neurons, the
feature representation field Flo and the category
representation field F
2
, as illustrated in Fig. 2. The neurons in
layer F] are activated by the input pattern, while the
prototypes of the formed clusters are stored in layer F
2
• The
neurons in layer F
2
that are already being used as
representations of input patterns are said to be committed.
Correspondingly, the uncommitted neuron encodes no input
patterns. The two layers are connected via adaptive weights
Wj , emanating from nodej in layer F
2
. After an input pattern is
presented, the neurons (including a certain number of
committed neurons and one uncommitted neuron) in layer F
2
compete by calculating the category choice function
T _IXI\Wj [
J - a+lwjl
where 1\ is the fuzzy AND operator defined by
(XI\Y), =min(x"y,) , (10)
and a > 0 is the choice parameter to break the tie when more
than one prototype vector is a fuzzy subset of the input pattern,
based on the winner-take-all rule,
TJ
3392
Component 1 Component 2
120 120
100 100
80 80
60 60
40 40
20 20
-40 -20 20 40 60 80 100 -40 -20 20 40 60 80 100
Component 3 Component 4
120 120
100 100
80 80
60 60
40 40
20 20
-40 -20 20 40 60 80 100 -40 -20 20 40 60 80 100
Fig. 4. The first four components obtained in the diffusion map for the middl e image. These component s interpret the major clusters in the image.
of detecting atypical patters or outliers during its learning.
The detection of such patterns is accomplished via the
employment of a match-based criterion that decides to which
degre e a particular pattern matches the characteristics of an
already formed category in layer F
2
• Finally, FA is far simpler
to implement, for example, than kernel-based clustering or
clustering algorithms based on mixture densities. More
discussions on the properti es of FA in terms of prototype,
access, reset, and the number of learning epochs required for
weight stabilization can also be found in [15].
IV. EXPERIMENTAL RESULTS A ND DISCUSSIONS
We applied the proposed method to the hyperspectral dat a
from AngloGold Ashanti. The or iginal image (Fig. 3) is of a
core sample on the imaging tray of the Hel, which is further
split into thr ee overlapping subimages, denot ed as upper,
middl e, and lower image, due to the large size of the image.
For these images, the sides have been trimmed to remove
almost all traces of the core tray. The original image contains
640 bands that are produced by thre e spectrometers that
measure overlapping parts of the spectrum, whi ch is then
reduced to 400 unique bands. Among the 400 bands, we only
used 100 bands from the middle of the third spectrometer, the
part that is most sensitive to mineral features, in our further
analyses. It is impossible to use spectral values from all three
spectrometers because the spectrometers are not perfectly
Fig. 5. Clustering result of FA on the middle image, with the
dimensionality reduced to 20. Four maj or clusters are observed.
3393
Component 2
Component 4
80 100
100 80 60
-:
40 20
Component 1
120 120
100 100
80 80
60 60
40 40
20 20
-40 -20 80 100 -40 -20
Component 3
120 120
100 100
80 80
60 60
40
.-
40
20 20
-40 -20 20 40 60 80 100 -40 -20
Fig. 6. The first four principal components obtained in the PCA for the middl e image.
spatially aligned. Furthermore, the extremes of the ranges of
each spectrometer are particularly noisy and not of much use.
The Gaussian kernel was applied in our further analyses ,
with the kernel width parameter (J set to 10
3
. The time step
parameter was set to 1. All values in P less than 10.
3
were set
to O. According to our experiments, we find that the
clustering results are not sensitive to the category choice
parameter a, which was then set as 10-
3
for all analyses. We
also fixed the learning rate fJ at 1 for fast learning. It is clear
from previous discussions that the vigilance parameter p
plays an important role in determining the number of
categories formed in Layer F
2
• The larger the value of p, the
fewer mismatches will be tolerated ; therefore, more clusters
are likely to be generated. In our study, we start with a
relatively small value ofp and gradually increase it to observe
its effect on the formed clusters.
Fig. 4 shows the first four components of the diffusion map
on the middle image. The /hcomponent corresponds to the /h
dimension of the diffusion map. Compared to the resulting
clusters by FA with the dimensionality reduced to 20
dimensions (Fig. 5), it can be seen that the first four
components represent the major clusters (top right) existing
in the data. In comparison, the first four principal components
of PCA on the middle image are depicted in Fig. 6, where
clusters are not clearly shown as the case when diffusion
maps are used. As discussed in [3], a major concern on the
Fig. 7. Clustering result of FA on the lower image, before (left) and
after (right) the dimensionality reduced from 100 to 20. The vigilance
parameter is set at 0.8.
3394
(c) (d)
Fig. 8. Effect of the vigilance parameter p on the resulting clusters.
Clusters ofthe lower image when (a) p = 0.6, (b) p = 0.7, (c) p = 0.8, and
(d) p = 0.9. More clusters are generated with the increase ofp.
F2 layer of FA is 5, 9, 15, and 21, respectively. Clusters have
become clearer as p gradually increases. There is an obvious
spatial correlation between bands ofthe unprocessed data and
the clustering results. The results are also consistent with the
analyses using the K-means algorithm [2]; however, it is not a
trivial task for K-means to determine the number ofclusters in
advance.
Fig. 9 displays the cluster ing result ofthe upper image with
20 dimensions. The clusters formed at the lower right of the
image, which are shown in amber, correspond to the part of
the tray that is not removed. Hence, these clusters should not
occur anywhere else on the image, as the figure shows. The
composite image of the clustering result is obtained by
merging the three subimages based on the overlapping parts,
as shown in Fig. 10, although a representative spectrum for
each cluster could be used to decide the merge of the clusters.
It is encouraging that the overlaps between the pictures are
consistent with each other.
As we discuss in Section 11, one way to speed-up the
process of the diffusion maps is parallelization. However, a
major bottleneck in the performance of parallel computing is
communication. The speed of communication mediums is not
currently comparable to that of processors [16]. In our
particular case most of the overhead stems from having to
distribute data to the different processors and then
reassembling the results efficiently. We believe that one of
the most efficient algorithms is to simply send the complete
set ofdata points to every processor and use each processor to
calculate (N
2
- N) / 2r weights, where r is the number of
processors. A possible approach to implementing such an
Fig. 9. Clustering results of the upper image with FA. The clusters in
amber are the part ofthe tray.
(b) (a)
application of PCA to hyperspectral data is the number and
shape of the principal eigenvectors. The transform has a
blurring effect and the eigenvectors would most likely only
describe the most common characteristics of the data. The
existence of rare spectra in the data could go unnoticed
because of this. Whether such a weakness is carried over into
diffusion maps is worthy of further investigation.
The clustering results of FA on the lower image before and
after the dimension is reduced from 100 to 20 are shown in
Fig. 7, respectively. It is obvious that before the diffusion
map is applied, it is very difficult to observe any clear
clusters. However, when the dimension is reduced to 20,
major clusters become much clearer and could be easily
identified.
Fig. 8 depicts the effect of the vigilance parameter p on the
number of clusters generated. The value of p varies from 0.6
to 0.9 with a step of 0.1. The number of clusters created in the
3395
Fig. 10. Clustering results of the composite image with FA. The
vigilance parameter was set at 0.8.
algorithm would be to use Google's MapReduce function,
which is specifically designed to make it easy to implement
concurrent programs that use large sets of data [17]. The data
is mapped to many computers, and after processing, the
results are reduced to a manageable scale. Speed-up can also
be achieved in terms of graphics processing units (GPUs).
Although GPUs are mostly familiar to us from computer
games and 3-D graphics processors , they are increasingly
becoming a complementary platform for general-purpose
computation, as in computational intelligence, which is
usually involved with highly expensive computation. The
stream processing capability of GPUs, where the same
instruction is performed on different data streams, makes
them ideally suited to the computation of the transition
matrix. An implementation of GPU in accelerating ART can
produce a speed-up 22 times greater than the CPU
implementation [18]. However, it is necessary to understand
the limitations of the graphics processing hardware and to
take these limitations into account in developing algorithms
targeted at the GPU. Further investigation of the speed-up of
the diffusion maps and FA, based on the above discussions
and some possible pre-processing steps for noise reduction
are important topics for further research.
V. CONCLUSIONS
The occurrence of large amounts of hyperspectral data
brings important challenges to store and process. Here, we
investigate the performance ofdiffusion maps and fuzzy ART
to real hyperspectral image data, from core samples provided
by AngloGold Ashanti. The experimental results are very
encouraging and promising, with clearly defined clusters
found in the data, which correlate well with features of the
original image.
ACKNOWLEDGMENT
Partial support for this research from the National Science
Foundation, the Missouri University of Science &
Technology Intelligent Systems Center, and the M.K. Finley
Missouri endowment, is gratefully acknowledged.
We are grateful to Anglo Technical Division Geosciences
Resource Group and AngloGold Ashanti for allowing us
access to the Hcr data.
R EFERENC ES
[I] 1. Kerekes and J. Schott, " Hyperspectral imaging systems," In C.
Chang, Editor, Hyperspectral Data Exploitation: Theory and
Applications, pp. 19-45, Wiley, 2007.
(2) K. Cawse, S. Damelin, L. du Plessis, R. McIntyre, M. Mitchley, and M.
Sears, "An investigation of data compression techniques for
hyperspectral core imager data, " In Proceedings ofthe Mathematics in
Industry Study Group - MISG2008, South Africa , to appear, 2008.
(3) J. Bowles and D. Gillis, "An optical real-time adaptive spectral
identification systems(ORASIS)," In C. Chang, Editor, Hyperspectral
Data Exploitation: Theory and Applications, pp. 77- I06, Wiley, 2007.
(4) R. Coifman and S. Lafon, " Diffusion maps ," Applied and
Computat ional Harmonic Analysis , vol. 21, pp. 5-30,2006.
[5] S. Lafon and A. Lee, " Diffusion maps and coarse-graining: A unified
framework for dimensionality reduction, graph partitioning, and data
set parameterization," IEEE Transactions on Pattern Analysis and
Machine Intelligence , vol. 28, no. 9, pp. 1393-1403,2006.
[6] S. Lafon , Y. Keller, and R. Coifman, " Data fusion and multicue data
matching by diffusion maps ," IEEE Transactions on Pattern Analysis
and Machine Intelligence, vol. 28, no. II , pp. 1784-1797,2006.
[7] Y. Keller, S. Lafon, and M. Krauthammer, " Protein cluster analysis via
directed diffusion," The Fifth Georgia Tech International Conference
on Bioiriformatics, Georgia, USA, November, 2005.
(8) R. Xu, S. Damelin, and D. Wunsch II, "Clustering of cancer tissues
using diffusion maps and fuzzy ART with gene expression data," In
Proceedings of World Congress on Computational Intelligence 2008,
pp. 183-188, Hong Kong, China, June, 2008.
(9) Y. Ma, S. Damelin, O. Masoud, and N. Papanikolopoulos, "Activity
recognition via classification constrained diffusion maps, "
International Symposium ofComputer Vision, pp 1-8,2006.
3396
[10] G. Carpenter, S. Grossberg, and D. Rosen, "Fuzzy ART: fast stable
learning and categorization of analog patterns by an adaptive resonance
system," Neural Networks, vol. 4, pp. 759-771,1991.
[11] G. Carpenter, and S. Grossberg, "A massively parallel architecture for a
self-organizing neural pattern recognition machine," Computer Vision,
Graphics, and Image Processing, vol. 37, pp. 54-115, 1987.
[12] S. Grossberg, "Adaptive pattern recognition and universal encoding II:
feedback, expectation, olfaction, and illusions," Biological
Cybernetics, vol. 23, pp. 187-202, 1976.
[13] S. Mulder and D. Wunsch II, "Million city traveling salesman problem
solution by divide and conquer clustering with adaptive resonance
neural networks," Neural Networks, vol. 16, pp.827-832, 2003.
[14] R. Xu and D. Wunsch II, "Clustering," IEEE/Wiley, 2008.
[15] 1. Huang, M. Georgiopoulos, and G. Heileman, "Fuzzy ART
properties," Neural Networks, vol. 8, no. 2, pp. 203-213, 1995.
[16] G. Bell, "Bell's law for the birth and death of computer classes: A
theory of the computer's evolution," Communications ofthe ACM, vol.
51,no. l,pp. 86-94,2008.
[17] 1. Dean and S. Ghemawat, "Mapreduce: Simplified data processing on
large clusters," Communications of the ACM, vol. 51, no. 1, pp.
107-113,2008.
[18] R. Meuth, "GPUs surpass computers at repetitive calculations," IEEE
Potentials, pp. 12-15,2007.
3397