Professional Documents
Culture Documents
2, FEBRUARY 1995
Abstract— This paper presents a tutorial overview of neural higher-order structures in the visual cortex and beyond. The
networks as signal processing tools for image compression. efficiency of such parallel architectures over serial processing
They are well suited to the problem of image compression due is reflected by the efficiency with which we process images
to their massively parallel and distributed architecture. Their
characteristics are analogous to some of the features of our own over language.
visual system, which allow us to process visual information with For artificial information processing systems, there are
much ease. For example, multilayer perceptrons can be used also the two approaches. The most common of these is the
as nonlinear predictors in differential pulse-code modulation serial computer. Its serial nature has made it well suited for
(DPCM). Such predictors have been shown to increase the
the implementation of mathematical formulas and algorithms
predictive gain relative to a linear predictor. Another active
area of research is in the application of Hebbian learning to which tend to be expressed in a linguistic form. Recently, there
the extraction of principal components, which are the basis has been an increased interest in the investigation of parallel
vectors for the optimal linear Karhunen-Loève transform (KLT). information processing systems [1], [2], [3], [4], [5]. These so-
These learning algorithms are iterative, have some computational called “artificial neural networks,” or simply “neural networks”
advantages over standard eigendecomposition techniques, and
are networks of simple computational units operating in a
can be made to adapt to changes in the input signal. Yet another
model, the self-organizing feature map (SOFM), has been used parallel and highly interconnected fashion.
with a great deal of success in the design of codebooks for vector The analogy between artificial and natural parallel process-
quantization (VQ). The resulting codebooks are less sensitive to ing systems is not without some validity. In many instances,
initial conditions than the standard LBG algorithm, and the the characteristics of natural networks have been the inspira-
topological ordering of the entries can be exploited to further
increase coding efficiency and reduce computational complexity. tion for artificial networks. As well, the performance of some
natural systems have been successfully modelled by artificial
systems. There are many tasks which the brain performs very
I. I NTRODUCTION well, but which have eluded efficient implementation in a serial
architecture. It would seem logical that to emulate such tasks
“A picture is worth a thousand words.” This axiom ex-
at which the brain is very efficient in an artificial system,
presses the essential difference between our ability to perceive
the architectural characteristics of the artificial system should
linguistic information and visual information. For the same
reflect the corresponding natural system. For image processing
message, a visual representation tends to be perceived as being applications, this suggests that the most efficient computational
more efficient than the spoken or written word. This is hardly
model would be a parallel, highly interconnected neural net-
surprising when viewed from an evolutionary perspective.
work.
Language is but an instantaneous development in the course
Recently there has been a tremendous growth of interest in
of evolutionary history and is manifested in only a single
the field of neural networks. Yet the wide range of applications
species. In contrast, vision, in one form or another, has
and architectures which fall under this category make a specific
existed for hundreds of millions of years and is shared by
definition of the term “neural network” difficult. A neural
a countless number of organisms. But why should there be
network can be defined as a “massively parallel distributed
such a difference between the two representations in such an
processor that has a natural propensity for storing experiential
advanced being as ourselves?
knowledge and making it available for use” [5]. Generally,
The processing of language is inherently serial. Words
neural networks refer to a computational paradigm in which a
and their meanings are recorded or perceived one at a time
large number of simple computational units, interconnected to
in a causal manner. Visual information, on the other hand,
form a network, perform complex computational tasks. There
is processed by massively parallel interconnected networks
is an analogy between such a computational model and the
of processing units. In the mammalian visual system, this
functioning of complex neurobiological systems. Higher-order
parallelism is evident from the retina right through to the
neurobiological systems, of which the human brain is the
Manuscript received March 8, 1994; revised September 29, 1994. This ultimate example, perform extremely complex tasks using a
work was supported in part by the Natural Sciences and Engineering Research highly connected network of neurons in which each neuron
Council of Canada through a Postgraduate Scholarship. performs a relatively simple information processing task.
The authors are with the Communications Research Laboratory, Department
of Electrical and Computer Engineering, McMaster University, Hamilton, This model contrasts sharply with the classical serial com-
Ontario, Canada L8S 4K1. putational model found in most general-purpose computers in
IEEE Log Number 9407213 use today. In the serial model, a highly complex processor
performs computations in a rigidly serial manner. The way in
0018-9219/95$04.00
1995
c IEEE
DONY AND HAYKIN: NEURAL NETWORK APPROACHES TO IMAGE COMPRESSION 289
The study of image compression methods has been an x(n) = wj x(n − j) + n (1)
j=1
active area of research since the inception of digital image
processing. Since images can be regarded as two-dimensional where {wj } is a set of AR coefficients, and {n } is a set
signals with the independent variables being the coordinates of of zero-mean independent and identically distributed (i.i.d.)
a two-dimensional space, many digital compression techniques random variables. In this case, the predicted value is a linear
for one-dimensional signals can be extended to images with sum of the neighboring samples (pixels) as shown by
relative ease. As a result, a number of approaches to the
p
problem are well established [6], [7], [8], [9], [10], [11], X
x̂n = wj x(n − j) (2)
[12], [13], [14]. Most current approaches fall into one of
j=1
three major categories: predictive coding, transform coding,
or vector quantization. Alternatively, a combination of these Equation 2 is the basis of linear predictive coding (LPC). To
techniques may be applied in a hybrid approach. minimize the mean squared error E[(x̂ − x)2 ], the following
290 PROCEEDINGS OF THE IEEE, VOL. 83, NO. 2, FEBRUARY 1995
relationship must be satisfied fixed-basis transforms such as the discrete cosine transform
(DCT) [15], which can be computed in order O(N log N ), are
Rw = d (3)
typically used when implementing block transform schemes.
where [R]ij = E[x(i)x(j)] is ijth element of the autocovari- The Joint Photographics Expert Group (JPEG) have adopted
ance matrix and, dj = E[x̂(n)x(j)] is the jth element of the the linear block transform coding approach for its standard
cross-covariance vector d. Knowing R and d, the unknown using the DCT as the transformation [16].
coefficient vector w can be computed, and the AR model (i.e.
predictor) is thereby determined. C. Vector Quantization
The process of quantization maps a signal x(n) into a series
B. Transform Coding of K discrete messages. For the kth message, there exists a
Another approach to image compression is the use of pair of thresholds tk and tk+1 , and an output value qk such
transformations that operate on an image to produce a set that tk < qk ≤ tk+1 . For a given set of quantization values,
of coefficients. A subset of these coefficients is chosen and the optimal thresholds are equidistant from the values. The
quantized for transmission across a channel or for storage. concept of quantizing data can be extended from scalar or
The goal of this technique is to choose a transformation for one-dimensional data to vector data of arbitrary dimension.
which such a subset of coefficients is adequate to reconstruct Instead of output levels, vector quantization (VQ) employs a
an image with a minimum of discernible distortion. set of representation vectors (for the one-dimensional case) or
A simple, yet powerful, class of transform coding techniques matrices (for the two-dimensional case) [17], [18], [19], [20],
is linear block transform coding. An image is subdivided [12]. The set is referred to as the “codebook” and the entries
into non-overlapping blocks of n × n pixels which can be as “codewords”. The thresholds are replaced by decision
considered as N -dimensional vectors x with N = n × n. A surfaces defined by a distance metric. Typically, Euclidean
linear transformation, which can be written as an M × N - distance from the codeword is used. The advantage of vector
dimensional matrix W with M ≤ N , is performed on each quantization over scalar quantization is that the high degree
block with the M rows of W, wi being the basis vectors of of correlation between neighboring pixels can be exploited.
the transformation. The resulting M -dimensional coefficient Even for a memoryless system, the coding of vectors instead
vector y is calculated as of scalars can theoretically improve performance.
In the coding phase, the image is subdivided into blocks,
y = Wx (4)
typically of a fixed size of n × n pixels. For each block, the
If the basis vectors wi are orthonormal, that is nearest codebook entry under the distance metric is found
1, i=j and the ordinal number of the entry is transmitted. On re-
T
wi wj = (5) construction, the same codebook is used and a simple look-up
0, i 6= j
operation is performed to produce the reconstructed image.
then the inverse transformation is given by the transpose of the The standard approach to calculate the codebook is by way of
forward transformation matrix resulting in the reconstructed the Linde, Buzo and Gray (LBG) algorithm [17]. Initially, K
vector codebook entries are set to random values. On each iteration,
x̂ = WT y (6) each block in the input space is classified, based on its nearest
The optimal linear transformation with respect to minimiz- codeword. Each codeword is then replaced by the mean of
ing the mean squared error (MSE) is the Karhunen-Loève its resulting class. The iterations continue until a minimum
transformation (KLT). The transformation matrix W consists acceptable error is achieved. This algorithm minimizes the
of M rows of the eigenvectors corresponding to the M largest mean squared error over the training set.
eigenvalues of the sample autocovariance matrix While the LBG algorithm converges to a local minimum, it
is not guaranteed to reach the global minimum. In addition,
Σ = E[xxT ] (7) the algorithm is very sensitive to the initial codebook. Fur-
thermore, the algorithm is slow since it requires an exhaustive
The KLT also produces uncorrelated coefficients and therefore
results in the most efficient coding of the data since the search through the entire codebook on each iteration.
redundancy due to the high degree of correlation between With this brief review of conventional image compression
neighboring pixels is removed. The KLT is related to principal techniques at hand, we are ready to consider the role of neural
components analysis (PCA), since the basis vectors are also the networks as an image compression tool.
M principal components of the data. Because the KLT is an
orthonormal transformation, its inverse is simply its transpose. III. P REDICTIVE C ODING U SING N EURAL N ETWORKS
A number of practical difficulties exist when trying to im- As indicated above, optimal predictors based on a linear
plement the above approach. The calculation of the estimate of weighted sum of the neighboring pixels are relatively easy to
the covariance of an image may be unwieldy and may require a design using the statistics of the image. However, if a nonlinear
large amount of memory. In addition, the solution of the eigen- model is more appropriate, the use of a linear predictor will
vectors and eigenvalues is computationally intensive. Finally, clearly result in a suboptimal solution. Unfortunately, the de-
the calculation of the forward and inverse transforms is of sign of nonlinear predictors is generally not as mathematically
order O(M N ) for each image block. Due to these difficulties, tractable as that of linear predictors. Neural networks may
DONY AND HAYKIN: NEURAL NETWORK APPROACHES TO IMAGE COMPRESSION 291
+1 Bias
x1 wj1 wj0
x2 wj2
w yj
x3 j3
Adder Nonlinear Output
wjn Function
xn Weights
Inputs
nonlinear predictors.
0.5
A. Multilayer Perceptrons
When designing a nonlinear predictor, the goal is to find
the optimal parameter set Wo for a given nonlinear function
of the previous p inputs, as shown by
x̂(n) = f (x(n − 1), . . . , x(n − p), W) (8)
0
0
such
that the mean squared value of the prediction error,
input x
x(n)
x(n)
Quantizer
e(n) Delay Delay Delay Delay
+
- ^
x(n)
x(n-3)
x(n-4)
x(n-1)x(n-2)
x(n-1)x(n-3)
x(n-1)x(n-4)
^
x(n)
x(n-2)x(n-3) f( )
x(n-2)x(n-4)
x(n-3)x(n-4)
x(n-1)x(n-2)x(n-3)
x(n-2)x(n-3)x(n-4)
x1 w1
w2
x1
x2 w11
w3 x2 y1
x3 y w12
w1n
wn wm1
w2m
xn ym
wmn
Fig. 8. Simplified linear neuron.
xn
x(n) p
x(n-1) Delay Delay Delay
x(n-p+1)
Delay
w0 w1 w2 wq
^
x(n+1)
and Wm−1 is an (m−1)×N matrix whose m−1 rows are the principal component extraction (APEX), in which the output
previous m − 1 principal components. The orthogonalization of the mth principal component ym can be calculated based
is incorporated into the learning rule to form on the previous m − 1 components through
connections to decorrelate the components. The justification where n is the number of input patterns. This choice of
of this approach was based on earlier work by Barlow and learning-parameters allows the network to adapt to slowly
Földiák [36] on the visual cortex. Building on this approach varying changes in the signal statistics. Further, the additional
and the work of Oja [28], Kung and Diamantaras [37], [38], calculation of a further principal component requires only a
[39] have developed a sequential solution, called adaptive linear order, O(n), of multiplications per iteration. For testing,
DONY AND HAYKIN: NEURAL NETWORK APPROACHES TO IMAGE COMPRESSION 295
subspace classifier. The selection is based on the class whose 1) The set of feature vectors are a good approximation to
projected vector norm kx̂i k is maximum. The projected vector the original input space.
x̂i is calculated by taking the inverse transformation of the 2) The feature vectors are topologically ordered in the
coefficient vector. The image is decoded using the same set of feature map such that the correlation between the fea-
transformations. The class index is used to choose the class ture vectors increases as the distance between them
for the inverse transformation and the resulting reconstructed decreases.
image block x̂ is calculated. 3) The density of the feature map corresponds to the
It was found that for the same coding rate, the use of density of the input distribution so that regions with
the network decreases the MSE by 40-50% over the KLT. a higher probability density have better resolution than
Alternatively, for fixed distortion, the compression ratio can areas with a lower density.
almost be doubled [55]. In addition, the network can be used as
a segmentor which has the property of illumination invariance,
C. Comparison of the SOFM and LBG Algorithms
and produces class representations that are analogous to the
arrangements of the directionally sensitive columns in the The LBG algorithm for codebook design and the SOFM
visual cortex [55], [56]. algorithm are closely related [59]. In fact, the LBG algorithm
is the batch equivalent of the SOFM algorithm for a neigh-
V. V ECTOR Q UANTIZATION U SING N EURAL N ETWORKS borhood size of one, N (Ci ) = {Ci }. The LBG algorithm
A. Self-Organizing Feature Map Algorithm minimizes the mean squared error (MSE) distortion within a
Kohonen’s self-organizing feature map (SOFM) [57] has class
formed a basis for a great deal of research into applying DM SE (x, wi ) = E kx − wi k2 , x ∈ Ci
(37)
network models to the problem of codebook design in vector
quantization. Kohonen introduced the concept of classes or- by setting the vector wi to the mean of the class. The means
dered in a “topological map” of features. In many clustering are calculated in batch mode, i.e. after n training samples. To
algorithms such as K-means each input vector x is classified update the vector wi after every training sample, a gradient
and only the “winning” class is modified during each iteration descent approach based on minimizing (37) can be employed.
[58]. In the SOFM algorithm, the vector x is used to update The updating rule becomes
not only the winning class, but also its neighboring classes wi (t + 1) = wi (t) − 12 α∇DM SE (x, wi (t))
according to the following rule:
(38)
For each vector x in the training set:
= wi (t) + α(t)[x − wi (t)]
1) Classify x according to
which is equivalent to (36) for N (Ci ) = {Ci }. Therefore,
x ∈ Ci if kx − wi k = min kx − wj k (35)
j both compute the class means, resulting in a minimum MSE
2) Update the features wj according to codebook.
A number of researchers [60], [61], [62], [63] have success-
wj (t) + α(t)[x − wj (t)], Cj ∈ N (Ci , t) fully used the SOFM algorithm to generate VQ codebooks and
wj (t+1) =
wj (t), Ci 6∈ N (Cj , t) have demonstrated a number of advantages of using the SOFM
(36)
algorithm over the classical LBG algorithm. The advantages
where w is the feature vector, α is a learning parameter in include less sensitivity to initialization of the codebook, better
the range 0 < α < 1, and N (Ci , t) is the set of classes rate distortion performance, and faster convergence. It has also
which are in the neighborhood of the winning class Ci at been shown [64], [65] that the codewords, or weights, converge
time t. The class features wi converge to the class means. The to the mean of the classes. The resulting codewords are optimal
neighborhood of a class is defined according to some distance in the sense that the average distortion is minimized.
measure on a topological ordering of the classes. For example,
if the classes were ordered on a two-dimensional square grid,
the neighborhood of a class could be defined as the set of D. Address Predictive Vector Quantization
classes whose Euclidean distances from the class are less than Another advantage of using the SOFM algorithm for code-
some specified threshold. Initially, the neighborhood may be book design has been demonstrated for a variation on VQ
quite large during training, e.g. half the number of classes or called Address Predictive Vector Quantization (APVQ) [66].
more. As the training progresses, the size of the neighborhood This technique uses an ordered codebook in which adjacent
shrinks until, eventually, it only includes the one class. During entries are in some sense correlated. The correlation between
training, the learning parameter α also shrinks down to a small adjacent codewords and the correlation between neighboring
value (e.g. 0.01) for the fine tuning (convergence) phase of the blocks in an image can be exploited to construct a DPCM
algorithm. encoder in which the input signal is the codeword address.
This technique can improve the coding gain and allow for
B. Properties of the SOFM Algorithm lossy address encoding. When the SOFM algorithm was used
The SOFM algorithm has a number of important properties to compute a codebook in conjunction with APVQ, 37% fewer
which make it suitable for use as a codebook generator for bits were required to code an image compared to standard VQ
vector quantization [5]:. [66].
298 PROCEEDINGS OF THE IEEE, VOL. 83, NO. 2, FEBRUARY 1995
w1,1
w1,M y1
W1
x^1
w2,1
w2,M y2
W2
n yk ,k
n x
x^2 Classifier
wK,1
wK,M yK
WK
x^M
Fig. 12. Modular system architecture of OIAL. Input are blocks of n × n pixels. The K transformations W i consist of M basis images of size n × n
and output an M -dimensional vector yi . The coefficient vector to be sent is chosen by the subspace classifier based on the maximum norm of the projected
vector kx̂i k.
Layer 1 1 2
x1 Two-input
vector
x2 quantizer
3
Layer 2 1 2
Two-input
vector Output
quantizer
x3 Two-input
3
vector
quantizer Layer 3 4
x4 1 2
(Base Layer)
that while this technique dramatically reduced the training the fairness function has a significant effect on the performance
and encoding time relative to a full search, the increase in of the resulting codebook. The use of (42) resulted in an
distortion was minor, at most a 0.5 dB decrease in SNR. increase in MSE relative to the LBG codebook, while (43)
Lee and Peterson [64] have developed an approach which resulted in MSE approximately the same as for the LBG
dynamically changes the structure and neighborhood rela- codebook.
tionships between the weights by allowing creation of new
weights, merging and deletion of existing weights, and merg- VI. E VALUATION I SSUES
ing of axes in the topological map.
While the scope of this paper has been to review the
current research into applying neural network techniques to
H. Frequency Sensitive Competitive Learning the task of image compression and summarize the various
One of the problems with the LBG algorithm is that the approaches, we have included, where possible, the claims as to
frequency of use of the entries in the codebook can be the performance of the various techniques by their respective
quite uneven with some codewords being underutilized. In authors. However, there is still a need for an objective, unified
frequency sensitive competitive learning (FSCL), [72], [73], evaluation of these new approaches. A common set of training
[74], the distance to the codewords during training is weighted images and common testing images from outside the training
300 PROCEEDINGS OF THE IEEE, VOL. 83, NO. 2, FEBRUARY 1995
set would allow the performance of the different algorithms c) Weighted Mean Squared Error (WMSE):: An alterna-
to be compared with each other. In addition, the performance tive to the MSE is the weighted mean squared error (WMSE).
of the state-of-the-art in the “classical” approaches such as In this method, the error is weighted by a subjective function
linear predictive coding, JPEG using the DCT, and vector which measures the local visibility of distortion. The weighting
quantization on the same test images would provide a set of function can be derived through subjective experiments or
benchmarks for comparison. models of the human visual system (HVS). The visibility of
Such an evaluation must address the four issues of training, error can depend on the modulation transfer function (MTF),
distortion, bit rate, and complexity. (the spatial frequency response) of the HVS, the local contrast,
and the local texture content [14].
A. Training d) End-use Measures:: If the end-use of a class of
images is well defined, then the performance under the end-
Because these techniques require training to calculate the use can be measured. For example, if feature detection is
parameters of the encoder, care must be taken in selecting the performed on a remote sensing image, then receiver oper-
common set of training images. They must be representative ating characteristics (ROC) curves can be employed [75].
of the class of images for which the network is to be used. For ROC curves plot the relationship between the probability of
example, if natural images are to be processed, the training set detection (Pd ) and the probability of false alarm (Pf a ). As
should include a variety of naturally occurring scenes. On the the distortion of the compression system increases, Pd will
other hand, if artificial images such as synthetic aperture radar decrease for a given Pf a . For medical imaging, the effect of
(SAR) or magnetic resonance imaging (MRI) are used, then the distortion on the accuracy of diagnosis can be similarly
the training set must be representative of that class of images. measured [76].
The images used for evaluation should also be representative
of the class of images used in training but they should not
be from the training set. The use of images from outside
of the training set allows the generalization properties of the C. Bit Rate
networks to be evaluated. For many classical techniques, an
Bit rate can be measured as the average number of bits
image model is assumed. For example, linear predictive coding
per pixel for the encoded image. Both the bit rate and the
assumes a pth order AR process. For images for which the
distortion measure must be quoted together for any meaningful
assumed model is valid and whose statistics are similar, the
comparison of performance.
encoder performs well. When these assumptions are not valid,
the performance is degraded. Similarly for neural networks, e) Entropy and Perceptual Entropy:: Shannon [77] in-
if a test image has similar attributes to those of the training troduced the concept of entropy as a measure of the average
set, the encoder should perform well and the network is said information content of a source. For lossless coding, the lower
to generalize well. The degree to which the attributes of an bound for the minimum number of bits required is given by the
image may deviate from those in the training set before there entropy of the image. Recently, Jayant introduced the concept
is a significant degradation in performance provides a measure of perceptual entropy defined as the “fundamental limit to
of the generalization ability of the network. which we can compress the signal with zero (perceived)
distortion” [14]. If some distortion is permissible, then the
bit rate can be less than the entropy bound. In this case,
B. Distortion the rate-distortion relationship must be derived. While this
With lossy compression, the reconstructed image differs relationship can be analytically derived for the mean squared
from the original. This difference may result in some visible error distortion measure, for the reasons stated above, the
distortion which may be measured in a number of ways. resulting relationship may not give an accurate indication of
a) Mean Squared Error (MSE):: A common measure of the performance of an image compression algorithm. For per-
distortion is the mean squared error (MSE). Its wide-spread ceptually based distortion measures, the relationship between
use is due mainly to its ease of calculation. Generally, an image bit rate and distortion is not so mathematically well defined,
with a high MSE has more visible distortion than that with a and therefore would have to be found experimentally.
low MSE. In addition, the minimization of the MSE as an f) Overhead:: Since a number of neural network ap-
optimality criterion has formed the basis for the development proaches to compression are adaptive in nature, the bit rate
of a large number of successful image compression algorithms. measure must include any overhead associated with the adapta-
b) Mean Opinion Score:: However, as many researchers tion mechanism. For example, if a new set of network weights
have rightly pointed out, despite its popularity as a distortion were required at the decoder during processing, the number of
measure, MSE can be a poor indicator of the subjective quality bits in the message notifying the decoder of the change as well
of reconstruction [14], [11]. As a result, perceptually based as the bits required for the transmission of the new weights,
criteria may be more appropriate. One example is the mean if required, must be accounted for in the average bit rate. In
opinion score [11]. A number of subjects view an image and the adaptive transform coding scheme of Section IV.IV-G, the
rate its quality on a five point scale of “bad,” “poor,” “fair,” number of bits for the coding of the class index associated
“good,” or “excellent.” The mean opinion score is simply the with each image block is accounted for in the final bit rate
average rating assigned by all the subjects. measure.
DONY AND HAYKIN: NEURAL NETWORK APPROACHES TO IMAGE COMPRESSION 301
[7] A. K. Jain, “Image data compression: A review,” Proc. IEEE, vol. 69, [38] K. I. Diamantaras, “Principal component learning networks and appli-
no. 3, pp. 349–389, March 1981. cations,” Ph.D. dissertation, Princeton University, October 1992.
[8] A. Rosenfeld and A. C. Kak, Digital Picture Processing, 2nd ed. San [39] S. Y. Kung, K. I. Diamantaras, and J. S. Taur, “Adaptive principal
Diego, CA: Academic Press, 1982, vol. I & II. component extraction (APEX) and applications,” IEEE Trans. Signal
[9] N. S. Jayant and P. Noll, Digital Coding of Waveforms. Englewood Processing, vol. 42, no. 5, pp. 1202–1217, May 1994.
Cliffs, NJ: Prentice-Hall, 1984. [40] H. Chen and R. Liu, “Adaptive distibuted orthogonalization process for
[10] H. G. Musmann, P. Pirsch, and H.-J. Grallert, “Advances in picture principal components analysis,” in Proc. IEEE Int. Conf. on Acoustics,
coding,” Proc. IEEE, vol. 73, no. 4, pp. 523–548, April 1985. Speech, and Signal Processing ’92, San Francisco, CA, March 23-26
[11] A. N. Netravali and B. G. Haskell, Digital Pictures: Representation and 1992, pp. II 283–296.
Compression. New York, NY: Plenum Press, 1988. [41] L. Xu and A. Yuille, “Robust principal component analysis by self-
[12] A. Gersho and R. M. Gray, Vector Quantization and Signal Compression. organizing rules based on statistical physics approach,” Harvard Robotics
Norwell, MA: Kluwer Academic Publishers, 1992. Laboratory, Tech. Rep. 92-3, February 1992.
[13] J. A. Storer and M. Cohn, Eds., Proc. Data Compression Conference. [42] E. Oja, Subspace Methods of Pattern Recognition. Letchworth, U.K.:
Snowbird, UT: IEEE Computer Society Press, March 30 - April 2 1993. Research Studies Press Ltd., 1983.
[14] N. Jayant, J. Johnston, and R. Safranek, “Signal compression based on [43] S. Becker and M. Plumbley, “Unsupervised neural network learning
models of human perception,” Proc. IEEE, vol. 81, no. 10, pp. 1385– procedures for feature extraction and classification,” 1996, to appear
1421, October 1993. Int. J. Applied Intelligence.
[15] K. R. Rao and P. Yip, Discrete Cosine Transform: Algorithms, Advan- [44] G. W. Cottrell and P. Munro, “Principal components analysis of images
tages, Applications. New York, NY: Academic Press, 1990. via back propagation,” in SPIE Vol. 1001 Visual Communications and
[16] G. K. Wallace, “Overview of the JPEG (ISO/CCITT) still image Image Processing ’88, 1988, pp. 1070–1077.
compression standard,” in Proceedings of the SPIE, vol. 1244, Feb. 1990, [45] G. W. Cottrell, P. Munro, and D. Zipser, “Learning internal representa-
pp. 220–233. tions from gray-scale images: An example of extensional programming,”
[17] Y. Linde, A. Buzo, and R. M. Gray, “An algorithm for vector quantizer in Ninth Annual Conf. of the Cognitive Society, July 16-18 1987, pp.
design,” IEEE Trans. Communications, vol. 28, no. 1, pp. 84–95, January 462–473.
1980. [46] G. W. Cottrell and J. Metcalfe, “EMPATH: Face, emotion, and gender
[18] R. M. Gray, “Vector quantization,” IEEE ASSP Magazine, vol. 1, pp. recognition using holons,” in Advances in Neural Information Processing
4–29, 1984. Systems 3, D. S. Touretzky, Ed., 91, pp. 564–571.
[19] N. M. Nasrabadi and R. A. King, “Image coding using vector quan- [47] B. A. Golomb, D. T. Lawrence, and T. J. Sejnowski, “SEXNET: A
tization: A review,” IEEE Trans. Communications, vol. 36, no. 8, pp. neural network identifies sex from human faces,” in Advances in Neural
957–971, August 1988. Information Processing Systems 3, D. S. Touretzky, Ed., 91, pp. 572–
[20] H. Abut, Ed., Vector Quantization. New York, NY: IEEE Press, 1990. 577.
[21] S. A. Dianat, N. M. Nasrabadi, and S. Venkataraman, “A non-linear [48] G. L. Sicuranzi, G. Ramponi, and S. Marsi, “Artificial neural network
predictor for differential pulse-code encoder (DPCM) using artificial for image compression,” Electronics Letters, vol. 26, no. 7, pp. 477–479,
neural networks,” in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal March 29 1990.
Processing ’91, Toronto, Canada, May 14-17 1991, pp. 2793–2796. [49] D. Anthony, E. Hines, D. Taylor, and J. Barham, “A study of data
[22] J. Li and C. N. Manikopoulos, “Nonlinear predictor in image coding with compression using neural networks and principal component analysis,”
DPCM,” Electronics Letters, vol. 26, no. 17, pp. 1357–1359, August 16 in Colloquium on Biomedical Applications of Digital Signal Processing,
1990. 1989, pp. 1–5.
[23] C. N. Manikopoulos, “Neural network apporach to DPCM system design
[50] R. Kohno, M. Arai, and H. Imai, “Image compression using a neural
for image coding,” IEE Proceedings-I, vol. 139, no. 5, pp. 501–507,
network with learning capability of variable function of a neural unit,”
October 1992.
in SPIE Vol. 1360 Visual Communications and Image Processing ’90,
[24] Y.-H. Pao, Adaptive Pattern Recognition and Neural Networks. Read- 1990, pp. 69–75.
ing, MA: Addison-Wesley, 1989.
[51] A. Namphol, M. Arozullah, and S. Chin, “Higher-order data com-
[25] S. Haykin and L. Li, “Nonlinear adaptive prediction of nonstationary
pression with neural networks,” in Proc. Int. Joint Conf. on Neural
signals,” IEEE Trans. Signal Processing, 1995.
Networks’91, 1991, pp. I 55–59.
[26] S. Haykin, Adaptive Filter Theory, 2nd ed. Englewood Cliffs, NJ:
[52] D. DeMers and G. W. Cottrell, “Non-linear dimensionality reduction,”
Prentice Hall, 1991.
in Advances in Neual Information Processing Systems 5, 1993.
[27] D. O. Hebb, The Organization of Behavior. Wiley, 1949.
[28] E. Oja, “A simplified neuron model as a principal component analyzer,” [53] L. E. Russo and E. C. Real, “Image compression using an outer product
neural network,” in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal
J. Math. Biology, vol. 15, pp. 267–273, 1982.
Processing ’92, San Francisco, CA, March 23-26 1992, pp. II 377–380.
[29] ——, “Neural networks, principal components, and subspaces,” Int. J.
Neural Systems, vol. 1, no. 1, pp. 61–68, 1989. [54] R. D. Dony and S. Haykin, “Optimally integrated adaptive learning,”
[30] ——, “Principal components, minor components, and linear neural in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing ’93,
networks,” Neural Networks, vol. 5, pp. 927–935, 1992. Minneapolis, MN, April 27-30 1993, pp. I 609–612.
[31] E. Oja and J. Karhunen, “On stochastic approximation of the eigenvec- [55] ——, “Optimally adaptive transform coding,” 1995, to appear IEEE
tors and eigenvalues of the expectation of a random matrix,” J. Math. Trans. Image Processing.
Analysis and Applications, vol. 106, pp. 69–84, 1985. [56] ——, “Self-organizing segmentor and feature extractor,” in Proc. IEEE
[32] T. D. Sanger, “Optimal unsupervised learning in a single-layer linear Int. Conf. on Image Processing, Austin, TX, November 13-16 1994, pp.
feedforward neural network,” Neural Networks, vol. 2, pp. 459–473, III 898–902.
1989. [57] T. Kohonen, “The self-organizing map,” Proc. IEEE, vol. 78, no. 9, pp.
[33] ——, “An optimality principle for unsupervised learning,” in Advances 1464–1480, September 1990.
in Neural Information Processing Systems 1, D. S. Touretzky, Ed., 1989, [58] R. O. Duda and P. E. Hart, Pattern Classification and Scene Analysis.
pp. 11–19. John Wiley & Sons, 1973.
[34] ——, “Analysis of the two-dimensional receptive fields learned by [59] S. P. Luttrell, “Self-organization: A derivation from first principle of a
the generalized hebbian algorithm in response to random input,” Biol. class of learning algorithms,” in IEEE Conference on Neural Networks,
Cybern., vol. 63, pp. 221–228, 1990. Washington, DC, 1989, pp. 495–498.
[35] P. Földi´ak, “Adaptive network for optimal linear feature extraction,” in [60] N. M. Nasrabadi and Y. Feng, “Vector quantization of image based
Int. Joint Conf. on Neural Networks, vol. 1, Washington, DC, 1989, pp. upon a neural-network clustering algorithm,” in SPIE Vol. 1001 Visual
401–405. Communications and Image Processing ’88, 1988, pp. 207–213.
[36] H. Barlow and P. Földi´ak, “Adaptation and decorrelation in the cortex,” [61] ——, “Vector quantization of image based upon the Kohonen self-
in The Computing Neuron, R. Durbin, M. C., and G. Michison, Eds. organizing feature map,” in Proc. IEEE Int. Conf. on Neural Networks,
Reading, MA: Addison-Wesley, 1989, pp. 54–72. San Diego, CA, July 24-27 1988, pp. I 101–105.
[37] S. Y. Kung and K. I. Diamantaras, “A neural network learning algorithm [62] J. D. McAuliffe, L. E. Atlas, and C. Rivera, “A comparison of the
for adaptive principal component extraction (APEX),” in Proc. IEEE Int. LBG algorithm and Kohonen neural network paradigm for image vector
Conf. Acoustics, Speech, and Signal Processing 90, Alburqurque, NM, quantization,” in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal
April 3-6 1990, pp. 861–864. Processing ’90, Alburquerque, NM, April 3-6 1990, pp. 2293–2296.
DONY AND HAYKIN: NEURAL NETWORK APPROACHES TO IMAGE COMPRESSION 303