You are on page 1of 3

1322 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 42, NO.

4, JULY 1996

Book Reviews -

Fundamentals of Artificial Neural Networks- The initial generation of books on artificial neural networks that
M. H. Hassoun (Cambridge, MA: MIT Press, 1995) appeared in the late 1980’s tended to be either highly simplified
overviews, with a significant emphasis on neurobiological issues, or
Reviewer: Terrence L. Fine, Fellow, IEEE edited collections of papers, frequently with a physics orientation and
focus on Hopfield/recurrent/feedback networks. In the last couple
of years we have been fortunate to see the emergence of several,
Artificial neural networks are systems motivated by the distributed, engineering-oriented texts written by capable authors with systems
massively parallel computation in the brain that enables it to be so or statistics backgrounds. There continues to be a proliferation of
successful at complex control and recognitiodclassification tasks. The edited, and I use this term loosely, collections of papers serving
biological neural network that accomplishes this can be mathemat- the ego-gratifying purpose of having conferences and workshops
ically modeledcaricatured by a weighted, directed graph of highly appear in hard cover. Among the new generation of texts worth
interconnected nodes (neurons). The artificial nodes are almost always study, the first was perhaps Hertz et al. [6] and those of Haykin
simple transcendental functions whose arguments are the weighted [5] and Zurada [16] are comparable to the current text by Hassoun
summation of the inputs to the node; early work on neural networks as being appropriate for electrical engineers. Haykin’s is the most
and some current work uses node functions taking on only binary comprehensive of these books while Hassoun’s [4] is somewhat
values. After a period of active development in the 1950’s and 1960’s, more mathematical while also attempting to be comprehensive.
that slowed in the face of the limitations of the networks then being Unfortunately, a consequence of the attempts by all of these authors
explored, neural networks experienced a renaissance in the 1980’s to be comprehensive is that on many important topics their treatment
with the work of Hopfield [7] on the use of networks with feedback is too superficial for a good seniorhrst-year-graduate course on this
(graphs with cycles) as associative memories, and that of Rumelhart subject, and too superficial to please the readers of these Transactions.
et al. [13] on backpropagation training and feedforward (acyclic Typically, mathematical results are quoted from other sources and
graphs) networks that could “leam” from input-output examples little or no supporting argument, let alone proofs, provided. An
provided in a training set. Learning in this context is carried out attempt at wider communication that argues for a de-emphasis
by a descent-based algorithm that adjusts the network weights so on rigor becomes corrupted by a de-emphasis on precision and a
that the network response closely approximates the desired responses frequent absence of significant explanation and development. Those
specified by the training set. This ability to learn from training data, seeking a deeper level of explanation will be interested in a newer
rather than needing to be explicitly (heuristically) programmed, was neural network literature represented by Ripley [ l l ] , Siu et al.
important both for an understanding of the functioning of brains and [14], and Vapnik [l5] that either make fewer compromises with
for progress in a great variety of applications in which practitioners mathematical theory or explain mathematical issues more soundly.
had been unable to embed their qualitative understanding in suc- Each of these monographs are more focussed in their treatment of
cessful programs. The capabilities of neural networks were quickly neural networks.
exploited in a great number of applications to pattern classification, Hassoun’s goals are outlined in his Preface as, “emphasizes
control, and time-series forecasting. Hopfield’s work on associative fundamental theoretical aspects of the computational capabilities and
memories excited the interest of his fellow statistical physicists learning abilities of artificial neural networks . . . to present a unified
who felt that their methods of analysis would be applicable and framework that makes the subject more accessible to students and
productive in studying the asymptotic behavior of neural networks. practitioners. . . . The main audience is first-year graduate students in
Unfortunately, many of the applications and studies were either trivial electrical engineering, computer engineering, and computer science.
or misguided and earned the field the sobriquet of “hype” for the . . . The theory and techniques . . . are fairly mathematical, although
common appearance of exaggerated claims. Nonetheless, information the level of mathematical rigor is relatively low. . . . The operation
theory distinguished itself with such solid papers as that of McEliece of artificial neural networks is viewed as that of nonlinear systems.”
et al. [8] in providing mathematically sophisticated analyses of The author, in a personal communication, has asserted that, “The
network capabilities. The 1990’s saw a significant maturation both unifying framework is that one may treat all neural networks in terms
in applications and in theoretical understanding of performance and of an approximation model and an associated learning rule . . ..” This
limitations. In particular, neural networks provided a wide spectrum claim of a unified framework is further amplified in the Preface by
of applied statisticians with a new and powerful class of regression the assertion, “a continuous-time learning rule is viewed as first-order
and classification functions that, for the first time, allowed them stochastic differential equatiorVdynamica1 system whereby the state
to make successful truly nonlinear models involving hundreds of of the system evolves so as to minimize an associated instantaneous
variables. The problem of “feature” or “regressor” selection becomes criterion function.” While this viewpoint explains much of the order
less critical when you do not need to narrow your choices among of presentation in this text, it encompasses so much that it does not
input variables. A new regime in statistics became accessible, and define accurately the choice of topics. This is perhaps to be expected,
applied statisticians were no longer restricted in practice either to
given the attempt to be comprehensive and the weight given to various
very simple nonlinear models in a few variables or to larger, but issues that are reflective of the author’s research interests. The low
linear, models based solely upon second-order properties.
level of mathematical rigor is less problematic than is the fact that
many mathematical results are simply quoted from the literature, a
failing common to this generation of neural network texts. Finally,
Publisher Item Identifier S 0018-9448(96)05458-2 the target audience is an appropriate one for this text.
0018-9448/96$05.00 0 1996 IEEE
IEEE TRANSACTIONS ON INFORMATION THEORY. VOL. 42, NO. 4, JULY 1996 1323

Chapter 1 and the first part of Chapter 2 contain about 45 Chapter 4 is an ambitious attempt to provide mathematical ana1,yses
pages devoted to neural networks with a binary-valued signum of several of the preceding learning rules that address asymplotic
function element called a linear threshold gate (LTG) and its easy behavior as well as complexity and generalization performance. One
generalization to a polynomial threshold gate (PTG) in which the can question the impomrtance of asymptotic analyses that address
inputs to the LTG are polynomial functions of the original inputs stability rather than analyses of statistical performance. Chapter 4
rather than just the inputs themselves. These networks were the ones uses approximations of learning algorithms by ordinary differential
originally envisioned by McCullough and Pitts in 1943 and studied equations to study stability, and this is an area of research interest to
in some detail by Frank Rosenblatt [12]. Chapter 1 concentrates on the author. I suspect that this approach has little practical importance
the properties of an individual linear threshold gate (step function) in terms of guidelines useful in deploying and controlling these
and follows the seminal earlier work of Cover [3] and can be learning algorithms with finite training sets, limited computational
compared to Nilsson [lo]. While Weierstrass’ Theorem on uniform resources, and the prevalent use of training algorithms controlled to
approximation to continuous functions by polynomials is cited in limit the amount of training. Generalization is addressed on the bases
Chapter 1, it is out of place in this discussion of approximation of both average and worst case performance analyses. The exposition
by LTG’s and PTG’s; the synthesis of Boolean functions by PTG’s of average case generalization (e.g., eq. 4.8.4) does not meet the
requires far less than the Weierstrass Theorem. Results from older standards of rigor expected by the statistical and information-theoretic
studies on switching function realization are recalled, sometimes to communities. The discwsion of worst case analysis is very brief and
advantage and sometimes, as in the case of Karnaugh maps with amounts to a page of quotation primarily from Baum and Haussler [2].
their applicability only to very small numbers of inputs, to little The important issue of generalization ability of a trained network is
advantage. However, it is of value to embed LTG-based networks given short shrift compared to the less important issue of an idealized
in their historical switching circuit predecessor. A good selection of asymptotic behavior of the training algorithm. The attached exercises
exercises rounds out Chapter 1, although several problems (e.g., ones underscore this imbalance of treatment.
on polynomial approximation) have been given little foundation and Chapter 5 treats the central topic of the backpropagation formu-
do not support the developments of the chapter. lation of steepest descent for training multilayer neural networks
Chapter 2 starts by analyzing the Boolean function representation in a somewhat thrown-together fashion. Curiously, Hassoun never
properties of LTG-based networks in terms of complexity of im- introduces a systematic notation for describing multilayer neural
plementations. More extensive discussion of such networks can be networks. One is left with a diagram to indicate that w J Land ,wl3
found in the detailed treatment given by Siu et al. [I41 that is written denote weights connected to different layers. There is no thorough
at the mathematical level of these Transactions. The approximation discussion setting backpropagation in the context of steepest descent
properties of neural networks with real-valued nodes are noted, albeit algorithms. Indeed, Hassoun allows different learning rates for the
briefly, in Section 2.3. This is surprising given the detailed attention network output weights and for the hidden layer weights, and thereby
given to LTG-based networks and the far greater importance in moves away from steepest descent without formally facing this
practice of networks with real-valued nodes. Kolmogorov’s theorem consequence. There is only a one-sided discussion of the relative
on representations of continuous functions of several variables is merits of on-line (called here “incremental”) versus batch processing,
presented in some detail even though it is largely irrelevant to with the recommendation solidly on the side of on-line processing
representations by neural networks. The essential results on universal for its claimed advantage of being noisier and therefore possessing a
approximation properties of single hidden layer neural networks are greater likelihood of not becoming trapped in a poor local minimum.
indicated in Section 2.3.2. Section 2.4.2 contains some speculations On-line processing can indeed be construed as more random in its
on the energy required to perform computations that is of interest results than is batch processing. However, batch processing is the
but seems out of place. Chapter 2 closes with a small selection of only version that can implement true steepest descent. The choice
exercises that focus on the LTG case and not on the more important of fixed learning rate is addressed in a previous Section 4.2.2, where
real-valued case. bounds necessary for convergence are noted, and an “optimal learning
The major focus of this text, 220 pages of Chapters 3, 4, and 5, rate” is suggested in Seiztion 5.2.2.
is the issue of learning in neural networks. More than 20 supervised Section 5.2 contains albout 25 pages on improvements to the basic
(learning with classified data) and unsupervised learning algorithms backpropagation algorithm presented in Section 5.1. Several impor-
are introduced in Chapter 3. Chapter 3 opens with the perceptron tant training algorithms are touched on only briefly in Section 52.3,
training algorithm of Rosenblatt, used to train a single binary-valued entitled “Momentum.” Levenberg-Marquardt, the standard adopted
node. Convergence is proven when the training set is learnable by the Matlab Neural Networks Toolbox, is inaccurately referred to in
(linearly separable) and some generalizations are given of the PTA. one sentence on p.218. Industry standard quasi-Newton methods ,are
The PTA is also derived as a gradient descent rule. There is a not noted, although some approximations to the Hessian appearing in
brief discussion of the nonconvergent behavior of the PTA when Newton methods are discussed briefly. The important idea of moving
the training set is not linearly separable. More is known now about from steepest descent search to conjugate gradient directions search is
this case and about the intrinsic difficulty of the nonseparable case not given the treatment it deserves. Conjugate gradient methods have
(e.g., Amaldi [l]). The widely discussed LMS rule is presented in been quite successful in neural network applications (e.g., Moller
some detail. In all, there is about 30 pages of discussion of learn- [9]). Weight decay and rtegularization methods are also deserving of
ingkraining for a single-node neural network, There is an extensive more discussion than that provided in Section 5.2.5. Cross-validation
presentation of unsupervised learning rules including Hebbian and is treated in only a qualitative fashion in Section 5.2.6. The chapter
Linsker, principal components analysis, Kohonen vector quantization, concludes with a section containing a good variety of applications
and Kohonen self-organizing feature maps. Hassoun notes that, “The culled from the literature: and a section that includes such topics as
presentation of these rules is unified in the sense that they may all time-delay neural networks (TDNN) and so-called backpropagation
be viewed as realizing incremental steepest-gradient-descent search through time used in controller design.
on a suitable criterion function.” Chapter 3 concludes with a nice Hassoun reaches several general conclusions about the relative
table summarizing characteristics of the wide variety of learning rules values of networks with two hidden layers versus ones with a single
discussed. hidden layer, preferring networks with a single hidden layer to
1324 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 42, NO. 4 .JULY 1996

multiple-layer networks having the same number of weights. While E. Baum and D. Haussler, “What size net gives valid generalization?,” in
these conclusions are worth keeping in mind, too little is known D. Touretzky, Ed., Advances in Neural Information Processing Systems
analytically about the relative merits of such networks to place much I . San Mateo, CA: Morgan Kaufmann, 1989, pp. 81-90.
T. M. Cover, “Geometrical and statistical properties of systems of
confidence in them. Global optimization is an issue that Hassoun linear inequalities with applications in pattern recognition,” IEEE Trans.
examines in Chapter 8 as well as in Chapter 5. The suggestion made Electron. Comput., pp. 326-334, 1965.
in Chapter 5 is a revision of the loss or error function that is only M. Hassoun, Fundamentals ofArtificial Neural Networks. Cambridge,
guaranteed to find a global minimum for networks with a single input MA: MIT Press, 1995.
S. Haykin, Neural Networks: A Comprehensive Foundation. New
variable, an application far from typical for neural networks. York: Macmillan, 1994.
Chapter 6 treats a variety of other networks such as those based J. Hertz, A. Krogh, and R. Palmer, Introduction to the Theory of Neural
on radial basis functions, the CMAC model, and so-called adaptive Computation. Redwood City, CA: Addison-Wesley, 1991.
resonance theory for clustering networks. Radial basis functions J. Hopfield, “Neural networks and physical systems with emergent
will be familiar to some as the kernel methods used for estimating collective computational abilities,” Proc. Nut. Acad. Sciences, vol. 79,
pp. 2554-2558, 1982.
probability density functions and for interpolation. The CMAC model, R. McEliece, E. Posner, E. Rodemich, and S. Venkatcsh, “The capacity
while it has adherents, is peculiar in this setting for being a linear of the Hopfield associative memory,” IEEE Trans. Inform Theory, vol.
mapping operating on features selected through Boolean operations IT-33, pp. 461482, 1987.
on the network inputs. M. Moeller, “A scaled conjugate gradient algorithm for fast supervised
learning,” Neural Networks, vol. 6 , pp. 525-533, 1993.
Chapter 7 treats associative memories in some detail. Models N. Nilsson, The Mathematical Foundations of Learning Machines
include both linear systems and neural networks with feedback known (1965). San Mateo, CA: Morgan Kaufmann, 1990, reprint.
as Hopfield nets. While the analysis of the memory storage abilities B. Ripley, Pattern Recognition and Neural Networks. Cambridge, UK:
(capacity) of Hopfield nets had early contributions from information Cambridge Univ. Press, 1996.
theorists (e.g.. McEliece et al. [SI), the treatment provided herein will F. Rosenblatt, Principles of Neurodynamics: Perceptrons and the Theory
of Brain Mechanisms (a version of a 1961 report of the same title done
disappoint readers of these Trunsactions for its brevity. Results in this for Come11 Aeronautical Labs.). Washington, DC: Spartan Books,
area are cited but neither derived nor discussed in detail. The chapter 1962.
concludes with a good variety of problems. D. E. Rumelhart and J. L. McClelland, Eds. Parallel Distributed
Chapter 8 devotes about 60 pages to such global optimization Processing. Cambridge MA: MIT Press, 1986.
K-Y. Siu, V. Roychowdhury, and T. Kailath, Discrete Neural Computa-
methods for neural networks as simulated annealing and genetic tiont A Theoretical Foundation. Englewood Cliffs, NJ: Prentice-Hall,
algorithms. While this i s a topic worth treating, one should not 1995.
make much of such approaches. Global optimization of functions of V. Vapnik, The Nature of Statistical Learning Theory. New York:
hundreds or thousands of variables is an unrealistic demand that can Springer-Verlag, 1995.
J. Zurada, Introduction to Artificial Neural Systems. St. Paul, MN: West
only be achieved if one devotes unlikely computational resources
Publishing, 1992.
to the optimization. Neural networks work in practice because a
judicious selection among several local minima turns out to be
sufficient.
There is an extensive bibliography of about 700 items covering the
period ending in 1993, and a good index. Each chapter closes with
a useful summary. In sum, Hassoun has written a worthy compre-
hensive text on a wide variety of neural networks and approaches to
their design that can, as he hoped, be the basis of a seniorifirst-year
graduate level course. However, in a number of areas those seeking Terrence I. Fine (S’62-M’63-SM’81-F’82) received the Ph.D. degree in
deeper understanding either for complex applications or through an applied physics from Harvard University, Cambridge, MA, in 1963.
interest in research will have to seek out either one of the newer After a post-doctoral appointment at Harvard and serving as a Miller
texts or follow their interests through the references provided to other Fellow ut the University of California ut Berkeley, he joined the School of
Electrical Engineering at Cornell University, Ithaca, NY, in 1966. His research
sources.
interests are in the foundation of probability, particularly when probability is
not represented by the r e d s and in statistical aspects of the performance and
design oj neural networks.
REFERENCES Dr. Fine has been an Associate Editor f o r Detection and Estimation and f o r
book reviews of the IEEE TRANSACTIONS ON INFORMATION THEORY and is U Past-
[l] E. Amaldi, “From finding maximum feasible subsystems of linear President of the IEEE Information Theory Society (actually, the lust President
systems to feedforward neural network design,” Sc.D. dissertation, Dept. of the Information Theory Group). H e is also a founding member of the Board
Math., ETH Lausanne, EPFL, 1994. of Directors of the Neural Information Processing Systems Foundution.

You might also like