You are on page 1of 11

Understanding Neural Networks as Statistical Tools

Author(s): Brad Warner and Manavendra Misra


Source: The American Statistician, Vol. 50, No. 4 (Nov., 1996), pp. 284-293
Published by: Taylor & Francis, Ltd. on behalf of the American Statistical Association
Stable URL: https://www.jstor.org/stable/2684922
Accessed: 18-05-2019 11:45 UTC

REFERENCES
Linked references are available on JSTOR for this article:
https://www.jstor.org/stable/2684922?seq=1&cid=pdf-reference#references_tab_contents
You may need to log in to JSTOR to access the linked references.

JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide
range of content in a trusted digital archive. We use information technology and tools to increase productivity and
facilitate new forms of scholarship. For more information about JSTOR, please contact support@jstor.org.

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at
https://about.jstor.org/terms

American Statistical Association, Taylor & Francis, Ltd. are collaborating with JSTOR to
digitize, preserve and extend access to The American Statistician

This content downloaded from 193.19.172.190 on Sat, 18 May 2019 11:45:45 UTC
All use subject to https://about.jstor.org/terms
Understanding Neural Networks as Statistical Tools
Brad WARNER and Manavendra MISRA

jnowski 1988), and predicting heart problems in patients


(Baxt 1990, 1991; Fujita, Katafuchi, Uehara, and Nishimura
Neural networks have received a great deal of attention over 1992). They have also been used in such diverse areas
the last few years. They are being used in the areas of pre- as diagnosing hypertension (Poli, Cagnoni, Livi, Coppini,
diction and classification, areas where regression models and Valli 1991), playing backgammon (Tesauro 1990), and
and other related statistical techniques have traditionally recognizing speech (Lippmann 1989). In time series ap-
been used. In this paper we discuss neural networks and plications they have been used in predicting stock market
compare them to regression models. We start by exploring performance (Hutchinson 1994). Neural networks are cur-
the history of neural networks. This includes a review of rently the preferred tool in predicting protein secondary
relevant literature on the topic of neural networks. Neural structures (Qian and Sejnowski 1988). As statisticians or
network nomenclature is then introduced, and the backprop- users of statistics, we would normally solve these prob-
agation algorithm, the most widely used learning algorithm, lems through classical statistical models such as discrimi-
is derived and explained in detail. A comparison between re-
nant analysis (Flury and Riedwyl 1990), logistic regression
gression analysis and neural networks in terms of notation
(Studenmund 1992), Bayes and other types of classifiers
and implementation is conducted to aid the reader in un-
(Duda and Hart 1973), multiple regression (Neter, Wasser-
derstanding neural networks. We compare the performance
man, and Kutner 1990), and time series models such as
of regression analysis with that of neural networks on two
ARIMA and other forecasting methods (Studenmund 1992).
simulated examples and one example on a large dataset. We
It is therefore time to recognize neural networks as a po-
show that neural networks act as a type of nonparametric
tential tool for data analysis.
regression model, enabling us to model complex functional
Several authors have done comparison studies between
forms. We discuss when it is advantageous to use this type
statistical methods and neural networks (Hruschka 1993;
of model in place of a parametric regression model, as well
Wu and Yen 1992). These works tend to focus on perfor-
as some of the difficulties in implementation.
mance comparisons and use specific problems as examples.
KEY WORDS: Artificial intelligence; Backpropagation; There are a number of good introductory articles on neural
Generalized linear model; Nonparametric regression. networks, usually located in various trade journals. For in-
stance, Lippmann (1987) provides an excellent overview of
neural networks for the signal processing community. There
are also a number of good introductory books on neural
1. INTRODUCTION
networks, with Hertz, Krogh, and Palmer (1991) providing
Neural networks have recently received a great deal of at- a good mathematical description, Smith (1993) explaining
tention in many fields of study. The excitement stems from backpropagation in an applied setting, and Freeman (1994)
the fact that these networks are attempts to model the capa- using examples and code to explain neural networks. There
bilities of the human brain. People are naturally attracted by have also been papers relating neural networks and statisti-
attempts to create human-like machines, a Frankenstein ob- cal methods (Buntine and Weigend 1991; Ripley 1992; Sarle
session, if you will. On a practical level the human brain has 1994; Werbos 1991). One of the best for a general overview
many features that are desirable in an electronic computer. is Ripley (1993).
The human brain has the ability to generalize from abstract This article intends to provide a short, basic introduc-
ideas, recognize patterns in the presence of noise, quickly tion of neural networks to scientists, statisticians, engineers,
recall memories, and withstand localized damage. From a and professionals with a mathematical and statistical back-
statistical perspective neural networks are interesting be- ground. We achieve this by contrasting regression models
cause of their potential use in prediction and classification with the most popular neural network tool, a feedforward
problems.
multilayered network trained using backpropagation. This
Neural networks have been used for a wide variety of
paper provides an easy to understand introduction to neural
applications where statistical methods are traditionally em- networks, avoiding the overwhelming complexities of many
ployed. They have been used in classification problems such
other papers comparing these techniques.
as identifying underwater sonar contacts (Gorman and Se-
Section 2 discusses the history of neural networks. Sec-
tion 3 explains the nomenclature unique to the neural net-

Brad Warner is Assistant Professor, United States Air Force Academy,


work community, and provides a detailed derivation of
Colorado Springs, CO 80840. Manavendra Misra is Assistant Profes- the backpropagation learning algorithm. Section 4 shows
sor, Colorado School of Mines, Golden, CO 80401. The authors thank an equivalence between regression and neural networks. It
Guillermo Marshall, the referees, and editors for the helpful comments
demonstrates the methods on three examples. Two exam-
that have improved this paper. The authors also thank Dr. Karl Hammer-
meister and the Department of Veterans Affairs for the use of their data. ples are simulated data where the underlying functions are
Please direct all correspondence to the second author. known, and the third is on data from the Department of Vet-

284 The Americani Statisticiani, November 1996, Vol. 50, No. 4 (?) 1996 Amler-ican1 Statistical Ass
This content downloaded from 193.19.172.190 on Sat, 18 May 2019 11:45:45 UTC
All use subject to https://about.jstor.org/terms
Sy ptic Endbulb model as
Dendrites Cell

Yj WijXj - i(1)

where yi is the output of n


C~~~~~~Bdy
neuron j to neuron i, xj is t
threshold for neuron i, and E) is the activation function,
defined as

((netinput) = I if netinput > 0 (2)


6(neinpu) ~0 otherwise.(2
Although this model is simple, it has been demonstrated
that computationally it is equivalent to a digital computer.
This means that any of the computations carried out on con-
ventional
Axonal digital computers can be accomplished with a set
Aborization
of interconnected McCulloch-Pitts neurons (Abu-Mostafa
1986).
Figure 1. Schematic Representation of a Biological Neuron. In the early 1960s Rosenblatt developed a learning algo-
rithm for a model he called the simple perceptron (Rosen-
erans Affairs Continuous Improvement in Cardiac Surgery blatt 1962). The simple perceptron consists of McCulloch-
Program (Hammermeister, Johnson, Marshall, and Grover Pitts model neurons that form two layers, input and output.
1994). These examples demonstrate the ideas in the paper, The input neurons receive data from the external world,
and clarify when one method would be preferred over the and the output neurons send information from the network
other. to the outside world. Each input neuron is unidirectionally
connected to all the output neurons. The model uses binary
(-1 or 1) input and output units. Rosenblatt was able to
demonstrate that if a solution to a classification problem
2. HISTORY
"existed," his model would converge, or learn, the solution
Computers are extremely fast at numerical computa- in a finite number of steps. For the problems in which he
tions, far exceeding human capabilities. However, the hu- was interested a solution existed if the problem was lin-
man brain has many abilities that would be desirable in a early separable. (Linear separability means that a hyper-
computer. These include: the ability to quickly identify fea- plane, which is simply a line in two dimensions, exists that
tures, even in the presence of noise; to understand, interpret, can completely delineate the classes that the classifier at-
and act on probabilistic or fuzzy notions (such as "Maybe tempts to identify. Problems that are linearly separable are
it will rain tomorrow"); to make inferences and judgments only a special case of all possible classification problems.)
based on past experiences and relate them to situations that A major blow to the early development of neural networks
have never been encountered before; and to suffer local- occurred when Minsky and Papert picked up on the linear
ized damage without losing complete functionality (fault separability limitation of the simple perceptron and pub-
tolerance). So even though the computer is faster than the
human brain in numeric computations, the brain far outper- xl

forms the computer in other tasks. This is the underlying \Wi1

motivation for trying to understand and model the human


brain. x2
Wi2\
The neuron is the basic computational unit of the brain
(see Fig. 1). A human brain has approximately 101l neurons X3 V
acting in parallel. The neurons are highly interconnected,
with a typical neuron being connected to several thousand Yi

other neurons. [For more details of the biology of neurons


see Thompson (1985).] Early work on modeling the brain
started with models of the neuron. The McCulloch-Pitts 0 WL neuron
model (McCulloch and Pitts 1943) of the neuron was one of
the first attempts in this area. The McCulloch-Pitts model is XN

a simple binary threshold unit (Fig. 2). The neuron receives


a weighted sum of inputs from connected units, and outputs
a value of one (fires) if this sum is greater than a thresh-
Figure 2. McCulloch-Pitts Neuron. The inputs are multiplied by
old. If the sum is less than the threshold, the model neuron weights and summed. If the net input exceeds a threshold level, the
outputs a zero value. Mathematically we can represent this neuron fires (the resultant output is 1).

The American Statistician, November 1996, Vol. 50, No. 4 285

This content downloaded from 193.19.172.190 on Sat, 18 May 2019 11:45:45 UTC
All use subject to https://about.jstor.org/terms
Outputs as a memory as it is prone to getting stuck in local minima,
Yi Y2 as well as being limited in the number of stored patterns
(the network could reliably store a total number of patterns
equal to approximately one tenth the number of inputs).
Instead, Hopfield used his model for solving optimization
Hidden Units problems such as the traveling salesperson problem (Hop-
field and Tank 1985).
One of the most important developments during this pe-
riod was the development of a method to train multilay-
he2 h3 Weighted Links ered networks. This new learning algorithm was called
backpropagation (McClelland, Rumelhart, and the PDP Re-
search Group 1986). The idea was explored in earlier works
(Werbos 1974), but was not fully appreciated at the time.
Backpropagation overcame the earlier problems of the sim-
ple perceptron and renewed interest in neural networks. A
network trained using backpropagation can solve a prob-
Xi X2 X3 X4 lem that is not linearly separable. Many of the current uses
of neural networks in applied settings involve a multilay-
Inputs
ered feedforward network trained using backpropagation or
a modification of the algorithm. Details of the backpropa-
Figure 3. Schematic Representation of a Multilayer Feedforward gation algorithm will be presented in Section 3.
Neural Network. The hidden units receive a weighted sum of the in-
Neural network research incorporates many other ar-
puts and apply an activation function to the sum. The output units then
receive a weighted sum of the hidden unit's output and apply an activa- chitectures besides the multilayered feedforward net-
tion function to this sum. Information is passed only from one layer to work. Boltzmann machines have been developed based on
the next. There are no connections within layers and no connections to stochastic units, and have been used for tasks such as pat-
pass information back to a previous layer.
tern completion (Hinton and Sejnowski 1986). Time series
problems have been attacked with recurrent networks such
lished results demonstrating this limitation (Minsky and
as the Elman network (Elman 1990), the Jordan network
Papert 1969). Although Rosenblatt knew of these limita-
(Jordan 1989), and real-time recurrent learning (Williams
tions, he had not yet found a way to train other models
and Zipser 1989), to mention a few. There are neural net-
to overcome this problem. As a result, interest and fund-
works that perform principal component analysis, prototyp-
ing in neural networks waned. [It is interesting to note that
ing, encoding, and clustering. Examples of neural network
while Rosenblatt, a psychologist, was interested in model-
implementations include linear vector quantization (Koho-
ing the brain, Widrow, an engineer, was developing a similar
nen 1989), adaptive resonance theory (Moore 1988), feature
model for signal processing applications called the Adaline
mapping (Willshaw and von der Malsburg 1976), and coun-
(Widrow 1962).] terpropagation networks (Hecht-Nielsen 1987). There are,
In the 1970s there was still a limited amount of research of course, many more equally important contributions that
activity in the area of neural networks. Modeling the mem- have been omitted here in the interest of time and space.
ory was the common thread of most of this work. (Ander-
son (1970) and Willshaw, Buneman, and Longuet-Higgins 3. NEURAL NETWORK THEORY
(1969) discuss some of this work.) Grossberg (1976) and The nomenclature used in the neural network literature
von der Malsburg (1973) were developing ideas on competi- is different from that used in statistical literature. This sec-
tive learning, while Kohonen (1982) was developing feature tion introduces the nomenclature, and explains in detail the
maps. Grossberg (1983) was also developing his Adaptive backpropagation algorithm (the algorithm used for estima-
Resonance Theory. Obviously, there was a great deal of tion of model coefficients).
work done during this period, with many important papers
and ideas that are not presented in this paper. [For a more 3.1 Nomenclature
detailed description of the history see Cowen and Sharp Although the original motivation for the development o
(1988).] neural networks was to model the human brain, most neu
Interest in neural networks renewed with the Hopfield networks as they are currently being used bear little resem
model (Hopfield 1982) of a content-addressable memory. blance to a biological brain. (It must be pointed out that
In contrast to the human brain a computer stores data as there is research in the areas of accurately modeling bio-
a look-up table. Access to this memory is made using ad- logical neurons and the processes of the brain, but these
dresses. The human brain does not go through this look- areas will not be discussed further here because this paper
up process; it "settles" to the closest match based on the is concerned with the use of neural networks in prediction
information content- presented to it. This is the idea of a and function estimation.) A neural network is a set of sim-
content-addressable memory. The Hopfield model retrieves ple computational units that are highly interconnected. The
a stored pattern by "relaxing" to the closest match to an units are also called nodes, and loosely represent the biolog-
input pattern. Hopfield, however, did not use the network ical neuron. The networks discussed in this paper resemble

286 The American Stcatisticiani, November 1996, Vol. 50, No. 4

This content downloaded from 193.19.172.190 on Sat, 18 May 2019 11:45:45 UTC
All use subject to https://about.jstor.org/terms
the network in Figure 3. The neurons are represented by learning with a teacher, occurs when there is a known tar-
circles in Figure 3; the connections between units are unidi- get value associated with each input in the training set. The
rectional and are represented by arrows in the figure. These output of the network is compared with the target value,
connections model the synaptic connections in the brain. and this difference is used to train the network (alter the
Each connection has a weight called the synaptic weight, weights). There are many different algorithms for training
neural
denoted as wij, associated with it. The synaptic weight, wij,networks using supervised learning; backpropaga-
is interpreted as the strength of the connection from the jth tion is one of the more common ones, and will be explored
unit to the ith unit. in detail in Section 3. A biological example of supervised
The input into a node is a weighted sum of the outputs learning is when you teach a child the alphabet. You show
from nodes connected to it. Thus the net input into a node him or her a letter, and based on his or her response, you
is provide feedback to the child. This process is repeated for
each letter until the child knows the alphabet.
netinputi = EWij * output3 + pti, (3) Unsupervised learning is needed when the training data
j lack target output values corresponding to input patterns.
The network must learn to group or cluster the input pat-
where wij are the weights connecting neuron j to neuron
terns based on some common features, similar to factor
i, outputj is the output from unit j, and pi is a threshold analysis (Harman 1976) and principal components (Mor-
for neuron i. The threshold term is the baseline input to a
rison 1976). This type of training is also called learning
node in the absence of any other inputs. (The term threshold
without a teacher because there is no source of feedback in
comes from the activation function used in the McCulloch-
the training process. A biological example would be when
Pitts neuron; see Equation (2) where the threshold term set
a child touches a hot heating coil. He or she soon learns,
the level that the other weighted inputs had to exceed for
without any external teaching, not to touch it. In fact, the
the neuron to fire.) If a weight wij is negative, it is termed
child may associate a bright red glow with hot, and learn
inhibitory because it decreases the net input. If the weight is
to avoid touching objects with this feature.
positive, the contribution is excitatory because it increases
The networks discussed in this paper are constructed with
the net input.
layers of units, and thus are termed multilayered networks.
Each unit takes its net input and applies an activation
A layer of units in a multilayer network is composed of
function to it. For example, the output of the jth unit, also
units that perform similar tasks. A feedforward network is
called the activation value of the unit, is: g( wjixi), where
one where units in one layer are connected only to units
g( ) is the activation function and xi is the output of the ith
in the next layer, and not to units in a preceding layer or
unit connected to unit j. A number of nonlinear functions
units in the same layer. Figure 3 shows a multilayered feed-
have been used by researchers as activation functions; the
forward network. Networks where the units are connected
two common choices for activation functions are the thresh-
to other units in the same layer, to units in the preceding
old function in Equation (2) (mentioned in Section 2) and
layer, or even to themselves are termed recurrent networks.
sigmoid functions such as
Feedforward networks can be viewed as a special case of
recurrent networks.
g(netinput) =41 + e-fletiput (4) The first layer of a multilayer network consists of the. in-
put units, denoted by xi. These units are known as indepen-
or
dent variables in statistical literature. The last layer contains

g(netinput) tanh (netinput). the output units, denoted by Yk, In statistical nomenclature
these units are known as the dependent or response vari-
The threshold function is useful in situations where the in- ables. (Note that Figure 3 has more than one output unit.
puts and outputs are binary encoded such as the exclusive This configuration is common in neural network classifica-
or problem (McClelland et al. 1986) and the parity problem tion applications where there are more than two classes. The
(Minsky and Papert 1969). The sigmoid functions are the outputs represent membership in one of the k classes. The
common activation functions used in current neural network multiple outputs could represent a multivariate response
modeling, and were used in many of the application papers function, but this is not common in practice.) All other units
discussed in Section 1. The only practical requirement for in the model are called hidden units, hj, and constitute the
the activation function for use in the backpropagation al- hidden layers. The feedforward network can have any num-
gorithm, as will become clear in Section 3.2, is that it be ber of hidden layers with a variable number of hidden units
differentiable. per layer. When counting layers it is common practice not
The brain learns by adapting the strength of the synap- to count the input layer because it does not perform any
tic connections. Likewise, the synaptic weights in neural computation, but simply passes data onto the next layer.
networks, similar to coefficients in regression models, are So a network with an input layer, one hidden layer, and an
adjusted to solve the problem presented to the network. output layer is termed a two-layer network.
Learning or tr^ainin1g is the term used to describe the pro-
3.2 Backpropagation Derivation
cess of finding the values of these weights. The two types
of learning associated with neural networks are sulpervised
The backpropagation algorithm is a method to find
and ulnsupervised learning. Supervised learning, also called
weights for a multilayered feedforward network. The de-

The Amiiericain. Statistician, November 1996, Vol. 50, No. 4 287

This content downloaded from 193.19.172.190 on Sat, 18 May 2019 11:45:45 UTC
All use subject to https://about.jstor.org/terms
velopment of the backpropagation algorithm is primarily [assuming g( ) is the sigmoid function define
responsible for the recent resurgence of interest in neural (4)]. Similarly, output unit k receives a net input of
networks. One of the reasons for this is that it has been
M
shown that a two-layer feedforward neural network with a
fpk Z WkjVpj, (8)
sufficient number of hidden units can approximate any con-
j=1
tinuous function to any degree of accuracy (Cybenko 1989).
This makes multilayered feedforward neural networks a where M is the number of hidden units, and Wkj repres
powerful modeling tool. the weight from hidden unit j to output k. The unit then
As mentioned, Figure 3 shows a schematic of a feed- outputs the quantity
forward, two-layered neural network. Given a set of input
patterns (observations) with associated known outputs (re-
Ypk = 9(fpk) 1 + efpk (9)
sponses), the objective is to train the network, using super-
vised learning, to estimate the functional relationship be- (Notice that the threshold value has been excluded from the
tween the inputs and outputs. The network can then be used equations. This is because the threshold can be accounted
to model or predict a response corresponding to a new in- for by adding an extra unit to the layer and fixing its value
put pattern. This is similar to the regression problem where at 1. This is similar to adding a column of ones to the design
we have a set of independent variables (inputs) and depen- matrix in regression problems to account for the intercept.)
dent variables (output), and we want to find the relationship Recall that the goal is to find the set of weights wji,
between the two. the weights connecting the input units to the hidden units,
To accomplish the learning some form of an objective and Wkj, the weights connecting the hidden units to the
function or performance metric is required. The goal is output units that minimize our objective function, the sum
to use the objective function to optimize the weights. The of squared errors in Equation (5). Equations (6)-(9) demon-
most common performance metric used in neural networks strate that the objective function, Equation (5), is a function
[although not the only one; see Solla, Levin, and Fleisher of the unknown weights wji and Wkj. Therefore, the partial
(1988)] is the sum of squared errors defined as derivative of the objective function with respect to a weight
n O represents the rate of change of the objective function with
respect to that weight (it is the slope of the objective func-
E = E E (Ypk -pk )2, (5)
p=l k=1 tion). Moving the weights in a direction down this slope
will result in a decrease in the objective function. This in-
where the subscript p refers to the patterns (observations) tuitively suggests a method to iteratively find values for the
with a total of n patterns, the subscript k to the output unit
weights. We evaluate the partial derivative of the objective
with a total of 0 output units, y is the observed response,
function with respect to the weights, and then move the
and y is the model (predicted) response. This is a sum of
weights in a direction down the slope, continuing until the
the squared difference between the predicted response and
error function no longer decreases. Mathematically this is
the observed response averaged over all outputs and obser-
represented as
vations (patterns). In the simple case of predicting a single
outcome k = 1 and Equation (5) reduces to

In
AWk j aWkj (10)
E 1 E (y _ - 2 (The r term is known as the learning rate and
p=1
the step size. The common practice in neural networks is
the usual function to minimize in least squares regression. to have the user enter a fixed value for the learning rate r
To understand backpropagation learning we will start by at the beginning of the problem.)
examining how information is first passed forward through We will first derive an expression for calculating the ad-
the network. The process starts with the input values being justment for the weights connecting the hidden units to the
presented to the input layer. The input units perform no outputs, Wkj. Substituting Equations (6)-(9) into Equation
operation on this information, but simply pass it onto the (5) yields
hidden units. Recalling the simple computational structure

2~~~
of a unit expressed in Equation (3), the input into the jth n0 ( N

hidden unit is
=Ypkg ( Wjigpi
N p=l k=1 j=1 i=l

hpj =EWj ixpi. (6) and then expanding Equation (10) using the chain rule we
get
Here, N is the total number of input nodes, wji is the weight
d E d E &Ypk &9fpk
from input unit i to hidden unit j, and Xp is value of the ith
input for pattern p. The jth hidden unit applies an activation 0 &VVkJ j &Ypk & fpk &Wk j

function to its net input and outputs: but

vp- g(hpj) - 1 + o-(7


i -Ypk -Ypk),
288 The American Statistician, November 1996, Vol. 50, No. 4

This content downloaded from 193.19.172.190 on Sat, 18 May 2019 11:45:45 UTC
All use subject to https://about.jstor.org/terms
A'
Substituting back into Equation (13) reduces to
&Ypk ( 91(fpk) = tp(1 - Jp), (11)
19fpk
0
[for the sigmoid in Equation (4)], and
AWji = 7ZE (Ypk - Ypk%)yk(1 - Ypk)VVkjVpj -Vpj)Xpi
k=1
afpk _
(14)

Note that there is a summation over the number of output


Substituting these results back into Equation (10) the
units. This is because each hidden unit is connected to all
change in weights from the hidden units to the output units,
the output units. So if the weight connecting an input unit to
AVV'kj, is given by
a hidden unit changes, it will affect all the outputs. Again,

AT'Vk-j = -q(-I)(Ypk - ypk)lpk(1 - jpk)Vpj* (12) notice that if the number of output units equals one, then

This gives us a formula to update the weights from the AWi = 7(Yp - yp)p(I- p)Wjpj (1 -vpj)pi
hidden units to the output units. The weights are updated
as
3.3 Backpropagation Algorithm

kj - Given the above equations we proceed to put down the


This equation implies that we take the weight adjustment processing steps needed to compute the change in network
in Equation (12) and add it to our current estimate of the weights using backpropagation learning:
Note: this algorithm is adapted from Hertz et al. (1991).
weight, Vt'kj, to obtain an updated weight estimate, Tkt?l.
Before moving on to the calculations for the weights
1. Initialize the weights to small random values. This
from the inputs to the hidden units, there are several in-
puts the output of each unit around .5.
teresting points to be made about Equation (12). (1) Given
2. Choose a pattern p and propagate it forward. This
that the range of the sigmoid function is 0 < g(Q) < 1, from
yields values for Vpj and Ypk, the outputs from the hidden
Equation (11) we can see that the maximum change of the
layer and output layer.
weight will occur when the output ipk is .5. In classification
3. Compute the output errors: 8pk = (Ypk - ypk)g(fpk).
problems the objective is to have the output be either 1 or 0,
so an output of .5 represents an undecided case. If the out- 4. Compute the hidden layer errors: ?pj = L=1 8pk
put is at saturation (O or 1), then the weight will not change. TVkjVpj (1 -vpj
Conceptually this means that units that are undecided will 5. Compute

receive the greatest change to their weights. (2) As in other


AWTkj =6pkVpj
function approximation problems, the (Ypk -Ypk) term influ-
ences the weight change. If the predicted response matches and
the desired response, then no weight changes occur. (3) If
k = 1, one outcome, then Equation (12) simplifies to A wji = Xpj pi
to update the weights.
AVVj =-r -(--)(yp-yp)Iyp(I-yp)vpj.
6. Repeat the steps for each pattern.
To update the weights, wji, connecting the inputs to the
It is easy to see how this could be implemented in a
hidden units, we will follow similar logic as Equation (12).
computer program. (Note that there are many commercial
Thus
and shareware products that implement the multilayered
feedforward neural network. Check the frequently asked
A\wj i= a wE
questions, FAQ, posted monthly to the comp. ai .neural -
Expanding using the chain rule nets newsgroup for a listing of these products. The web
sitehttp://wwwipd.ira.uka.de/ prechelt/FAQ/
neural-net-f aq.html also maintains this list.)
_- _ =-r1 _ __ __ P (13)
Owji k- pk i fpk &vpj &hpj &Wji 4. REGRESSION AND NEURAL NETWORKS

where &E/&Npk and &pkk/0fpk are given in Equation (11).In this section we compare and contrast neural network
Also models with regression models. Most scientists have some
experience using regression models, and by explaining neu-
ral networks in relation to regression analysis, some deeper
understanding can be achieved.

&VPJ= g'(hpj) = p(-vp 4.1 Discussion

Regression is used to model a relationship between vari-


and ables. The covariates (independent variables, stimuli) are
denoted as x2. These variables are either under the experi-
awji menter's control or are observed by the experimenter. The

The Americani Staitisticiani, November 1996, Vol. 50, No. 4 289

This content downloaded from 193.19.172.190 on Sat, 18 May 2019 11:45:45 UTC
All use subject to https://about.jstor.org/terms
xl To find the coefficients we must have a dataset that includes
the independent variable and associated known values of
the dependent variable (akin to a training set in supervised
X2 learning in neural networks).
This problem is equivalent to a single-layer feedforward
neural network (Fig. 4). The independent variables corre-
x3~~~P
X3 5
spond to the inputs of the neural network and the response
y variable y to the output. The coefficients, B, correspond to
the weights in the neural network. The activation function
is the identity function. To find the weights in the neural
network we would use backpropagation and a cost function
similar to Equation (5). A difference in the two approaches
is that multiple linear regression has a closed form solu-
XN
tion for the coefficients, while the neural network uses an
iterative process.
Figure 4. Regression Model Configured as a Neural Network. The In general, any generalized linear model can be mapped
figure is a single-layer feedforward neural network with the identity acti-
onto an equivalent single-layer neural network. The ac-
vation function. This is equivalent to a multiple linear regression model.
tivation function is selected to match the inverse of the

response (outcome, dependent) variable is denoted as y. The link function (h = g-l) and the cost function is selected
objective of regression is to predict or classify the response, to match the deviance, in the language of McCullagh and

y, from the covariates, xi. Sometimes the investigator also Nelder (1989). Deviance is based on maximum likelihood
uses regression to test hypotheses about the functional re- theory, and is determined by the distribution of the ran-
lationship between the response and the stimulus. dom component. The generalized linear model attempts to
The general form of the regression model is [this is find coefficients to minimize the deviance. Obviously, the
adopted from McCullagh and Nelder (1989)] theory for these problems already exists, and neural net-
works as presented in this section only produce similar re-
N
sults, adding nothing to the theory. These examples only
Ti Z 3ixi
give some insight and understanding of neural networks.
i=O
In regression models a functional form is imposed on the
with data. In the case of multiple linear regression this assump-
tion is that the outcome is related to a linear combination
Tj= h(p)
of the independent variables. If this assumed model is not
E(y) = H. correct, it will lead to error in the prediction. An alternate
approach is to not assume any functional relationship and
Here h(.) is the link function, Bi are the coefficients, N is
let the data define the functional form; in a sense we let
the number of covariate variables, and 00 is the intercept.
the data speak for itself. This is the basis of the power of
This model has three components:
the neural networks. As mentioned in Section 3 a two-layer
1. a random component of the response variable y, with feedforward network with sigmoid activation functions is a
mean p and variance (72 universal approximater because it can approximate any con-
2. a systematic component that relates the stimuli xi totinuous function to any degree of accuracy. Thus a neural
a linear predictor Ti = i=o Oixi network is extremely useful when you do not have any idea
3. a link function that relates the mean to the linear pre- of the functional relationship between the dependent and
dictor Ti = h(,). independent variables. If you had an idea of the functional

The generalized linear model reduces to the familiar mul- relationship, you are better off using a regression model.

tiple linear regression if we believe that the random compo- An advantage of assuming a functional form is that it
nent has a normal distribution with mean zero and variance allows different hypothesis tests. Regression, for example,

a2 and we specify the link function h(.) as the identity allows you to test the functional relationships by testing
function. The model is then the individual coefficients for statistical significance. Also,

N
because regression models tend to be nested, two different
models can be tested to determine which one models the
yp=oE
'p Q + 1i3ix~
pz+p+ ~~m
data better. A neural network never reveals the functional
i=l1
relations; they are buried in the summing of the sigmoidal
where Ep- N(O, u2). The objective of this regression prob- functions.
lem is to find the coefficients ,i that minimize the sum of Other difficulties with neural networks involve choosing
squared errors, the parameters, such as the number of hidden units, the

n / N \ 2 learning parameter ry, the initial starting weights, the cost


function, and deciding when to stop training. The process
E E (yp-ixpi)
p-1 i=o
of determining appropriate values for these variables is of-

290 The American Statistician, November 1996, Vol. 50, No. 4

This content downloaded from 193.19.172.190 on Sat, 18 May 2019 11:45:45 UTC
All use subject to https://about.jstor.org/terms
neural network curve are close to the true curve. A separate
validation set of 100 values was generated and applied to
the regression model and the neural network model. The
sum of squared errors on this validation set was .290 and
.303 for the regression model and neural network model,
respectively. So the predictive performance of both mod-
els was essentially equal. The linear regression model was
much faster and easier to develop and is easy to interpret.
The neural network did not assume any functional form for
Neui,Netork(4 Hidden Units)
. = ~~~~~~~True Curve
the relationship between dependent and independent vari-
NC
8~~~~~~~~~~~~~~......
ElM Data
-Ro-egression
Points
Curve ll able, and was still able to derive an accurate curve.
As a second, more difficult example, consider the function
20 40 60 80 100

x
y = 20exp- 85x[ln(.9x + .2) + 1.5].
Figure 5. Comparison Between a Linear Regression Model and a
Neural Network Model where the Underlying Functional Relationship
Fifty random x values between 0 and 1 were used to gener-
was Linear. The filled dots represent the actual data, the solid line is
the predicted function from the neural network, the dotted line is the re- ate data. A random noise component consisting of normally
sult from the regression model, and the dashed line is the true function. distributed error terms with mean 0 and standard deviation
This example demonstrates that the neural network approximated the
of .05 were added to each y value. An identical neural net-
linear function without any assumption about the underlying functional
form.
work to the one used on the previous example, with the ex-
ception that this network had eight hidden units, was trained
ten an experimental process where different values are used on this data. The results are plotted in Figure 6. The true
and evaluated. The problem with this is that it can be very curve is dotted and the neural network estimate is solid.
time consuming, especially when you consider the fact that This problem would be difficult to model using regression
neural networks typically have slow convergence rates. techniques, and would require some estimates of variable
transformations or the use of a smoothing method (Hastie
4.2 Examples and Tibshirani 1990). Most of these transformations assume
To demonstrate the use of neural networks two simulated a power or logarithmic transformation when a combination
examples and one real example are presented. Both of the of both would have been more appropriate in this case.
simulated examples involve one independent variable and The third example is from the Department of Veterans
a continuous valued output variable. The first example is a Affairs Continuous Improvement in Cardiac Surgery Study
simple linear problem. The true relationship is (Hammermeister et al. 1994). The outcome variable is a
binary variable indicating operative death status 30 days
y = 30 + lOx. after a coronary artery bypass grafting surgery. If a patient
is still alive 30 days after surgery, he or she is coded as a 0;
Fifty samples were obtained by generating 50 random er-
otherwise as a 1. The objective is to obtain predictions of a
ror terms from a normal distribution with mean zero and
patient's probability of death given their individual risk fac-
standard deviation of 50 and adding these to the y values of
tors. The 12 independent variables in the study are patient
the data set. The values of x were randomly selected from
risk factors, and include variables such as age, priority of
the range of 20-100. Linear regression was applied to the
surgery, and history of prior heart surgery. Both neural net-
problem of the form

E(ylx) = ac+Ox

and yielded coefficients of &z = -10.69 and 3 - 10.52.


(Note the error in the intercept term; remember that we
were interested in modeling the data between x values of
20 and 100, so an accurate intercept was not important.)
The dotted line in Figure 5 shows the regression line.
The problem was also solved with a two-layer, feedfor-
ward neural network with four hidden units and sigmoid
activation functions. The network was trained using back-
propagation. This network configuration is too general for
this problem; however, we wanted to set up the problem
0.0 0.2 0.4 0.6 0.8 1.0
with the notion that the underlying functional relationship
x
is unknown. This gives us an idea how the network will
perform when we think the problem is complex when in Figure 6. Example of Modeling a Nonlinear Function. This figure
demonstrates the ability of a neural network to model a complex func-
fact it is simple. The solid line in Figure 5 shows the re-
tional relationship. The filled dots represent the data, the dashed curve
sults from the neural network; the dashed line is the true is the true function, and the solid line is the predicted function from the
functional relationship. Both the regression curve and the neural network.

The American Statistician, November 1996, Vol. 50, No. 4 291

This content downloaded from 193.19.172.190 on Sat, 18 May 2019 11:45:45 UTC
All use subject to https://about.jstor.org/terms
work and logistic regression models were built on 21,435 cients, the neural network community uses the term weights
observations (2 learning sample) and validated on the re-
and instead of observations they use patterns.
maining 10,657 observations (3 testing sample). The neural Backpropagation is an algorithm that can be used to de-
network was a feedforward network with one hidden layer termine the weights of a network designed to solve a given
comprised of four hidden units. It was trained using the problem. It is an iterative procedure that uses a gradient de-
backpropagation algorithm. scent method. The cost function used is normally a squared
The discrimination and calibration of these two models error criterion, but functions based on maximum likelihood
were compared on the validation set. The c index was used are also used. This paper gives a detailed derivation of
to measure discrimination [how well the predicted binary the backpropagation algorithm based on existing bodies of
outcomes can be separated (Hanley and McNeil 1982)]. No work and gives an outline of how to implement it on a
statistically significant difference was found between the computer.
two c indices (the neural network c index = .7168 and Two simple synthetic problems were presented to demon-
the logistic regression c index = .7162 with p < .05). The strate the advantages and disadvantages of multilayered
Hosmer-Lemeshow test (Hosmer and Lemeshow 1989) was feedforward neural networks. These networks do not im-
applied to the validation data to test calibration of the mod- pose a functional relationship between the independent and
els (calibration measures how close the predicted values are dependent variables. Instead, the functional relationship is
to the observed values). The p value for the logistic regres- determined by the data in the process of finding values for
sion model was .34, indicating a good fit to the data, while the weights. The advantage of this process is that the net-
the p value for the neural network was .08, indicating a lack work is able to approximate any continuous function, and
of fit. In summary, the logistic regression model had compa- we do not have to guess the functional form. The disadvan-
rable predictive power and better calibration in comparison tage is that it is difficult to interpret the network. In linear
to the neural network. regression models we can interpret the coefficients in re-
The reason this occurs is that the majority of the inde- lation to the problem. Another disadvantage of the neural
pendent variables are binary. This means that their con- network is that convergence to a solution can be slow and
tribution to the model must be on a linear scale. Trying to depends on the network's initial conditions. A third exam-
model them in a nonlinear manner will not contribute to the ple on real data revealed that traditional statistical tools still
predictive performance of the model. In addition, a simple have a role in analysis, and that the use of any tool must be
check of all two-variable interactions revealed nothing of thought about carefully.
significance. Thus with no interactions or nonlinearities the Neural networks can be viewed as a nonparametric re-
linear additive structure of logistic regression is appropri- gression method. A large number of claims have been made
ate for this data. As indicated in Section 1, the literature is about the modeling capabilities of neural networks, some
full of examples illustrating the improved performance of exaggerated and some justified. As statisticians it is impor-
neural networks over traditional techniques. But as the last tant to understand the capabilities and potential of neural
example illustrates, this is not always true, and the practi- networks. This paper is intended to build a bridge of un-
tioner must be aware of the appropriate model for his or derstanding for the practitioner and interested reader.
her problem.
Neural networks can be valuable when we do not know [Received August 1994. Revised Decem71ber 1995.]

the functional relationship between independent and depen-


dent variables. They use the data to determine the functional REFERENCES
relationship between the dependent and independent vari-
Abu-Mostafa, Y. S. (1986), "Neural Networks for Computing," in Pro-
ables. Since they are data dependent, their performance im- ceedinigs of the Amiier-icau1 Inistitite of Physics Meeting, pp. 1-6.
proves with sample size. Regression performs better when Anderson, J. A. (1970), "Two Models for Memory Organization," Matlie-

theory or experience indicates an underlying relationship. mnatical Bioscienices, 8, 137-160.


Baxt, W. G. (1990), "Use of an Artificial Neural Network for Data Anal-
Regression may also be a better alternative for extremely
ysis in Clinical Decision-Making: The Diagnosis of Acute Coronary
small sample sizes. Occlusion," Neitrcal Coniipuvtationi, 2, 480-489.
(1991), "Use of an Artificial Neural Network for the Diagnosis of
Myocardial Infarction," Annals of Inter-nal Medicinie, 115, 843-848.
5. CONCLUSIONS
Buntine, W. L., and Weigend, A. S. (1991), "Bayesian Back-Propagation,"
Neural networks originally developed out of an interest Comnplex Systemns, 5, 603-643.

in modeling the human brain. They have, however, found Cowan, J. D., and Sharp, D. H. (1988), "Neural Nets," Quarterly Reviewvs
of Biophysics, 21, 365-427.
applications in many different fields of study. This paper
Cybenko, G. (1989), "Approximation by Superpositions of a Sigmoidal
has focused on the use of neural networks for prediction Function," Mathemiiatics of Control, Signals, anid Systemiis, 2, 303-314.
and classification problems. We specifically restricted our Duda, R. O., and Hart, P. E. (1973), Patternl Classificationi an1d Scenie Anial-
discussion to multilayered feedforward neural networks. ysis, New York: John Wiley.

Parallels with statistical terminology used in regression Elman, J. L. (1990), "Finding Structure in Time," Cognitive Scienice, 14,
179-211.
models were developed to aid the reader in understanding
Flury, B., and Riedwyl, H. (1990), Multivariate Statistics: A Pr-actical Ap-
neural networks. Neural network notation is different from
pr-oach, London: Chapman & Hall.
that of statistical regression analysis, but most of the un- Freeman, J. A. (1994), Simuillatinig Netracl Netvor-ks with Mathenmatica,
derlying ideas are the same. For example, instead of coeffi- Reading, MA: Addison-Wesley.

292 The Am71erican Statistician., November- 1996, Vol. 50, No. 4

This content downloaded from 193.19.172.190 on Sat, 18 May 2019 11:45:45 UTC
All use subject to https://about.jstor.org/terms
Fujita, H., Katafuchi, T., Uehara, T., and Nishimura, T. (1992), "Appli- don: Chapman & Hall.
cation of Artificial Neural Network to Computer-Aided Diagnosis of McCulloch, W. S., and Pitts, W. (1943), "A Logical Calculus of Ideas
Coronary Artery Disease in Myocardial Spect Bull's-Eye Images," Jour- Imminent in Nervous Activity," Bulletin of Mathematical Biophysics, 5,
nial of Nuclear Medicine, 33(2), 272-276. 115-133.
Gorman, R. P., and Sejnowski, T. J. (1988), "Analysis of Hidden Units Minsky, M. L., and Papert, S. A. (1969), Perceptrons, Cambridge, MA:
in a Layered Network to Classify Sonar Targets," Neural Netvorks, 1, MIT Press.
75-89.
Moore, B. (1988), "ART1 and Pattern Clustering," in Proceedings of the
Grossberg, S. (1976), "Adaptive Pattern Classification and Universal Re- 1988 Connectionist Models Sumn7ner School, eds. D. Touretzky, G. Hin-
coding. I: Parallel Development and Coding of Neural Feature Detec- ton, and T. Sejnowski, San Mateo, CA: Morgan Kaufmann.
tors," Biological Cybernetics, 23, 121-134.
Morrison, D. F. (1976), Multivariate Statistical Methods (2nd ed.), New
Grossberg, S., and Carpenter, G. A. (1983), "A Massively Parallel Ar- York: McGraw-Hill.
chitecture for a Self-Organizing Neural Pattern Recognition Machine,"
Neter, J., Wasserman, W., and Kutner, M. H. (1990), Applied Linlear Sta-
Conmputer Vision, Graphics, antd Image Processing, 37, 54-115.
tistical Models, Homewood, IL: Richard D. Irwin.
Hammermeister, K. E., Johnson, R., Marshall, G., and Grover, F. L. (1994),
Poli, R., Cagnoni, S., Livi, R., Coppini, G., and Valli, G. (1991), "A Neural
"Continuous Assessment and Improvement in Quality of Care: A Model
Network Expert System for Diagnosing and Treating Hypertension,"
from the Department of Veterans Affairs Cardiac Surgery," Annals of
Computer, 64-71.
Surgery, 219, 281-290.
Qian, N., and Sejnowski, T. J. (1988), "Predicting the Secondary Struc-
Hanley, J. A., and McNeil, B. J. (1982), "The Meaning and Use of the Area
ture of Globular Proteins Using Neural Network Models," JoIrinal of
under a Receiver Operating Characteristic (ROC) Curve," Radiology,
Molecular Biology, 202, 865-884.
143, 29-36.
Ripley, B. D. (1994), "Neural Networks and Related Methods for Classi-
Harman, H. H. (1976), Moderni Factor Analysis (3rd ed.), Chicago: Uni-
fication," Jourznal of the Royal Statistical Society B, 56(3) 409-456.
versity of Chicago Press.
(1993), "Statistical Aspects of Neural Networks," in Networks anid
Hastie, T., and Tibshirani, R. (1990), Genieralized Additive Models, London
Chaos-Statistical anid Probabilistic Aspects, eds. 0. Barndorff-Nielsen,
and New York: Chapman & Hall.
J. Jensen, and W. Kendall, London: Chapman & Hall, pp. 40-123.
Hecht-Nielsen, R. (1987), "Counterpropagation Networks," Applied Op-
Rosenblatt, F. (1962), Prinzciples of Neurodyniamics, Washington, DC: Spar-
tics, 26, 4979-4984.
tan.
Hertz, J., Krogh, A., and Palmer, R. G. (1991), Introductio7t to the Tlheoiy
Sarle, W. S. (1994), "Neural Networks and Statistical Methods," in Pro-
of Nelural Comnputation, Santa Fe Institute Studies in the Sciences of
ceedinigs of the 19th Annzuial SAS Users Givolp Internationial Conifer-enice.
Complexity (Vol. 1), Redwood City, CA: Addison-Wesley.
Smith, M. (1993), Neural Networks for Statistical Modelinig, New York:
Hinton, G. E., and Sejnowski, T. J. (1986), "Learning and Relearning in
Van Nostrand Reinhold.
Boltzmann Machines," in Parallel Distribluted Processing (Vol. 1), chap.
7. Solla, S. A., Levin, E., and Fleisher, M. (1988), "Accelerated Learning in
Layered Neural Networks," Conmplex Systems, 2, 625-639.
Hopfield, J. J. (1982), "Neural Networks and Physical Systems with Emer-
gent Collective Computational Abilities," Proceedings of the National Studenmund, A. H. (1992), Using Econometrics. A Practical Guide, New
Acaclemi;y of Scientces USA, 81, 2554-2558. York: Harper Collins.

Hopfield, J. J., and Tank, D. W. (1985), "'Neural' Computation of Deci- Tesauro, G. (1990), "Neurogammon Wins Computer Olympiad," Neural
sions in Optimization Problems," Biological Cybernetics, 52, 141-152. Coniputationi, 1, 321-323.
Hosmer, D. W., Jr., and Lemeshow, S. (1989), Applied Logistic Regression, Thompson, R. F. (1985), The Braini: A Neuroscience Primer, New York:
New York: John Wiley. Freeman.

Hruschka, H. (1993), "Determining Market Response Functions by Neural von der Malsburg, C. (1973), "Self-Organizing of Orientation Sensitive
Network Modeling: A Comparison to Econometric Techniques," Euro- Cells in the Striate Cortex," Kybernetik, 14, 85-100.
pean Journ7al of Operationial Research, 66, 27-35. Werbos, P. J. (1991), "Links Between Artificial Neural Networks (ANN)
Hutchinson, J. M. (1994), "A Radial Basis Function Approach to Financial and Statistical Pattern Recognition," in Artificial Neural Networks and
Time Series Analysis," Ph.D. dissertation, Massachusetts Institute of Statistical Patternz Recogniition: Old anid New Coninections, eds. I. Sethi
Technology. and A. Jain, Elsevier Science, pp. 11-31.
Jordan, M. I. (1989), "Serial Order: A Parallel, Distributed Processing (1974), "Beyond Regression: New Tools for Prediction and Analy-
Approach," in Advanices in Connectionist Theoy: Speech, eds. J. Elmansis in the Behavioral Sciences," Ph.D. dissertation, Harvard University.
and D. Rumelhart, Hillsdale, NJ: Erlbaum. Widrow, B. (1962), "Generalization and Information Storage in Networks
Kohonen, T. (1982), "Self-Organized Formation of Topologically Correct of Adaline Neurons," in SelfOt-ganlizinlg Systems, eds. M. Yovitz, G.
Feature Maps," Biological Cybernetics, 43, 59-69. Jacobi, and G. Goldstein, Washington, DC: Spartan, pp. 435-461.
(1989), Self-Organiization anld Associative Memory (3rd ed.), Williams, R. J., and Zipser, D. (1989), "A Learning Algorithm for Contin-
Berlin: Springer-Verlag. ually Running Fully Recurrent Neural Networks," Neural Computatiou,
Lippmann, R. P. (1987), "An Introduction to Computing with Neural Nets," 1, 270-280.
IEEE ASSP Magazine, 4-22. Willshaw, D. J., and von der Malsburg, C. (1976), "How Patterned Neural
(1989), "Review of Neural Networks for Speech Recognition," Connections Can Be Set Up by Self-Organization," Proceedings of the
Nelural Compnltation, 1, 1-38. Royal Society of Lonidon B, 194, 431-445.
McClelland, J. L., Rumelhart, D. E., and the PDP Research Group (1986), Willshaw, D. J., Buneman, 0. P., and Longuet-Higgins, H. C. (1969), "Non-
Parcallel Distributed Processing: Explorations in the Microstructure of Holographic Associative Memory," Nature, 222, 960-962.
Cognition, Volumaze 2: Psychological and Biological Models, Cambridge,
Wu, F. Y., and Yen, K. K. (1992), "Application of Neural Network in
MA: MIT Press. Regression Analysis," in Proceedinigs of the 14th Anntiual Couifereuice on
McCullagh, P., and Nelder, J. A. (1989), Generalized Lin.ear Models, Lon- Computers anid Inidustrial Enginieerinzg.

The American Statistician, Novemnber 1996, Vol. 50, No. 4 293

This content downloaded from 193.19.172.190 on Sat, 18 May 2019 11:45:45 UTC
All use subject to https://about.jstor.org/terms

You might also like