This action might not be possible to undo. Are you sure you want to continue?
4.1 Introduction
4
L
l
The Perceptron
The perceptron is the simplest form of a neural network used for the classification of a special type of patterns said to be linearly separable (i.e., patterns that lie on opposite sides of a hyperplane). Basically, it consists of a single neuron with adjustable synaptic weights and threshold, as shown in Fig. 4.1. The algorithm used to adjust the free parameters of this neural network first appeared in a learning procedure developed by Rosenblatt (1958, 1962) for his perceptron brain model.’ Indeed, Rosenblatt proved that if the patterns (vectors) used to train the perceptron are drawn from two linearly separable classes, then the perceptron algorithm converges and positions the decision surface in the form of a hyperplane between the two classes. The proof of convergence of the algorithm is known as the perceptron convergence theorem. The singlelayer perceptron depicted in Fig. 4.1 has a single neuron. Such a perceptron is limited to performing pattern classification with only two classes. By expanding the output (computation) layer of the perceptron to include more than one neuron, we may correspondingly form classification with more than two classes. However, the classes would have to be linearly separable for the perceptron to work properly. Accordingly, insofar as the basic theory of a singlelayer perceptron as a pattern classifier is concerned, we need consider only the singleneuron configuration shown in Fig. 4.1. The extension of the theory so developed to the case of more than one neuron is a trivial matter.
Organization of the Chapter
This chapter on the singlelayer perceptron is relatively short, but it is important for historical and theoretical reasons. In Section 4.2 we present some basic considerations involved in the operation of the perceptron. Then, in Section 4.3 we describe Rosenblatt’s original algorithm for adjusting the synaptic weight vector of the perceptron for pattern classification of linearly separable classes, and demonstrate convergence of the algorithm. This is followed by the description of a performance measure for the singlelayerperceptron in Section 4.4. In Section 4.5 we consider the relationship between the singlelayer perceptron and the maximumlikelihood Gaussian classifier. The chapter concludes with general discussion in Section 4.6.
The network organization of the original version of the perceptron, as envisioned by Rosenblatt (1962), has three types of units: sensory units, association units, and response units. The connections from the sensory units to the association units have fixed weights, and the connections from the association units to the response units have variable weights. The association units act as preprocessors designed to extract a pattern from the environmental input. Insofar as the variable weights are concerned, the operation of Rosenblatt’s original perceptron is essentially the same as that depicted in Fig. 4.1 for the case of a single response unit.
106
4.2 / Basic Considerations 107
Inputs
FIGURE 4.1
Singlelayer perceptron.
4.2 Basic Considerations
Basic to the operation of Rosenblatt's perceptron is the McCullochPitts model of a neuron. From Chapter 1, we recall that such a neuron model consists of a linear combiner followed by a hard limiter, as depicted in Fig. 4.2. The summing node of the neuron model computes a linear combination of the inputs applied to its synapses, and also accounts for an externally applied threshold. The resulting sum is applied to a hard limiter. Accordingly, the neuron produces an output equal to +1 if the hard limiter input is positive, and 1 if it is negative. In the signalflow graph model of Fig. 4.2, the synaptic weights of the singlelayer perceptron are denoted by wlr w2, . . , w,. (We have simplified the notation here by not . including an additional subscript to identify the neuron, since we only have to deal with a single neuron.) Correspondingly, the inputs applied to the perceptron are denoted by xlrxz,. . . , xp.The externally applied threshold is denoted by 8. From the model of Fig. 4.2 we thus find that the linear combiner output (i.e., hard limiter input) is
u = C wiXi e
i=l
(4.1)
The purpose of the perceptron is to classify the set of externally applied stimuli xl,xz, . . . , xp into one of two classes, Cel or %z, say. The decision rule for the classification is to assign the point represented by the inputs xl,xz, . . . , xp to class (e if the perceptron output y is + 1 and to class CeZ if it is  1. To develop insight into the behavior of a pattern classifier, it is customary to plot a map of the decision regions in the pdimensional signal space spanned by the p input variables xl, xz,. . . , xp. In the case of an elementary perceptron, there are two decision regions separated by a hyperplane defined by
i wixi  o = o :
i=l
(4.2)
Inputs
FIGURE 4.2 Signalflow graph of the perceptron.
1.
X1
x2 4
\
w.
/'
Threkld
P
108 4 / The Perceptron
This is illustrated in Fig. 4.3 for the case of two input variables x, and xz, for which the decision boundary takes the form of a straight line. A point (xl,xz)that lies above the boundary line is assigned to class %1, and a point (xI,xz)that lies below the boundary line is assigned to class (e2. Note also that the effect of the threshold 8 is merely to shift the decision boundary away from the origin. . The synaptic weights wlr wz,. . , w pof the perceptron can be fixed or adapted on an iterationbyiterationbasis. For the adaptation, we may use an errorcorrectionrule known as the perceptron convergence algorithm, developed in the next section.
4.3 The Perceptron Convergence Theorem
For the development of the errorcorrection learning algorithm for a singlelayer perceptron, we find it more convenient to work with the modified signalflow graph model of Fig. 4.4. In this second model, which is equivalent to that of Fig. 4.2, the threshold e(n) is treated as a synaptic weight connected to a fixed input equal to  1. We may thus define the ( p + 1)by1 input vector
Correspondingly, we define the ( p w(n)
=
+ 1)by1 weight vector
w ) wh),. . . , wp(n)lr &,
(4.4)
Accordingly, the linear combiner output is written in the compact form
u(n) = wT(n)x(n)
(4.5)
For fixed n, the equation wTx = 0, plotted in a pdimensional space with coordinates xl, x2, . . . , x,, defines a hyperplane as the decision surface between two different classes of inputs. Suppose then the input variables of the singlelayer perceptron originate from two linearly separable classes that fall on the opposite sides of some hyperplane. Let XI be the subset of training vectors xl(l), x1(2), . . . that belong to class Gel, and let X 2 be the
x2
I
X1
Decision boundary wlxl + w2x2  8 = 0
FIGURE 4.3 Illustrating linear reparability for a twodimensional, twoclass patternclassification problem.
4.3 I The Perceptron Convergence Theorem 109
Fixed x,,=Io, input
Inputs
W
Y
Y
Hard limiter
. .xo
Linear combiner
FIGURE 4.4
Equivalent signalflow graph of the perceptron.
subset of training vectors x2(1), x2(2), . . . that belong to class S2. The union of XI and X , is the complete training set X. Given the sets of vectors XI and Xz train the classifier, to the training process involves the adjustment of the weight vector w in such a way that the two classes and % are separable. These two classes are said to be linearly separable , if a realizable setting of the weight vector w exists. Conversely, if the two classes % I and %, are known to be linearly separable, then there exists a weight vector w such that we may state
wTx 2 0 for every input vector x belonging to class
and
(4.6)
wrx < 0 for every input vector x belonging to class %,
Given the subsets of training vectors XI and X 2 , the training problem for the elementary perceptron is then to find a weight vector w such that the two inequalities of Eq. (4.6) are satisfied. The algorithm for adapting the weight vector of the elementary perceptron may now be formulated as follows:
1. If the nth member of the training vector, x(n), is correctly classified by the weight vector w(n)computed at the nth iteration of the algorithm, no correction is made to the weight vector of the perceptron, as shown by
w(n
and
+ 1) = w(n)
+ 1) = w(n)
if w'(n>x(n) 2 0 and x(n) belongs to class
(4.7)
w(n
if w ( ) ( ) < 0 and x(n)belongs to class %, 'nxn
2. Otherwise, the weight vector of the perceptron is updated in accordance with the rule
w(n + 1) = w(n)  ~(n)x(n)
and
2 if wT(n)x(n) 0 and x(n) belongs to class sz
(4.8)
w(n + 1) = w(n) + 7(n)x(n)
if w'(n)x(n)
< 0 and x(n) belongs to class
where the learningrate parameter ~ ( ncontrols the adjustment applied to the weight ) vector at iteration n.
If q(n) = 7 > 0, where 7 is a constant independent of the iteration number n, we have a f i e d increment adaptation rule for the perceptron. In the sequel, we first prove the convergence of a fixed increment adaptation rule for which 7 = 1. Clearly, the value of 77 is unimportant, so long as it is positive. A value
110 4 / The Perceptron
of q f: 1 merely scales the pattern vectors without affecting their separability. The case of a variable q(n) is considered later in the section. The proof is presented for the initial condition w(0) = 0. Suppose that wT(n)x(n)< 0 for n = 1, 2, . . . and an input vector x(n) that belongs to the subset X , . That is, the perceptron incorrectly classifies the vectors x(l), x(2), . . . , since the second condition of ) Eq. (4.6) is violated. Then, with the constant ~ ( n = 1, we may use the second line of Eq. (4.8) to write
w(n
+ 1) = w(n) + x(n)
w(n + 1)
=
for x(n) belonging to class
(4.9)
Given the initial condition w(0) = 0, we may iteratively solve this equation for w(n obtaining the result
+ l),
(4.10)
x(1)
+ x(2) +   + x(n)
*
Since the classes (el and %2 are assumed to be linearly separable, there exists a solution wo for which wTx(n) > 0 for the vectors x( I), . . . , x(n) belonging to the subset X,. For a fixed solution wo, we may then define a positive number a by the relation
(4.11)
where x(n) E Xl stands for “x(n) belongs to subset XI.’’ Hence, multiplying both sides of Eq. (4.10) by the row vector wi, we get
wiw(n + 1) = WtX(1)
+ WtX(2) +
*
. . + wtx(n>
(4.12)
Accordingly, in light of the definition given in Eq. (4.1l), we have
wiw(n
+ 1) 2 na
(4.13)
Next, we make use of an inequality known as the CauchySchwarz inequality. Given two vectors wo and w(n + l), the CauchySchwarz inequality states that
llwol1211w(n+ 1)Ip 2 [wiw(n + 1)12
(4.14)
where /(*I1denotes the Euclidean norm of the enclosed argument vector, and the inner product w;w(n 1) is a scalar quantity. We now note from Fiq. (4.13) that [wiw(n 1)12 is equal to or greater than n2a2.Moreover, from Eq. (4.14) we note that Ilwol1211w(n 1)112 is equal to or greater than [w;w(n 1)12. It follows therefore that
+
+
+
+
Ilwo(1211w(n+ 1)y
or, equivalently,
2 n2a2
(4.15)
Next, we follow another development route. In particular, we rewrite Eq. (4.9) in the form
w(k
+ 1) = w(k) + x(k)
fork
=
1,
. . . , n and x(k) E XI
(4.16)
Hence, taking the squared Euclidean norm of both sides of Eq. (4.16), we get
Ilw<k + 1)112 = lw(k>112 + llX(k)1l2 + 2wT(k)x(k)
(4.17)
But, under the assumption that the perceptron incorrectly classifies an input vector x(k) belonging to the subset X,, we have wT(k)x(k)< 0. We therefore deduce from Eq. (4.17) that
Ilw@ + 1)IP
5
llW(W + Ilx(k)l12
4.3 / The Perceptron Convergence Theorem 111
or, equivalently, Ilw<k +
 l l ~ ( k ) (5~llx(k)112, (
k
=
1, . . . ,
(4.18)
Adding these inequalities for k = 1, . . . , n, and assuming that the initial condition w(0) = 0, we get the following condition: lIw(n + 1)1125
c llx(k>ll"
n
k=l
5 np
(4.19) (4.20)
where P is a positive number defined by
P=
I1 w2 l1
Equation (4.19) states that the squared Euclidean norm of the weight vector w(n + 1) grows at most linearly with the number of iterations n. Clearly, the second result of Eq. (4.19) is in conflict with the earlier result of Eq. (4.15) for sufficiently large values of n. Indeed, we can state that n cannot be larger than some value n,, for which Eqs. (4.15) and (4.19) are both satisfied with the equality sign. That is, n,, is the solution of the equation
Solving for nmx for a solution vector wo, we find that (4.21) We have thus proved that for q(n) = I for all n, and w(0) = 0, and given that a solution vector wo exists, the rule for adapting the synaptic weights connecting the associator units to the response unit of the perceptron must terminate after at most n,, iterations. Note also from Eqs. (4.11), (4.20), and (4.21) that there is no unique solution for wo or nm,. We may now state the Jixedincrement convergence theorem for a singlelayer perceptron as follows (Rosenblatt, 1962):
Let the subsets o training vectors X , and X , be linearly separable. Let the inputs f presented to the singlelayer perceptron originate from these two subsets. The perceptron converges after some no iterations, in the sense that
w(no) = w(no + 1) = w(no + 2) = . . .
i a solution vector for no 5 n,, s
.
Consider next the absolute errorcorrection procedure for the adaptation of a singlelayer perceptron, for which q(n) is variable. In particular, let q(n) be the smallest integer for which rp(W(n)x(n> > lwWx(n)l With this procedure we find that if the inner product w'(n)x(n) at iteration n has an incorrect sign, then wr(n + l x n at iteration n + 1 would have the correct sign. This )() suggests that if w'(n)x(n) has an incorrect sign, we may modify the training sequence at 1) = x(n). In other words, each pattern is presented iteration n + 1 by setting x(n repeatedly to the perceptron until that pattern is classified correctly. Note also that the use of an initial value w(0) different from the null condition merely results in a decrease or increase in the number of iterations required to converge, depending
+
112 4 / The Perceptron
on how w(0) relates to the solution wo. Hence, regardless of the value assigned to w(O), the singlelayer perceptron of Fig. 4.1 is assured of convergence.
Summary
In Table 4.1 we present a summary of the perceptron convergence algorithm (Lippmann, 1987). The symbol sgn(), used in step 3 of the table for computing the actual response of the perceptron, stands for the signum finction:
sgn(u) = +1 1 ifu>0 ifu<O
(4.22)
TABLE 4.1 Summary of the Perceptron Convergence Algorithm
Variables and Parameters
x(n) = ( p + 1)by1 input vector
= [ 1, xdn), xz(n),
. ..
1
x,(n)l'
w(n) = ( p
+ 1)by1weight vector
. . . ,wp(n>lT
= [B(n), w1(n), w&),
B(n) = threshold y(n) = actual response (quantized)
d(n) = desired response
77 = learningrate parameter, a positive constant less than unity
Step 7: Initialization
Set w(0) = 0. Then perform the following computations for time n
=
1, 2, . . . .
Step 2: Activation
At time n, activate the perceptron by applying continuousvalued input vector x(n) and desired response d(n).
Step 3: Computation of Actual Response
Compute the actual response of the perceptron:
y(n) = sgn[w'(n)x(n)l
where sgn() is the signum function.
Step 4: Adaptation of Weight Vector
Update the weight vector of the perceptron: w(n where d(n) =
+
1) = w(n)
+ rl[d(n) y(n)lx(n)
{
+1
1
if x(n) belongs to class if x(n) belongs to class
Step 5
Increment time n by one unit, and go back to step 2.
4.4 I Performance Measure 113
In effect, the signum function is a shorthand notation for the asymmetric form of hard limiting. Accordingly, the quantized response y(n) of the perceptron is written in the compact form
An)
=
sgn(w'(n)x(n))
(4.23)
Note that the input vector x(n) is a ( p + 1)by1 vector whose first element is fixed at  1 throughout the computation in accordance with the definition of Eq. (4.3). Correspondingly, the weight vector w(n) is a ( p + 1)by1 vector whose first element equals the threshold q n ) , as defined in Eq. (4.4). In Table 4.1 we have also introduced a quantized desired response d(n), defined by d(n) =
+1
 1
if x(n) belongs to class if x(n) belongs to class %*
(4.24)
Thus, the adaptation of the weight vector w(n) is summed up nicely in the form of an errorcorrection learning rule, as shown by
w(n
+ 1) = w(n) + q[d(n)

y(n)]x(n)
(4.25)
where q is the Zeamingrate parameter, and the difference d(n)  y(n) plays the role of an error signal. The learningrate parameter is a positive constant limited to the range 0 < q 5 1. In assigning a value to it inside this range, we have to keep in mind two conflicting requirements (Lippmann, 1987): Averaging of past inputs to provide stable weight estimates, which requires a small q Fast adaptation with respect to real changes in the underlying distributions of the process responsible for the generation of the input vector x, which requires a large 77
4.4 Performance Measure
In the previous section we derived the perceptron convergence algorithm without any reference to a performance measure, yet in Chapter 2 we emphasized that improvement in the performance of a learning machine takes place over time in accordance with a performance measure. What then is a suitable performance measure for the singlelayer perceptron? An obvious choice would be the auerage probability o classijication error, f defined as the average probability of the perceptron making a decision in favor of a particular class when the input vector is drawn from the other class. Unfortunately, such a performance measure does not lend itself readily to the analytic derivation of a learning algorithm. Shynk (1990) has proposed the use of a performance function that befits the operation of a perceptron, as shown by the statistical expectation
J
=
E[e(n)u(n)]
(4.26)
where e(n) is the error signal defined as the difference between the desired response d(n) and the actual response y(n) of the perceptron, and v(n) is the linear combiner output (Le., hardlimiter input). Note that the error signal e(n) is not a quantized signal, but rather the difference between two quantized signals. The instantaneous estimate of the performance function is the timevarying function S(n) = e(n>v(n)
=
 y(n>lv(n)
(4.27)
114 4 I The Perceptron
The instantaneous gradient vector is defined as the derivative of the estimate &n) with respect to the weight vector w(n), as shown by (4.28) Hence, recognizing that the desired response d(n) is independent of w(n) and the fact that the actual response y(n) has a constant value equal to  1 or + 1, we find that the use of Eq. (4.27) in (4.28) yields2 (4.29)
For the defining equation (4.5), we readily find that3
(4.30)
Thus the use of Eq. (4.30) in (4.29) yields the result
vw.m
=
[d(n)  y(n)lx(n)
(4.31)
Hence, according to the errorcorrection rule described in Chapter 2 we may now express the change applied to the weight vector as
A w ( ~ )=  qV,.T(n)
= q[d(n) y(n>lx(n)
(4.32)
To be precise, we should also include a Dirac delta function (Le., impulse function) whenever the perceptron output y(n) goes through a transition at u(n) = 0. Averaged over a long sequence of iterations, however, such a transition has a negligible effect. Consider a scalar quantity u defined as the inner product of two realvalued pby1 vectors w and x, as shown by
v
= WTX =
2 wix,
i=l
P
where wi and xi are the ith elements of the vectors w and x, respectively. The partial derivative of the quantity v with respect to the vector w is defined by (Haykin, 1991)
From the given definition of u , we note that
I::[
av = aw
auiaw,
av    x,,
awi
We therefore have
i = 1 , 2 , . . . ,p
4.5 / MaximumLikelihoodGaussian Classifier 115
where r ) is the learningrate parameter. This is precisely the correction made to the weight vector as we progress from iteration n to n I 1, as described in Eq. (4.25). We have thus demonstrated that the instantaneous performance function j ( n ) defined in Eq. (4.27) is indeed the correct performance measure for a singlelayer perceptron (Shynk, 1990).
4.5 MaximumLikelihood Gaussian Classifier
A singlelayer perceptron bears a close relationship to a classical pattern classifier known as the maximumlikelihood Gaussian ClassiJier,in that they are both examples of linear classiJiers (Duda and H r ,1973). at The maximumlikelihood method is a classical parameterestimation procedure that views the parameters as quantities whose values are@ed but unknown. The best estimate is defined to be the one that maximizes the probability of obtaining the samples actually observed. To be specific, suppose that we separate a set of samples according to class, so that we have M subsets of samples denoted by XI, X,, . . . , X,. Let the samples in subset Xjbe drawn independently according to a probability law of known parametric form. Let this probability law be described by the conditional probability densityfunction f x w )where x is an observation vector and wjrepresents an unknown parameter vector (lj, associated with class T j ,j = 1, 2, . . . ,M . We further assume that the samples in X , give no information about wj if i # j , that is, the parameters for the different classes are functionally independent. Now, viewed as a function of w, f(x)w) called the likelihood is of w with respect to the observation vector x. The maximumlikelihood estimate of w is, by definition, that value ii. which maximizes the likelihood f(x)w). To proceed with the issue at hand, consider a pby1 observation vector x defined by
x=
[XI, x2,
.. . ,Xp]T
(4.33)
Note that the vector x as defined here is confined exclusively to the externally applied inputs. The observation vector x is described in terms of a mean vector p and covariance matrix C,which are defined by, respectively, p = E[xl and (4.34)
c=

p( )x

PYI
(4.35)
where E is the expectation operator. Assuming that the vector x is Gaussiandistributed, we may express the jointprobability density function of the elements of x as follows (Van Trees, 1968): (4.36) where det C is the determinant of the covariance matrix C,the matrix C'is the inverse of C,and p is the dimension of the vector x. To be more specific, suppose that we have a twoclass problem, with the vector x characterized as follows, depending on whether it belongs to class or class %,:
x EX,: x EX,:
meanvector = p1 covariance matrix = C mean vector = p2 covariance matrix = C
116 4 / The Perceptron
In other words, the vector x has a mean vector the value of which depends on whether it belongs to class %1 or class (e2,and a covariance matrix that is the same for both classes. It is further assumed that:
rn The classes % l and (e2 are equiprobable. rn The samples of class
or (e2 are correlated, so that the covariance matrix C is nondiagonal. rn The covariance matrix C is nonsingular, so that the inverse matrix C' exists. We may then use Eq. (4.36) to express the jointprobability density function of the input vector x as follows: (4.37) where the index i = 1, 2. Taking the natural logarithm of both sides of Eq. (4.37) and then expanding terms, we get P 1 1 lnf(xl%J =   ln(2n)   ln(det C)   xTC'x 2 2 2
1 + prC'x   pTC'pl 2
(4.38)
.
The first three terms on the righthand side of Eq. (4.38) are independent of the index i. Hence, for the purpose of pattern classification considered here, we may define a loglikelihood as follows:
1 &(X)= pTClx   pTC'pi 2
(4.39)
where i
=
1, 2. For class TI, the loglikelihood is
1 ll(X) = p p x   pfC'pl 2
For class %2, the loglikelihood is
1 &(x)= p p x  p;c1p,
(4.40)
5
(4.41)
Hence, subtracting Eq. (4.41) from (4.40), we get
1 = &(x)

Iz(x)
(4.42)
= (p1
1 p2)Tc1x  ( p p  l p '  p2TC'pz) 2
which is linearly related to the input vector x. To emphasize this property of the composite loglikelihood 1, we rewrite Eq. (4.42) in the equivalent form 1 = fiTx  8
=
i :
i= 1
*jxj 
e
I
(4.43)
where fi is the maximumlikelihood estimate of the unknown parameter vector w, defined by w = cyp.1 p*) (4.44) and
8 is a constant threshold, defined by
8 =  (pfC1p12
1
p;c'p2)
(4.45)
4.5 I MaximumLikelihood Gaussian Classifier 117
The maximumlikelihood Gaussian classiJer or Gaussian classijier (for short) may thus be implemented by a linear combiner as shown in the signalflow graph of Fig. 4.5. Equations (4.44) and (4.45) simplify considerably for the following special case. All the samples of the vector x are uncorrelated with each other, and also have a common regardless of the associated class; that is, the covariance matrix C is a diagonal variance uz, matrix defined by
c = u2I
where I is the pbyp identity matrix. We now have
1 C'= 1
u 2
(4.46)
(4.47)
Accordingly, Eqs. (4.44) and (4.45) simplify as, respectively,
% =  (1 11.1 (+2

11.2)
(4.48)
and (4.49) where )).I) denotes the Euclidean norm of the enclosed vector. In any case, treating 1 as the overall classifier output, the Gaussian classifier, characterized by the parameter vector % and threshold 8, solves the patternclassification problem in accordance with the following rule: If 1 2 0, then 1, 2 lZand therefore assign x to class (el If 1 < 0, then lZ> lIand therefore assign x to class (ez (4.50)
The operation of the Gaussian classifier is analogous to that of the singlelayer perceptron in that they are both linear classifiers; see Eqs. (4.2) and (4.43). There are, however, some subtle and important differences between them, which should be carefully noted as discussed here:
w
The singlelayer perceptron operates on the premise that the patterns to be classified are linearly separable. The Gaussian distribution of the two patterns assumed in the derivation of the maximumlikelihood Gaussian classifier do certainly overlap each other and are therefore not exactly separable; the extent of the overlap is determined by the mean vectors pl and p2,and the covariance matrices C1 and Cz. The nature of this overlap is illustrated in Fig. 4.6 for the special case of a scalar input x (i.e., input dimension p = 1). When the inputs are nonseparable and their distributions overlap as described here, the perceptron convergence algorithm develops a problem
FIGURE 4.5
Signalflow graph of Gaussian classifier.
118 4 I The Perceptron
Decision boundary
X
%
FIGURE 4.6
%
Two overlapping, onedimensional Gaussian distributions.
in that the decision boundaries between the different classes may oscillate continuously (Lippmann, 1987). The maximumlikelihood Gaussian classifier minimizes the average probability of classification error. This minimization is independent of the overlap between the underlying Gaussian distributions of the two classes. For example, in the special case illustrated in Fig. 4.6, the maximumlikelihood Gaussian classifier always positions the decision boundary at the point where the two Gaussian dstributions cross each other. This positioning of the decision boundary is perfectly in accord with the optimum Bayes 's hypothesistesting procedure on the assumption that the two classes are equiprobable (ie., there is no prior information to be incorporated into the decision making). The perceptron convergence algorithm is nonparametric in the sense that it makes no assumptions concerning the form of the underlying distributions; it operates by concentrating on errors that occur where the distributions overlap. It may thus be more robust than classical techniques, and work well when the inputs are generated by nonlinear physical mechanisms, and whose distributions are heavily skewed and nonGaussian (Lippmann, 1987). In contrast, the maximumlikelihood Gaussian classifier is parametric; its derivation is contingent on the assumption that the underlying distributions are Gaussian, which may limit its area of application. The perceptron convergence algorithm is both adaptive and simple to implement; its storage requirement is confined to the set of synaptic weights and threshold. On the other hand, the design of the maximumlikelihood Gaussian classifier is fixed; it can be made adaptive, but at the expense of increased storage requirement and more complex computations (Lippmann, 1987).
4 6 Discussion .
The study of the singlelayer perceptron presented in this chapter has been formulated in the context of the McCullochPitts formal model of a neuron. The nonlinear element of this model is represented by a hard limiter at its output end. It would be tempting to think that we can do better if we were to use a sigmoidal nonlinear element in place of the hard limiter. Well, it turns out that the steadystate, decisionmaking characteristics of a singlelayer perceptron are basically the same, regardless of whether we use hardlimiting or softlimiting as the source of nonlinearity in the neural model (Shynk and Bershad,
4.6 I Discussion 119
1991, 1992; Shynk, 1990). We may therefore state formally that so long as we limit ourselves to the model of a neuron that consists of a linear combiner followed by a nonlinear element, then regardless of the form of nonlinearity used, a singlelayer perceptron can perform pattern classification only on linearly separable patterns. Linear separability requires that the patterns to be classified must be sufficiently separated from each other to ensure that the decision surfaces consist of hyperplanes. This requirement is illustrated in Fig. 4.7a for the case of a twodimensional, singlelayer perceptron. If now the two patterns TI and (e2 are allowed to move too close to each other, as in Fig. 4.7b, they become nonlinearly separable and the elementary perceptron fails to classify them. The first real critique of Rosenblatt’s perceptron was presented by Minsky and Selfridge (1961). Minsky and Selfridge pointed out that the perceptron as defined by Rosenblatt could not even generalize toward the notion of binary parity, let alone make general abstractions. They also suggested the roles that connectionist networks might play in the implementation of larger systems. The computational limitations of Rosenblatt’ s perceptron were subsequently put on a solid mathematical foundation in the famous book by Minsky and Papert (1969, 1988). This is a thorough and wellwritten book. After the presentation of some brilliant and highly detailed mathematical analysis of the perceptron, Minsky and Papert proved that the perceptron as defined by Rosenblatt is inherently incapable of making some global generalizations on the basis of locally learned examples. In the last chapter of their book, Minsky and Papert go on to make the conjecture that the limitations of the kind they had discovered for Rosenblatt’s perceptron would also hold true for its variants, more specifically, multilayer neural networks. Quoting from Section 13.2 of the book by Minsky and Papert (1969):
( b)
FIGURE 4.7 patterns.
(a) A pair of linearly separable patterns. (b) A pair of nonlinearly separable
120 4 / The Perceptron
The perceptron has shown itself worthy of study despite (and even because of !) its severe limitations. It has many features to attract attention: its linearity; its intriguing learning theorem; its clear paradigmatic simplicity as a kind of parallel computation. There is no reason to suppose that any of these virtues carry over to the manylayered version. Nevertheless, we consider it to be an important research problem to elucidate (or reject) our intuitive judgement that the extension to multilayer systems is sterile.
This conjecture cast serious doubt on the computational capabilities of not only the perceptron but neural networks in general up to the mid1980s. History has shown, however, that the conjecture made by Minsky and Papert seems to be unjustified in that we now have several advanced forms of neural networks that are computationally more powerful than Rosenblatt’s perceptron. For example, multilayer perceptrons trained with the backpropagation algorithm discussed in Chapter 6 and the radial basisfunction networks discussed in Chapter 7 overcome the computational limitations of the singlelayer perceptron.
4.1 The singlelayer perceptron is a linear pattern classifier. Justify this statement.
42 Consider two onedimensional, Gaussiandistributed classes .
common variance equal to 1. Their mean values are as follows:
p = 10 1
p = 410 z
and %z that have a
These two classes are essentially linearly separable. Design a classifier that separates these two classes.
4.3 Verify that Eqs. (4.22)(4.25), summarizing the perceptron convergence algorithm, are consistent with Eqs. (4.7) and (4.8).
4.4 Equations (4.44) and (4.45) define the weight vector and threshold of a maximumor %z are likelihood Gaussian classifier, assuming that the samples of classes correlated. Suppose now that these samples are uncorrelated but have different variances, so that for both classes we may write C = diag[cr:,
( ; T,
. . . ,cr3
Find the weight vector and threshold of the classifier so defined.
4.5 In Eq. (4.27) we defined one form of an instantaneous performance function for a I singlelayer perceptron. Another way of expressing this performance measure is as follows (Shynk, 1990):
j(n)
=
Iu(n)l  d(n) u(n)
where u(n) is the hardlimiter input, and d(n) is the desired response. Using this performance measure, find the instantaneous gradient vector V,j(n), and show that the expression so derived is identical with that in Eq. (4.31). 4.6 The perceptron may be used to perform numerous logic functions. Demonstrate the implementation of the binary logic functions AND, OR, and COMPLEMENT. However, a basic limitation of the perceptron is that it cannot implement the EXCLUSIVE OR function. Explain the reason for this limitation.
This action might not be possible to undo. Are you sure you want to continue?
We've moved you to where you read on your other device.
Get the full title to continue listening from where you left off, or restart the preview.