Jan Jantzen - Introduction To Perceptron Networks

Introduction To Perceptron Networks
Jan Jantzen jj@iau.dtu.dk:1
$EVWUDFW
When it is time-consuming or expensive to model a plant using the basic laws of physics, a neural network approach can be an alternative. From a control engineers viewpoint a twolayer perceptron network is sufficient. It is indicated how to model a dynamic plant using a perceptron network.
&RQWHQWV
,QWURGXFWLRQ 7KH SHUFHSWURQ 2.1 Perceptron training 2.2 Single layer perceptron 2.3 Gradient descent learning 2.4 Multi-layer perceptron 2.5 Back-propagation 2.6 A general model 3UDFWLFDO LVVXHV 3.1 Rate of learning 3.2 Pattern and batch modes of training 3.3 Initialisation 3.4 Scaling 3.5 Stopping criteria 3.6 The number of training examples 3.7 The number of layers and neurons +RZ WR PRGHO D G\QDPLF SODQW 6XPPDU\ 5 9 9 13 15 18 24 25 25 25 26 26 27
Technical University of Denmark, Department of Automation, Bldg 326, DK-2800 Lyngby, DENMARK. Tech. report no 98-H 873 (nnet), 25 Oct 1998.
Plant
yp -
Network
ym
Figure 1: A neural network models a plant in a forward learning manner.

,QWURGXFWLRQ
A neural network is basically a model structure and an algorithm for fitting the model to some given data. The QHWZRUN DSSURDFK to modelling a plant uses a generic nonlinearity and allows all the parameters to be adjusted. In this way it can deal with a wide range of nonlinearities. /HDUQLQJ is the procedure of training a neural network to represent the dynamics of a plant, for instance in accordance with Fig. 1. The neural network is placed in parallel with the plant and the error h between the output of the system and the network outputs, the SUHGLFWLRQ HUURU, is used as the training signal. Neural networks have a potential for intelligent control systems because they can learn and adapt, they can approximate nonlinear functions, they are suited for parallel and distributed processing, and they naturally model multivariable systems. If a physical model is unavailable or too expensive to develop, a neural network model might be an alternative. Networks are also used for FODVVLILFDWLRQ. An example of an industrial application concerns acoustic quality control (Meier, Weber & Zimmermann, 1994). A factory produces ceramic tiles and an experienced operator is able to tell a bad tile from a good one by hitting it with a hammer; if there are cracks inside, it makes an unusual sound. In the quality control system the tiles are hit automatically and the sound is recorded with a microphone. A neural network (a so-called Kohonen network) tells the bad tiles from the good ones, with an acceptable rate of success, after being presented with a number of examples. The objective here is to present the subject from a FRQWURO engineers viewpoint. Thus, there are two immediate application areas: Models of (industrial) processes, and controllers for (industrial) processes. In the early 1940s McCulloch and Pitts studied the connection of several basic elements based on a model of a neuron, and Hebb studied the adaptation in neural systems. Rosenblatt devised in the late 50s the 3HUFHSWURQ now widely used. Then in 1969 Minsky and Papert pointed to several limitations of the perceptron, and as a consequence the research in the field slowed down for lack of funding. The catalyst for todays level of research was a
series of results and algorithms published in 1986 by Rumelhart and his co-workers. In the 90s neural networks and fuzzy logic came together in neurofuzzy systems since both techniques are applied where there is uncertainty. There are now many real-world applications ranging from finance to aerospace. There are many neural network architectures such as the perceptron, multilayer perceptrons, networks with feedback loops, self-organising systems, and dynamical networks, together with several different learning methods such as error-correction learning, competitive learning, supervised and unsupervised learning (see the textbook by Haykin, 1994). Neural network types and learning methods have been organised into a brief classification scheme, a WD[RQRP\ (Lippmann, 1987). Neural networks have already been examined from a control engineers viewpoint (Miller, Sutton & Werbos in Hunt, Sbarbaro, Zbikowski & Gawthrop, 1992). Neural networks can be used for system identification (forward modelling, inverse modelling) and for control, such as supervised control, direct inverse control, model reference control, internal model control, and predictive control (see the overview article by Hunt et al., 1992). Within the realm of modelling, identification, and control of nonlinear systems there are applications to pattern recognition, information processing, design, planning, and diagnosis (see the overview article by Fukuda & Shibata, 1992). Hybrid systems using neural networks, fuzzy sets, and artificial intelligence technologies exist, and these are surveyed also in that article. A systematic investigation of neural networks in control confirms that neural networks can be trained to control dynamic, nonlinear, multivariable, and noisy processes (see the PhD thesis by Srensen, 1994). Somewhat related is Nrgaards investigation (1996) of their application to system identification, and he also proposes improvements to specific control designs. A comprehensive systematic classification of the control schemes proposed in the literature has been attempted by Agarwal (1997). His taxonomy is a tree with control schemes using neural networks at the top node, broken down into two classes: neural network only as aid and neural network as controller. These are then further refined. There are many commercial tools for building and using neural networks, either alone or together with fuzzy logic tools; for an overview, see the database CITE (MIT, 1995). Neural network computations are naturally expressed in matrix notation, and there are several toolboxes in the matrix language Matlab, for example the commercial neural network toolbox (Demuth & Beale, 1992), and a university developed toolbox for identification and control, downloadable from the World Wide Web (Nrgaard, NNSYSID with NNCTRL ). In the wider perspective of VXSHUYLVRU\ control, there are other application areas, such as robotic vision, planning, diagnosis, quality control, and data analysis (data mining). The strategy in this lecture note is to aim at all these application areas, but only present the necessary and sufficient neural network material for understanding the basics.

7KH SHUFHSWURQ
Given a classification problem, the set of data X is to be classified into the classes C4 and C5 = A neural network can learn from data and improve its performance through learning, and that is the ability we are interested in. In a OHDUQLQJ SKDVH we wish to present a subset 3
+]
0.958 1.043 1.907 0.780 0.579 0.003 0.001 0.014 0.007 0.004
+]
0.003 0.001 0.003 0.002 0.001 0.105 1.748 1.839 1.021 0.214
4XDOLW\ 2N"
Y es Y es Y es Y es Y es No No No No No
Table 1: Frequency intensities for ten tiles of examples to the network in order to train it. In a following JHQHUDOLVDWLRQ SKDVH we wish the trained network makes the correct classification with data it has never seen before. ([DPSOH WLOHV 7LOHV DUH PDGH IURP FOD\ PRXOGHG LQWR WKH ULJKW VKDSH EUXVKHG JOD]HG DQG EDNHG 8QIRUWXQDWHO\ WKH EDNLQJ PD\ SURGXFH LQYLVLEOH FUDFNV 2SHUDWRUV FDQ GHWHFW WKH FUDFNV E\ KLWWLQJ WKH WLOHV ZLWK D KDPPHU DQG LQ DQ DXWRPDWHG V\VWHP WKH UHVSRQVH LV UHFRUGHG ZLWK D PLFURSKRQH ILOWHUHG )RXULHU WUDQVIRUPHG DQG QRUPDOLVHG $ VPDOO VHW RI GDWD DGDSWHG IURP 0,7 LV JLYHQ LQ 7$%/( 1 The SHUFHSWURQ the simplest form of a neural network, is able to classify data into two classes. Basically it consists of a single QHXURQ with a number of adjustable weights. The neuron is the fundamental processor of a neural network (Fig. 2). It has three basic elements: 1. A set of connecting OLQNV (or V\QDSVHV ); each link carries a ZHLJKW (or JDLQ ) z3 > z4 > z5 . 2. A VXPPDWLRQ (or DGGHU ) sums the input signals after they are multiplied by their respective weights. 3. An DFWLYDWLRQ IXQFWLRQ i+{, limits the output of the neuron. Typically the output is limited to the interval ^3> 4` or alternatively ^ 4> 4`. The summation in the neuron also includes an RIIVHW z3 for lowering or raising the net input to the activation function. Mathematically the input to the neuron is represented by a vector X @ +4> x4 > x5 > = = = > xq ,W , and the output is a scalar | @ i +{,. The weights of the connections are represented by the vector Z @ +z3 > z4 > = = = > zq ,W where z3 is the offset. The output is calculated as | @ i +ZW X, (1) Fig. 2a is a perceptron with two inputs and an offset. With a hard limiter as activation function (b), the neuron produces an output equal to .4 or 4 that we can associate with C4 and C5 > respectively. 4
Hard limiter 1 1 w0
f(x)
w1 w2
f(x)
-1
-2
0 x (b)
(a)
Figure 2: Perceptron consisting of a neuron (a) with an offset z3 and an activation function i+{,, which is a hard limiter (b).

3HUFHSWURQ WUDLQLQJ
The weights Z are adjusted using an adaptive learning rule. One such learning rule is the SHUFHSWURQ FRQYHUJHQFH DOJRULWKP. If the two classes C4 and C5 are OLQHDUO\ VHSDUDEOH (i.e., they lie on opposite sides of a straight line or, in general, a hyperplane), then there exists a weight vector Z with the properties ZW X 3 for every input vector X belonging to class C1 ZW X ? 3 for every input vector X belonging to class C2 (2)
Assuming, to be general, that the perceptron has p inputs, then the equation ZW X @ 3> in an p dimensional space with coordinates x4 > x5 > = = = > xp , defines a hyperplane as the switching surface between the two different classes of input. The training process adjusts the weights Z to satisfy the two inequalities (2). A training set consists of, say, N samples of the input vector X together with each samples class membership (3 or 4). A presentation of the complete training set to the multilayer perceptron is called an HSRFK. The learning is continued epoch by epoch until the weights stabilise. $OJRULWKP The core of the perceptron convergence algorithm for adapting the weights of the elementary perceptron has two steps. (a) If the nth member of the training set, Xn > +n @ 4> 5 = = = > N, is correctly classified by the weight vector Zn computed at the nth iteration of the algorithm, no correction is made to Zn as shown by Zn.4 @ Zn and if ZW Xn 3 and Xn belongs to class C4 n (3) (4) (5) Zn.4 @ Zn if ZW Xn ? 3 and Xn belongs to class C5 n (b) Otherwise the perceptron weights are updated according to the rule Zn.4 @ Zn n Xn
if ZW Xn 3 and Xn belongs to class C5 n 5
X X
: :
: X
Figure 3: After two updates is larger than 90 degrees, and ZW X @ mZm mXm frv +, changes sign. and Zn.4 @ Zn . n Xn
if ZW Xn ? 3 and Xn belongs to class C4 n
(6)
Notice the interchange of the class numbers from step 1 to step 2. The learning-rate n controls the adjustment applied to the weight vector at iteration n. If is a constant, independent of the iteration number n, we have a IL[HG LQFUHPHQW adaptation rule for the perceptron. The algorithm has been proved to converge (Haykin, 1994; Lippmann, 1987). The perceptron learning rule is illustrated in Fig. 3. The input vector X points at some point in the p-dimensional space marked by a star. The update rule is based on the definition of the dot-product of two vectors, which relates the angle between the input vector and the weight vector w, ZW X @ mZm mXm frv +, The figure shows a situation where the weight vector Z+4, needs to be changed. The angle is less than 90 degrees, frv +, is positive, and the update rule changes the weight vector into Z+5, by the amount X in the direction opposite of X The weight vector turned, but not enough, so another pass is necessary to bring it to Z+6,= Now the angle is larger than 90 degrees, and the sign of ZW X is correct. The vector Z is orthogonal to the hyperplane that separates the classes (Fig. 4). When Z turns, the hyperplane turns with it, until the hyperplane separates the classes correctly. ([DPSOH WLOHV :H ZLOO WUDLQ WKH SHUFHSWURQ LQ )LJ 2 WR GHWHFW EDG WLOHV 7KH LQSXWV WR WKH QHXURQ DUH WKH WZR ILUVW FROXPQV LQ 7$%/( 1 /HW XV DVVRFLDWH WKH JRRG WLOHV ZLWK FODVV &4 DQG WKH EDG RQHV ZLWK &5 = 7KH ILUVW WHVW LQSXW WR WKH SHUFHSWURQ LQFOXGLQJ D 4 IRU WKH RIIVHW ZHLJKW LV WKHQ (7) XW @ +4> 3=<8;> 3=336, 7KLV FRUUHVSRQGV WR D WLOH IURP &4 = 7KH ZHLJKWV LQFOXGLQJ WKH RIIVHW ZHLJKW KDYH WR EH LQLWLDOLVHG DQG OHW XV DUELWUDULO\ VWDUW ZLWK ZW @ +3=8> 3=8> 38, 6 (8)
X Z1
Figure 4: The weight vector defines the line (hyperplane), that separates the white class from the shaded class. 7KH OHDUQLQJ UDWH LV IL[HG DW
@ 3=8 $FFRUGLQJ WR WKH SHUFHSWURQ FRQYHUJHQFH DOJRULWKP ZH KDYH WR WHVW WKH SURGXFW ZW X @ +4> 3=<8;> 3=336,+3=8> 3=8> 38,W @ 3=<;4
(9) (10) (11)
7KH VLJQ LV SRVLWLYH FRUUHVSRQGLQJ WR WKH FDVH LQ WKHUHIRUH ZH OHDYH WKH ZHLJKWV RI WKH SHUFHSWURQ DV WKH\ DUH :H UHSHDW WKH SURFHGXUH ZLWK WKH QH[W VHW RI LQSXWV DQG WKH QH[W DQG VR RQ 7KH UHVXOWV IURP D KDQG FDOFXODWLRQ RI WZR SDVVHV WKURXJK WKH ZKROH VHW RI GDWD DUH FROOHFWHG LQ 7$%/( 2 7KHUH DUH QR XSGDWHV WR WKH ZHLJKWV EHIRUH ZH UHDFK WKH ILUVW EDG WLOH WKHQ ZH DSSO\ DQG WKH ZHLJKWV FKDQJH $IWHU WKH ILUVW SDVV LW ORRNV OLNH WKH SHUFHSWURQ FDQ UHFRJQLVH WKH EDG WLOHV EXW DV WKH ZHLJKWV KDYH FKDQJHG VLQFH WKH LQLWLDO UXQ ZH KDYH WR JR EDFN DQG FKHFN WKH JRRG WLOHV DJDLQ ,Q WKH VHFRQG SDVV 7$%/( 2 VHFRQG SDVV WKH ZHLJKWV DUH XSGDWHG LQ WKH ILUVW LWHUDWLRQ EXW DIWHU WKDW WKHUH DUH QR IXUWKHU FKDQJHV $ WKLUG SDVV QRW VKRZQ ZLOO VKRZ WKDW WKH DGDSWDWLRQ KDV VWRSSHG 7KH SHUFHSWURQ LV LQGHHG DEOH WR GLVWLQJXLVK EHWZHHQ JRRG DQG EDG WLOHV )LJXUH 5 VKRZV D ORJDULWKPLF SORW RI WKH GDWD DQG D VHSDUDWLQJ OLQH EHWZHHQ WKH WZR FODVVHV 7KH ILQDO ZHLJKWV DUH (12) ZW @ +3> 3=<::> 3=758, 7KDW LV QR RIIVHW DQG WKH OLQH ZKHUH WKH VLJQ VZLWFKHV LV GHILQHG E\ 7KLV LV HTXLYDOHQW WR WKH VWUDLJKW OLQH x5 @ +3=<::@3=758,x4 GUDZQ LQ WKH ILJXUH :LWK D KDUG OLPLWHU vljqxp IXQFWLRQ DV WKH DFWLYDWLRQ IXQFWLRQ WKH SHUFHSWURQ SURGXFHV DQ RXWSXW DFFRUGLQJ WR (14) | @ vjq+ZW X, $Q\ QHZ GDWD GUDZQ IURP EH\RQG WKH WUDLQLQJ VHW EXW DERYH WKH OLQH ZLOO UHVXOW LQ | @ 4 EDG DQG DQ\ GDWD EHORZ WKH OLQH ZLOO UHVXOW LQ | @ .4 JRRG 3=<::x4 3=758x5 @ 3 (13)
XW
(1,0.958,0.003) (1,1.043,0.001) (1,1.907,0.003) (1,0.780,0.002) (1,0.579,0.001) (1,0.003,0.105) (1,0.001,1.748) (1,0.014,1.839) (1,0.007,1.021) (1,0.004,0.214)
ZW
(0.5,0.5,0.5) (0.5,0.5,0.5) (0.5,0.5,0.5) (0.5,0.5,0.5) (0.5,0.5,0.5) (0.5,0.5,0.5) (0,0.499,0.448) (-0.5,0.498,-0.427) (-0.5,0.498,-0.427) (-0.5,0.498,-0.427) (-0.5,0.498,-0.427) (0,0.977,-0.425) (0,0.977,-0.425) (0,0.977,-0.425) (0,0.977,-0.425) (0,0.977,-0.425) (0,0.977,-0.425) (0,0.977,-0.425) (0,0.977,-0.425) (0,0.977,-0.425)
OK?
1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0
ZW X
0.981 1.022 1.455 0.891 0.790 0.554 0.783 -1.277 -0.932 -0.589 -0.024 1.019 1.862 0.761 0.565 -0.042 -0.742 -0.768 -0.427 -0.087
Equation
(3) (3) (3) (3) (3) (5) (5) (3) (3) (3) (6) (3) (3) (3) (3) (3) (3) (3) (3) (3)
Updated ZW
(0.5,0.5,0.5) (0.5,0.5,0.5) (0.5,0.5,0.5) (0.5,0.5,0.5) (0.5,0.5,0.5) (0,0.499,0.448) (-0.5,0.498,-0.427) (-0.5,0.498,-0.427) (-0.5,0.498,-0.427) (-0.5,0.498,-0.427) (0,0.977,-0.425) (0,0.977,-0.425) (0,0.977,-0.425) (0,0.977,-0.425) (0,0.977,-0.425) (0,0.977,-0.425) (0,0.977,-0.425) (0,0.977,-0.425) (0,0.977,-0.425) (0,0.977,-0.425)
2nd pass
(1,0.958,0.003) (1,1.043,0.001) (1,1.907,0.003) (1,0.780,0.002) (1,0.579,0.001) (1,0.003,0.105) (1,0.001,1.748) (1,0.014,1.839) (1,0.007,1.021) (1,0.004,0.214)
Table 2: After two passes the the perceptron algorithm converges.
1 10
Bad
0 10
557 Hz
-1 10
-2 10
Good
-3 10 -3 10
-2 10
-1 10 475 Hz
0 10
1 10
Figure 5: Classification of tiles in the input space (logarithmic axes).

6LQJOH OD\HU SHUFHSWURQ
Given a problem which calls for more than two classes, several perceptrons can be combined into a network. The simplest form of a OD\HUHG network just has an input layer of source nodes that connect to an output layer of neurons, but not vice versa. Fig. 6 shows a single layer perceptron with five nodes capable of recognising three linearly separable classes (three output nodes) by means of two features (two input nodes). Each output neuron defines a weight vector which in turn defines a hyperplane. The hyperplanes separate the classes as in Fig. 7. The activation function could have various shapes depending on the application; Fig. 8 shows six different types. The mathematical expressions for the three in the bottom row are, respectively 4 (15) i +{, @ 4 . h{s+ d{, h{s+{, h{s+ {, (16) i +{, @ h{s+{, . h{s+ {, (17) i +{, @ h{s+ {5 , Here (15) is a ORJLVWLF function, an example of the widely used VLJPRLG function; in general a sigmoid is an v shaped function, strictly increasing and asymptotic. The parameter d is used to adjust the slope; in the limit as d approaches infinity, the sigmoid becomes a step. Equation (16) is a K\SHUEROLF WDQJHQW ; it is symmetric about the origin, which can be convenient. The third function (17) is a *DXVVLDQ function.

*UDGLHQW GHVFHQW OHDUQLQJ
If we introduce the GHVLUHG response g of the network, .4 if X belongs to class C4 g@ 4 if X belongs to class C5
(18)
the learning rules (3)-(6) can be expressed in a more convenient way. They can be summed up nicely in the form of an HUURUFRUUHFWLRQ learning rule, the so-called GHOWD rule, as shown by @ Zn . Zn (19) (20) @ Zn . n +gn |n ,Xn The difference gn |n plays the role of an error signal and n corresponds to the current learning sample. The learning rate parameter is a positive constant limited to the range 3 ? 4= The learning rule increases the output | by increasing Z when the error h @ g | is positive; therefore zl increases if xl is positive, and it decreases if xl is negative (and similarly when the error is negative). Features of the delta rule are as follows: It is simple, learning can be performed locally at each neuron, and weights are updated on-line after presentation of each pattern. 9 Zn.4
3 u1 1
y1
4 u2 2
y2
y3
Figure 6: Single layer perceptron network with three output neurons.
v3 = 0
C1 C3
v2 = 0
C2 v1 = 0
Figure 7: Three linearly separable classes.
10
Linear 1
Limiter
Hard limiter
-1
Logistic 1
Tanh
Gaussian
-1 -2 0 x 2 -2 0 x 2 -2 0 x 2
Figure 8: Examples of six activation functions. The ultimate purpose of error-correction learning is to minimise a cost function based on the error signal h= Then the response of each output neuron approaches the target response for that neuron in some statistical sense. Indeed, once a cost function is selected, the learning is strictly an optimisation problem. A common cost function is the sum of the squared errors, 4[ 5 h (21) H +Z, @ 5 u u
The summation is over all the output neurons of the network (cf. index u). The network is trained by minimising H with respect to the weights, and this leads to the JUDGLHQW GHVFHQW method. The factor 4@5 is used to simplify differentiation when minimising H= A plot of the cost function H versus the weights is a multidimensional HUURUVXUIDFH. Depending on the type of activation functions used in the network we may encounter two situations: 1. 7KH QHWZRUN KDV HQWLUHO\ OLQHDU QHXURQV. The error surface is a quadratic function of the weights, and it is bowl-shaped with a unique minimum point. 2. 7KH QHWZRUN KDV QRQOLQHDU QHXURQV. The surface has a global minimum (perhaps multiple) as well as local minima.
In both cases, the objective of the learning is to start from an arbitrary point on the surface, determined by the initial weights and the initial inputs, and move toward a global minimum in a step-by-step fashion. This is feasible in the first case, whereas in the second case the algorithm may get trapped in a local minimum, never reaching a global minimum.
11
4 3 2
E
1 0 -1 0 w0 1 -1 -0.5 w1 0 0.5 1
Figure 9: Example of error surface for a simple perceptron. ([DPSOH HUURU VXUIDFH 7R VHH KRZ WKH HUURU VXUIDFH LV IRUPHG ZH ZLOO H[DPLQH D VLPSOH SHUFHSWURQ ZLWK RQH QHXURQ +u @ 4, RQH LQSXW +p @ 4, DQ RIIVHW DQG D OLQHDU DFWLYDWLRQ IXQFWLRQ 7KH QHWZRUN RXWSXW LV | @ i +ZW X, @ ZW X @ +z3 > z4 ,+4> x,W @ z3 . z4 x (22) *LYHQ D SDUWLFXODU LQSXW x DQG WKH FRUUHVSRQGLQJ GHVLUHG RXWSXW g ZH FDQ SORW WKH HUURU VXUIDFH $VVXPH x @ 4> g @ 3=8 (23) WKHQ WKH HUURU IXQFWLRQ EHFRPHV 4 5 h (24) H +Z, @ 5 4 @ +g |,5 (25) 5 4 +g +z3 . z4 x,,5 (26) @ 5 4 +3=8 +z3 . z4 ,,5 (27) @ 5 7KLV LV D SDUDEROLF ERZO DV WKH SORW LQGLFDWHV LQ )LJ 9 ,Q FDVH WKHUH DUH N H[DPSOHV LQ WKH WUDLQLQJ VHW WKHQ x | DQG g EHFRPH URZ YHFWRUV RI OHQJWK N DQG WKH VTXDUHG HUURUV DUH VXPPHG IRU HDFK LQVWDQFH RI z3 DQG z4 4[ W 5 G H +Z, @ z3 . z4 XW (28) 5 7KH VTXDULQJ RSHUDWLRQ LV HOHPHQWE\HOHPHQW DQG ZH KDYH D YHFWRU XQGHU WKH VXPPDWLRQ V\PERO 7KH VXPPDWLRQ LV RYHU WKH HOHPHQWV RI WKDW YHFWRU 7KH UHVXOW LV D VFDODU IRU HDFK FRPELQDWLRQ RI WKH VFDODUV z3 DQG z4 To sum up, in order to make H +Z, small, the idea is to change the weights Z in the 12
Input layer
Hidden layer
Output layer
Figure 10: Fully connected multi-layer perceptron. direction of the negative gradient of H, that is CH Ch Z@ @ h (29) gZ gZ S 5 Here h has been substituted for u hu = The partial derivatives Ch@gZ tells how much the error is influenced by changes in the weights.
([DPSOH 7KHUH DUH DOWHUQDWLYHV WR WKH FRVW IXQFWLRQ ,I LW LV FKRVHQ WR EH H +Z, @ mhm WKH JUDGLHQW PHWKRG JLYHV Z@ Ch vljq +h, gZ

0XOWLOD\HU SHUFHSWURQ
The single-layer perceptron can only classify linearly separable problems. For non-separable problems it is necessary to use more layers. A PXOWLOD\HU (feedforward) network has one or more KLGGHQ layers whose neurons are called KLGGHQ QHXURQV. The graph in Fig. 10 illustrates a multilayer network with one hidden layer. The network is IXOO\ FRQQHFWHG because every node in a layer is connected to all nodes in the next layer. If some of the links are missing, the network is SDUWLDOO\ FRQQHFWHG. When we say that a network has q layers, we only count the hidden layers and the output layer; the input layer of source nodes does not count, because the nodes do not perform any computations. A single layer network thus refers to a network with just an output layer.
([DPSOH ;25 ,Q )LJ 11 OHIW WKH SRLQWV PDUNHG E\ DQ [ EHORQJ WR FODVV F4 DQG WKH SRLQWV PDUNHG E\ DQ R EHORQJ WR F5 7KH UHODWLRQ EHWZHHQ +x4 > x5 , YHUVXV WKH FODVV 13
u1 u2
1 3
u2 1 u1
Figure 11: Two classes x and o that cannot be separated by a line (left) require a network with more than one layer (right). PHPEHUVKLS | FDQ EH GHVFULEHG LQ WKH WDEOH x4 3 3 4 4 x5 3 4 3 4 | 3 4 4 3
(30)
| @ i +ZW X, @ ZW X @ +z3 > z4 > z5 ,+4> x4 > x5 ,W @ z3 . z4 x4 . z5 x5 ,W LV UHTXLUHG WR VDWLVI\ WKH IROORZLQJ IRXU LQHTXDOLWLHV 3 3 4 4 z4 . 3 z4 . 4 z4 . 3 z4 . 4 z5 . z3 z5 . z3 z5 . z3 z5 . z3 A A 3 , z3 3 , z3 3 , z3 3 , z3 A 3 A 3 3 3 A z5 A z4 z4
7KH WDEOH DOVR H[SUHVVHV WKH ORJLFDO h{foxvlyh0ru ;25 RSHUDWLRQ LW ZDV XVHG PDQ\ \HDUV DJR WR GHPRQVWUDWH WKDW WKH VLQJOH OD\HU SHUFHSWURQ ZDV XQDEOH WR FODVVLI\ D YHU\ VLPSOH VHW RI GDWD ,Q V\PEROV D WZRLQSXW SHUFHSWURQ ZLWK RQH QHXURQ DQ RIIVHW DQG D OLQHDU DFWLYDWLRQ IXQFWLRQ KDV WKH IROORZLQJ IXQFWLRQ (31) (32) (33) (34) (35) (36) (37) (38)
z5
8VLQJ LQ LPSOLHV z5 z4 z4 . z5
,W LV RI FRXUVH LPSRVVLEOH WR VDWLVI\ DOO DW WKH VDPH WLPH
([DPSOH ,W LV SRVVLEOH QHYHUWKHOHVV WR VROYH WKH SUREOHP ZLWK WKH WZR OD\HU SHUFHSWURQ 14
LQ )LJ 11 ULJKW LI QHXURQ LQ WKH KLGGHQ OD\HU KDV WKH ZHLJKWV Z4 @ + 3=8> 4> 4,> QHXURQ LQ WKH KLGGHQ OD\HU Z5 @ +3=8> 4> 4, > DQG WKH RXWSXW QHXURQ Z6 @ +3=8> 4> 4, = $VVXPLQJ D KDUG OLPLWHU DFWLYDWLRQ IXQFWLRQ LQ HDFK QHXURQ DQG UHSUHVHQWLQJ WKH ELWV 3 DQG 4 E\ 4 UHVSHFWLYHO\ .4> WKHQ WKH RYHUDOO IXQFWLRQ RI WKH QHWZRUN LV W (39) |6 @ vjq ZW +4> |4 > |5 , @ vjq +3=8 . |4 |5 , 6 W (40) |4 @ vjq ZW +4> x4 > x5 , @ vjq + 3=8 . x4 x5 , 4 W (41) |5 @ vjq ZW +4> x4 > x5 , @ vjq +3=8 . x4 x5 , 5 %\ LQVHUWLRQ WKH LQSXWRXWSXW UHODWLRQ LV :LWK X4 @ + 4> 4> 4> 4,W DQG X5 @ + 4> 4 DV GHVLUHG FI |6 @ vjq +3=8 . +vjq + 3=8 . x4 4> 4,W WKH RXWSXW EHFRPHV \ @ + 4> 4> 4> 4,W x5 ,, vjq +3=8 . x4 x5 ,, (42)

%DFNSURSDJDWLRQ
The HUURU EDFNSURSDJDWLRQ algorithm is a popular learning rule for multilayer perceptrons. It is based on the delta rule, and it uses the squared error measure (21) for output nodes. A perceptron weight zml , the weight on the connection from neuron l to neuron m> is updated according to the JHQHUDOLVHG GHOWD UXOH CH zml @ zml . zml @ zml (43) Czml To make the notation clearer the index of the training instance n has been omitted from the expression; it is understood that the equation is recursive such that the zml on the left hand side is the new weight, while the zml on the right hand side is the old weight. In the generalised delta rule the correction to the weight is proportional to the gradient of the error or, in other words, to the sensitivity of the error to changes in the weight. To apply the algorithm two passes of computation are necessary, a IRUZDUG SDVV and a EDFNZDUG SDVV. In the forward pass the weights remain unchanged. The forward pass begins at the first hidden layer by presenting it with the input vector, and terminates at the output layer by computing the error signal for each output neuron. The backward pass starts at the output layer by passing the error signals backwards through the network, layer by layer. For the backward pass, assume that neuron m is an output neuron. The derivative may be expressed using the chain rule, CH Chm C|m Cym CH @ @ hm + 4, fmQ |l (44) Czml Chm C|m Cym Czml Here hm is the error of neuron m> |m is the output of the neuron, and ym is the internal input 3 to neuron m after summation, but before the activation function, i.e., |m @ i (ym ). The i 15
notation means differentiation of i with respect to its argument. Insertion into (43) yields The error hm of neuron m is associated with a desired output gm , and it is simply hm @ gm |m = For the hidden node l> we have to EDFNSURSDJDWH the error recursively, since it occurs in the two first partial derivatives in (44). But there is a problem, in that the error is unknown for a hidden node. We will instead compute directly CH CH Chl @ (46) Chl C|l C|l The partial derivative of H with respect to the output of hidden neuron l> connected directly to output neuron m> is [ Chm CH @ hm (47) C|l C|l m @ @ @ In this case the update rule becomes zlk @ [ [ [ zml @ hm fmQ |l m is an output neuron (45)
m m m
hm hm hm
Chm Cym Cym C|l C +gm
(48) (49) (50)
fm +ym ,, Cym Cym C|l
fmQ + , zml
hm fmQ + , zml flQ + , |k
(51)
Here l is a hidden neuron, k is an immediate predecessor of l, and m is an output neuron. The Q factor il + , depends solely on the activation function of hidden neuron l. The summation over m requires knowledge of the error signals hm for all the output neurons. The term zml consists of the weights associated with the output connections. Finally, |k is the input to neuron l, and it stems from the partial derivative Cym @C|l = For a neuron j in the next layer, going towards the input layer, the update rule is applied recursively. In summary, the correction to the weight zts on the connection from neuron s to neuron t is defined by (52) zts @ t |s Here is the learning rate parameter, t is called the ORFDO JUDGLHQW cp. (29) or (43), and |s is the input signal to neuron t. The local gradient is computed recursively for each neuron, and it depends on whether the neuron is an output neuron or a hidden neuron: 1. If neuron t is an output neuron, then Both factors in the product are associated with neuron t. t @ ht ftQ + , (53)
16
2. If neuron t is a hidden node, then t @
u The summation is a weighted sum of u s of the immediate successor neurons.
u zut ftQ + ,
(54)
The weights on the connections feeding the output layer are updated using the delta rule (52), where the local gradient t is as in (53). Given the s for the neurons of the output layer, we next use (54) to compute the s for all the neurons in the next layer and the changes to all the weights of the connections feeding it. To compute the for each neuron, the activation function i + , of that neuron must be differentiable. $OJRULWKP (after Lippmann, 1987) The back-propagation algorithm has five steps. (a) ,QLWLDOLVH ZHLJKWV. Set all weights to small random values. (b) 3UHVHQW LQSXWV DQG GHVLUHG RXWSXWV WUDLQLQJ SDLUV. Present an input vector X and specify the desired outputs G. If the net is used as a classifier then all desired outputs are typically set to zero, except one (set to 1). The input could be new on each trial, or samples from a training set could be presented cyclically until weights stabilise. (c) &DOFXODWH DFWXDO RXWSXWV. Calculate the outputs by successive use of \ @ I+ZW X, where I is a vector of activation functions. (d) $GDSW ZHLJKWV. Start at the output neurons and work back to the first hidden layer. Adjust weights by (55) zml @ zml . m |l In this equation zml is the weight from hidden neuron l (or an input node) to neuron m, |l is the output of neuron l (or an input), is the learning rate parameter, and m is the gradient; if neuron m is an output neuron then it is defined by (53), and if neuron m is hidden then it is defined by (54). Convergence is sometimes faster if a momentum term is added according to (78). (e) *R WR VWHS . ([DPSOH ORFDO JUDGLHQW 7KH ORJLVWLF IXQFWLRQ i +{, @ 4@+4 . h{s+ {,, LV GLIIHU HQWLDEOH ZLWK WKH GHULYDWLYH i Q +{, @ h{s+ {,@+4 . h{s+ {,,5 :H FDQ HOLPLQDWH WKH H[SRQHQWLDO WHUP DQG H[SUHVV LW DV I Q +{, @ I +{, ^4 m @ hm ImQ + , @ +gm I +{,` |m , |m +4 |l , |m , [ m zml (56) (57) (58) )RU D QHXURQ m LQ WKH RXWSXW OD\HU ZH PD\ WKHQ H[SUHVV WKH ORFDO JUDGLHQW DV )RU D KLGGHQ QHXURQ l WKH ORFDO JUDGLHQW LV [ l @ IlQ + , m zml @ |l +4
([DPSOH EDFNSURSDJDWLRQ DIWHU 'HPXWK 17
%HDOH 7R VWXG\ EDFNSURSDJDWLRQ
Error surface 4 1 3 2 0.5 1
Error contours
Sum Squared Error
w0
0 -5 0 w1 5 -4 -2 w0 0 2 4
0 -1 -2 -3 -4 -5 w1 0 5
Figure 12: Nonlinear error surface for back-propagation. ZH ZLOO FRQVLGHU D VLPSOH H[DPSOH ZLWK WKH WUDLQLQJ VHW x 6 5 | 3=7 3=; (59)
7KXV WKH QHWZRUN ZLOO KDYH RQH LQSXW WKHUH ZLOO EH QR KLGGHQ OD\HU DQG WKH RXWSXW OD\HU ZLOO KDYH RQH RXWSXW QHXURQ 7R PDNH WKH H[DPSOH PRUH LQWHUHVWLQJ ZH FKRRVH D QRQOLQHDU DFWLYDWLRQ IXQFWLRQ ORJLVWLF IXQFWLRQ 7KH RYHUDOO IXQFWLRQ LV WKHQ W (60) | @ i +z3 > z4 , +4> x, @ i +z3 . z4 x, 7KH HUURU LV 4[ 5 4[ 5 4 [ ++3=7> 3=;, i +z . z + 6> 5,,,5 (61) H@ h @ +g |, @ 3 4 5 5 5 :LWK VRPH KHOS IURP WKH QHXUDO QHWZRUN WRROER[ 'HPXWK %HDOH ZH FDQ SORW WKH HUURU VXUIDFH VHH )LJ 12 WKH WRROER[ RPLWV WKH IDFWRU 4@5 LQ $V WKH OHDUQLQJ SURJUHVVHV WKH ZHLJKWV DUH DGMXVWHG WR PRYH WKH QHWZRUN GRZQ WKH JUDGLHQW 7KH HUURU FRQWRXU JUDSK RQ WKH ULJKW VLGH RI )LJ 12 VKRZV KRZ WKH QHWZRUN PRYHG IURP LWV RULJLQDO YDOXHV WR D VROXWLRQ 1RWLFH WKDW LW PRUH RU OHVV FKRRVHV WKH VWHHSHVW SDWK 7KH QHWZRUN HUURU SHU HSRFK LV VDYHG WKURXJKRXW WKH WUDLQLQJ DQG WKH HUURU SHU HSRFK LV SORWWHG LQ )LJ 13 7KH HUURU GHFUHDVHV WKURXJKRXW WKH WUDLQLQJ DQG WKH WUDLQLQJ VWRSV DIWHU DERXW HSRFKV ZKHQ WKH HUURU GURSV EHORZ WKH SUHVHW OLPLW RI 3=334 $W WKLV SRLQW WKH ILQDO YDOXHV RI WKH ZHLJKWV DUH +z3 > z4 , @ +3=87<:> 3=6683, =

$ JHQHUDO PRGHO
A standard, matrix oriented model for neural networks has been developed (Hunt, Sbar18
10
10
Sum-Squared Error
-1
10
-2
10
-3
10
-4 0 20 40 Epoch 60 80
Figure 13: Network error.
y1 ... yN
ai1
aiN
+ vi + xi f Activation function yi
u1
bi1
1 s+1 Linear dynamics
...
biM
Figure 14: Generalised neuron model.
19
baro, Zbikowski and Gawthrop, 1992), and many well-known structures fall within that model. The basic processing element is considered to have three elements (see Fig. 14): a weighted summer, a linear dynamic single-input, single-output function, and a non-dynamic nonlinear function. The ZHLJKWHG VXPPHU is described by yl +w, @
Q [ m @4
dlm i +{m ,+w, .
P [ n@4
eln xn +w,>
(62)
giving a weighted sum yl in terms of the outputs of all elements i +{m ,, external inputs xn > and corresponding weights dlm and eln with offsets included in eln . To simplify the notation the time index will be omitted in the following, and the equation may be written in matrix notation as: Y @ $I[ . %X (63) where $ is an Q Q matrix of weights dlm , and % is an Q P matrix of weights eln . The neurons are enumerated consecutively from top to bottom, layer by layer in the forward direction. The offsets are incorporated with the inputs X, but it could be useful to represent them explicitly. The OLQHDU G\QDPLF IXQFWLRQ has input yl and output {l . In transfer function form it is described by (64) {l +v, @ K+v,yl +v, Five common choices are K+v, @ 4 4 K+v, @ v 4 K+v, @ 4 . v 4 K+v, @ s4 v . s5 K+v, @ h{s+ v , Clearly, the first three are special cases of the fourth. The QRQG\QDPLF QRQOLQHDU IXQFWLRQ returns the output i +{l , (70) in terms of the input {l = The function i + , corresponds to the previously mentioned activation functions (see for example Fig. 8). The three components of the neuron can be combined in various ways. For example, if the neurons are all non-dynamic +K +v, @ 4, then an array of neurons can be written as the set of DOJHEUDLF equations obtained by combining (62), (64), and (70): [ @ $I[ . %X where [ is an Q (71) 4 vector. If, on the other hand, each neuron has integrator dynamics 20 (65) (66) (67) (68) (69)
K+v, @ 4@v, then an array of neurons can be written as the set of GLIIHUHQWLDO equations [@ $I[ . %X

(72)
Clearly, the solutions of (71) form possible steady state solutions of (72). The behaviour of such a network depends on the interconnection matrix $ and on the form of K +v, = We shall use the model now to give an alternative formulation of the backpropagation algorithm. In a VWDWLF WZROD\HU IHHGIRUZDUG network, K +v, @ 4= In order to study more closely how the calculations proceed, Fig. 15 shows a signal flow graph of a simple network with one input, two nodes, and one output. It is a two layer network with offsets. The arrows are weighted (sometimes nonlinearly) and the nodes are summation points. Generalising to any number of nodes, [ @ $I[ . %X \ @ &I[ (73)
Keeping the two-layer structure, the [, \, and X vectors are partitioned according to the layers, and the forward pass of the backpropagation algorithm can be written in details as f+[4 , %44 E45 X4 [4 . (74) @ [5 E55 x5 @ 4 $54 f+[5 , f+[4 , , \ @ (75) f+[5 , The subscripts denote the layers in the network, such that the hidden layer has subscript 1 and the output layer has subscript 2. The $ and % matrices have a block structure and the matrix $54 holds the weights from layer 1 to layer 2; , is the identity matrix. The inputs are partitioned according to real inputs X4 and one additional input x5 @ 4 which is used for the offsets of all nodes. The input matrix %44 holds the weights from the input nodes X4 to the hidden layer, E45 is a vector of offset weights for the first layer, and E55 is for the second layer. The forward pass proceeds in three iterations, after the vector [ is initialised to . Iteration 1 [4 [5 Iteration 2 [4 [5 Iteration 3 @ %44 X4 . E45 @ $54 i +%44 X4 . E45 , . E55 @ %44 X4 . E45 @ E55
\ @ i +$54 i +%44 X4 . E45 , . E55 , Each iteration progresses on level farther into the flow graph, and after three iterations, the outputs are reached. The calculation is straight forward to implement. The backward pass is somewhat tricky, at least regarding the understanding of the difference between errors and gradients. Figure 16 is a flow graph of a backward pass in our
21
u1
b11
f(x1)a21
f(x2)
b21
b22
Layer 1
Layer 2
Figure 15: Signal flow graph of a forward pass in a two node network.
u1
b11
1 = f(x)e1, e1 = a 2 1 21
2 = f(x)e2 2
d-y=2 e
b21
b22
Layer 1
Layer 2
Figure 16: Signal flow graph of a backward pass showing the difference between errors and gradients.
22
simple network. The network is the same as in Fig. 15, except the arrows are reversed to emphasise the direction of the calculations. The input is now the error, which is the difference between the desired value and the network output, or h5 @ g |= The error is multiplied by the derivative of the activation function in the current operating point to produce the first local gradient 5 = The error corresponding to node 1 is then h4 @ d54 5 = The errors are used in the following matrix calculations similar to state variables in the state space model. Reversing the arrows corresponds to transposing the matrices. The backward pass can thus be expressed as, H @ $W ^i Q +[, H` . &W +G\, Hx @ %W ^i Q +[, H` (76)
The backward pass also proceeds in three iterations. The vector H is initialised to . Iteration 1 H4 H5 Iteration 2 H4 H5 Iteration 3 Hx4 Hx5 @ @ G \ \,
The expression in square brackets ^ ` is a product of two vectors, but it is element-toelement, rather than a vector dot product, such that i Q +[ +4,, is multiplied by H +4, > i Q +[ +5,, by H +5, > and so on, producing a column vector. The bottom system of equations is not used in practice, but included for completeness; it expresses the backpropagation of the error all the way to the inputs. Note that I[H is the vector of gradients B= For implementation purposes (76) can be written in more detail, Q i +[4 , H4 H4 $W 54 @ . ^G \` (77) H5 i Q +[5 , H5 , W Q %44 i +[4 , H4 Hx @ W i Q +[5 , H5 EW 45 E55
@ $W i Q +[5 , +G 54 @ G \
Once the backward pass is over, the weights can be updated. Both matrices $ and % have to be updated, according to the delta rule, $ @ $. $ % @ %. % Or, in more detail, $ @ $54 . Q +[ ,W 5i 4
@ %W i Q +[4 , $W i Q +[5 , +G 54 44 @ EW i Q +[4 , $W i Q +[5 , +G 54 45
\, \, . EW i Q +[5 , +G 55
\,
23
% @
The vectors B 4 and B 5 are column vectors, and the products 5 i Q +[4 ,W and 4 XW are outer 4 products, i.e., the results are matrices the same shape as $54 and %44 respectively. The model is for a two-layer, fully connected network. It should be fairly easy to extend it to more layers than two.
%44
E45 E55
.
4 XW 4
4 5

3UDFWLFDO LVVXHV
It is attractive that neural networks can model nonlinear relationships. From the treatment above, it may even seem easy to obtain a model of an unknown plant. In practice, however, several difficulties jeopardise the development, such as deciding the necessary and sufficient number of nodes in a network. There are no rigid mathematics regarding such design choices, but this section provides a few rules of thumb.

5DWH RI OHDUQLQJ
The smaller we make the learning parameter the smaller will the changes to the weights be, but then the learning takes longer. If is too large, the learning may become unstable. According to Srensen (1994) it is normal practice to gradually decrease from 3=4 to 3=334. The backpropagation algorithm may reach only a local minimum of the error function, and the search may be slow, especially near the minimum. A simple way to improve the situation, and yet avoid instability, is to modify the delta rule by including a PRPHQWXP term (78) zml @ |l . zml where is usually a positive constant called the PRPHQWXP FRQVWDQW. The effect is a lowpass filtering of the increments of the weight updates, thus reducing the risk of getting stuck in a local minimum. The following rules of thumb apply. For the learning to be stable, the momentum constant must be in the range 3 mm ? 4= When is zero the back-propagation algorithm operates without momentum. Note that can be negative, although it is unlikely to be used in practice. According to Srensen (1994) is typically chosen to be 3=5> 3=6> ===> 3=<. When the partial derivative CH@Czml has the same sign on consecutive iterations, the sum zml grows in magnitude, and so the weight is adjusted by a large amount. Hence the momentum term tends to accelerate the descent. When the partial derivative CH@Czml has opposite signs on consecutive iterations, the sum zml shrinks in magnitude, and the weight is adjusted by a small amount. Hence the momentum term has a stabilising effect in directions that alternate in sign. The momentum term may also prevent the learning from terminating in a shallow local minimum on the error surface. The learning parameter has been assumed constant, but in reality it should be connec24
tion dependent. We may in fact constrain any number of weights to remain fixed by simply making the learning rate ml for weight zml equal to zero.

3DWWHUQ DQG EDWFK PRGHV RI WUDLQLQJ
Back-propagation may proceed in one of two basic ways: 3DWWHUQ PRGH. Weights are updated after each training example; that is how it was presented here. To be precise, consider an epoch consisting of N training examples (SDWWHUQV ) arranged in the order ^X4 > G4 ` > = = = > ^XN > GN ` = The first example ^X4 > G4 ` is presented to the network, and the forward and backward computations are performed. Then the second example ^X5 > G5 ` is presented, and the forward and backward computations are repeated. This continues until the last example. %DWFK PRGH. Weights are updated DIWHU presenting DOO training examples; that constitutes an epoch. From an on-line point of view, where the data are presented as time series, the pattern mode uses less storage. Moreover, given a random presentation of the patterns, the search in weight space becomes stochastic in nature, and the back-propagation gets trapped less likely in a local minimum. The batch mode, however, is appropriate off-line providing a more accurate estimate of the gradient vector. The choice between the two methods therefore depends on the problem at hand.

,QLWLDOLVDWLRQ
A good choice for the initial values of the weights can be important. The customary practice is to initialise all weights randomly and uniformly distributed within a small range of values. The wrong choice of initial weights can lead to VDWXUDWLRQ when the value of H remains constant for some period of time during the learning process, corresponding to a saddle point in the error surface. Assuming sigmoidal activation functions, it happens if the input to one (or several) activation functions is numerically so large that it operates on the tails of the sigmoids. Here the slope is rather small, the neuron is in saturation. Consequently the adjustments to the weights of the neuron will be small, and the network may take a long time to escape from it. The following observations have been made (Lee, Oh & Kim in Haykin, 1994): Incorrect saturation is avoided by choosing the initial values of the weights inside a VPDOO range of values. Incorrect saturation is less likely to occur when the number of hidden neurons is low. Incorrect saturation rarely occurs when the neurons operate in their linear regions.

6FDOLQJ
The training data most often have to be scaled. The data are measured in physical units, which often are of quite different orders of magnitude. This is inconvenient when the ac-
25
tivation functions are defined on standard universes. In the case of s-shaped activation functions, there is a risk that neurons will operate on the tails of the activation function, causing saturation phenomena and increased training time. Another problem is that the error function H favours elements accidentally measured in a unit which implies a large magnitude. Scaling may avoid these problems. One way is to map the data such that the scaled signals have a mean value p of zero, and a standard deviation v of one, i.e., [ p > [ @ v where [ is a set of physical measurements, and [ the corresponding vector in the scaled world of the network.

6WRSSLQJ FULWHULD
In principle training continues until the weights stabilise and the average squared error over the entire training set converges to some minimum value. There are actually no welldefined criteria for stopping the algorithms operation, but there are some reasonable criteria. It is logical to look for a zero gradient, since the gradient of the error surface is zero in a minimum. A stopping criterion could therefore be to stop when the Euclidean norm of the gradient vector is sufficiently small. Learning may take a long time, however, and it requires computing the gradient vector J(Z). Another possibility is to exploit that the error function is stationary at a minimum. The stopping criterion could therefore be to stop when the absolute rate of change in the average squared error per epoch is sufficiently small. A typical value is 0.1 to 1 percent per epoch; sometimes a value as small as 0.01 percent per epoch is used.

7KH QXPEHU RI WUDLQLQJ H[DPSOHV
Back-propagation learning starts with presenting a WUDLQLQJ VHW to the network. Hopefully the network is designed to JHQHUDOLVH, that is, it is able to perform a nonlinear interpolation. The network generalises well when its output is nearly correct when presented with a WHVW GDWD VHW never used in training. The network can interpolate mainly because continuous activation functions lead to continuous output functions. If a network learns too many input-output relations, i.e., there is RYHUILWWLQJ, in the presence of noise, it is probably because it uses too many hidden neurons. When this happens the error is low; it looks like a desirable situation, but it is not. In the presence of noise it is undesirable to perfectly fit the data points. Instead a smoother approximation is more correct. A high error goal stops a network from RYHUWUDLQLQJ, because it gives the network some tolerance in the fit. The opposite effect, XQGHUILWWLQJ, is also possible. This occurs when there are too few weights, then the network cannot produce outputs reasonably close to the desired outputs. An example is illustrated in Fig. 17. Overfitting and underfitting are affected by the size and the efficiency of the training
26
0.5
Output: -, Target: +
-0.5
-1 -1
-0.5
0 Input
0.5
Figure 17: Due to underfitting, a network (solid line) has not been able to model the variations in the training data +.,. set, and the architecture of the network. If the architecture is fixed, then a right size of the training set must be determined. For the case of a network containing a single hidden layer used as a binary classifier, some guidelines are available (Baum & Haussler in Haykin, 1994). The network will almost certainly provide generalisation, if 1. the fraction of errors made on the training set is less than %@5> and 2. the number of examples, N, used in training is 65P 65Z oq N % %
(79)
Here P is the number of hidden neurons, Z is the total number of weights in the network, % is the ratio of the permissible error to the desired output, and oq is the natural logarithm. The equation provides a worst case formula for estimating the training set size for a single-layer neural network; in practice there can be a huge gap between the actual size of the training set needed and that predicted by (79). In practice, all we need is to satisfy the condition Z (80) N % Thus, with an error ratio of, say, 0.1 (10 percent) the number of training examples should be approximately 10 times the number of weights in the network.

7KH QXPEHU RI OD\HUV DQG QHXURQV
A single layer perceptron forms half-plane decision regions (Lippmann, 1987). A two-
27
layer perceptron can form any, possibly unbounded, convex region in the input space. Such regions are formed from intersections of the half-plane regions formed by each node in the first layer of the multi-layer perceptron. These convex regions have at most as many sides as there are nodes in the first layer. The number of nodes must be large enough to form a decision region that is as complex as is required. It must not, however, be so large that the weights required can not be readily estimated from the available training data. A three layer perceptron (with two hidden layers) can form arbitrarily complex decision regions. Thus no more than three layers are required in perceptron networks. The number of nodes in the second (hidden) layer must be greater than one when decision regions are disconnected or meshed and cannot be formed from one convex region. The number of second layer nodes required in the worst case is equal to the number of disconnected regions in the input space. The number of nodes in the first layer must typically be sufficient to provide three or more edges for each convex area generated by every second layer node. There should thus typically be more than three times as many nodes in the second as in the first layer. The above concerns primarily multi-layer perceptrons with one output when hard limiters (VLJQXP functions) are used as activation functions. Similar rules apply to multi-output networks where the resulting class corresponds to the output node with the largest output. For control applications, not classification, Srensen (1994) chooses to only consider multi-layer perceptrons with one hidden layer since it is sufficient.

+RZ WR PRGHO D G\QDPLF SODQW
To return to the learning problem posed in the introduction, forward learning is the procedure of training a neural network to represent the dynamics of a plant (Fig. 1). The neural network is placed in parallel with the plant and the error between the system and the network outputs, the SUHGLFWLRQ HUURU, is used as the training signal. In the case of a multi-layer perceptron the network can be trained by back-propagating the prediction error. But how does the network allow for the dynamics of the plant? One possibility is to introduce dynamics into the network itself. This can be done using internal feedback in the network (a UHFXUUHQW network) or by introducing transfer functions into the neurons. A straight forward approach is to augment the network input with signals corresponding to past inputs and outputs. Assume that the plant is governed by a nonlinear discrete-time difference equation Thus the system output |s at time w . 4 is a nonlinear function i of the past q output values and of the past p values of the input x. The superscript s refers to the plant. An immediate | s +w . 4, @ i ^| s +w,> = = = > |s +w q . 4,> x+w,> = = = > x+w p . 4,` (81)
idea is to choose the input-output structure of the network equal to the believed structure of the system. Denoting the output of the network as | p where superscript p refers to the model, we then have | p +w . 4, @ e ^|s +w,> = = = > | s +w q . 4,> x+w,> = = = > x+w p . 4,` (82) i
28
Y(k), y(k+1)
0.5
20
40 Time
60
80
100
Figure 18: Experimental step response data for training. Here e represents the nonlinear input-output map of the network (i.e. the approximation i of i ). Notice that the input to the network includes the past values of the plant output (the network has no memory). This dependence is not included in Fig. 1 for simplicity. If we assume that after a suitable training period the network gives a good representation of the plant (i.e. |p |s ), then for subsequent post-training purposes the network output itself can be fed back and used as part of the network input. In this way the network can be used independently of the plant. This may be preferred when there is noise since it avoids problems of bias caused by noise in the plant. A pure delay of qg samples in the plant can be directly incorporated. In the case of (81) the equation becomes | s +w . 4, @ i ^| s +w,> = = = > | s +w q . 4,> x+w qg . 4,> = = = > x+w qg p . 5,`
([DPSOH ILUVW RUGHU SODQW *LYHQ D G\QDPLF SODQW ZLWK RQH LQSXW x DQG RQH RXWSXW | ZH ZLVK WR PRGHO LW ZLWK D QHXURIX]]\ QHWZRUN XVLQJ IRUZDUG OHDUQLQJ ,WV VWHS UHVSRQVH WR D XQLW VWHS LQSXW x +w, @ 4 +3 w 433, ZLOO EH XVHG DV WUDLQLQJ GDWD LW LV SORWWHG LQ )LJ 18 $VVXPH WKH SODQW FDQ EH PRGHOOHG XVLQJ D FRQWLQXRXV WLPH /DSODFH WUDQVIHU IXQFWLRQ ZLWK RQH WLPH FRQVWDQW 4 x (83) |@ 4 . v 7KH GLVFUHWH WLPH ]HURRUGHUKROG HTXLYDOHQW LV |+n . 4, @ d|+n, . +4 d,x+n, (84) ZKHUH d LV D FRQVWDQW WKDW GHSHQGV RQ DQG WKH VDPSOLQJ WLPH 7KH QHWZRUN LQSXWV DQG RXWSXW DUH FI Lqsxw4 @ |+n, Lqsxw5 @ x+n, Rxwsxw @ |+n . 4, 29 (85) (86) (87)
yC C Matrix u N Network yN
+ + Sum1
^ y
+ Sum2
Figure 19: Parallel coupling with a matrix. 8VLQJ WKH VWHS UHVSRQVH ZH FDQ SURGXFH D WUDLQLQJ VHW 7ZR H[DPSOHV IURP WKH VHW DUH |+n, x+n, |+n . 4, (88)
2QH PD\ DUJXH DVVXPLQJ WKDW LV DQ DFFXUDWH PRGHO WKDW WKH ILUVW URZ GHWHUPLQHV d DQG WKH SUREOHP LV VROYHG ZLWKRXW QHXUDO QHWV +RZHYHU LQ FDVH WKH SODQW FRQWDLQV D QRQOLQHDULW\ QRW DFFRXQWHG IRU LQ D VLQJOH QHXURQ ZLWK WZR LQSXWV DQG RQH RXWSXW PLJKW PRGHO WKH SODQW EHWWHU If a nonlinear plant is known to have partly linear, partly nonlinear gains from the inputs to the outputs, a network can be trained to model just the nonlinear behaviour of the plant (Srensen, 1994). The network is coupled in parallel with a known constant gain matrix & and the network and & are both fed with the data X(k), the input data to the plant (Fig. 19). The outputs from the network are \Q +n, and from the matrix \F +n,> and \Q +n, @ I +X+n,> w, \F +n, @ &X+n, (89) (90)
The vector w contains all the weights and offsets in the network ordered in some way. The total output of the arrangement is e+n, @ \Q +n, . \F +n, \ @ I +X+n,> w, . &X+n, The total gain matrix is 1+n, @ (91) (92)
gel +n, | > (93) gxm +n, where index l runs over all outputs and index m over all inputs. The gain matrix 1 evaluated at a particular input X+n, at time instance n, is called the -DFRELDQ matrix of the function e from X to |. Each element is the gain on a small change of an input xm > and the gain in 30
position +l> m, refers to the signal path from input xm to output |l . A further computation e shows that g I +X+n,> w, .& (94) 1+n, @ gXW +n, @ 1Q +n, . & (95) where 1Q +n, is the gain matrix of the network. Equations (92) and (95) simply show that the combined parallel structure has a total output equal to the sum of the outputs from the nonlinear network and the linear matrix, and that the combined gain matrix is the sum of the two individual gain matrices. The configuration inside the dotted box in Fig. 19 behaves as an ordinary network, and can be trained as a network.

6XPPDU\
Back-propagation is an example of VXSHUYLVHG learning. That means the network is shown (by a teacher) a desired output given an example from a training set. A performance measure is the sum of the squared errors visualised as an error surface. For the system to learn (from the teacher), the operating point is forced to move toward a minimum point, local or global. The gradient points in the direction of the steepest descent. Given an adequate training set and enough time, the network is usually able to perform classification and function approximation satisfactorily. A network can be described in terms of matrices resulting in a model similar in form to the state-space model known from control engineering. A dynamic plant can be modelled using a static nonlinear regression in the discrete time domain.
5HIHUHQFHV
Agarwal, M. (1997). A systematic classification of neural-network-based control, ,((( FRQWURO V\VWHPV (2): 7593. Demuth, H. and Beale, M. (1992). 1HXUDO 1HWZRUN 7 RROER[ )RU 8VH ZLWK 0DWODE, The MathWorks, Inc, Natick, MA, USA. Fukuda, T. and Shibata, T. (1992). Theory and applications of neural networks for industrial control systems, ,((( 7UDQVDFWLRQV RQ LQGXVWULDO HOHFWURQLFV (6): 472489. Haykin, S. (1994). 1HXUDO 1HWZRUNV $ &RPSUHKHQVLYH )RXQGDWLRQ, Macmillan College Publishing Company, Inc., 866 Third Ave, New Y ork, NY 10022. Hunt, K., Sbarbaro, D., Zbikowski, R. and Gawthrop, P (1992). Neural networks for control . systems - a survey, $XWRPDWLFD (6): 10831112. Lippmann, R. (1987). An introduction to computing with neural nets, ,((( $663 0DJD]LQH pp. 422. Meier, W., Weber, R. and Zimmermann, H.-J. (1994). Fuzzy data analysis-methods and industrial applications, )X]]\ 6HWV DQG 6\VWHPV : 1928. MIT (1995). &,7( /LWHUDWXUH DQG 3URGXFWV 'DWDEDVH, MIT GmbH / ELITE, Promenade 9, D-
31
52076 Aachen, Germany. MIT (1997). 'DWD(QJLQH 3DUW ,, 7XWRULDOV, MIT GmbH, Promenade 9, D-52076 Aachen, Germany. Nrgaard, P M. (1996). 6\VWHP ,GHQWLILFDWLRQ DQG &RQWURO ZLWK 1HXUDO 1HWZRUNV, PhD thesis, . Technical University of Denmark, Dept. of Automation, Denmark. Nrgaard, P M. (n.d.a). 11&75/ 7RRONLW, Technical University of Denmark: Dept. of Automa. tion, http://www.iau.dtu.dk/research/control/nnctrl.html. Nrgaard, P M. (n.d.b). 116<6,' 7RROER[, Technical University of Denmark: Dept. of Automa. tion, http://www.iau.dtu.dk/Projects/proj/nnhtml.html. Srensen, O. (1994). 1HXUDO 1HWZRUNV LQ &RQWURO $SSOLFDWLRQV, PhD thesis, Aalborg University, Institute of Electronic Systems, Dept. of Control Engineering, Denmark. ISSN 0908-1208.
32

Jan Jantzen - Introduction To Perceptron Networks

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Jan Jantzen - Introduction To Perceptron Networks

Uploaded by

Copyright:

Available Formats

Introduction To Perceptron Networks

Jan Jantzen jj@iau.dtu.dk:1

Figure 1: A neural network models a plant in a forward learning manner.

if ZW Xn  3 and Xn belongs to class C5 n 5

if ZW Xn ? 3 and Xn belongs to class C4 n

(9) (10) (11)

Table 2: After two passes the the perceptron algorithm converges.

Figure 5: Classification of tiles in the input space (logarithmic axes).

6LQJOH OD\HU SHUFHSWURQ

*UDGLHQW GHVFHQW OHDUQLQJ

Figure 6: Single layer perceptron network with three output neurons.

Figure 7: Three linearly separable classes.

,W LV RI FRXUVH LPSRVVLEOH WR VDWLVI\    DOO DW WKH VDPH WLPH

Chm Cym Cym C|l C +gm

(48) (49) (50)

fm +ym ,, Cym Cym C|l

hm fmQ + , zml flQ + , |k

2. If neuron t is a hidden node, then t @

u The summation is a weighted sum of  u s of the immediate successor neurons.

([DPSOH  EDFNSURSDJDWLRQ DIWHU 'HPXWK 17

%HDOH  7R VWXG\ EDFNSURSDJDWLRQ

Error surface 4 1 3 2 0.5 1

Sum Squared Error

Figure 13: Network error.

1 s+1 Linear dynamics

Figure 14: Generalised neuron model.

dlm i +{m ,+w, .

@ %W i Q +[4 , $W i Q +[5 , +G 54 44 @ EW i Q +[4 , $W i Q +[5 , +G 54 45

3DWWHUQ DQG EDWFK PRGHV RI WUDLQLQJ

7KH QXPEHU RI WUDLQLQJ H[DPSOHV

7KH QXPEHU RI OD\HUV DQG QHXURQV

+RZ WR PRGHO D G\QDPLF SODQW

You might also like

if ZW Xn 3 and Xn belongs to class C5 n 5

,W LV RI FRXUVH LPSRVVLEOH WR VDWLVI\ DOO DW WKH VDPH WLPH

2. If neuron t is a hidden node, then t @

u The summation is a weighted sum of u s of the immediate successor neurons.

([DPSOH EDFNSURSDJDWLRQ DIWHU 'HPXWK 17

%HDOH 7R VWXG\ EDFNSURSDJDWLRQ