Multilayer Perceptron and Uppercase Handwritten Characters Recognition

Ir. Gosselin Bernard
Service de TheoriC des Circuits et de Traitement du Signal,
Faculte Polytechnique de Mons,Belgium
Absrtad :

In this case, the best classification is performed by
assigning the class Si to a character represented by a
vector X , if the a posteriori probability to have the class
Si given the vector X is higher than for another class :

AJer an introduction on the problem of the
automatic character recognition and on multilayer
perceptron used for classijcation, we describe what
one can hope to get j-om a multilayer perceptron.
We also describe some of the problems that can
occur during the training and present a fast learning
algorithm. This algorithm was tested to train a
multilayer perceptron to recognise multiscriptor
uppercase handwritten characters. The system have
reached a recognition rate of 88.1 96 , without any
contextual analysis which is still indispensable, but
will be easier due to the fact that the multilayer
perceptron provides the probability of each class to
be the unknown character.

It is the Bayes classifier who furnishes the optimal
fiontiers of decision, But the analytic forms of these
frontiers are too difficult to compute. That is why we try
to find these frontiers by using a neural network.

2. The multilayer perceptron (MLP)
2.1. The artificial neuron model

Proposed in 1943 by Mc Culloch and Pitts, this neuron
model is a very simple processor which realise a
weighted sum of its inputs (figure 2.1.). This sum is
compared to a threshold, and the output is a non-linear
function of the result [l],. The most often used function is
probably the sigmoid function, because it is a derivable
function (figure 2.2.).

1. Introduction
An automatic character recognition system must be
able to analyse a set of points that represent a scanned
character in order to idenhfy the character and to
associate the corresponding ASCII code.
Once a character is scanned and isolated from the
other ones, it is associated with a vector in the Euclidean
Space Rn. These vectors can be collected in a finite
number of sets, each one corresponding to a particular
class. Sometimes, for some of these classes, there is
overlapping (figure 1.1. ,classes U and V,for example).

figure 2.1.


figure 1.1.

figure 2.2.


0-81864960-7/93$3.00 0 1993 IEEE

The coefficients a. Introduction. Interpretation of the hidden units The sigmoid function of a neuron (figure 2.. x . and.. or cost Iunclion.. T.-. The backpropagation training algorithm The training of the multilayer perceptron is supervised and consists of repeating the following steps : 0 present a character and its class to the network.of each cross-prodnct depend on the values of the weights which connect the neurons together [2]. ItJtk The non-linear function generates a linear combination of all the possible cross-products of the For an MLP used for classification.) =a+-. The Taylor series expansion of (2.. ) B ' +II w. .. 0 computc thc actual outputs of thc nctwork and c o w thcm to thc dcsircd o m . update the weights to bring the actual outputs closer to the desired ones. . ... A multilayer perceptron may contain one or more hidden layers. 0 WN+ 1 is the value of the threshold. But what is the interest of the hidden layer(s) ? 2. The training of tbe MLP.(1 -y.with one output for each class. as the network is used as a classifier. where : where a is the updating factor..4. 3.x .3. neurons are organised in layers and the ones which belong to the same layer have the same inputs.) is: 0 C When a momentum term is included. with all targets set to zero except for the 936 . (Xki = xi k>o). and 0 for all the other neurons. the weights are updated by : +PAW. This minimisation is done by using the steepest gradient method. 2. In our case.y . The time t. the output layer will contain one neuron for each class. + I 121 3. =x.) 1 This often increases the speed of convergence.3. The training of the neural network consists of minimising the error according to the weights connecting the neurons together.6.1.)(t.).) elsewhere : 8.) can be re-written as : (2.3. The multilayer perceptran In this model of neural network. and 6: is : if 1 is the last layer : 6: =y. 0 Wi i-1. + CSIJXl. N+1 N+I N+l Y = a. + c r * x . The outputs of the neurons of a layer are the inputs of the neurons of the next layer and there is no feedback in the network (figure 2. in the case of binary inputs.the inputs of the neural network will be the components of the vector which represents a c h c t e r . y& and t& are respectively the actual output and the desired one of the i* neuron when the kth character is presented to the network.N are lhe weighls. 2. It's thus possible to take high order moments into account. ) (2. = -a(") dw 0 <p< 1 (2..Nare the inputs. 0 XN+l has always the value -1. can be defined as Iollows : n where C is the number of classes. which provides the Back Propagation training algorithm.3.(l . + c a l x . and. of a weight connecting the j* neuron of layer 1-1 to the i* neuron of layer I is : flgure 2. The modifcation .2..inputs.. Xi i=1. x ...1.'+I (2=4 N is the number of inputs of the neuron. Aw. M is the total number of characters in the training set.2.4.p. which arc 1 for the neuron associated to the class of the presented character.

-E.2.K . Influence of the updating factor. As we can see in figure 3. 3. and there is no guarantee that the optimal output values can be reached. another cost function can be used (relative entropy criteria [3]): ai t expression (2.1 If the absolute value of (dE/dwi)t is too small. di = ti . wx have to choose E' B =-. If we update the weights so that the error follows such a law.Merent value of updatmg factor for each weight [4].3.. -y.1. and a limitation must be introduced. Conversely. It is then advised not to update the weight W i too much (and to use a small value of updating factor). If we want to improve the speed of convergence of the algorithm. As we wouldlitre to have :m= -E.1. But how to choose these values ? Assuming we have the following differential equation : correct class where it is unity.yj. N 1 and then : Yi figure 3. at a finite time. we have to use a 3. wiul 0 <s cl & the variable x will reach the value zero.4 The cross-validation When the evolution of the recognition rate as a function of the numbcr of iterations is observed. If the actual output is zero. represent the evoIution of the two first terms of 6. we see that if the target for neuron i is 1.2. ) . More aver. 3.N f dE Y' I. -Aw.. it is advised to use a high value of updating factor. the error is then very sensitive to a variation of wi.)still available.2. (1 -y. the updating factor results from a compromise between these two situations. the value of 6i will be very small and there will be too little updating of the weights. In figure 3. The first problem of training is the one due to local minima. but now. figure 3. the updating of the weight will be very high. 14 dw. and if the training does not get stuck at a local minimum. This algorithm has shown its ability to train multilayer perceptrons by being more than 10 times faster than the classical backpropagation algorithm. as shown in figure 3. bysettmg:a. there will be no more updating and the algorithm gets stuck at a local mini". In the classical BP algorithm. = " (.2. ) ( t . But in practice. the two 937 . figure 3. with one hidden layer containing enough neurons.To avoid this problem. Local minima where N is the total number of weights. from any initial condition. If the absolute value of the derivative of the error according to a weight wi is high.. we now get a decreasing value of the updating factor when the sensibility of the error according to the weight increases. For example. for a neuron of the output layer. if the absolute value of the derivative of the error is small.= -itturnsinto: M = . the output values of the MLP will approximate the a posteriori probabiities of & =-xz each class [2]. and that the actual output is very low. we still train the MLP. We have :AE= Et+. the training is the minimisation of a cost function according to a great number of variables: training is then often critical. there is always a limitation due to the number of parameters. =y.3.

"An Acceleration Method for the Backpropagation Learnmg Algorithm". Several tests where made to find convenient values for the number of hidden neurons (graph Roc. Results The training was made with a database containing 250 hand-written by 50 persons. France.1. the recognition rate increases too. pp 4-22.. After scanning. As there are 26 class. S. But. so that they couldn't use contextual analysis too. of the qth International Conference On Neural Networks & Their Apllications.AT&T Bell Labs. It is convenient to notice that this recognition rate is obtained without any contextual can get a multiscriptor recognition system. 62 Fleisher. [2] Bourlard H. November 1991.3. Manuscript.. But when this number is too high. 1991.1 %.From the graph 4. Validation&t 0 X Nu& of Ilerrtlona figure 3.A.. Conclusions Tnlnlng W . sets of the 26 uppercase letters."Neural Networks . [3] Solla.Theory and Parallels with Conventional Algorithms". & Evans D. The best recognition rate on the test set is 88. A p d 1987. there are too many parameters in the MLP.. The tests were performed by using 100 sets of the 26 uppercase letters. b o u t & Hauspie Speechproducts. 1988. Comparisons between the performances of the multilayer perceptron and the ones of humans shown that multilayer perceptron have a good ability to hand-written characters recognition. are obtained. the one observed on the validation set decreases if the number of iterations is too high.. M."An Introduction To Computiq with Neural Nets". there are also 26 neurons in the output layers. We have made tests on humans. and the training set becomes too little to perform the learning.). [4] Sanossian H.3.). Later. $3 W n n n Graph 4. IEEE ASSP Magazine. :Recognition rate as a function of the number of hidden units... An inconvenient of the MLP is that the training phase still takes a long time (about 50 hours on a 486133 PC for the examples in graph 4. which is still indispensableto get a convenient recognition system. Bibliography: [I] Lippmann R. The learning must thus be stopped when the recognition rate on the validation set is maximum [2]. where the validation set contains characters which are not involved in the training. E.1. number 2. Levin. with their own noise. While the recognition rate on the training set always increases. This is due to the fact that at the beginning of the training. 5. we can see that when the number of hidden units increases. and they got a recognition rate of only 94 % on the same test set ! curves in figure 3... There are then 100 neurons in the input layer.. the MLP is learning the characteristicsof each class.Nimes. by using a very large database. as the outputs of the MLP are an image of the a posteriori probability of each class. 4.1. and then training is only made once. And.. 938 . the characters were normalised in size and represented by a 10 by 10 matrix of floating values between 0 and 1. Manwript. it begins to model the characters of the training set themselves. it will make further contextual analysis easier. volume 4.1. The recognition rate on the test set then decreases. "Accelerated Learning in Layered Neural Networks". presenting them some isolated characters. hand-written by 20 others persons.