SCE4101

Multi-layer
Perceptrons
Computational Intelligence

Warren Gauci ( 529190 (M)

Multi-layer perceptrons: Selection of a multi-layer perceptron for a specific data classification task Warren Gauci Abstract – This paper considers the different parameters of multi layer perceptron architectures and suggests a suitable architecture to complete a specific data set classification task. the mean square error plot and the receiver operating characteristic chart. For a comprehensive overview of other kinds of networks refer to [2]. which can be used to extract patterns and detect trends that are too complex to be noticed by either humans or other computer techniques. for a good overview refer to [3]. A perceptron is an MCP model with additional. 2. A simple neuron is a device with many inputs and one output. as this kind of neuron is best for pattern recognition (see [1]). The second question is also relevant but is not in the scope of this paper. the neuron fires. Perceptrons may be grouped in single layer or multilayer architectures. Background Theory . In most cases the structure and size of an ANN are chosen a priori or after the ANN ‘learns’ a given task. ANN’s are scalable. Performance measures are evaluated by the use of the confusion matrix. In general. The difference from the simple neuron model is that the inputs are 'weighted'. Two questions that are relevant in this case are:  What size and structure is necessary to solve a given task?  What size and structure is sufficient to solve a given task? This paper gives an answer to the first question quoted above. preprocessing.e. Neural networks possess a remarkable ability to derive meaning from complicated or imprecise data. they have parameters that can be changed. An adjustment to the MCP model lead to the formulation of the ‘perceptron’. i. This paper will contribute to further advancements in the field of neuron training and in the field of distinguishing and classifying linearly and non-linearly separable data. 1. Each input has a decision making that is dependent on the weight of the particular input. The objective of this paper is to find MLP parameters that lead to the best MLP architecture used for the classification of a given data set. the training mode and the using mode. as the connection form one layer to the next allows for non-linearly separable data recognition and classification. A more sophisticated neuron is presented in the McCulloch and Pitts model (MCP). Single layer architectures are restricted in classifying only linearly separable data. In reality a MLP model can approximate functions to any given accuracy. All weighted inputs are then added together and if they exceed a pre-set threshold value. Results obtained from this paper are based on Math Works Neural Network Toolbox software. a term coined by Frank Rosenblatt. All testing was performed on an Iris Data set. This paper deals with perceptron architectures structure. INTRODUCTION An artificial neural network (ANN) is an information-processing system that has certain performance characteristics in common with biological neural networks. thus in this paper only multi layer perceptron (MLP) networks are used. fixed. Rigorous testing with variable hidden layer size. learning rate and tests sets lead to a neural architecture with the best performance measures. The neuron has two modes of operation.

petal length.gd algorithm. This toolbox allowed for the division of the data set into training. this data set is a classic in the field of pattern recognition. The method used may be divided into three sections. which is specified for the units. The methodology used in this paper makes use of an adaptive network. sepal width. The error is stipulated using a least mean square convergence technique. Virginica. The function that changes the number of neurons in the hidden layer was used to change the MLP architecture. Hidden layer size was chosen. METHOD The best structure of MLP to perform the given classification task was determined using an empirical procedure. where each neuron actually knows the target output and adjusts weight to the input signals to minimise the error. 4. while the latter are not linearly separable from each other.A Fisher. Each class has four attributes. In order to train the ANN to perform a classification task. simulate and assess different ANN architectures. associative mapping and regularity detection. It contains 3 classes of 50 instances each. train. namely. Sigmoid units bear a greater resemblance to real neurons than linear units do. in which neurons found in the input layer are capable of changing their weights. 3. This process involves the memorisation of patterns and the subsequent response of the neural network. .1 COLLECTION OF DATA   Iris dataset was divided into training validation and test data [70% : 15% : 15%]. The behaviour of an ANN depends also on the input-output transfer function. epoch limit.2 DATASET The dataset used for classification is an Iris Data set. The method used allowed for the variation of all the variable parameters. MATERIALS 3. The pattern recognition tool was applied on an Iris dataset (see 3. and hidden layer size. In this way the network is able to calculate how the error will change as each weight is increased or decreased slightly. This method enumerated a total of 80 samples in the selection process of the best ANN architecture. 4. This tool box was used in the development of this paper to design. where the output varies continuously but not linearly as the input changes. This paper makes use of sigmoid units. Input weights. The specific steps performed are presented in a flow chart in Appendix A.1 NEURAL NETWORK TOOLBOX The Neural Network Toolbox provides functions that allow the modelling of complex nonlinear systems. This can be categorised into two paradigms. sepal length. validation and test sets. 3. Running train. Versicolour. One class is linearly separable from the other two.2). learning rate. petal width. Learning is performed by the updating of the value of weights associated with each input. Each class refers to a type of Iris plant. This paper considers a back-propagation algorithm technique where the error derivative of the weights (EW) is computed. Setosa. Created by R. some kind of weight adjusting technique must be set.The ANN must be trained using a learning process. The performance of the different ANN’s was assessed using the performance plots provided by this software. The adaptive network is introduced to a supervised learning procedure. collection of data and choice of best data.

015 0. saving and assessing performance plots of each sample. Epoch=2000 Iris Thyroid Datset dataset 2.77E0.04231 1. 100 ].023 0.00222 02 85 3. lr=5.89E-08 06 97 0.04337 0.04899 0.04122 0.044835 0.000834 1.0174 0.048414 1.0034 0.3 CHOICE OF BEST DATA .  Evaluate the network using the validation data.27 05 0.007992 1.02E-08 0.045019 0.01453 0.24494 8 0.0067 0.04532 2.00228 0.09462 0.03996 0.07355 0. 10.073507 0.00001 7 54 3.00754 0. 60.00618 0.00756 0.21883 0.05013 7.GD ALGORITHIM   Assign and initialise weights to input data .00233 0.041 0. Define the learning rate and epoch limit.000224 1.029518 0.0.03996 0. 35.27435 0.1296 0. The outcome of this section of the method was the determination of the minimum mean square error of the validation data set for each sample taken in the previously defined method.000224 0.094913 0.00071 0. This section of the method had the following outcomes.15392 0.14033 0.00234 0.09E-09 0.04573 96 94 034 48 54 0.00241 0.226 0.05669 0.  Update weights and terminate when the validation error is a minimum.00347 0.2197 0.0505 0.0268 0.2604 0.13372 0.045504 2. determination of the best data for each hidden layer size.  Train the network using the pre-defined training set.006489 0.04938 0. 4.09E-09 0.04551 49 54 0.34636 4 0.gd function in the software.  The sample with the minimum mse error was chosen considering validation and test data plots.00030 0.887E-08 06 77 0.dev iation Mean Square Error (Validation Data Set) HL size HL size Hl size HL size 5 10 HL size 35 60 100 0.003665 008 182 Table 1: Mean square error of different hidden layer sizes  Train.2 RUNNING TRAIN.45E-01 0.46E-06 0.027226 23 7 17 0.16041 0.06644 29 92 36 62 54 Choice of best data HL size 35.045459 2.001E.21794 0.15964 0.040442 0.00753 0.037337 0.  The procedure was repeated for 5 different hidden layer sizes [ 5. production of 10 samples for 5 different hidden layers (50 samples in total).000335 0.01510 0.81E-08 0.00E0.14 0.14066 0.gd algorithm was run for 10 times.3521 0. All those steps were performed by the train.000357 0.Collection of Data Sample No 1 2 3 4 5 6 7 8 9 10 Min mse Average mse Standard.003790 5 0. 4.003075 5 3 0.63E-02 8 0.00071 7 0.16 0. The best learning rate and epoch limit values were determined in a pre-test and remained fixed during this method (learning rate of 5 and epoch limit of 2000).

This shows that an error of classification still occurred even if training was executed perfectly.3% in the test data.  Work out another 10 samples using the best determined HL size. Figure 1 and Figure 2show the mse vs epoch plot and the confusion matrix for the best data sample. The latter is reinforced in the ROC plots. All results were obtained using the following percentage ratios for training.0241e-08 at 18 epochs. DISCUSSION The samples with 35 HL size were initially not those with the minimum average and standard deviation values. Figure 1 shows that for the best data sample. The best ANN architecture. Results are also based on a learning rate of 5 and epoch limit of 2000.  Save parameters of best sample and try this ANN architecture on a new set of data. These values were justified by a pre-testing procedure using the same dataset. validation converged with a mse of 2. showing true positives and no false positives for the test and validation data and a few false positives present in the test data. The confusion matrix in Figure 2 shows perfect classification in the test and validation data but a miss-classification of 4. CONCLUSION . which was proof of good and consistent data. This architecture also proved a consistent mse value when test on thyroid dataset. validation and test sets 70% : 15% :15%. RESULTS The most relevant results are tabulated in Table 1. This choice is based on the average and standard deviation values of the mean square error. Results also show this architecture applied to a different data set.  Select the best sample overall using validation and test performance plots. with a test error of 9e-09. taking into account the mean square error of both the validation and test data is that containing 35 neurons in the hidden layer. Further samples were taken and more concrete results were obtained. 7. 5. that has the same no of classes.  Determine the best overall sample and the corresponding hidden layer size (using average standard deviation functions). This section of the method allowed for the determination of the overall best ANN architecture using another set of samples. but more attributes. Upload the best sample for each hidden layer size from the collection of data method. Figure 1 – Mse Vs Epoch plot Figure 2 – Confusion matrix 6.

Haykin. Neural Networks. Wunsch. Lewis. Neural Networks for Modelling and Control of Dynamic Systems. N. no. pp. [3] F. but classification in random and not always consistent.M. Eds. 1994. Nørgaard. IEEE Trans. Ravn. O. Huang. 18. Griffin. Grimble and M. T. results show that class 2 and 3 are the classes containing non-linearly separable data. Johnson.It may be concluded that although results are not always satisfactory. REFERENCES [1] M. Furthermore. and L. July 2007. It may also be concluded that a specific MLP architecture for a particular classification task can be chosen. J. Poulsen. 8. and D. J. Ed. [2] S. Neural Networks. Hansen. D. . vol. 4. Parisini. London: Springer-Verlag. Prokhorov. New York: Macmillan College Publishing Company. 969–972. consistency is present only in considerably small sized HL networks. 2000.

Appendix A .9.