This action might not be possible to undo. Are you sure you want to continue?

Lecture: 5

Dr. Tahani Abdalla Attia

Dr. Tahani Abdalla 2010 NN and FL -5th Software Engg. 1

Supervised Learning

Dr. Tahani Abdalla 2010

NN and FL -5th Software Engg.

2

Supervised Learning

Figure above depicts supervised learning with a teacher. In this paradigm the learning algorithm is given a set of input/output pattern pairs. The weights are adjusted so that the network will produce the required output in future. In this example the algorithm would be given a set of pictures of animals that the teacher classifies as spider , insect , lizard or other. If the network is shown a spider, but classifies it as a lizard then the weights are adjusted to make the network respond "spider".

Dr. Tahani Abdalla 2010 NN and FL -5th Software Engg. 3

Supervised Learning

Dr. Tahani Abdalla 2010

NN and FL -5th Software Engg.

4

Learning Rules:

Mean Square Error: Like the perceptron learning rule, the least mean square error (LMS) algorithm is an example of supervised training, in which the learning rule is provided with a set of examples of desired network behavior:

**_p1 , t1a , _p 2 , t 2 a , ... , _pQ , t Q a
**

Dr. Tahani Abdalla 2010 NN and FL -5th Software Engg. 5

The error is calculated as the difference between the target output and the network output.a (k )) Q k !1 Q k !1 Dr.Here pq is an input to the network. and tq is the corresponding target output. 6 Q Q . Tahani Abdalla 2010 NN and FL -5th Software Engg. the network output is compared to the target. We want to minimize the average of the sum of these errors: 1 1 2 2 mse ! § e(k ) ! § (t (k ) . As each input is applied to the network.

Specifically. 7 .The LMS algorithm adjusts the weights and biases of the linear network so as to minimize this mean square error. depending on the characteristics of the input vectors. Tahani Abdalla 2010 NN and FL -5th Software Engg. Thus. the performance index will either have one global minimum. a weak minimum or no minimum. the characteristics of the input vectors determine whether or not a unique solution exists. the mean square error performance index for the linear network is a quadratic function. Dr. Fortunately.

procedure Here again. Tahani Abdalla 2010 NN and FL -5th Software Engg. 8 . Dr. Widrow and Hoff had the insight that they could estimate the mean square error by using the squared error at each iteration / .LMS Algorithm: (Delta Rule) (Adaline rule)(Widrow-Hoff rule) The LMS algorithm or Widrow-Hoff learning algorithm is based on an approximate steepest descent . linear networks are trained on examples of correct behavior.

If we take the partial derivative of the squared error with respect to the weights and biases at the kth iteration we have: Dr. 9 . Tahani Abdalla 2010 NN and FL -5th Software Engg.

Tahani Abdalla 2010 NN and FL -5th Software Engg.Next look at the partial derivative with respect to the error: Dr. 10 .

Tahani Abdalla 2010 NN and FL -5th Software Engg. 11 .Dr.

Dr. 12 . Tahani Abdalla 2010 NN and FL -5th Software Engg.

but if it is too large it may lead to instability and errors may even increase.Here the error e and the bias b are vectors and a learning rate. 13 . Dr. the learning rate must be less than the reciprocal of the largest eigenvalue of the correlation matrix pTp of the input vectors. Tahani Abdalla 2010 NN and FL -5th Software Engg. learning occurs quickly. If is is large. To ensure stable learning.

Tahani Abdalla 2010 NN and FL -5th Software Engg.Finding the minimum of a function: gradient descent Dr. 14 .

gradient descent on an error Dr. 15 . Tahani Abdalla 2010 NN and FL -5th Software Engg.

Dr. 16 . Tahani Abdalla 2010 NN and FL -5th Software Engg.

Dr. 17 . Tahani Abdalla 2010 NN and FL -5th Software Engg.

Tahani Abdalla 2010 NN and FL -5th Software Engg.Dr. 18 .

or pattern classification. The network can be trained for function approximation (nonlinear regression). pattern association. 19 . The training process requires a set of examples of proper network behavior-network inputs p and target outputs t. The default performance function for feedforward networks is the Mean Square Error MSE ± the average squared error between the network outputs a and the target outputs t. Tahani Abdalla 2010 NN and FL -5th Software Engg. Dr. the network is ready for training.Backpropagation ( ) : Once the network weights and biases have been initialized.

20 . which involves performing computations backwards through the network.Several different training algorithms for feedforward networks use the gradient of the performance function to determine how to adjust the weights to minimize performance. Tahani Abdalla 2010 NN and FL -5th Software Engg. The gradient is determined using a technique called Backpropagation. Dr.

21 .Dr. Tahani Abdalla 2010 NN and FL -5th Software Engg.

22 .Dr. Tahani Abdalla 2010 NN and FL -5th Software Engg.

Tahani Abdalla 2010 NN and FL -5th Software Engg.the negative of the gradient. 23 . The simplest implementation of backpropagation learning updates the network weights and biases in the direction in which the performance function decreases most rapidly . One iteration of this algorithm can be written as: Dr.Backpropagation Algorithms: There are many variations of the backpropagation algorithms.

Tahani Abdalla 2010 NN and FL -5th Software Engg. Dr. 24 . and Batch mode: where all of the inputs are applied to the network before the weights are updated.Backpropagation Algorithms: There are two different ways in which this gradient descent algorithm can be implemented: Incremental mode: where the gradient is computed and the weights are updated after each input is applied to the network.

The gradients calculated at each training example are added together to determine the change in the weights and biases. Dr.Batch Training: Batch Training: In batch mode the weights and biases of the network are updated only after the entire training set has been applied to the network. Batch Gradient Descent: In the batch steepest descent training function the weights and biases are updated in the direction of the negative gradient of the performance function. 25 . There is only one training function associated with a given network. Tahani Abdalla 2010 NN and FL -5th Software Engg.

Acting like a low-pass filter. Without momentum a network may get stuck in a shallow local minimum. Dr. 26 . Tahani Abdalla 2010 NN and FL -5th Software Engg. Momentum allows a network to respond not only to the local gradient. momentum allows the network to ignore small features in the error surface.This algorithm often provides faster convergence- Batch Gradient Descent with Momentum : steepest descent with momentum. but also to recent trends in the error surface. With momentum a network can slide through such a minimum.

27 .Question : Compare between Backpropagation algorithm and the Hill Climbing Search method if you have the following definition: Hill Climbing Search is a simple optimization technique that modifies a proposed solution by a small amount and then accepts it if it is better than the previous solution. Dr. Tahani Abdalla 2010 NN and FL -5th Software Engg. The technique can be slow and suffers from being caught in local optima.

Faster Training: The previous two backpropagation training methods are often too slow for practical problems. There are several high performance algorithms that can converge from ten to one hundred times faster than the early mentioned algorithms. Algorithms in this section operate in the Batch mode. 28 . Tahani Abdalla 2010 NN and FL -5th Software Engg. These faster algorithms fall into two main categories: Dr.

29 . Dr. One heuristic modification is the momentum technique.Faster Training: The first category uses heuristic techniques. and Resilient backpropagation. This section discusses two more heuristic techniques: Variable learning rate backpropagation. Tahani Abdalla 2010 NN and FL -5th Software Engg. which were developed from an analysis of the performance of the standard steepest descent algorithm. which was presented in the previous section..

There are three types of numerical optimization techniques for neural network training: Conjugate gradient. Tahani Abdalla 2010 NN and FL -5th Software Engg. Quasi-Newton.Faster Training: The second category of fast algorithms uses standard numerical optimization techniques. Dr. Levenberg-Marquardt. 30 .

the algorithm will take too long to converge. If the learning rate is too small. the learning rate is held constant throughout training. in fact. Dr. 31 . as the algorithm moves across the performance surface. If the learning rate is set too high. the optimal learning rate changes during the training process. It is not practical to determine the optimal setting for the learning rate before training. The performance of the algorithm is very sensitive to the proper setting of the learning rate. the algorithm may oscillate and become unstable. and.Variable Learning Rate: With standard steepest descent. Tahani Abdalla 2010 NN and FL -5th Software Engg.

Variable Learning Rate: The performance of the Steepest Descent Algorithm (SDA) can be improved if we allow the learning rate to change during the training process. An adaptive learning rate will attempt to keep the learning step size as large as possible while keeping learning stable. Dr. Tahani Abdalla 2010 NN and FL -5th Software Engg. The learning rate is made responsive to the complexity of the local error surface. An adaptive learning rate requires some changes in the training procedure used. 32 .

These functions are often called ³squashing´ functions. Dr. even though the weights and biases are far from their optimal values.Resilient Backpropagation: Multilayer networks typically use sigmoid transfer functions in the hidden layers. Sigmoid functions are characterized by the fact that their slope must approach zero as the input gets large. since the gradient can have a very small magnitude. and therefore. Tahani Abdalla 2010 NN and FL -5th Software Engg. 33 . since they compress an infinite input range into a finite output range. cause small changes in the weights and biases. This causes a problem when using steepest descent to train a multilayer network with sigmoid functions.

Resilient Backpropagation: The purpose of the resilient backpropagation training algorithm is to eliminate these harmful effects of the magnitudes of the partial derivatives. 34 . Only the sign of the derivative is used to determine the direction of the weight update. The size of the weight change is determined by a separate update value. Dr. The update value for each weight and bias is increased by a factor (inc) whenever the derivative of the performance function with respect to that weight has the same sign for two successive iterations. Tahani Abdalla 2010 NN and FL -5th Software Engg. the magnitude of the derivative has no effect on the weight update.

then the magnitude of the weight change will be increased. Dr. then the update value remains the same. Whenever the weights are oscillating the weight change will be reduced.Resilient Backpropagation: The update value is decreased by a factor (dec) whenever the derivative with respect that weight changes sign from the previous iteration. 35 . Tahani Abdalla 2010 NN and FL -5th Software Engg. If the derivative is zero. If the weight continues to change in the same direction for several iterations.

this does not necessarily produce the fastest convergence Dr.Conjugate Gradient Algorithms The basic backpropagation algorithm adjusts the weights in the steepest descent direction (negative of the gradient). although the function decreases most rapidly along the negative of the gradient. This is the direction in which the performance function is decreasing most rapidly. It turns out that. Tahani Abdalla 2010 NN and FL -5th Software Engg. 36 .

which produces generally faster convergence than steepest descent directions. Dr. 37 . Tahani Abdalla 2010 NN and FL -5th Software Engg.Conjugate Gradient Algorithms In the conjugate gradient algorithms a search is performed along conjugate directions.

Tahani Abdalla 2010 NN and FL -5th Software Engg. Dr. 38 . The basic step of Newton¶s method is where Ak is the Hessian matrix (second derivatives) of the performance index at the current values of the weights and biases.Quasi-Newton Algorithm: Newton¶s method is an alternative to the conjugate gradient methods for fast optimization.

but which doesn¶t require calculation of second derivatives. Dr. it is complex and expensive to compute the Hessian matrix for feedforward neural networks.Quasi-Newton Algorithms: Newton¶s method often converges faster than conjugate gradient methods. They update an approximate Hessian matrix at each iteration of the algorithm. Unfortunately. The update is computed as a function of the gradient. Tahani Abdalla 2010 NN and FL -5th Software Engg. These are called quasi-Newton (or secant ) methods. There is a class of algorithms that is based on Newton¶s method. 39 .

Levenberg-Marquardt: Like the quasi-Newton methods. Tahani Abdalla 2010 NN and FL -5th Software Engg. then the Hessian matrix can be approximated as: Dr. When the performance function has the form of a sum of squares (as is typical in training feedforward networks). the LevenbergMarquardt algorithm was designed to approach second-order training speed without having to compute the Hessian matrix. 40 .

and e is a vector of network errors.Levenberg-Marquardt: and the gradient can be computed as : where J is the Jacobian matrix that contains first derivatives of the network errors with respect to the weights and biases. 41 . Dr. Tahani Abdalla 2010 NN and FL -5th Software Engg.

using the approximate Hessian matrix Dr. this is just Newton·s method. 42 .Levenberg-Marquardt: The Jacobian matrix can be computed through a standard backpropagation technique that is much less complex than computing the Hessian matrix. The Levenberg-Marquardt algorithm uses this approximation to the Hessian matrix in the following Newton-like update: When the scalar is zero. Tahani Abdalla 2010 NN and FL -5th Software Engg.

Thus. this becomes gradient descent with a small step size. each successful step (reduction in performance function) and is increased only when a tentative step would increase the performance function Dr. Tahani Abdalla 2010 NN and FL -5th Software Engg. When is large. 43 . Newton¶s method is faster and more accurate near an error minimum. so the aim is to shift towards Newton¶s method as is decreased after quickly as possible. this is just Newton¶s method.Levenberg-Marquardt: When the scalar is zero. using the approximate Hessian matrix.

the performance function will always be reduced at each iteration of the algorithm. The main drawback of the Levenberg-Marquardt algorithm is that it requires the storage of some matrices that can be quite large for certain problems. It turns out that this matrix does not have to be computed and stored as a whole. Tahani Abdalla 2010 NN and FL -5th Software Engg.Levenberg-Marquardt: In this way. 44 . The size of the Jacobian matrix is Q×n. where Q is the number of training sets and n is the number of weights and biases in the network. Dr.

Tahani Abdalla 2010 NN and FL -5th Software Engg. the corresponding submatrix of the Jacobian can be cleared.Levenberg-Marquardt: For example. Once one subterm has been computed. 45 . if we were to divide the Jacobian into two equal submatrices we could compute the approximate Hessian matrix by summing a series of subterms. Dr. This is called Reduced Memory Levenberg-Marquardt.

In general. Tahani Abdalla 2010 NN and FL -5th Software Engg. as the number of weights in the network increases.Summary to Backpropagation Algorithms: There are several algorithm characteristics that can be deduced from the experiments. This advantage is especially noticeable if very accurate training is required. LM algorithm is able to obtain lower mean square errors than any of the other algorithms tested. In many cases. for networks that contain up to a few hundred weights. on function approximation problems. the advantage of the LM decreases. the Levenberg-Marquardt (LM) algorithm will have the fastest convergence. 46 . Dr. However.

Tahani Abdalla 2010 NN and FL -5th Software Engg. Levenberg-Marquardt (LM) performance is relatively poor on pattern recognition problems. However.Summary to Backpropagation Algorithms: In addition. the storage requirements can be reduced. it does not perform well on function approximation problems. Its as the error goal is performance also degrades reduced. By adjusting the training parameters. 47 . Dr. but at a cost of increased execution time. The storage requirements of it are larger than the other algorithms tested. The Resilient backpropagation is the fastest algorithm on pattern recognition problems.

Dr.Summary to Backpropagation Algorithms: The memory requirements for Resilient backpropagation algorithm are relatively small in comparison to the other algorithms considered. Tahani Abdalla 2010 NN and FL -5th Software Engg. 48 . The conjugate gradient algorithms seem to perform well over a wide variety of problems. The Scaled Conjugate Gradient (SCG) algorithm is almost as fast as the LM algorithm on function approximation problems (faster for large networks) and is almost as fast as Resilient backpropagation on pattern recognition problems. particularly for networks with a large number of weights.

The variable learning rate algorithm is usually much slower than the other methods. Tahani Abdalla 2010 NN and FL -5th Software Engg. It does not require as much storage as LM. and it can still be useful for some problems. The quasi-Newton backpropagation performance is similar to that of LM algorithm. 49 . since the equivalent of a matrix inverse must be computed at each iteration. Dr.Summary to Backpropagation Algorithms: The conjugate gradient algorithms have relatively modest memory requirements. but the computation required does increase geometrically with the size of the network.

Tahani Abdalla 2010 NN and FL -5th Software Engg. 50 .Question What are the recent backpropagation algorithms used with Neural Networks? Dr.

51 . Tahani Abdalla 2010 NN and FL -5th Software Engg.Dr.

- Download 10.Tasmin
- 13_2010_REVISI68.pdf
- PERGUB_NO_169_TAHUN_2015.pdf
- EXPO-AFET_ET(2013)457137_EN.pdf
- Filo Sofi
- Visa Information for Foreigners (Turkey)
- Company Profile Edited
- Ali Khamenei
- Sekolah Jerman
- Proposal EO
- Proposal EO
- Coa Coo & Bl Nar 5048 Up Date Junaidi
- Silabus Pembelajaran
- LAPORAN PPKT ANGGI
- Format HSE
- Prosedur Operasi Standar k3l Security
- KUESIONER TANGGAP DARURAT
- hyperv
- Struktur Organisasi Spare Part Departemen
- UAS Filpen Fix
- Memadu Sains Dan Agama
- Hello
- stepper
- mikrotik

- CRL-94-1image registration spatial domain
- wq_pmb10
- Kalman Recursive Least Squares Algorithm
- Constrained Approximation and Mixed Criteria
- 14 D&C Algorithm
- apriori
- Rend. Sem. Mat. Univ. Poi. Torino Fascicolo
- Parareal Algorithm
- Strassen's Matrix Mult
- Motion2d Part2 New
- Srr320502 Computer Graphics
- Alternative approach for computing the activation factor of the PNLMS algorithm
- Assigment Problem
- Provable 8
- 1
- Artificial Vision for Autonomous Driving
- cg
- daa_2006
- r5221202 Computer Graphics
- A_ search algorithm - Wikipedia, the free encyclopedia.pdf
- assgnmentdaa
- Boosting Knn TR 2008 03
- Warp PLS
- Iasted 2009
- CXC IT THEORY 2000.docx
- JNTU old question papers 2007
- Face Detection Chapter
- Rr411504 Computer Graphics
- Fuzzy and Motor Drive
- TSP.pdf
- Neural Network and Fuzzy Logic(Lec5)

Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

We've moved you to where you read on your other device.

Get the full title to continue

Get the full title to continue reading from where you left off, or restart the preview.

scribd