You are on page 1of 16

3.

Least Mean Square (LMS) Algorithm
3.1 Spatial Filtering
 

uses single linear neuron and can be understood as adaptive filtering y = ∑k wkxk error e = d − y for k = 1 to p where d = desired value 1 2 e 2

cost function = mean squared error = J = -1 x1 w0 = θ w1 ..... xn wp J 3.2 Steepest descent ∂J/∂wk = 0 to determine optimum weight /
¡

output y

adjust weights iteratively and move along the error surface towards the optimum value ∂ / wk(n+1) = wk(n) − η (∂J(n)/∂wk) i.e. updated value is proportional to negative of the gradient of the error surface ∴ wk(n+1) = wk(n) + η e(n) xk(n)
Jmin

gradient / = ∂J/∂w

w0

single weight

18

for example η(n) = c/n for some constant c / 19 .2. requires: 0 < η < 2/tr[Rx] where tr[Rx] = ∑k λk ≥ λmax / • Faster convergence is usually obtained by making η a function of n. • reduces storage requirement to information present in its current set of weights. and can operate in a nonstationary environment. 3.Properties of LMS: • a stochastic gradient algorithm in that the gradient vector is ‘random’ in contrast to steepest descent • on average improves in accuracy for increasing values of n.1 Convergence ( proof not given) • in the mean if weight vector → optimum value as n → ∞. requires: / 0 < η < 2/λmax λmax is max eigenvalue of autocorrelation matrix Rx Rx = E[x xT] • in the mean square if mean-square of error signal → constant as n → ∞.

dj uj ϕ(•) • -1 ej ej(n) = dj(n) − yj(n) υj(n) = ∑iwji(n)yi(n) υ yj(n) = ϕj(υj(n)) (n) = 1 ∑ e2(n) over all j in o/p layer 2 j for i = 0 to p 20 . Let wji be weight connected from neuron i to neuron j error signal: net internal sum: output: £ ¢ Instantaneous sum of squared errors: ....... • neuron j yi wij from previous layer . Multilayer Feedforward Perceptron Training 4......1 Back-propagation Algorithm • • .4..

∂ (n) / ∂υj(n) ¤ ¥ • Learning goal is to minimise av by adjusting weights.e. need to consider neuron j feeding neuron k. where inmputs to neuron j are yi δj(n) = − ∂ (n) ϕj′(υj(n)) = −∑k ek ∂ek(n) ϕj′(υj(n)) υ υ ∂yj(n) ∂yj(n) ¤ ¥ ∴ δj(n) = − ϕj′(υj(n)) ∑k ek(n) ∂ek(n) ∂υk(n) = − ϕj′(υj(n)) ∑k δk(n) wkj(n) υ υ ∂υk(n) ∂yj(n) • Thus δj(n) is computed in terms of δk(n) which is closer to the output. steepest descent where δj (n) = . After calculating the network output in a forward pass. local gradient easily calculated Case 2: hidden node more complex. the error is computed and recursively back-propagated through the network in a backward pass. but instead of estimate (n) is used on a pattern-by-pattern basis ¤ ¥ ¤ ¥ ¤ av ¤ ¥ ¥ For N patterns.Case 1: Output node. average squared error: = 1 ∑ (n) N n for n = 1 to N av the 21 . × × weight correction = (learning rate)×(local gradient)×(i/p signal neuron) ∆ wji(n) η δj(n) = yi(n) ¦ § = η δj (n) yi(n) ¦ § ∴ weight correction: ∆ wji(n) = − η ∂ (n) ∂ wji(n) ¦ § ¦ § From the chain rule: ∂ (n) = ∂ (n) ∂ej(n) ∂yj(n) ∂υj(n) ∂wji(n) ∂ej(n) ∂yj(n) ∂υj(n) ∂wji(n) i.

4.2 Back-propagation training Activation function: yj(n) = ϕj(uj(n)) = 1 1 + exp(−uj(n)) − ∂yj(n) = ∂uj(n) ¨ = yj(n) [ 1 − yj(n)] ϕj′(uj(n)) = exp(−uj(n)) − 2 [1 + exp(−uj(n))] − Note that max value of ϕj′(υj(n)) occurs at yj(n) = 0.5 and υ min value of 0 occurs at yj(n) = 0 or 1 Momentum term: + α ∆ wji(n − 1) α 0 ≤ |α| < 1 helps locate more desirable local minimum in complex error surface example error surface ¨ no change in error sign ⇒ ∆ wji(n) increases and descent is accelerated ¨ changes in error sign ⇒ ∆ wji(n) decreases and stabilises oscillations ¨ large enough α can stop process terminating in shallow local minima ¨ single weight with momentum. η can be larger 22 .

4. W = no. ε = fraction of errors permitted on test 4.2 Stopping criteria e.3 Other perspectives for improving generalisation 4.3.3.1 Pattern vs Batch Mode Choice depends on particular problem: • randomly updating weights after each pattern requires very little storage and leads to a stochastic search which is less likely to get stuck in local minima • updating after presentation of all training samples (an epoch) provides a more accurate estimate of the gradient vector since it is based on the average ©  squared error av 4.3.3 Initialisation • default is uniform distribution inside a small range of values • too large values can lead to premature saturation (neuron outputs close to limits) which gives small weight adjustments even though error is large 4.g. gradient vector threshold and/or change in average squared error per epoch 4. of hidden nodes. of synaptic weights. of examples.5 Cross-Validation • measures generalisation on test set • various parameters including no. learning rate and training set size can be set based on cross-validation performance 23 .3.3.4 Training Set Size worst-case formula N > W/ε where: / N = no.

g. Kalman filtering.7 Other ways of minimising cost function • Back-propagation uses a relatively simple.4 Universal Approximation Theorem single hidden layer with suitable ϕ gets arbitrarily close to any continuous function • logistic function satisfies ϕ(⋅) definition ⋅ • single hidden layer sufficient.6 Network Pruning by complexity regularisation (two possibilities: network growing and network pruning)     goal is to find weight vector that minimises R(w) =   s(w) +λ c(w) where s(w) is standared error measure e. mean square error λ is the regularisation parameter c(w)  is the complexity penalty that depends on the network e. but no clue on synthesis • single hidden layer is restrictive in that hierarchical features not supported  24 .3. conjugate-gradient method 4.3.g.g. ||w||2 • regularisation term allows identification of weights having insignificant effect 4.4. quick approach to minimising cost function by obtaining an instantaneous estimate of the gradient • methods and techniques from nonlinear optimum filtering and nonlinear function optimisation have been used to provide more sophisticated approach to minimising the cost function e.

4.5 -1 x1 x1 0 0 1 1 x2 0 1 0 1 a 0 0 0 1 b 0 1 1 1 target 0 1 1 0 x2 1 neuron c out =0 out =1 out =0 0 1 x1 25 .5 -1 out =0 0 1 out =1 out 1 neuron b x1 1 1 1 1 1.5 a • x2 b 0.5 Example of learning XOR Problem Decision Boundaries x1 0 0 1 1 x2 0 1 0 1 target 0 1 1 0 x2 1 neuron a out =1 out =0 0 -1 x• 1 1 x2 -2 1 c 0.

..6 Example: vehicle navigation sharp left sharp right .......... fully connected 9 hidden units video input retina  network computes steering angle  training examples from human driver  obstacles detected by laser range finder 26 ...............4....... fully connected 45 output units ....

w ( k ) 22 22 2p  W(k) =       w p1 ( k ) w p 2 ( k ) . bkp ak = ak1 ak2 . ak2..1 linear associative memory stimulus ak ak1 • w11 w12 w13 ak2 • . .. bk2  w11 ( k ) w12 ( k ) .... bkp [ak1... w pp ( k )     akp • p bkp response bk = W(k) ak Design of weight matrix for storing q pattern associations ak  bk estimate of weight matrix = ∑k bk akT for k = 1 to q (Hebbian learning principle) where bk akT is the outer product = bk1 bk2 ... .. response bk bk1 1 bk = bk1 bk2 ...akp] 27 .. w1 p ( k )  w ( k ) w ( k ) . .5... akp 2 . Associative Memories 5. ...

of patterns reliably stored is p. [0. the dimension of input space which is also the rank (no. a2 = [0 1 0 0]T.45]T  which is closer to b1 than b2 or b3 28 .25 -0. of independent columns or rows) of W  For an auto-associative memory ideally W ak = ak showing that stimulus patterns are eigenvectors of W with all unity eigenvalues Example: a1 = [1 0 0 0]T.2]T gives [4 1.8 -0.and hetero-associative  content addressable and resistant to noise and damage  interaction between stored patterns may lead to error on recall  The max.Pattern recall: For recall of a stimulus pattern aj: b = W aj = ∑k (akTaj) bk vj = ∑k (akTaj)bk for k = 1 to q.g. k ≠ j assuming that key patterns have been normalised. b2 = [-2 1 6]T. vj results from interference from all other stimulus patterns ∴ (akTaj) = 0 for j ≠k → perfect recall Main features:  (orthonormal patterns) distributed memory  auto.e. b3 = [-2 4 3]T memory weight matrix = 5 -2 -2 0 1 1 4 0 0 6 3 0 giving perfect recall since stimulus patterns are orthonormal noisy stimulus e.15 -0. a3 = [0 0 1 0]T b1 = [5 1 0]T. no.15 0. akTaj = 1 b = bj + vj where i.

.... Radial Basis Functions 6.1 Separability of patterns Separability theorem (Cover) states that if mapping ϕ(x) is nonlinear and hiddenunit space is high relative to input space then it is more likely to be non-separable ϕ1 1 w0 w1 x2 • .... then the training set can be learned perfectly .... wp t = centre of Gaussian 29 Example of RBF is a Gaussian ϕ(x) = exp(−||x − t||2) − output neuron is linear weighted sum ϕ(x) is nonlinear and hidden-unit space [ϕ1(x). ϕp(x)] is usually ϕ high dimension relative to input space and more likely to be separable a difficult nonlinear optimisation problem has been converted to a linear optimisation problem that can be solved by LMS algorithm if a different RBF is centred on each training pattern... ϕp w1 2 ...6. ϕ2(x).. xp •     x1 • ϕ2 ..

t2 = [0.0)  use two hidden Gaussian functions ϕ1(x) = exp(−||x − t1||2).0]T − decision boundary • • 0 1 pattern (0.0) ϕ1(x) x1 0 0 1 1 x2 0 1 0 1 ϕ1(x) √ e-√2 e-1 e-1 1 ϕ2(x) 1 e-1 e-1 √ e-√2 30 .1) patterns (0. t1 = [1.1) (1.2 Example: XOR ϕ2(x) 1 • pattern (1.1]T − ϕ2(x) = exp(−||x − t2||2).6.

with centres xi and widths σi F(x) = ∑i wi exp( − ||x − xi||2 ) for i = 1 to N σ 2σi2 practical ways of regularising: reduce number of RBFs   change σ of the RBFs ! ! where  s(F) is the standard error term and c(F) is the regularising term 31 . for every x ∈ X there exists y ∈ Y (existence) 2.6. that includes a complexity term: ! ! (F) = ! s(F) +λ c(F) one regularised solution is given by a linear superposition of multivariate Gaussian basis functions • one regularised is given by a linear superposition of multivariate Gaussian basis functions.3 Ill-posed Hypersurface Reconstruction Inverse problem of finding unknown mapping F from domain X and range Y is well-posed if: 1. 3. t ∈ X. mapping is continuous (continuity) F(x) = F(t) iff x = t (uniqueness) X x Y F(x) Learning is ill-posed because of sparsity of information & noise in training set  Regularisation Theory for solving ill-posed problems (Tikhonov) uses a modified cost functional. for every pair of inputs x.

fundamentally different in hidden & o/p layers • all layers usually nonlinear vs. of centres and d = distance between them • Self-organised selection centres. nonlinear hidden but linear output • computation of inner product of i/p vector & weight vector vs.4 RBF Networks vs.g.g. local approximation with fast learning but poor extrapolation 6. e. error-correction learning 2with suitable cost function using modified gradient descent " choose position of centres 32 . k-n-n or self-organising NN • Supervised Selection of centres. For the hidden layer the main choice involves how the centres are learned: • Fixed Centres selected at random. choose Gaussian exp(− M d-2 ||x − ti||2) − where M = no. MLP • Single vs possibly multiple hidden layers • common computation nodes vs.5 Learning Strategies variety of possibilities since a nonlinear optimisation strategy for hidden layer is combined with linear optimisation strategy in output layer. e.g. Euclidean norm between i/p vector and centre of appropriate unit • global approximation and therefore good at extrapolation vs. e.6.

6.5 # output shown for 200 inputs uniformly sampled in the range [−8.5 best compromise is σ = 1.6 Example: curve fitting # − RBF for approximating (x−2)(2x+1)(1+x2)-1 from 15 noise-free examples # 15 Gaussian hidden units with same σ # Three designs are generated for σ = 0.5.5 σ = 1. 12] − x x x x x x x x x x x x x x x σ = 1.0 σ = 0.0.0 33 . σ = 1. σ = 1.