You are on page 1of 22

1430/10/28

Ch 10: Widrow-Hoff Learning

(LMS Algorithm)
In this chapter we apply the principles of performance
learning to a single-layer linear neural network.
Widrow-Hoff learning is an approximate steepest
descent algorithm, in which the performance index is
mean square error.

Bernard Widrow began working in NN in

the late 1950s, at about the same time that
Frank Rosenblatt developed the
perceptron learning rule.
In
I 1960 Widrow
Wid
and
dH
Hoff
ff iintroduced
t d
d
network.
Its learning rule is called LMS (Least Mean
Square) algorithm.
ADALINE is similar to the perceptron,
except that its transfer function is linear,
2

1430/10/28

Widrow, B., and Hoff, M. E., Jr., 1960, Adaptive

switching circuits, in 1960 IRE WESCON Convention
Record, Part 4, New York: IRE, pp. 96104.
Widrow, B., and Lehr, M. A., 1990, 30 years of
backpropagation, Proc. IEEE, 78:14151441.
Widrow, B., and Stearns, S. D., 1985, Adaptive Signal
Processing, Englewood Cliffs, NJ: Prentice-Hall.

Both have the same limitations; They can

only solve linearly separable problems.
The LMS algorithm minimizes mean
square error
error, and therefore tries to move
the decision boundaries as far from the
training patterns as possible.
The LMS algorithm found many more
practical uses than the p
p
perceptron
p
((like
most long distance phone lines use
4

1430/10/28

a = purel in Wp + b = Wp + b

w i 1

iw

iw

w i 2

w i R

a = pure lin n = purelin 1w p + b = 1w p + b

T

a = 1w p + b = w1 1 p 1 + w1 2 p 2 + b

The ADALINE like perceptron has a decision boundary, which is

determined by the input vectors for which the net input n is zero.

1430/10/28

Mean Square Error

The LMS algorithm is an example of supervised training.
Training Set:

{ p 1, t 1} {p 2 , t2} { p Q, t Q}

Input:

Target:

pq

tq

Notation:
x =

1w

a = 1w p + b

z = p

a = x z

Mean Square Error:

2

F x = E e = E t a = E t xT z
The expectation is taken over all sets of input/target pairs.7

Error Analysis
2

F x = E e = E t a = E t xT z
T

F x = E t 2 t x T z + x T z z x
2

F x = E t 2 x T E t z + xT E zz x
This can be written in the following convenient form:
T

F x = c 2 x h + x R x
where

c = E t

h = E tz

R = E zz

1430/10/28

The vector h gives the cross-correlation

between the input vector and its associated
target.
R is the input correlation matrix.
The diagonal elements of this matrix are
equal to the mean square values of the
elements of the input vectors.
The mean square error for the ADALINE Network is a
T
1 T
F x = c + d x + -- x Ax
2

d = 2 h

A = 2R

Stationary Point
Hessian Matrix:

A = 2R

The correlation matrix R must be at least positive

semidefinite. Really it can be shown that all correlation
matrices are either positive definite or positive
semidefinite. If there are any zero eigenvalues, the
performance index will either have a weak minimum or
else no stationary point (depending on d= -2h),
otherwise there will be a unique global minimum x*
(see Ch8).
T
1 T
Fx = c + d x + - x Ax = d + Ax = 2h + 2Rx

Stationary point:

2h + 2R x = 0

10

1430/10/28

definite:
1

x = R h

If we could calculate the statistical quantities h and R,

we could find the minimum point directly from above
equation.
But it is not desirable or convenient to calculate h and
R. So

11

Approximate Steepest Descent

Approximate mean square error (one sample):

x = t k a k 2 = e 2k
F
Expectation of the squared error has been replaced
by the squared error at iteration k.

Fx = e2k

2
e k
e k
e k j = ---------------- = 2 e k ------------ w1 j
w 1 j

j = 1 2 R

2
e k
2
e k
e k R + 1 = ---------------- = 2e k ------------b
b
12

1430/10/28

T
e k t k a k

t k 1 w pk + b
------------- = ---------------------------------- =
w1 j
w 1 j
w1 j

e k

------------- =
w 1 j
w 1

t k w 1 i p i k + b
i = 1

Where pi(k) is the ith elements of the input vector at kth iteration.
e k
------------- = p j k
w1 j

e k
-------- = 1
b

F x = e2 k = 2e k z k

13

Now we can see the beauty of approximating the

mean square error by the single error at iteration k as in:

x = tk ak2 = e2k
F
This approximation to F (x) can now be used
in the Steepest descent algorithm.

LMS Algorithm
Al ith
xk + 1 = xk F x

x = xk
14

1430/10/28

If we substitute

F (x) for F (x)

x k + 1 = x k + 2 e k z k
1w k + 1

= 1w k + 2e k p k

b k + 1 = b k + 2 e k
These last two equations make up the LMS algorithm.
Also called Delta Rule or the Widrow-Hoff learning
algorithm.
15

Multiple-Neuron Case
iw k +

1 = iw k + 2 ei k p k

b i k + 1 = b i k + 2e i k
Matrix Form:
T

W k + 1 = W k + 2e k p k

b k + 1 = b k + 2 e k
16

1430/10/28

Analysis of Convergence
Note that xk is a function only of z(k-1), z(k-2), , z(0). If
we assume that successive input vectors are statistically
independent, then xk is independent of z(k).
We will show that for stationary input processes meeting
this condition, so the expected value of the weight vector
will converge to:
*
1

x R h

This is the minimum mean square error {E[ek2]}

solution, as we saw before.
17

Recall the LMS Algorithm:

xk + 1 = xk + 2e k zk
E xk + 1 = E xk + 2E e k z k
Substitute the error with

t (k ) xTk z (k )
T

Ex k + 1 = Ex k + 2E t k z k E xk zk z k

since xTk z (k ) z T (k )x k
T

E xk + 1 = E xk + 2 Etk z k E zkz k xk
18

1430/10/28

Since xk is independent of z(k)

E xk + 1 = E xk + 2 h RE xk

E xk + 1 = I 2RE xk + 2h
For stability, the eigenvalues of this
matrix must fall inside the unit circle.

Conditions for Stability

eig I 2 R = 1 2 i 1
(where i is an eigenvalue of R)

Since

i 0 ,

19

1 2i 1 .

1 2

1 i

for all i

0 1 m ax

Note: we have the same condition as the SD algorithm. In

SD we use the Hessian Matrix A, here we use the input
correlation matrix R (Recall that A=2R).

10

20

1430/10/28

E xk + 1 = I 2 R E xk + 2 h
If the system is stable,
stable then a steady state condition will be reached.
reached

E xss = I 2 R E xss + 2 h
The solution to this equation is
1
Ex ss = R h = x

This is also the strong minimum of the performance index.

Thus the LMS solution, obtained by applying one input at a time, is
the same as the minimum mean square solution of x* R 1h

21

Example
Banana

p
=

t
=
1
1 1
1

p
=

t
=
Apple 2
1 2
1

If inputs are generated randomly with equal probability, the

input correlation matrix is:
1
2

1
2

R = E pp = -- p 1 p 1 + -- p 2 p 2
1

1 0 0

R = --2- 1 1 1 1 + -2- 1 1 1 1 = 0 1 1
1

1 = 1.0

2 = 0.0

3 = 2.0

0 1 1

1
1
------- = ---- = 0.5
max 2.0

We take =0.2 (Note: Practically it is difficult to calculate R and

. We choose them by trial and error).

11

22

1430/10/28

Iteration One
Banana

a0 = W 0p 0 = W0 p1= 0 0 0

1
1= 0
1

W(0) is
selected
arbitrarily.

e 0 = t 0 a0 = t1 a 0= 1 0= 1

W 1 = W0 + 2e 0 pT 0
T

1
W 1 = 0 0 0 + 20.2 1 1 = 0.4 0.4 0.4
1

23

Iteration Two
Apple

a 1= W 1 p 1= W 1 p2 = 0.4 0.4 0.4

1
1 = 0.4
1

e 1 = t1 a1 = t2 a 1= 1 0.4= 1.4
T

1
W 2 = 0.4 0.4 0.4 + 2 0.2 1.4 1 = 0.96 0.16 0.16
1

24

12

1430/10/28

Iteration Three
a 2= W2 p 2= W 2 p1= 0.96 0.16 0.16

1
1 = 0.64
1

e 2 = t 2 a 2 = t 1 a2 = 1 0.64= 0.36
T

W3 = W 2 + 2e 2p 2 = 1.1040 0.0160 0.0160

W = 1 0 0
25

learning process:
Computationally, the learning process
goes through
th
h allll ttraining
i i examples
l ((an
epoch) number of times, until a stopping
criterion is reached.
The convergence process can be
monitored with the plot of the meanmean
squared error function F(W(k)).

26

13

1430/10/28

The popular stopping criteria are:

the mean-squared error is sufficiently
small:
ll F(W(k)) <
The rate of change of the mean-squared
error is sufficiently small:

27

ADALINE is one of the most widely used NNs in practical
applications. One of the major application areas has been
Tapped Delay Line

28

14

1430/10/28

ak = purelinWp + b =

w1 i yk i + 1 + b

i= 1

In Digital Signal Processing (DSP) language

lang age wee
recognize this network as a finite impulse response
(FIR) filter.

29

Example: Noise Cancellation

30

15

1430/10/28

Two-input filter can attenuate and phase-shift the
noise in the desired way.

31

Correlation Matrix
To Analyze this system we need to find the input
correlation matrix R and the input/target crosscorrelation vector h.
h

R E[zz T ]
z k =

h = E t z

v k
v k 1

t k = s k + m k
2

R=

E v k

E v k v k 1
2

E v k 1v k Ev k 1

h =

16

E s k + m k v k
E s k + m k v k 1

32

1430/10/28

We must define the noise signal , the EEG signal s,

and the filtered noise m, to be able to obtain specific
values.
We assume: The EEG signal is a white (Uncorrelated
from one time step to the next) random signal
uniformly distributed between the values -0.2 and +0.2,
the noise source (60 Hz sine wave sampled at 180 Hz) is
given by
2 k
2k
v k = 1.2 sin---------
3

And the filtered noise that contaminates the EEG is the

noise attenuated by a factor 1.0 and shifted in phase by
33
-3/4:

m k = 1.2

2 k 3
sin --------- ----- 3
4

2
2k 2
E v k = 1.2 --- sin --------- = 1.2 0.5 = 0.72

3
3
21

k =1

E v k 1 = E v k = 0.72
3

2 k 1
2k
1
E v k v k 1 = --- 1.2 sin ---------1.2 sin-----------------------
3
3
3
k=1

2
2
= 1.2 0.5 cos ------ = 0.36
3

R=

17

0.72 0.36
0.36 0.72

34

1430/10/28

Stationary Point
E sk + mk v k = E sk v k + E mk v k
0

1st

The term is zero because s(k) and v(k) are

independent and zero mean.
1
Em k v k = -3

2k
3
--------- ------ 1.2sin --------- = 0.51
1.2 sin 2k
3
3
4

k =1

Now we find the 2nd element of h:

E s k + m k v k 1 = Es k v k 1 + E m kv k 1
0

35

1
2k 3
2 k 1
Em k v k 1 = --- 1.2 sin------- ----1.2 sin --------------- = 0.70

3
4
3
k=1

h =

E s k + m k v k

h = 0.51

E s k + m k v k 1

x = R 1 h =

0.72 0.36
0.36 0.72

0.70

0.51
0.70

0.30
0.82

Now, what kind of error will we have at the

minimum solution?
36

18

1430/10/28

Performance Index
T

F x = c 2 x h + x Rx

We have just found x*, R and h. To find c we have

2

c = E t k = E s k + m k
2

c = Es k + 2E s k mk + E m k
The middle term is zero because s(k) and v(k) are
independent and zero mean.
1
E s k = ------0.4
2

0.2

0.2

0.2
2
1
3
s d s = --------------- s 0.2 = 0.0133
3 0.4

37

1
E m k = --- 1.2 sin 2
------ 3
------ = 0.72
3
3
4

k =1

c = 0.0133
0 0133 + 0.72
0 72 = 0.7333
0 7333

F x = 0.7333 20.72 + 0.72 = 0.0133

The minimum mean square error is the same as the
mean square value of the EEG signal. This is what
we expected, since the error of this adaptive noise
canceller is in fact the reconstructed EEG Signal.

38

19

1430/10/28

W1,2

W1,1

LMS trajectory looks like noisy version of steepest

descent.

39

Note that the contours in this figure reflect the fact that
the eigenvalues and the eigenvectors of the Hessian
matrix A=2R are
0.7071
0.7071
, 2 0.75, z 2

0.7071
0.7071

1 2.16, z1

If the learning rate is decreased, the LMS trajectory is

smoother, but the learning proceed more slowly.
Note that max is 2/2.16=0.926 for stability.
40

20

1430/10/28

Note that error does not go to zero, because the LMS

algorithm is approximate steepest descent; it uses an estimate
41

Echo Cancellation

42

21

1430/10/28

HW

Ch 4: E 2, 4, 6, 7
Ch 5: 5, 7, 9
Ch 6: 4, 5, 8, 10
Ch 7: 1, 5, 6, 7
Ch 8: 2, 4, 5
Ch 9: 2, 5, 6
Ch 10: 3, 6, 7
43

22