You are on page 1of 92

The DSP Primer 6

Adaptive DSP

Return Return

DSPprimer Home DSPprimer Notes

August 2005, University of Strathclyde, Scotland, UK For Academic Use Only


THIS SLIDE IS BLANK
Top
Introduction 6.1

• In this session we will consider an overview of adaptive signal


processing:

• Architectures

• Applications

• Algorithms

• DSP Processor Implementation

• History: Least squares - 19th Century mathematician Gauss.

• Least Squares is widely used off-line in practically every branch of


science, engineering and business.

• Least mean squares - first suggested for DSP in 1960 by Widrow.

August 2005, For Academic Use Only, All Rights Reserved


Notes:
An adaptive digital filter (FIR or IIR) will “adapt” to its environment. The environment will be defined by the input
signals x(k) and d(k) to the adaptive digital filter.
Top
Intuitive Adaptive DSP 6.2

• General Closed Loop Adaptive Signal Processor:

“The aim is to adapt the digital filter such that the input signal x(k) is
filtered to produce y(k) which when subtracted from desired signal d(k),
will minimise the power of the error signal e(k).”

desired
signal d(k)
input +
signal y(k) e(k)
Adaptive FIR
Σ
x(k) Digital Filter Output - error
signal signal

Adaptive Algorithm

e(k) = d(k) - y(k)


y(k) = Filter(x(k))

August 2005, For Academic Use Only, All Rights Reserved


Notes:
Naming Conventions and Notation:

The naming of the signals as input, desired, output and error and denoted as x(k), d(k), y(k) and e(k)
respectively is standard in most textbooks and papers and will be used in this presentation.

The arrow through the adaptive filter is standard notation to indicate that the filter is adaptive and that all of the
digital filter weights can be updated as some function of the error signal. As with any feedback system, stability
is a concern; hence care must be taken to ensure that any adaptive algorithm is stable.

At this stage of abstraction no information is given on the actual input signals which could be anything: speech,
music, digital data streams, vibration signal, predefined noise and so on.

Filter Type

The adaptive filter could be FIR (non-recursive), IIR (recursive) or even a non-linear filter. Most adaptive filters
are FIR for reasons of algorithm stability and mathematical tractability. In the last few years however, adaptive
IIR filters have become increasingly used in stable forms and in a number of real world applications (notably
active noise control, and ADPCM techniques). Current research has highlighted a number of useful non-linear
adaptive filters such as Volterra filters, and some forms of simple artificial neural networks.

Adaptive Filter Performance

Obviously the key aim of the adaptive filter is to minimise the error signal e(k). The success of this minimisation
will clearly depend on the nature of the input signals, the length of the adaptive filter, and the adaptive algorithm
used.
Top
Architectures.... 6.3

Delay
s(k) + n)k)
d(k) d(k)
Unknown x(k) +e(k)
y(k) + e(k) Adaptive y(k)
Σ
n’(k)
x(k) Adaptive Σ System -
-
s(k) Filter
Filter

Noise Cancellation Inverse System Identification

Unknown
System
d(k)
d(k)
+ x(k) y(k)
+e(k)
x(k) Adaptive y(k) e(k) Adaptive Σ
Σ Delay -
Filter - s(k) Filter

System Identification Prediction

August 2005, For Academic Use Only, All Rights Reserved


Notes:
In each of the above architectures the general adaptive signal processor can be clearly seen.

To simplify the figures, the ADCs and DACs are not explicitly shown.

In each architecture the aim of the general adaptive signal processor is the same - to minimise the error signal
e(k).

A particular application may have elements of more than one single architecture. For example the set up below
includes elements of system identification, inverse system identification, and noise cancellation. If the adaptive
filter is successful in modelling “Unknown System 1”, and inverse modelling “Unknown System 2”, then if s ( k )
is uncorrelated with r ( k ) then the error signal is likely to be e ( k ) ≈ s ( k ) :
s(k)
+ +
Unknown
System 1 Delay

d(k)
x(k) y(k) + e(k)
Unknown Adaptive
r(k)
System 2 Filter -
Top
Application Examples 6.4

• System Identification:

• Channel identification; Echo Cancellation

• Inverse System Identification:

• Digital communications equalisation.

• Noise Cancellation:

• Active Noise Cancellation; Interference cancellation for CDMA

• Prediction:

• Periodic noise suppression; Periodic signal extraction;


Speech coders; CMDA interference suppression.

August 2005, For Academic Use Only, All Rights Reserved


Notes:
The sampling rate will vary depending on the particular application and the range of signal frequencies.

• Hi fidelity audio ~ 48kHz.

• Voiceband telecommunications ~ 8 kHz.

• Teleconferencing type applications ~ 16 kHz.

• Biomedical DSP~ 500 to 2000Hz.

• Low frequency active noise control ~ 1000Hz.

• Ultrasonic applications ~ MHz.

• Sonar ~ 50 - 100 kHz

• Radar ~ MHz.

Adaptive filtering has a tremendous range of applications from everyday equalisers on modems to less obvious
applications such as adaptive tracking filters for predicting the movement of human eyes when following a
moving stimulus! (research undertaken by S.Goodbody, MRC, London).
Top
Channel Identification 6.5

• Applying a broadband input signal the adaptive filter will adapt to


minimise the error, and therefore produce a digital filter model of the
room.

DAC Comms Channel ADC

d(k)
x(k) Adaptive y(k) +
Σ
Broadband Filter - e(k)
Signal
(White noise)
General Adaptive Signal Processor

August 2005, For Academic Use Only, All Rights Reserved


Notes:
To intuitively appreciate the above example, consider that the channel is a simple acoustic channel (from a
loudspeaker to a microphone). When an impulse is generated in a room, it will travel to a specific point by the
direct path, and also by many (first) echo or reflection paths and then by echoes of echoes (reverberation).
Clearly the dimensions and the walls of the room will influence the impulse response.

One traditional method of finding the impulse response of a room, is to apply an impulse using “clappers” or a
starting pistol, and record the impulse with a microphone and tape recorder. Improved impulse responses can
be found by taking an ensemble average of around 20 impulses. To find the frequency response of the room,
the Fourier transform of the impulse response is taken. This technique can however be difficult and time
consuming in practice, and white noise correlation techniques are more likely.

h(t) H(f)
Impulse Response
DFT

time frequency
IDFT

Calculating the impulse response is important to audio engineers working in applications such as architectural
acoustics, car interior acoustics, loudspeaker design, echo control systems, public address system design and
so on.

Adaptive System Identification Room Acoustic Identification

If the architecture in the previous slide does indeed adapt, then the error will become very small. Because the
adaptive filter and the room were excited by the same signal, then over the frequency range of the input signal,
x(k), the adaptive filter will have the same same impulse (and frequency) response as the room.
Top
Echo Cancellation 6.6

• Local line echo cancellation is widely used in data modems (V-series)


and in telephone exchanges for echo reduction.
“..morning”
A
x(k)
Input Signal
ADC

Echo Path “..morning”


e.g. Hybrid A
Adaptive Telephone
Filter Connection

B
Output Signal y(k) Simulated
echo of A “Hello”
- d(k)
B DAC Σ ADC
e(k) +
“Hello”
B + echo of A London Paris
“Hello” “..morning”

August 2005, For Academic Use Only, All Rights Reserved


Notes:
When speaker A (or data source A) sends information down the telephone line, mismatches in the telephone
hybrids can cause echoes to occur. Therefore speaker A will hear an echo of their own voice which can be
particularly annoying if the echo path from the near and far end hybrids is particularly long. (Some echo to the
earpiece is often desirable for telephone conversation, and the local hybrid is deliberately mismatched, however
for data transmission echo is very undesirable and must be removed.)

If we can suitably model the echo generating path with the adaptive filter, then a negative simulated echo can
be added to cancel out the speaker A echo. At the other end of the line, telephone user B can also have an
echo canceller.

In general local echo cancellation (where the adaptive echo canceller is inside the consumer’s telephone/data
communication equipment) is only used for data transmission and not speech. Minimum specifications for the
modem V series of recommendations can be found in the ITU (formerly CCITT) Blue Book. For V32 modems
(9600 bits/sec with Trellis code modulation) an echo reduction ratio of 52dB is required. This is power reduction
of around 160,000 in the echo. Hence the requirement for a powerful DSP processor implementing an adaptive
echo cancelling filter.

For long distance telephone calls where the round trip echo delay is more than 0.1 seconds and suppressed by
less than 40dB (this is typical via satellite or undersea cables) line echo on speech can be a particularly
annoying problem. Before adaptive echo cancellers this problem would be solved by setting up speech
detectors and allowing speech to be half duplex. This is inconvenient for speakers who must take it in turn to
speak. Adaptive echo cancellers at telephone exchanges have however helped to solve this problem.

To cancel both near-end and far-end an echo canceller is often presented in two sections, one for the near end
echo, and one for the far end echo. Further information on telecommunication echo cancellers can be found the
textbooks referenced earlier.
Top
Channel Equalisation 6.7

• To improve the bandwidth of a channel we can attempt to equalise a


communication channel:
Training s(k)
Sequence ∆

“Virtual wire”
d(k)

x(k) +
y(k)
Training Telephone Adaptive
Sequence DAC
Channel
ADC
Filter
Σ
- e(k)
s(k)

NEW YORK, USA GLASGOW, UK

• Training sequence could be a PRBS standard.

August 2005, For Academic Use Only, All Rights Reserved


Notes:
If the above architecture successfully adapts (error is minimised) then the adaptive filter will produce an
approximately inverse transfer function of the telephone channel.

Data Channel Equalisation

The process of data channel equalisation is one of the most exploited areas of adaptive signal processing. Most
digital data communications (V32 modems for example) use some form of data channel equaliser. In the last
few years the availability of fast adaptive equalisers has led to modems capable of more than 28800 bits/s
communication with 115200 bits/s (also using data compression) modems on the horizon.

If the above telephone channel is a (stationary) communication channel with a continuous time impulse
response, then when symbols are transmitted the impulse response will cause a symbol to spread over many
time intervals, thus introducing intersymbol interference (ISI). The aim of a data equaliser is to remove this ISI.
Compared to simple channel equalisation, it should be noted that a data equaliser only requires to equalise the
channel at the symbol sampling instants rather than over all time. Hence the problem can be posed with data
symbols as inputs, rather than the raw stochastic data (as in the slide).

In general for channels where the impulse response changes slowly, a decision directed adaptive data
equaliser is used, whereby a slicer is used to produce a retraining signal. It is also worth noting that for many
data transmission systems, the data is complex, and hence a complex adaptive algorithm is required.

Further Information

R. Lucky’s paper on adaptive equalisation (Bell Syst Tech. J., Vol. 45, 1966) defined the LMS algorithm for
equalisation of digital communications and still is very relevant today. A more recent paper is: S. Qureshi.
Adaptive Equalisation. Proceedings of the IEEE, Vol. 73, pp. 1349-1387, 1985. A useful introduction can be
found in: H.M. Ahmed. Recent advances in DSP Systems. IEEE Comms. Mag., Vol. 29, No. 5, pp 32-45, May
1991. See also the general adaptive DSP textbooks for more information.
Top
Minimising the Mean Squared Error 6.8

• If the statistics of x(k) and d(k) are wide sense stationary and ergodic
then we can choose to minimise the mean squared error signal:

desired
signal d(k)
input +
signal y(k) e(k)
Adaptive FIR
Σ
x(k) Digital Filter Output - error
signal signal

Adaptive Algorithm

e(k) = d(k) - y(k)


y(k) = Linear Filter(x(k)) such that Minimize { E [ e 2 ( k ) ] }

August 2005, For Academic Use Only, All Rights Reserved


Notes:
For a truly stationary signal all statistical moments are constant and therefore time invariant. For wide sense or
second order stationarity only the signal mean and variance are assumed to be constant.

“Traditional” adaptive filtering does not consider statistical information above the second order statistics. Some
recent higher order statistics techniques however do consider higher order moments in the development of
adaptive algorithms. The advantages, if any, of this approach are not clear as yet.

For mean squared analysis and derivation of suitable adaptive algorithms, the assumption of wide sense
stationarity for x(k) and d(k) has proven to be sufficient.
Top
Adaptive Algorithm for FIR Filter 6.9


x(k)
d(k)

w0 w1 w2 wN-2 wN-1
+ e(k)
y(k)
Σ

Adaptive Filter Weight Optimisation - Find wopt

N–1
y(k) = ∑ wn x ( k – n ) = w T x ( k )
n=0

e(k) = d(k) – y(k)


w opt = function of ( x ( k ), d ( k ) )

August 2005, For Academic Use Only, All Rights Reserved


Notes:
When deriving adaptive algorithms it is useful to use a vector notation for the input vector and the weight vector.
The output of the filter y(k), is the convolution of the weight vector and the input vector:

N–1
y( k ) = wTx( k ) = ∑ x ( k – n )w n
n=0

where

w = [ w 0 w 1 w 2 ..... w N – 2 w N – 1 ] T

and

x ( k ) = [ x ( k ) x ( k – 1 ) x ( k – 2 ) ..... x ( k – N + 2 ) x ( k – N + 1 ) ] T

Wopt is the weight vector which minimises the error signal in the mean squared sense.

Wopt is often called the minimum MSE (MMSE) solution.

Note that the algorithm we are currently driving is an open loop (i.e. no feedback) technique. Once a large
amount of data has been collected we are then performing a single calculation. Hence this type of technique is
sometimes called single step.
Top
Mean Squared Error 6.10

• Consider the squared error:

e2( k ) = ( d( k ) – wTx( k ) )2
= d 2 ( k ) + w T [ x ( k )x T ( k ) ]w – 2d ( k )w T x ( k )

• Taking expected (or mean) values (and dropping “(k)” for notational
convenience):

E [ e 2 ( k ) ] = E [ d 2 ] + w T E [ xx T ]w – 2w T E [ dx ]

• Writing in terms of the correlation matrix, R and the cross correlation


vector, p, gives:

E [ e 2 ( k ) ] = E [ d 2 ( k ) ] + w T Rw – 2w T p

August 2005, For Academic Use Only, All Rights Reserved


Notes:
:
Correlation Matrix: Assuming that xk (or x(k) and dk (or
d(k)) are wide sense stationary ergodic processes (i.e.
mean and variance are constant) the correlation matrix Correlation Vector
for a 3 weight adaptive FIR filter example, R, is:
The cross correlation vector, p, for a 3 weight
adaptive filter:
xk
R = E [ xx T ] = E x k – 1 x k x k – 1 x k – 2 dk x
k p0
xk – 2
p = E [ dk xk ] = E dk x = p1
k–1

( x k2 ) ( xk xk – 1 ) ( xk xk – 2 ) dk x p2
k–2
= E (x
k – 1 xk ) ( x k2 – 1 ) ( xk – 1 xk – 2 ) Hence for a general N weight filter:
( xk – 2 xk ) ( xk – 2 xk – 1 ) ( x k2 – 2 )
p0
The correlation matrix, R is Toeplitz symmetric and for a p1
filter with N weights the matrix will be N x N in dimension: p =
:
r0 r1 r2 … rN – 1 pN – 1

r1 r0 r1 … rN – 2 Note that above we have used xk for x(k) and


R = r r1 r0 … rN – 3 dk for d(k) purely for notational clarity. Both
2 quantities are the same.
: : : … :
rN – 1 rN – 2 rN – 3 … ro
Top
Minimum Mean Squared Error Solution 6.11

• Consider the MSE equation defining the so called MSE performance


surface, ζ = E [ e 2 ( k ) ] :

E [ e 2 ( k ) ] = E [ d 2 ( k ) ] + w T Rw – 2w T p

• This equation is quadratic in the vector w. Hence there is only one


minimum value of ζ , denoted MMSE (minimum mean square error)
and which occurs at, wopt.

• The MMSE is found from setting the (partial derivative) gradient vector,
∇ , to zero:

∂ζ
∇ = = 2Rw – 2p = 0
∂w

⇒ w opt = R –1 p

August 2005, For Academic Use Only, All Rights Reserved


Notes:

Consider a (trivial) one weight filter case, then:

ζ = E [ d ( k ) ] 2 + w 2 r – 2wp If the filter has two weights the performance


surface is a paraboloid in 3 dimensions::
E [ d k2 ] is a constant, and w, r, and p are all scalars.
Hence the performance surface ζ is a parabola
(upfacing): MSE, ζ

Plotting this performance surface gives:


w2
MSE, ζ w2(opt)
MMSE
w1(opt) w1

If the filter has more than three weights then


MMSE
we cannot draw the performance surface in
wopt w three dimensions, however, mathematically
there is only one minimum point which
occurs when the gradient vector (with
The minimum point is when the surface has respect to w) is zero. A performance surface
gradient = 0, i.e. with more than three dimensions is often
called a hyperparaboloid.

= 2rw – 2p = 0
dw
⇒ w opt = r – 1 p
Top
Wiener-Hopf Solution 6.12

• This solution is termed the Wiener-Hopf solution (and is the optimum


solution for the mean squared error minimisation):

input
signal Adaptive FIR y(k) e(k)
Σ
x(k)
Digital Filter Output
signal
- +
error
signal
Calculate desired
w = R-1p d(k) signal

• The Wiener-Hopf is NOT however a useful real time algorithm due to


the heavy computation required, and if the statistics of x(k) and d(k)
change then the wopt vector must be recalculated again.

August 2005, For Academic Use Only, All Rights Reserved


Notes:
Practical Wiener-Hopf Implementation

The Wiener-Hopf adaptive DSP computation is a single step algorithm that does not use feedback, and could
be used to solve any of the previous problems in system identification, inverse system identification, noise
cancellation, etc. However there are number of practical reasons why this is not often done.

If we assume that the statistical averages are equal to the time averages (i.e. x(k) and d(k) are ergodic), then
we can calculate all elements of R and p from:

M–1
1
r ( n ) = -----
M ∑ x ( i )x ( i + n )
i=0
1- M – 1
M ∑
pn = ---- d ( i )x ( i + n )
i=0
Calculation of R and p requires approx. 2MN multiply and accumulate (MAC) operations where M is the number
of samples in a “suitably” representative data sequence, and N is the adaptive filter length. The inversion of R
requires around N3 MACs, and the matrix-vector multiplication, N2 MACs. Therefore the total number of
computations in performing this one step algorithm is 2MN + N3 + N2 MACs. The computation is therefore very
high, e.g.
An N = 100 weight filter, calculating the r and p values with M = 10000 samples will require more than 3,000,000
MACs to find the MMSE solution.

More importantly if the statistics of signals x(k) or d(k) change (which is very likely), then the filter weights will
require to be recalculated, i.e. the algorithm has no tracking capabilities. Hence direct implementation of the
Wiener-Hopf solution is not practical for real time DSP implementation because of the high computation load,
and the need to recalculate when the statistics change.
Top
Gradient Techniques 1 6.13

• An iterative equation to find the MMSE can be performed by “jumping”


down the inside of the performance surface in the direction of steepest
gradient, – ∇ ( k ) .

w ( k + 1 ) = w ( k ) + µ ( –∇ ( k ) ) :

MSE, ζ
Small step size, µ

Large step size, µ


w1
wo
w =
w1

w0

August 2005, For Academic Use Only, All Rights Reserved


Notes:
The step size, µ, controls the speed of adaption and also the stability of the (feedback) algorithm. If µ is too large
then the algorithm will climb the inside of the parabola and hence be unstable (diverge). For example in a one
weight case::

MSE, ζ

Step size
OK, Step size
algorithm too large,
converges algorithm
diverges
MMSE

wopt w
Top
Gradient Techniques 2 6.14

• The initial value w(0) is an initial “guess”, and then at each new discrete
time, k, a new weight vector value is calculated.

• For an N weight FIR filter, the steepest descent technique will jump
down the inside of the surface to the MMSE point.

• Although this algorithm is simpler than the Wiener-Hopf, the calculation


of the gradient:

∂ζ ∂ E [ e 2 ( k ) ] = 2Rw ( k ) – 2p
∇(k) = =
∂w(k) ∂w(k)

is computation intensive as both R and p are still required.

August 2005, For Academic Use Only, All Rights Reserved


Notes:
There are few real time algorithms that will actually attempt to calculate the true gradient because of the need
to calculate R and p.
Top
The LMS Algorithm 6.15

• Widrow suggested that instead of calculating the derivative of the mean


squared error, use instead, the instantaneous squared error, i.e.


w(k + 1) = w(k) + µ– (e 2 ( k ))
 ∂w(k) 

• Calculating this gradient estimate ∇ ( k ) , gives:

ˆ ∂ e2
∇(k) = ( k ) = 2e ( k )  ∂ e ( k ) = – 2 e ( k )x ( k )
∂w(k) ∂w(k) 

and therefore the LMS iterative weight update algorithm is:

w ( k + 1 ) = w ( k ) + 2µe ( k )x ( k )

August 2005, For Academic Use Only, All Rights Reserved


Notes:
It can be shown that the gradient estimate is indeed an unbiased estimate of the true gradient:

ˆ
E ∇ ( k ) = E [ – 2 e ( k )x ( k ) ] = E [ – 2 ( d ( k ) – w T ( k )x ( k ) )x ( k ) ] = 2Rw ( k ) – 2p = ∇ ( k )
where we have assumed that w(k) and x(k) are independent.

In the mean the LMS will converge to the Wiener-Hopf solution if the step size, µ, is limited by the inverse of the
largest eigenvalue. Taking the expectation of both sides of the LMS equation gives:

E [ w ( k + 1 ) ] = E [ w ( k ) ] + 2µE [ e ( k )x ( k ) ] = E [ w ( k ) ] + 2µ ( E [ d ( k )x ( k ) ] – E [ x ( k )x kT w ( k ) ] )
and again assuming that w(k) and x(k) are independent:

E [ w ( k + 1 ) ] = E [ w ( k ) ] + 2µ ( p – RE [ w ( k ) ] )
= ( I – 2µR ) E [ w ( k ) ] + 2µRw opt
Now if we let v ( k ) = w ( k ) – w opt then we can rewrite the above in the form:

E [ v ( k + 1 ) ] = ( I – 2µR )E [ v ( k ) ]
For convergence of the LMS to the Wiener-Hopf, we require that w ( k ) → w opt as k → ∞ , and therefore
v ( k ) → 0 as k → ∞ . If the eigenvalue decomposition of R is given by Q T ΛQ , where Q T Q = I and Λ is a
diagonal matrix then writing the vector v(k) in terms of the linear transformation Q, such that
E [ v ( k ) ] = Q T E [ u ( k ) ] and multiplying both sides of the above equation, we realise the decoupled equations:
.......continued overleaf
Top
The LMS Algorithm 6.16

d(k)
x(k)

w0 w1 wN-2 wN-1
+
y(k e(k)
Σ

w ( k + 1 ) = w ( k ) + 2µe ( k )x ( k )

• The FIR filter requires N MACs (multiply-accumulates)


The LMS update requires N MACs.

• 2N MACs to implement each LMS algorithm iteration.


Hence 2Nfs MACs per second (where fs is the application sampling
frequency.)
August 2005, For Academic Use Only, All Rights Reserved
Notes:
.......continued from above

E [ u ( k + 1 ) ] = ( I – 2µΛ )E [ u ( k ) ]
and therefore:

E [ u ( k ) ] = ( I – 2µΛ ) k E [ u ( 0 ) ]
where ( I – 2µΛ ) is a diagonal matrix.

( 1 – 2µλ 0 ) 0 0 0
0 ( 1 – 2µλ 1 ) 0 0
( I – 2µΛ ) = 0 0 ( 1 – 2µλ 2 ) 0
… 0
0 0 0 0 ( 1 – 2µλ N – 1 )
For convergence of this equation to the zero vector, we require that

( 1 – 2µλ n ) n → 0 for all n = 0, 1, 2, …, N – 1


Therefore the step size, µ, must cater for the largest eigenvalue, λ max = max ( λ 0, λ 1, λ 2, …, λ N – 1 ) such
that: 1 – 2µλ max < 1 , and therefore:

1
0 < µ < --------------
λ max
Top
LMS Stability 6.17

• The stability of the LMS is dependent on the magnitude of the step size
parameter, µ.

• For stability it can be shown that the step size should be:

1
0 < µ < ---------------------------
NE [ x 2 ( k ) ]

E [ x 2 ( k ) ] is effectively a measure of the power in the input signal.

• Outwith these bounds it is likely that the LMS algorithm may go


unstable, and the therefore fail to adapt to minimise the error.

August 2005, For Academic Use Only, All Rights Reserved


Notes:
1
The previously derived bound of: 0 < µ < --------------
λ max
is not convenient to calculate, and hence not particularly useful for practical purposes. However using the linear
algebraic result that:

N–1
trace [ R ] = ∑ λn
n=0
i.e. the sum of the diagonal elements of the correlation matrix R, is equal to the sum of the eigenvalues.
Therefore the inequality:

λ max ≤ trace [ R ]
will hold. However if the signal xk is wide sense stationary, then the diagonal elements of the correlation matrix,
R, are E [ x 2 ( k ) ] which is a measure of the signal power. Hence:
·
trace [ R ] = N E [ x 2 ( k ) ] = N<Signal Power>
and the well known LMS stability bound of:

1
0 < µ < ---------------------------
· -
NE[x (k)] 2
Some recent publication have further limited the above bounds. In the real DSP algorithm design world, the
bound is useful although trial and error also provide good feedback!
Top
Kalman Filtering I 6.18

• Consider the linear discrete-time dynamic system below.

• It consists of two components:

• A process with a state q ( n ) which evolves through time

• A measurement of the process at each point in time


process measurement
q(n + 1) q(n)
v1 ( n ) Σ z–1 C(n) Σ y(n)

F ( n + 1, n )
v2 ( n )

q ( n + 1 ) = F ( n + 1, n )q ( n ) + v 1 ( n ) y ( n ) = C ( n )q ( n ) + v 2 ( n )

August 2005, For Academic Use Only, All Rights Reserved


Notes:
The system shown in the previous slide is formed as a state-space representation. That is it is described in
terms of a set of a set of state variables which is sufficient to represent the entire state of the system and how
those variables evolve through time.

The state vector of the system under consideration is denoted as q ( n ) and is a column vector containing M
state variables. The transition from the current state q ( n ) to the next state q ( n + 1 ) is described by the following
equation

q(n + 1) = F ( n + 1, n ) q ( n ) + v1 ( n )

(The dimensions for each matrix in the


equation are indicated. Note that for col-
M = M×M M + M
umn and row vectors only the length is
given)

where F ( n + 1, n ) is called the state transition matrix and v 1 ( n ) is the process noise (assumed to be white).

The observation vector y ( n ) is an N -by-1 column vector containing measurements of the process at time n .
y ( n ) is related to the state of the process by the following equation

y(n) = C(n) q(n) + v2 ( n )

N = N×M M + N

where C ( n ) is called the measurement matrix and v 2 ( n ) is the measurement noise (also assumed to be white).
Top
Kalman Filtering II 6.19

• Using the entire set of observed data y ( 1 ), y ( 2 ), …, y ( n ) obtain the


minimum mean-square estimate of the state vector q ( n ) .

• It is assumed that the following are known:

• state transition matrix F ( n + 1, n )

• measurement matrix C ( n )

• process noise corr. matrix Q 1 ( n ) = E { v 1 ( n )v 1H ( n ) }

• measurement noise corr. matrix Q 2 ( n ) = E { v 2 ( n )v 2H ( n ) }

• The minimum mean-square estimate of q ( n ) given the observation


vectors from y ( 1 ) to y ( k ) is denoted as q̂ ( n γ k ) .

August 2005, For Academic Use Only, All Rights Reserved


Notes:
The Kalman Filter is summarised by the following equations.

G ( n ) = F ( n + 1, n )K ( n, n – 1 )C H ( n ) [ C ( n )K ( n, n – 1 )C H ( n ) + Q 2 ( n ) ] – 1

α ( n ) = y ( n ) – C ( n ) q̂ ( n γ n – 1 )

q̂ ( n + 1 γ n ) = F ( n + 1, n ) q̂ ( n γ n – 1 ) + G ( n )α ( n )

K ( n ) = K ( n, n – 1 ) – F ( n, n + 1 )G ( n )C ( n )K ( n, n – 1 )

K ( n + 1, n ) = F ( n + 1, n )K ( n )F H ( n + 1, n ) + Q 1 ( n )

q̂ ( n γ n ) = F ( n, n + 1 ) q̂ ( n + 1 γ n )

For a more in depth description and derivation of the algorithm please see

S. Haykin. Adaptive Filter Theory, 3nd Edition. Prentice Hall, 1996


Top
State-space Formulation of RLS I 6.20

• RLS may be expressed from a Kalman filtering point of view

• This provides a way to formulate RLS as a state-space problem

• The first step is to model the process which we are measuring; that is
we must model the unknown system.

• A tapped delay line model is appropriate irrespective of the filter


configuration
model of unknown system
x(n – 1) x(n – 2) x(n – N – 1) e opt ( n )
x(n)
w opt, 0 w opt, 1 w opt, N – 1

d(n)

• { w opt, k } are the optimal filter coefficients and e opt ( n ) is additive noise.

August 2005, For Academic Use Only, All Rights Reserved


Notes:
The tapped delay line model shown in the previous slide appears to treat the filter as if it is operating in system
identification mode. However this model applies to any configuration an adaptive filter. The model simply
describes the relationship between the input signal to the adaptive filter and the desired signal in terms of the
optimal filter weights. Any discrepancies between the tap delay line model and the actual system under
consideration are represented by the noise term e opt ( n ) .

This is illustrated in the diagram below which shows an adaptive filter configured for inverse system
identification. The relationship between x ( k ) and d ( k ) may be modelled using the tapped delay line with an
additive noise term

x(k) model of
unknown system
d(k) equivalent

Delay d(k)

Unknown Adaptive y(k) +e(k)


s(k) System x(k) Filter -
Top
State-space Formulation of RLS II 6.21

• The linear model and state space models are shown below.
e opt ( n )

x(n) w opt Σ d(n)

process measurement
q(n + 1) q(n)
0 Σ z–1 xH( n ) Σ y(n)

v(n)
λ –1 ⁄ 2
q ( n + 1 ) = λ –1 / 2 q ( n ) v ( n ) = λ – n / 2 e opt∗ ( n )
q ( 0 ) = w opt y ( n ) = x H ( n )q ( n ) + v ( n )
∴q ( n ) = λ – n / 2 w opt = λ – n / 2 d∗ ( n )

August 2005, For Academic Use Only, All Rights Reserved


Notes:
Top
Adaptive DSP mapped to FPGAs 6.22

• This section explores the use of adaptive filters in FPGAs.

• A review of the signal flow graphs and update equations for the simple
LMS is first given.

• The non-canonical or transpose LMS and pipelined versions of the


LMS will be considered.

• The (often ignored!) differences between standard and transpose


LMS filter will be investigated and highlighted.

• Complex LMS algorithms and implementations will be presented for


adaptive equalisation architectures.

August 2005, For Academic Use Only, All Rights Reserved


Notes:
Top
LMS Hardware Implementation 6.23

• Considering the LMS weight update equation the hardware LMS


implementation is:
T
w ( k ) = w ( k – 1 ) + 2µe ( k )x ( k ) e ( k ) = d ( k ) – w ( k – 1 )x ( k )

x(k) z-1 z-1 z-1


weight update

z-1 z-1 z-1 z-1

w0(k-1) w1(k-1) w2(k-1) w3(k-1)


y(k)
e(k)=y(k)-d(k) - d(k)

• Retiming/Pipelining this structure is desired but not simple.


August 2005, For Academic Use Only, All Rights Reserved
Notes:
The hardware implementation shown here is based on the LMS weights update equation. It is common to find
this structure also represented in a slightly different format, where FIR filtering and weight update operations
are represented as different blocks. This structure is often called the serial LMS (SLMS).

The standard weight updates can be written as:

w0 w0 x(k)
w1 w1 x(k – 1)
= + 2µe ( k ) or in vector form:
w2 w2 x(k – 2)
w3 w3 x(k – 3)
k k–1 k

w ( k ) = w ( k – 1 ) + 2µe ( k )x ( k )

where

w0 x(k)
w1 x(k – 1)
w(k) = and x ( k ) =
w2 x(k – 2)
w3 x(k – 3)
k k
Top
Complex LMS 6.24

• In many digital communications systems we require a complex


arithmetic LMS algorithm:
*
w ( k ) = w ( k – 1 ) + 2µe ( k )x ( k )
H
e ( k ) = d ( k ) – w ( k – 1 )x ( k )
*
• e ( k ) is the complex conjugate of e ( k )
H
• w ( k ) is the Hermitian (transpose conjugate) of w ( k )

• Complex LMS might be required for some applications like equalisers


for quadrature communication systems

August 2005, For Academic Use Only, All Rights Reserved


Notes:
The equations for the complex LMS algorithm are very similar to the real case. The only differences are the
complex conjugate introduced in the error signal when updating the filter weights

*
w ( k ) = w ( k – 1 ) + 2µe ( k )x ( k )

and the conjugate of the weight vector when calculating the filter output

H
e ( k ) = d ( k ) – w ( k – 1 )x ( k )

The conjugate of the transpose of a vector is known as the Hermitian of the vector:

T * H
(w (k)) = w (k)
Top
Complex Baseband Modelling 6.25

In-phase (I)
LPF

f 1 ( t ) + jf 2 ( t ) cos 2πf t
c
cos ( 2πf t + θ )
c
Multipath
Channel
z 1 ( t ) + jz 2 ( t )
sin 2 πf t sin ( 2πf t + θ )
c c

LPF
Quadrature (Q)

“Equivalent”

Complex FIR Filter

f 1 ( t ) + jf 2 ( t ) a1+jb1 a2+jb2 an+jbn z 1 ( t ) + jz 2 ( t )

August 2005, For Academic Use Only, All Rights Reserved


Notes:

Assuming the channel characteristic is multipath, and there is a constant phase error in the local oscillator, then
the QAM scheme can be modelled at baseband with a complex FIR filter, i.e. complex inputs and complex
outputs, and complex weights.

For more “advanced” channels we can extend the model to introduce time varying complex filter weights and
therefore produce models for slow and fast fading channels.
Top
Complex System Model 6.26

• Baseband equivalent of a quadrature system:

noise
real real
filter filter
data I tx complex rx complex data I
filter filter filter filter
channel equaliser
data Q tx rx data Q
filter filter
real real
filter filter
noise

complex signal

August 2005, For Academic Use Only, All Rights Reserved


Notes:
The model presented in this slide shows the use of a complex equaliser in a communication system. As it has
been shown in previous slides, the quadrature modulator, the channel and the demodulator can be represented
as a baseband complex equivalent filter. The effect of this channel will be to change the transmitted signal. Two
main effects of this channel can be appreciated: one is the introduction of multipath, the other the changes in
phase and amplitude of the transmitted signal in each of the multipath components.

Consider a channel with a single multipath component. This can be represented as a complex FIR filter with a
single coefficient as shown below:
*
input signal: x r + jx i y r + jy i = ( x r + jx i ) ( w r + jw i )
jφ x jφ w * j ( φx – φw )
A e  w r + jw i jφ y jφ x
 x  Ay e =  Ax e  A e  = ( A A )e
A e
jφ w
   w  x w
 W 
jφ y change in amplitude
output signal: y r + jy i  A e 
change in phase
 y 

Note that the complex signals used (input, output, coefficient) can be represented in two formats: rectangular
and polar. With the latter it is easy to observe the change in amplitude and phase produced in the channel. A
complex equaliser should be able to compensate for these modifications introduced in the channel.
Top
Complex Equaliser in Real Systems 6.27

• Use of equaliser in a real system:

real real
filter filter
data I tx noise rx data I
filter filter complex
filter
cos 2πf t
c cos ( 2πf c t + θ )
channel equaliser
sin 2 πf t
c sin ( 2πf c t + θ )

data Q tx rx
filter filter data Q
real real
filter filter

August 2005, For Academic Use Only, All Rights Reserved


Notes:
The use of a complex equaliser in a real system is used in this slide. This system is equivalent to the one
presented in Slide 6.26.

A number of SystemView equalisation examples will be presented to demonstrated the complex FIR in action
for a genuine multipath channel.
Top
RLS Techniques 6.28

• As discussed before, RLS (recursive least squares) adaptive filters


are known to adapt faster, and to a smaller final error than LMS.

..... ......so why are we still using LMS?

• RLS based techniques are computationally expensive; filter length N:

LMS: O(N) MACs/s RLS: O(N2) MACs/s

• RLS also requires DIVIDES! Divides are, by fundamental calculation,


more expensive and slower than multiplies. DSP processors
traditionally are slow at implementing divides.

• RLS computation can be susceptible to numerical overflow and


underflow. If an underflow (i.e. a small number rounds down to zero)
and this result is used for divide - we have a fatal divide by zero.

• However the QR version of RLS addresses some of these problems!

August 2005, For Academic Use Only, All Rights Reserved


Notes:
In the next few slides we will derive LS then RLS. Generally the RLS comes with some numerical problems and
where possible the implementation technique of choice is the the QR-RLS. THis has better numerical
properties, however has a much higher level of computation requiring both square roots and divides.

QR is a computation based on using a data matrix rather than a correlation matrix. Hence the dynamic range
and therefore computational wordlength is smaller.

QR is a numerically robust algorithm.

But QR requires not only O(N2) MACs/s it also requires many divides AND square roots.

However we must take note now that computational processing power in DSP processors increases at the
same rate as Moores Law (doubles every two years) and if you are working on FPGAs then square roots and
divides can be dealt with adequately. They are expensive, but that are no longer prohibitive.

So we will see a lot more adaptive systems using QR and similar

Some SystemView examples...


Top
Least Squares Solution 6.29

• Consider the least squares solution for x ( k ) and d ( k ) :

desired
signal d(k)
input +
signal Adaptive FIR y(k) e(k)
Σ
x(k)
Digital Filter
w
- error
signal

• Minimizing the total sum of squared errors for all input signals up to and
including time, k. The total squared error, v ( k ) , is:

v(k) = ∑ [ e ( s ) ] 2 = e 2 ( 0 ) + e 2 ( 1 ) + e 2 ( 2 ) + … + e 2 ( k ) = e kT e k
s=0

August 2005, For Academic Use Only, All Rights Reserved


Notes:
Using vector notation:

e(0) d(0) y(0)


e(1) d(1) y(1)
ek = e(2) = d(2) – y(2) = dk – yk
: : :
e(k – 1) d(k – 1) y(k – 1)
e(k) d(k) y(k)

Noting that the output of the N weight adaptive FIR digital filter is given by:

N–1
y(k) = ∑ w n x ( k – n ) = w T x k = x kT w
n=0

where

w = [ w 0, w 1, w 2, …, w N – 1 ] and

x k = [ x ( k ), x ( k – 1 ), x ( k – 2 ), …, x ( k – N + 1 ) ]
Top
Least Squares Gradient Vector 6.30

• The total sum of square errors can then be written as:

v ( k ) = e kT e k = e k 2
2

= [ dk – Xk w ] T [ dk – Xk w ]
= d kT d k + w T X kT X k w – 2d kT X k w

• This equation is quadratic in w and can find the minimum value of v ( k )


by finding the where the gradient vector is zero:

∂ v(k) = 0
∂w

• To find a “good” solution such that the 2-norm of the error vector, e k , is
minimized. The function v ( k ) is an up-facing hyperparaboloid when
plotted in N+1 dimensional space, and there exists exactly one
minimum point at the bottom of the hyperparaboloid.

August 2005, For Academic Use Only, All Rights Reserved


Notes:

e(0) x 0T w x 0T

e(1) x 1T w x 1T
e(2) x 2T w x 2T
ek = = dk – = dk – w
: : :
e(k – 1) x kT – 1 w x kT – 1
e(k) x kT w x kT

x(0) 0 0 … 0 w0
x(1) x(0) 0 … 0 w1
= dk – x ( 2 ) x(1) x(0) … 0
w2
: : : … :
x(k – 1) x(k – 2) x(k – 3) … x(k – N) :
x(k) x(k – 1) x(k – 2) … x ( k – N + 1 ) wN – 1

i.e. ek = dk – Xk w

where X k is a ( k + 1 ) × N data matrix made up from input signal samples. Note that the first N rows of X k are
sparse.
Top
Least Square Equation 6.31

• The gradient vector is therefore:

∂ v(k)
= 2X kT X k w – 2X kT d k = – 2X kT [ d k – X k w ]
∂w

and therefore:

– 2X kT [ d k – X k w LS ] = 0
⇒ X kT X k w LS = X kT d k

• The least squares solution, denoted as w LS and based on data


received up to and including time, k, is given as:

w LS = [ X kT X k ] – 1 X kT d k

August 2005, For Academic Use Only, All Rights Reserved


Notes:
Note that because [ X kT X k ] is a symmetric square matrix, then [ X kT X k ] – 1 is also a symmetric square matrix. As
with any linear algebraic manipulation a useful check is to confirm that the matrix dimensions are compatible,
thus ensuring that w LS is a N × 1 matrix:

-1
x x x x x x x x x x
x x x x x x x x x x d
w N x x x x x x x x x x x x d
x x x x k+1
w x x x x x x x x x x x x d
N x x x x N k+1
w x x x x x x d
x x x x
w k+1 x x x x x x d
x x x x
d
k+1
N
-1
w XT X XT d

Note that if in the special case where X k is a square non-singular matrix, then the above simplifies to:

w LS = X k– 1 X k– T X kT d k = X k– 1 d k (1)

The computation to calculate least squares requires about O(N4) MACs (multiply/accumulates) and O ( N )
divides for the matrix inversion, and O ( ( k + 1 ) × N 2 ) MACs for the matrix multiplications. Clearly therefore, the
more data that is available, then the more computation required.
Top
Least Squares Computation 6.32

• At time iteration k+1, the least squares weight vector solution is:

x(k)
w0 w1 wN-2 wN-1

y(k) + e(k)

w k + 1 = [ X kT X k ] –1 X kT d k
d(k)

• To calculate we require matrix inversion..... this requires divisions.

August 2005, For Academic Use Only, All Rights Reserved


Notes:
Note however that at time k + 1 when a new data sample arrives at both the input, x ( k + 1 ) , and the desired
input, d ( k + 1 ) then this new information should ideally be incorporated in the least squares solution with a view
to obtaining an improved solution. The new least squares filter weight vector to use at time k + 2 (denoted as
w k + 2 ) is clearly given by:

w k + 2 = [ X kT + 1 X k + 1 ] – 1 X kT + 1 d k + 1

This equation requires that another full matrix inversion is performed, [ X kT + 1 X k + 1 ] – 1 , followed by the
appropriate matrix multiplications. This very high level of computation for every new data sample provides the
motivation for deriving the recursive least squares (RLS) algorithm. RLS has a much lower level of computation
by calculating w k + 1 using the result of previous estimate w k to reduce computation.

Consider the situation where we have calculated w k , from,

w k = [ X kT – 1 X k – 1 ] – 1 X kT – 1 d k – 1 = P k – 1 X kT – 1 d k – 1

where

P k – 1 = [ X kT – 1 X k – 1 ] – 1

When the new data samples, x ( k ) and d ( k ) , arrive we have to recalculate the matrix equation (including
recalculated the matrix inverse):

w k + 1 = [ X kT X k ] – 1 X kT d k = P k X kT d k

However note that P k can be written in terms of the previous data matrix X k – 1 and the data vector x k by
partitioning the matrix X k :
Top
Recursive Least Squares 6.33

• It is unreasonable/impossible to recalculate the least squares solution


when the next sample d ( k ) and x ( k ) arrive.

• Hence we need to find some simplified set of equations where the


previous matrix inverse can be “updated”

w k + 1 = [ X kT X k ] –1 X kT d k = P k X kT d k

• Note that P k can be written in terms of the previous data matrix X k – 1


and the data vector x k by partitioning the matrix X k :

Xk – 1 –1
Pk = [ X kT X k ] –1 = X kT – 1 xk = X kT – 1 X k – 1 + x k x kT
x kT
–1
= P k–1– 1 + x k x kT

i.e. we have recursion to calculate X kT X k from X kT – 1 X k – 1


August 2005, For Academic Use Only, All Rights Reserved
Notes:
Where, of course, x k = [ x ( k + 1 ), x ( k ), x ( k – 1 ), …, x ( k – N + 1 ) ] . In order to write this equation in a more
“suitable form” and caculable form we use the matrix inversion lemma which states that:

+ BCD ] – 1 = A – AB [ C + DAB ] –

where A is a non-singular matrix and B, C and D are appropriately dimensioned matrices. Using the matrix
inversion lemma where P k – 1 = A , x k = B , x kT = D and C is the 1 × 1 identity matrix. i.e. the scalar 1, then:

P k = P k – 1 – P k – 1 x k [ 1 + x kT P k – 1 x k ] – 1 x kT P k – 1

This equation implies that if we know the matrix [ X kT – 1 X k – 1 ] – 1 then the matrix [ X kT X k ] – 1 can be computed
without explicitly performing a complete matrix inversion from first principles. This, of course, saves in
computation effort. Equations and are one form of the RLS algorithm. By additional algebraic manipulation,
the computation complexity of Eq. can be simplified even further.

By substituting and partitioning the vector d k and simplifying gives:

w k + 1 = [ P k – 1 – P k – 1 x k [ 1 + x kT P k – 1 x k ] – 1 x kT P k – 1 ]X kT d k

dk – 1
= [ P k – 1 – P k – 1 x k [ 1 + x kT P k – 1 x k ] – 1 x kT P k – 1 ] X T
k – 1 xk (2)
d(k)

= [ P k – 1 – P k – 1 x k [ 1 + x kT P k – 1 x k ] – 1 x kT P k – 1 ] X T d
k – 1 k – 1 + xk d ( k )
Top
Recusive Least Squares 6.34

• The RLS requires O ( N 2 ) MACs and one divide on each iteration

d(k)
x(k)
w0 w1 wN-2 wN-1
+

y(k) e(k)

wk + 1 = wk + mk e ( k )
Pk – 1 xk
m k = ----------------------------------------
-
[ 1 + xk Pk – 1 xk ]
T

P k = P k – 1 – m k x kT P k – 1

August 2005, For Academic Use Only, All Rights Reserved


Notes:
Using the substitution that w k = P k – 1 X kT – 1 d k – 1 and also dropping the time subscripts for notational
convenience, i.e. P = P k – 1 , x = x k , d = d k – 1 , and d = d ( k ) ) , further simplification can be performed:

w k + 1 = [ P – Px [ 1 + x T Px ] – 1 x T P ] X T d + xd

= PX T d + Pxd – Px [ 1 + x T Px ] – 1 x T PX T d – Px [ 1 + x T Px ] – 1 x T Pxd
= w k – Px [ 1 + x T Px ] – 1 x T w k + Pxd – Px [ 1 + x T Px ] – 1 x T Pxd

= w k – Px [ 1 + x T Px ] – 1 x T w k + Pxd [ 1 – [ 1 + x T Px ] – 1 x T Px ] (3)

= w k – Px [ 1 + x T Px ] – 1 x T w k + Px [ 1 + x T Px ] – 1 [ [ 1 + x T Px ] – x T Px ]d

= w k – Px [ 1 + x T Px ] – 1 x T w k + Px [ 1 + x T Px ] – 1 d

= w k + Px [ 1 + x T Px ] – 1 ( d – x T w k )

and reintroducing the subscripts, and noting that y ( k ) = x kT w k :

w k + 1 = w k + P k – 1 x k [ 1 + x kT P k – 1 x k ] – 1 ( d ( k ) – y ( k ) )
= wk + mk ( d ( k ) – y ( k ) ) (4)

= wk + mk e ( k )

where m k = P k – 1 x k [ 1 + x kT P k – 1 x k ] – 1 and is called the gain vector.

The RLS adaptive filtering algorithm therefore requires that at each time step, the vector m k and the matrix P k
are computed. The filter weights are then updated using the error output, e ( k ). Therefore the block diagram for
the closed loop RLS adaptive FIR filter is:
Top
Exponentially Weighted RLS 6.35

• The RLS algorithm calculates the least squares vector at time k based
on all previous data, i.e. data from long ago is given as much relevance
as recently received data.

• Therefore the RLS algorithm has infinite memory.

• To overcome this each error sample is weighted using a forgetting


factor constant λ which just less than 1:

v(k) = ∑ λk – s[ e( s ) ]2 = λke2( 0 ) + λk – 1e2( 1 ) + … + e2( k )


s=0

• The exponentially weighted RLS is then derived as:

w LS = [ X kT Λ k X k ] – 1 X kT Λ k d k

August 2005, For Academic Use Only, All Rights Reserved


Notes:
For example if a forgetting factor of 0.9 was chosen then data which is 100 time iterations old is pre-multiplied
by 0.9 100 = 2.6561 × 10 – 5 and thus considerably de-emphasized compared to the current data. Therefore in
dB terms, data that is more 100 time iterations old is attenuated by 10 log ( 0.00026561 ) = – 46 dB . Data that is
more than 200 time iterations old is therefore attenuated by around 92 dB, and if the input data were 16 bit fixed
point corresponding to a dynamic range of 96dB, then the old data is on the verge of being completely forgotten
about. The forgetting factor is typically a value of between 0.9 and 0.9999.

Noting the form the above equation we can rewrite as:

v ( k ) = e kT Λ k e k

where Λ k is a ( k + 1 ) × ( k + 1 ) diagonal matrix Λ k = diag [ λ k, λ k – 1, λ k – 2, …, λ, 1 ]

Therefore:

v ( k ) = [ dk – Xk w ] T Λk [ dk – Xk w ]

= d kT Λ k d + w T X kT Λ k X w – 2d kT Λ k X w
k k k

Following the same procedure as least squares solution is easily found to be:

w LS = [ X kT Λ k X ] – 1 X kT Λ k d
k k
Top
Exponentially Recusive Least Squares 6.36

• The exponentially weighted RLS also requires O ( N 2 ) MACs and one


divide on each iteration.

d(k)
x(k)
w0 w1 wN-2 wN-1
+

y(k) e(k)

wk + 1 = wk + mk e ( k )
Pk – 1 xk
m k = ----------------------------------------
-
[ λ + x kT P k – 1 x k ]

P k – 1 – m k x kT P k – 1
P k = ------------------------------------------------
λ

August 2005, For Academic Use Only, All Rights Reserved


Notes:
The RLS algorithm is well known to have numerical integrity problems, and will invariably require floating point
arithmetic.

There exist many “fast” forms of the RLS such as the FTF (fast transversal filter), however in general these “fast”
implementations, while reducing computations invariably have numerical integrity and stability problems.

Two key differences from LMS. First we require high level of computation, and second we also require divisions.

But just to recap - why do we want RLS? It can train faster than LMS, and can invariably produce better final
MMSEs.
Top
QR-RLS Algorithm 6.37

• Aim of LS is to find set of filter weights w which minimise the total sum
of squared errors:

v(k) = ∑ λk – s[ e( s ) ]2 0<λ≤1
s=0

• QR is an alternative way of calculating the Least Squares solution, w LS


for N weight FIR filter.

desired
signal d(k)
input +
signal Adaptive FIR y(k) e(k)
Σ
x(k)
Digital Filter
w
- error
signal

August 2005, For Academic Use Only, All Rights Reserved


Notes:
Different algorithms have different performance criteria to be minimised. The LMS algorithm minimises the
mean squared error:

M2 – 1
1
E [ e 2 [ k ] ] = ----------------------
M2 – M1 ∑ e 2 [ k ] for large ( M 2 – M 1 )
k = M1

while RLS type algorithms (like the QR-RLS) attempt to minimise the sum of the squares of past errors:

k
v[k] = ∑ e2[ s ]
s=0

The previous equation applies in the real case (working with real numbers). When complex values are involved
the equation above can be expressed as:

k
v[k] = ∑ e [ s ]e * [ s ]
s=0

where ( · ) * denotes the complex conjugate of a complex value. Therefore e [ k – n ]e * [ k – n ] = e [ k – n ] 2 ,


which is the norm (square of the modulus) of e [ k – n ] .
Top
QR-LS Solution 6.38

• As for RLS, find set of filter weights w which minimise:

2 2
v[k] = e[k] = dk – X Tk w

• Recall the least squares solution:

w LS = [ X kT X k ] – 1 X kT d k

-1
x x x x x x x x x x
x x x x x x x x x x d
w N x x x x x x x x x x x x d
x x x x k+1
w x x x x x x x x x x x x d
N x x x x N k+1
w x x x x x x d
x x x x
w k+1 x x x x x x d
x x x x
d
k+1
N
-1
w XT X XT d

August 2005, For Academic Use Only, All Rights Reserved


Notes:
Top
QR Decomposition 6.39

• The QR matrix decomposition is an extremely useful technique in least


squares signal processing systems where a full rank m × n matrix data
X k ( m > n ) is decomposed into an upper triangular matrix, R and an
orthogonal matrix Q:

X k = QR

x x x x q q q q q q r r r r
x x x x q q q q q q 0 r r r
x x x x q q q q q q 0 0 r r
k k
q q q q q q
k QTQ = I
x x x x 0 0 0 r
x x x x q q q q q q 0 0 0 0
x x x x q q q q q q 0 0 0 0

N k N
X Q R

August 2005, For Academic Use Only, All Rights Reserved


Notes:
If the least squares solution is required for the overdetermined linear set of equations:

Ax = b (5)

where A is an m × n matrix, b is a known m element vector, and x is an unknown n element vector, then the
minimum norm solution is required, i.e. minimize, ε , where ε = Ax – b 2 . This can be found by the least
squares solution:

x LS = ( A T A ) – 1 A T b (6)
Top
QR Solution 6.40

• Recall the least squares solution:

w LS = [ X kT X k ] – 1 X kT d k

• Substituting for X k = QR (and dropping subscript k for convenience):

w LS = [ ( QR ) T ( QR ) ] – 1 ( QR ) T d k
= [ R T Q T QR ] –1 R T Q T d k
= [ R T R ] –1 R T Q T d k
= R –1 R –T R T Q T d k
= R –1 Q T d k
w LS = R –1 d k'

August 2005, For Academic Use Only, All Rights Reserved


Notes:
This last equation can be solved by backsubstitution. If an upper triangular system of linear equations:

r 11 … r 1, N – 2 r 1, N – 1 r 1N w1 d1
: : : : : : :
Rw = d ⇒ 0 … r N – 2, N – 2 r N – 2, N – 1 r N – 2, N w N – 2 = d N – 2
0 … 0 r N – 1, N – 1 r N – 1, N w N – 1 dN – 1
0 … 0 0 r NN wN dN

has to be solved for the weight vector w, where R is an N × N non-singular upper triangular matrix, then the last
element of the unknown vector, w can be calculated from multiplication of the last row of R with the vector w:

dN
r NN w N = d N ⇒ wN = ---------
-
r NN

the second last element can  dn 


therefore be calculated from b n – 1 – r n – 1, n  --------
multiplication of the second last  r nn
row of R with vector x, and r n – 1, n – 1 x n – 1 + r n – 1, n x n = d n – 1 ⇒ x n – 1 = -----------------------------------------------------
r n – 1, n – 1
substitution of w n :

In general it can be shown that all elements of x can be calculated recursively from: n
di – ∑ r ij w j
j = i+1
w i = --------------------------------------
-
r ii
Top
QR Decomposition: Givens Rotations I 6.41

• The QR matrix decomposition can be calculated using Givens rotations

• Consider the example

1 5 9 – 0.27 – 0.51 – 0.82 – 3.74 1.07 – 16.57


X = QR → 2 6 10 = – 0.53 – 0.63 0.56 0 – 10.43 – 4.38
3 – 7 11 – 0.80 0.57 – 0.10 0 0 – 2.87
T T
• Since Q is orthogonal Q X = R . The aim is to find Q : a series of
transformations which zero the lower half of X to produce R
Givens Rotation Givens Rotation Givens Rotation

1 5 9 2.24 7.60 12.97 3.74 – 1.07 16.57 3.74 – 1.07 16.57


2 6 10 → 0 – 1.79 – 3.58 → 0 – 1.79 – 3.58 → 0 – 10.43 – 4.38
3 – 7 11 3 –7 11 0 – 10.28 – 3.82 0 0 2.87

Element to zero R
Element which changes its values

August 2005, For Academic Use Only, All Rights Reserved


Notes:
A Givens rotation is a technique to zero the elements in a matrix. Other methods like Householder or Gram-
Schmidt are also available, but the order in which the Givens rotation zero the elements of a matrix make it quite
attractive for the type of application considered here. A Givens rotation based QR decomposition leads to a
simple implementation. Moreover it is a numerically stable method.

Consider the following 2 × 2 example where matrix B is to be made upper triangular, element B 10 must be
zeroed:

B 00 B 01
B =
B 10 B 11

The Givens method applies a rotation by θ rads in the following way

cos θ sin θ B 00 B 01 = B' 00 B' 01


– sin θ cos θ B 10 B 11 0 B' 11
B 10
this is satisfied if – B 00 sin θ + B 10 cos θ = 0 which can be achieved if tan θ = ---------
- . This solution can also be
B 00
expressed as:

1
cos θ = ----------------------------------------------
2
1 + ( B 10 ⁄ B 00 )
B 10 ⁄ B 00
sin θ = ----------------------------------------------
2
1 + ( B 10 ⁄ B 00 )
Top
QR Decomposition: Givens Rotations II 6.42

• A series of Givens rotations are applied to zero the required elements


and produce the upper triangular matrix R

• With the previous example


T
Q A

3.74 – 1.07 16.57 1 5 9


T
R = 0 – 10.43 – 4.38 = G G G
3 2 1 2 6 10 = Q A
0 0 2.87 3 – 7 11

elements they zero

T
• Q is composed of a series of Givens rotations G 1, G 2, G 3, etc which
zero the desired elements

• Each Givens rotation requires divisions and square roots

August 2005, For Academic Use Only, All Rights Reserved


Notes:
Previously we saw Givens rotations for the 2 × 2 case, but they can also be applied to any order. For the
example under consideration the following rotations are calculated.

cos θ sin θ 0 0.45 0.89 0


G 1 = – sin θ cos θ 0 = – 0.89 0.45 0
0 0 1 0 0 1

cos θ 0 sin θ 0.60 0 0.80


G2 = 0 1 0 = 0 1 0
– sin θ 0 cos θ – 0.80 0 0.60

1 0 0 1 0 0
G 3 = 0 cos θ sin θ = 0 0.17 0.98
0 – sin θ cos θ 0 – 0.98 0.17

0.45 0.89 0 1 5 9 2.24 7.60 12.97


G 1 A = – 0.89 0.45 0 2 6 10 = 0 – 1.79 – 3.58
0 0 1 3 – 7 11 3 –7 11

0.60 0 0.80 2.24 7.60 12.97 3.74 – 1.07 16.57


G2 G A = 0 1 0 0 – 1.79 – 3.58 = 0 – 1.79 – 3.58
1
– 0.80 0 0.60 3 –7 11 0 – 10.28 – 3.82

1 0 0 3.74 – 1.07 16.57 3.74 – 1.07 16.57


G 3 G G A = 0 0.17 0.98 0 – 1.79 – 3.58 = 0 – 10.43 – 4.38 = R
2 1
0 – 0.98 0.17 0 – 10.28 – 3.82 0 0 2.87
Top
QR Algorithm: Hardware Implementation 6.43

• The QR-RLS algorithm can be implemented iteratively with a systolic


type array
d0 d1 d2

R 00 R 01 R 02
y
BC
x

x’
cos θ
BC sin θ 0
R 11 R 12

BC
y’ = 0
2 2
x' = x cos θ + y sin θ = x +y
0
y' = y cos θ – x sin θ = 0 R 22

2 2
cos θ = x ⁄  x + y  y BC
 
de
ti u
n
2 2
sin θ = y ⁄  x + y  ag
  m
θ 0
x x’

August 2005, For Academic Use Only, All Rights Reserved


Notes:
The QR-RLS algorithm can be formulated iteratively, i.e. it is updated with every received sample. In this case,

for convenience the QR decomposition is formulated as X = Q R


0

T T
Remembering that Q is orthogonal, i.e. Q Q = QQ = I , we can write the minimisation criterion as:
2 T T T T 2 T T 2
e = e e = e QQ e = Q e = Q d – Q Aw . Considering the following equivalencies

T T
Q d = p , Q Xw = Rw
v 0

2
the problem of minimising ξ is equivalent to minimising ξ = p – Rw , which implies solving the equation
v
p = Rw with backsubstitution. In the iterative case we introduce time dependence with the subindex k . It can
then be proved that the algorithm is then composed of the following steps.

Step 1: R [ k ] = Q λR [ k – 1 ]
0 T
x [k]

Step 2: p [ k ] = Q λp [ k – 1 ]
γ d[k]
Step 3: p [ k ] = R [ k ]w [ k ]
Top
QR-RLS Tri-Array 6.44

• The fully parallel QR array is essentially a triangular array of dimension


N, where N is the number of weights in the FIR filter.

• Typical operation would be to run the QR algorithm on a large number


of samples of x ( k ) and d ( k ) and then calculate w using backsubstition:
d(k)
x(k)

Givens Generation Cell


- 2 multiplies/adds
- 1 square root
- 1 divide

Givens Rotation Cell


- 2 multiplies/adds

e(k)

August 2005, For Academic Use Only, All Rights Reserved


Notes:
McWhirter’s well known paper allows the “direct residual extraction” to produce e(k) from the diagonal boundary
cells. Note that although y(k) is not explicitly formed it can be calculated from:

y(k) = d(k) – e(k)

Note that the cost of the diagonal boundary cells means that the Givens array is somewhat imbalanced
computationally. The square root and divide will have a higher cost than the multiplies and adds.

After the QR array has operated on the incoming x ( k ) and d ( k ) data, the next step is to perform the
backsubstition using the R matrix which is essentially inside the QR array.

Note that for infinite precision arithmetic, both QR and QR-RLS give exactly the same results.

The QR-RLS has better numerical integrity than the direct RLS when using fixed point numbers. One simple
way to see this is to consider have an N bit processor available. When performing RLS or direct least squares
we are using squared quantities of x ( k ) hence the significant wordlegnth of x ( k ) should be less than N ⁄ 2 .
Whereas in the QR we are using the data quantities directly and working with orthogonal normalised transforms
and could therefore have closer to N bits resolution in the data x ( k ) .
Top
Conclusions 6.45

• In this section we have reviewed:

• LMS adaptive filtering signal flow graphs

• The transpose LMS algorithm

• Pipelining techniques for LMS

• RLS based processing using the QR

August 2005, For Academic Use Only, All Rights Reserved


Notes:

You might also like