Professional Documents
Culture Documents
Adaptive DSP
Return Return
• Architectures
• Applications
• Algorithms
“The aim is to adapt the digital filter such that the input signal x(k) is
filtered to produce y(k) which when subtracted from desired signal d(k),
will minimise the power of the error signal e(k).”
desired
signal d(k)
input +
signal y(k) e(k)
Adaptive FIR
Σ
x(k) Digital Filter Output - error
signal signal
Adaptive Algorithm
The naming of the signals as input, desired, output and error and denoted as x(k), d(k), y(k) and e(k)
respectively is standard in most textbooks and papers and will be used in this presentation.
The arrow through the adaptive filter is standard notation to indicate that the filter is adaptive and that all of the
digital filter weights can be updated as some function of the error signal. As with any feedback system, stability
is a concern; hence care must be taken to ensure that any adaptive algorithm is stable.
At this stage of abstraction no information is given on the actual input signals which could be anything: speech,
music, digital data streams, vibration signal, predefined noise and so on.
Filter Type
The adaptive filter could be FIR (non-recursive), IIR (recursive) or even a non-linear filter. Most adaptive filters
are FIR for reasons of algorithm stability and mathematical tractability. In the last few years however, adaptive
IIR filters have become increasingly used in stable forms and in a number of real world applications (notably
active noise control, and ADPCM techniques). Current research has highlighted a number of useful non-linear
adaptive filters such as Volterra filters, and some forms of simple artificial neural networks.
Obviously the key aim of the adaptive filter is to minimise the error signal e(k). The success of this minimisation
will clearly depend on the nature of the input signals, the length of the adaptive filter, and the adaptive algorithm
used.
Top
Architectures.... 6.3
Delay
s(k) + n)k)
d(k) d(k)
Unknown x(k) +e(k)
y(k) + e(k) Adaptive y(k)
Σ
n’(k)
x(k) Adaptive Σ System -
-
s(k) Filter
Filter
Unknown
System
d(k)
d(k)
+ x(k) y(k)
+e(k)
x(k) Adaptive y(k) e(k) Adaptive Σ
Σ Delay -
Filter - s(k) Filter
To simplify the figures, the ADCs and DACs are not explicitly shown.
In each architecture the aim of the general adaptive signal processor is the same - to minimise the error signal
e(k).
A particular application may have elements of more than one single architecture. For example the set up below
includes elements of system identification, inverse system identification, and noise cancellation. If the adaptive
filter is successful in modelling “Unknown System 1”, and inverse modelling “Unknown System 2”, then if s ( k )
is uncorrelated with r ( k ) then the error signal is likely to be e ( k ) ≈ s ( k ) :
s(k)
+ +
Unknown
System 1 Delay
d(k)
x(k) y(k) + e(k)
Unknown Adaptive
r(k)
System 2 Filter -
Top
Application Examples 6.4
• System Identification:
• Noise Cancellation:
• Prediction:
• Radar ~ MHz.
Adaptive filtering has a tremendous range of applications from everyday equalisers on modems to less obvious
applications such as adaptive tracking filters for predicting the movement of human eyes when following a
moving stimulus! (research undertaken by S.Goodbody, MRC, London).
Top
Channel Identification 6.5
d(k)
x(k) Adaptive y(k) +
Σ
Broadband Filter - e(k)
Signal
(White noise)
General Adaptive Signal Processor
One traditional method of finding the impulse response of a room, is to apply an impulse using “clappers” or a
starting pistol, and record the impulse with a microphone and tape recorder. Improved impulse responses can
be found by taking an ensemble average of around 20 impulses. To find the frequency response of the room,
the Fourier transform of the impulse response is taken. This technique can however be difficult and time
consuming in practice, and white noise correlation techniques are more likely.
h(t) H(f)
Impulse Response
DFT
time frequency
IDFT
Calculating the impulse response is important to audio engineers working in applications such as architectural
acoustics, car interior acoustics, loudspeaker design, echo control systems, public address system design and
so on.
If the architecture in the previous slide does indeed adapt, then the error will become very small. Because the
adaptive filter and the room were excited by the same signal, then over the frequency range of the input signal,
x(k), the adaptive filter will have the same same impulse (and frequency) response as the room.
Top
Echo Cancellation 6.6
B
Output Signal y(k) Simulated
echo of A “Hello”
- d(k)
B DAC Σ ADC
e(k) +
“Hello”
B + echo of A London Paris
“Hello” “..morning”
If we can suitably model the echo generating path with the adaptive filter, then a negative simulated echo can
be added to cancel out the speaker A echo. At the other end of the line, telephone user B can also have an
echo canceller.
In general local echo cancellation (where the adaptive echo canceller is inside the consumer’s telephone/data
communication equipment) is only used for data transmission and not speech. Minimum specifications for the
modem V series of recommendations can be found in the ITU (formerly CCITT) Blue Book. For V32 modems
(9600 bits/sec with Trellis code modulation) an echo reduction ratio of 52dB is required. This is power reduction
of around 160,000 in the echo. Hence the requirement for a powerful DSP processor implementing an adaptive
echo cancelling filter.
For long distance telephone calls where the round trip echo delay is more than 0.1 seconds and suppressed by
less than 40dB (this is typical via satellite or undersea cables) line echo on speech can be a particularly
annoying problem. Before adaptive echo cancellers this problem would be solved by setting up speech
detectors and allowing speech to be half duplex. This is inconvenient for speakers who must take it in turn to
speak. Adaptive echo cancellers at telephone exchanges have however helped to solve this problem.
To cancel both near-end and far-end an echo canceller is often presented in two sections, one for the near end
echo, and one for the far end echo. Further information on telecommunication echo cancellers can be found the
textbooks referenced earlier.
Top
Channel Equalisation 6.7
“Virtual wire”
d(k)
x(k) +
y(k)
Training Telephone Adaptive
Sequence DAC
Channel
ADC
Filter
Σ
- e(k)
s(k)
The process of data channel equalisation is one of the most exploited areas of adaptive signal processing. Most
digital data communications (V32 modems for example) use some form of data channel equaliser. In the last
few years the availability of fast adaptive equalisers has led to modems capable of more than 28800 bits/s
communication with 115200 bits/s (also using data compression) modems on the horizon.
If the above telephone channel is a (stationary) communication channel with a continuous time impulse
response, then when symbols are transmitted the impulse response will cause a symbol to spread over many
time intervals, thus introducing intersymbol interference (ISI). The aim of a data equaliser is to remove this ISI.
Compared to simple channel equalisation, it should be noted that a data equaliser only requires to equalise the
channel at the symbol sampling instants rather than over all time. Hence the problem can be posed with data
symbols as inputs, rather than the raw stochastic data (as in the slide).
In general for channels where the impulse response changes slowly, a decision directed adaptive data
equaliser is used, whereby a slicer is used to produce a retraining signal. It is also worth noting that for many
data transmission systems, the data is complex, and hence a complex adaptive algorithm is required.
Further Information
R. Lucky’s paper on adaptive equalisation (Bell Syst Tech. J., Vol. 45, 1966) defined the LMS algorithm for
equalisation of digital communications and still is very relevant today. A more recent paper is: S. Qureshi.
Adaptive Equalisation. Proceedings of the IEEE, Vol. 73, pp. 1349-1387, 1985. A useful introduction can be
found in: H.M. Ahmed. Recent advances in DSP Systems. IEEE Comms. Mag., Vol. 29, No. 5, pp 32-45, May
1991. See also the general adaptive DSP textbooks for more information.
Top
Minimising the Mean Squared Error 6.8
• If the statistics of x(k) and d(k) are wide sense stationary and ergodic
then we can choose to minimise the mean squared error signal:
desired
signal d(k)
input +
signal y(k) e(k)
Adaptive FIR
Σ
x(k) Digital Filter Output - error
signal signal
Adaptive Algorithm
“Traditional” adaptive filtering does not consider statistical information above the second order statistics. Some
recent higher order statistics techniques however do consider higher order moments in the development of
adaptive algorithms. The advantages, if any, of this approach are not clear as yet.
For mean squared analysis and derivation of suitable adaptive algorithms, the assumption of wide sense
stationarity for x(k) and d(k) has proven to be sufficient.
Top
Adaptive Algorithm for FIR Filter 6.9
•
x(k)
d(k)
w0 w1 w2 wN-2 wN-1
+ e(k)
y(k)
Σ
−
N–1
y(k) = ∑ wn x ( k – n ) = w T x ( k )
n=0
N–1
y( k ) = wTx( k ) = ∑ x ( k – n )w n
n=0
where
w = [ w 0 w 1 w 2 ..... w N – 2 w N – 1 ] T
and
x ( k ) = [ x ( k ) x ( k – 1 ) x ( k – 2 ) ..... x ( k – N + 2 ) x ( k – N + 1 ) ] T
Wopt is the weight vector which minimises the error signal in the mean squared sense.
Note that the algorithm we are currently driving is an open loop (i.e. no feedback) technique. Once a large
amount of data has been collected we are then performing a single calculation. Hence this type of technique is
sometimes called single step.
Top
Mean Squared Error 6.10
e2( k ) = ( d( k ) – wTx( k ) )2
= d 2 ( k ) + w T [ x ( k )x T ( k ) ]w – 2d ( k )w T x ( k )
• Taking expected (or mean) values (and dropping “(k)” for notational
convenience):
E [ e 2 ( k ) ] = E [ d 2 ] + w T E [ xx T ]w – 2w T E [ dx ]
E [ e 2 ( k ) ] = E [ d 2 ( k ) ] + w T Rw – 2w T p
( x k2 ) ( xk xk – 1 ) ( xk xk – 2 ) dk x p2
k–2
= E (x
k – 1 xk ) ( x k2 – 1 ) ( xk – 1 xk – 2 ) Hence for a general N weight filter:
( xk – 2 xk ) ( xk – 2 xk – 1 ) ( x k2 – 2 )
p0
The correlation matrix, R is Toeplitz symmetric and for a p1
filter with N weights the matrix will be N x N in dimension: p =
:
r0 r1 r2 … rN – 1 pN – 1
E [ e 2 ( k ) ] = E [ d 2 ( k ) ] + w T Rw – 2w T p
• The MMSE is found from setting the (partial derivative) gradient vector,
∇ , to zero:
∂ζ
∇ = = 2Rw – 2p = 0
∂w
⇒ w opt = R –1 p
input
signal Adaptive FIR y(k) e(k)
Σ
x(k)
Digital Filter Output
signal
- +
error
signal
Calculate desired
w = R-1p d(k) signal
The Wiener-Hopf adaptive DSP computation is a single step algorithm that does not use feedback, and could
be used to solve any of the previous problems in system identification, inverse system identification, noise
cancellation, etc. However there are number of practical reasons why this is not often done.
If we assume that the statistical averages are equal to the time averages (i.e. x(k) and d(k) are ergodic), then
we can calculate all elements of R and p from:
M–1
1
r ( n ) = -----
M ∑ x ( i )x ( i + n )
i=0
1- M – 1
M ∑
pn = ---- d ( i )x ( i + n )
i=0
Calculation of R and p requires approx. 2MN multiply and accumulate (MAC) operations where M is the number
of samples in a “suitably” representative data sequence, and N is the adaptive filter length. The inversion of R
requires around N3 MACs, and the matrix-vector multiplication, N2 MACs. Therefore the total number of
computations in performing this one step algorithm is 2MN + N3 + N2 MACs. The computation is therefore very
high, e.g.
An N = 100 weight filter, calculating the r and p values with M = 10000 samples will require more than 3,000,000
MACs to find the MMSE solution.
More importantly if the statistics of signals x(k) or d(k) change (which is very likely), then the filter weights will
require to be recalculated, i.e. the algorithm has no tracking capabilities. Hence direct implementation of the
Wiener-Hopf solution is not practical for real time DSP implementation because of the high computation load,
and the need to recalculate when the statistics change.
Top
Gradient Techniques 1 6.13
w ( k + 1 ) = w ( k ) + µ ( –∇ ( k ) ) :
MSE, ζ
Small step size, µ
w0
MSE, ζ
Step size
OK, Step size
algorithm too large,
converges algorithm
diverges
MMSE
wopt w
Top
Gradient Techniques 2 6.14
• The initial value w(0) is an initial “guess”, and then at each new discrete
time, k, a new weight vector value is calculated.
• For an N weight FIR filter, the steepest descent technique will jump
down the inside of the surface to the MMSE point.
∂ζ ∂ E [ e 2 ( k ) ] = 2Rw ( k ) – 2p
∇(k) = =
∂w(k) ∂w(k)
∂
w(k + 1) = w(k) + µ– (e 2 ( k ))
∂w(k)
ˆ ∂ e2
∇(k) = ( k ) = 2e ( k ) ∂ e ( k ) = – 2 e ( k )x ( k )
∂w(k) ∂w(k)
w ( k + 1 ) = w ( k ) + 2µe ( k )x ( k )
ˆ
E ∇ ( k ) = E [ – 2 e ( k )x ( k ) ] = E [ – 2 ( d ( k ) – w T ( k )x ( k ) )x ( k ) ] = 2Rw ( k ) – 2p = ∇ ( k )
where we have assumed that w(k) and x(k) are independent.
In the mean the LMS will converge to the Wiener-Hopf solution if the step size, µ, is limited by the inverse of the
largest eigenvalue. Taking the expectation of both sides of the LMS equation gives:
E [ w ( k + 1 ) ] = E [ w ( k ) ] + 2µE [ e ( k )x ( k ) ] = E [ w ( k ) ] + 2µ ( E [ d ( k )x ( k ) ] – E [ x ( k )x kT w ( k ) ] )
and again assuming that w(k) and x(k) are independent:
E [ w ( k + 1 ) ] = E [ w ( k ) ] + 2µ ( p – RE [ w ( k ) ] )
= ( I – 2µR ) E [ w ( k ) ] + 2µRw opt
Now if we let v ( k ) = w ( k ) – w opt then we can rewrite the above in the form:
E [ v ( k + 1 ) ] = ( I – 2µR )E [ v ( k ) ]
For convergence of the LMS to the Wiener-Hopf, we require that w ( k ) → w opt as k → ∞ , and therefore
v ( k ) → 0 as k → ∞ . If the eigenvalue decomposition of R is given by Q T ΛQ , where Q T Q = I and Λ is a
diagonal matrix then writing the vector v(k) in terms of the linear transformation Q, such that
E [ v ( k ) ] = Q T E [ u ( k ) ] and multiplying both sides of the above equation, we realise the decoupled equations:
.......continued overleaf
Top
The LMS Algorithm 6.16
d(k)
x(k)
w0 w1 wN-2 wN-1
+
y(k e(k)
Σ
−
w ( k + 1 ) = w ( k ) + 2µe ( k )x ( k )
E [ u ( k + 1 ) ] = ( I – 2µΛ )E [ u ( k ) ]
and therefore:
E [ u ( k ) ] = ( I – 2µΛ ) k E [ u ( 0 ) ]
where ( I – 2µΛ ) is a diagonal matrix.
( 1 – 2µλ 0 ) 0 0 0
0 ( 1 – 2µλ 1 ) 0 0
( I – 2µΛ ) = 0 0 ( 1 – 2µλ 2 ) 0
… 0
0 0 0 0 ( 1 – 2µλ N – 1 )
For convergence of this equation to the zero vector, we require that
1
0 < µ < --------------
λ max
Top
LMS Stability 6.17
• The stability of the LMS is dependent on the magnitude of the step size
parameter, µ.
• For stability it can be shown that the step size should be:
1
0 < µ < ---------------------------
NE [ x 2 ( k ) ]
N–1
trace [ R ] = ∑ λn
n=0
i.e. the sum of the diagonal elements of the correlation matrix R, is equal to the sum of the eigenvalues.
Therefore the inequality:
λ max ≤ trace [ R ]
will hold. However if the signal xk is wide sense stationary, then the diagonal elements of the correlation matrix,
R, are E [ x 2 ( k ) ] which is a measure of the signal power. Hence:
·
trace [ R ] = N E [ x 2 ( k ) ] = N<Signal Power>
and the well known LMS stability bound of:
1
0 < µ < ---------------------------
· -
NE[x (k)] 2
Some recent publication have further limited the above bounds. In the real DSP algorithm design world, the
bound is useful although trial and error also provide good feedback!
Top
Kalman Filtering I 6.18
F ( n + 1, n )
v2 ( n )
q ( n + 1 ) = F ( n + 1, n )q ( n ) + v 1 ( n ) y ( n ) = C ( n )q ( n ) + v 2 ( n )
The state vector of the system under consideration is denoted as q ( n ) and is a column vector containing M
state variables. The transition from the current state q ( n ) to the next state q ( n + 1 ) is described by the following
equation
q(n + 1) = F ( n + 1, n ) q ( n ) + v1 ( n )
where F ( n + 1, n ) is called the state transition matrix and v 1 ( n ) is the process noise (assumed to be white).
The observation vector y ( n ) is an N -by-1 column vector containing measurements of the process at time n .
y ( n ) is related to the state of the process by the following equation
N = N×M M + N
where C ( n ) is called the measurement matrix and v 2 ( n ) is the measurement noise (also assumed to be white).
Top
Kalman Filtering II 6.19
• measurement matrix C ( n )
G ( n ) = F ( n + 1, n )K ( n, n – 1 )C H ( n ) [ C ( n )K ( n, n – 1 )C H ( n ) + Q 2 ( n ) ] – 1
α ( n ) = y ( n ) – C ( n ) q̂ ( n γ n – 1 )
q̂ ( n + 1 γ n ) = F ( n + 1, n ) q̂ ( n γ n – 1 ) + G ( n )α ( n )
K ( n ) = K ( n, n – 1 ) – F ( n, n + 1 )G ( n )C ( n )K ( n, n – 1 )
K ( n + 1, n ) = F ( n + 1, n )K ( n )F H ( n + 1, n ) + Q 1 ( n )
q̂ ( n γ n ) = F ( n, n + 1 ) q̂ ( n + 1 γ n )
For a more in depth description and derivation of the algorithm please see
• The first step is to model the process which we are measuring; that is
we must model the unknown system.
d(n)
• { w opt, k } are the optimal filter coefficients and e opt ( n ) is additive noise.
This is illustrated in the diagram below which shows an adaptive filter configured for inverse system
identification. The relationship between x ( k ) and d ( k ) may be modelled using the tapped delay line with an
additive noise term
x(k) model of
unknown system
d(k) equivalent
Delay d(k)
• The linear model and state space models are shown below.
e opt ( n )
process measurement
q(n + 1) q(n)
0 Σ z–1 xH( n ) Σ y(n)
v(n)
λ –1 ⁄ 2
q ( n + 1 ) = λ –1 / 2 q ( n ) v ( n ) = λ – n / 2 e opt∗ ( n )
q ( 0 ) = w opt y ( n ) = x H ( n )q ( n ) + v ( n )
∴q ( n ) = λ – n / 2 w opt = λ – n / 2 d∗ ( n )
• A review of the signal flow graphs and update equations for the simple
LMS is first given.
2µ
w0 w0 x(k)
w1 w1 x(k – 1)
= + 2µe ( k ) or in vector form:
w2 w2 x(k – 2)
w3 w3 x(k – 3)
k k–1 k
w ( k ) = w ( k – 1 ) + 2µe ( k )x ( k )
where
w0 x(k)
w1 x(k – 1)
w(k) = and x ( k ) =
w2 x(k – 2)
w3 x(k – 3)
k k
Top
Complex LMS 6.24
*
w ( k ) = w ( k – 1 ) + 2µe ( k )x ( k )
and the conjugate of the weight vector when calculating the filter output
H
e ( k ) = d ( k ) – w ( k – 1 )x ( k )
The conjugate of the transpose of a vector is known as the Hermitian of the vector:
T * H
(w (k)) = w (k)
Top
Complex Baseband Modelling 6.25
In-phase (I)
LPF
f 1 ( t ) + jf 2 ( t ) cos 2πf t
c
cos ( 2πf t + θ )
c
Multipath
Channel
z 1 ( t ) + jz 2 ( t )
sin 2 πf t sin ( 2πf t + θ )
c c
LPF
Quadrature (Q)
“Equivalent”
Assuming the channel characteristic is multipath, and there is a constant phase error in the local oscillator, then
the QAM scheme can be modelled at baseband with a complex FIR filter, i.e. complex inputs and complex
outputs, and complex weights.
For more “advanced” channels we can extend the model to introduce time varying complex filter weights and
therefore produce models for slow and fast fading channels.
Top
Complex System Model 6.26
noise
real real
filter filter
data I tx complex rx complex data I
filter filter filter filter
channel equaliser
data Q tx rx data Q
filter filter
real real
filter filter
noise
complex signal
Consider a channel with a single multipath component. This can be represented as a complex FIR filter with a
single coefficient as shown below:
*
input signal: x r + jx i y r + jy i = ( x r + jx i ) ( w r + jw i )
jφ x jφ w * j ( φx – φw )
A e w r + jw i jφ y jφ x
x Ay e = Ax e A e = ( A A )e
A e
jφ w
w x w
W
jφ y change in amplitude
output signal: y r + jy i A e
change in phase
y
Note that the complex signals used (input, output, coefficient) can be represented in two formats: rectangular
and polar. With the latter it is easy to observe the change in amplitude and phase produced in the channel. A
complex equaliser should be able to compensate for these modifications introduced in the channel.
Top
Complex Equaliser in Real Systems 6.27
real real
filter filter
data I tx noise rx data I
filter filter complex
filter
cos 2πf t
c cos ( 2πf c t + θ )
channel equaliser
sin 2 πf t
c sin ( 2πf c t + θ )
data Q tx rx
filter filter data Q
real real
filter filter
A number of SystemView equalisation examples will be presented to demonstrated the complex FIR in action
for a genuine multipath channel.
Top
RLS Techniques 6.28
QR is a computation based on using a data matrix rather than a correlation matrix. Hence the dynamic range
and therefore computational wordlength is smaller.
But QR requires not only O(N2) MACs/s it also requires many divides AND square roots.
However we must take note now that computational processing power in DSP processors increases at the
same rate as Moores Law (doubles every two years) and if you are working on FPGAs then square roots and
divides can be dealt with adequately. They are expensive, but that are no longer prohibitive.
desired
signal d(k)
input +
signal Adaptive FIR y(k) e(k)
Σ
x(k)
Digital Filter
w
- error
signal
• Minimizing the total sum of squared errors for all input signals up to and
including time, k. The total squared error, v ( k ) , is:
v(k) = ∑ [ e ( s ) ] 2 = e 2 ( 0 ) + e 2 ( 1 ) + e 2 ( 2 ) + … + e 2 ( k ) = e kT e k
s=0
Noting that the output of the N weight adaptive FIR digital filter is given by:
N–1
y(k) = ∑ w n x ( k – n ) = w T x k = x kT w
n=0
where
w = [ w 0, w 1, w 2, …, w N – 1 ] and
x k = [ x ( k ), x ( k – 1 ), x ( k – 2 ), …, x ( k – N + 1 ) ]
Top
Least Squares Gradient Vector 6.30
v ( k ) = e kT e k = e k 2
2
= [ dk – Xk w ] T [ dk – Xk w ]
= d kT d k + w T X kT X k w – 2d kT X k w
∂ v(k) = 0
∂w
• To find a “good” solution such that the 2-norm of the error vector, e k , is
minimized. The function v ( k ) is an up-facing hyperparaboloid when
plotted in N+1 dimensional space, and there exists exactly one
minimum point at the bottom of the hyperparaboloid.
e(0) x 0T w x 0T
e(1) x 1T w x 1T
e(2) x 2T w x 2T
ek = = dk – = dk – w
: : :
e(k – 1) x kT – 1 w x kT – 1
e(k) x kT w x kT
x(0) 0 0 … 0 w0
x(1) x(0) 0 … 0 w1
= dk – x ( 2 ) x(1) x(0) … 0
w2
: : : … :
x(k – 1) x(k – 2) x(k – 3) … x(k – N) :
x(k) x(k – 1) x(k – 2) … x ( k – N + 1 ) wN – 1
i.e. ek = dk – Xk w
where X k is a ( k + 1 ) × N data matrix made up from input signal samples. Note that the first N rows of X k are
sparse.
Top
Least Square Equation 6.31
∂ v(k)
= 2X kT X k w – 2X kT d k = – 2X kT [ d k – X k w ]
∂w
and therefore:
– 2X kT [ d k – X k w LS ] = 0
⇒ X kT X k w LS = X kT d k
w LS = [ X kT X k ] – 1 X kT d k
-1
x x x x x x x x x x
x x x x x x x x x x d
w N x x x x x x x x x x x x d
x x x x k+1
w x x x x x x x x x x x x d
N x x x x N k+1
w x x x x x x d
x x x x
w k+1 x x x x x x d
x x x x
d
k+1
N
-1
w XT X XT d
Note that if in the special case where X k is a square non-singular matrix, then the above simplifies to:
w LS = X k– 1 X k– T X kT d k = X k– 1 d k (1)
The computation to calculate least squares requires about O(N4) MACs (multiply/accumulates) and O ( N )
divides for the matrix inversion, and O ( ( k + 1 ) × N 2 ) MACs for the matrix multiplications. Clearly therefore, the
more data that is available, then the more computation required.
Top
Least Squares Computation 6.32
• At time iteration k+1, the least squares weight vector solution is:
x(k)
w0 w1 wN-2 wN-1
−
y(k) + e(k)
w k + 1 = [ X kT X k ] –1 X kT d k
d(k)
w k + 2 = [ X kT + 1 X k + 1 ] – 1 X kT + 1 d k + 1
This equation requires that another full matrix inversion is performed, [ X kT + 1 X k + 1 ] – 1 , followed by the
appropriate matrix multiplications. This very high level of computation for every new data sample provides the
motivation for deriving the recursive least squares (RLS) algorithm. RLS has a much lower level of computation
by calculating w k + 1 using the result of previous estimate w k to reduce computation.
w k = [ X kT – 1 X k – 1 ] – 1 X kT – 1 d k – 1 = P k – 1 X kT – 1 d k – 1
where
P k – 1 = [ X kT – 1 X k – 1 ] – 1
When the new data samples, x ( k ) and d ( k ) , arrive we have to recalculate the matrix equation (including
recalculated the matrix inverse):
w k + 1 = [ X kT X k ] – 1 X kT d k = P k X kT d k
However note that P k can be written in terms of the previous data matrix X k – 1 and the data vector x k by
partitioning the matrix X k :
Top
Recursive Least Squares 6.33
w k + 1 = [ X kT X k ] –1 X kT d k = P k X kT d k
Xk – 1 –1
Pk = [ X kT X k ] –1 = X kT – 1 xk = X kT – 1 X k – 1 + x k x kT
x kT
–1
= P k–1– 1 + x k x kT
+ BCD ] – 1 = A – AB [ C + DAB ] –
where A is a non-singular matrix and B, C and D are appropriately dimensioned matrices. Using the matrix
inversion lemma where P k – 1 = A , x k = B , x kT = D and C is the 1 × 1 identity matrix. i.e. the scalar 1, then:
P k = P k – 1 – P k – 1 x k [ 1 + x kT P k – 1 x k ] – 1 x kT P k – 1
This equation implies that if we know the matrix [ X kT – 1 X k – 1 ] – 1 then the matrix [ X kT X k ] – 1 can be computed
without explicitly performing a complete matrix inversion from first principles. This, of course, saves in
computation effort. Equations and are one form of the RLS algorithm. By additional algebraic manipulation,
the computation complexity of Eq. can be simplified even further.
w k + 1 = [ P k – 1 – P k – 1 x k [ 1 + x kT P k – 1 x k ] – 1 x kT P k – 1 ]X kT d k
dk – 1
= [ P k – 1 – P k – 1 x k [ 1 + x kT P k – 1 x k ] – 1 x kT P k – 1 ] X T
k – 1 xk (2)
d(k)
= [ P k – 1 – P k – 1 x k [ 1 + x kT P k – 1 x k ] – 1 x kT P k – 1 ] X T d
k – 1 k – 1 + xk d ( k )
Top
Recusive Least Squares 6.34
d(k)
x(k)
w0 w1 wN-2 wN-1
+
−
y(k) e(k)
wk + 1 = wk + mk e ( k )
Pk – 1 xk
m k = ----------------------------------------
-
[ 1 + xk Pk – 1 xk ]
T
P k = P k – 1 – m k x kT P k – 1
w k + 1 = [ P – Px [ 1 + x T Px ] – 1 x T P ] X T d + xd
= PX T d + Pxd – Px [ 1 + x T Px ] – 1 x T PX T d – Px [ 1 + x T Px ] – 1 x T Pxd
= w k – Px [ 1 + x T Px ] – 1 x T w k + Pxd – Px [ 1 + x T Px ] – 1 x T Pxd
= w k – Px [ 1 + x T Px ] – 1 x T w k + Pxd [ 1 – [ 1 + x T Px ] – 1 x T Px ] (3)
= w k – Px [ 1 + x T Px ] – 1 x T w k + Px [ 1 + x T Px ] – 1 [ [ 1 + x T Px ] – x T Px ]d
= w k – Px [ 1 + x T Px ] – 1 x T w k + Px [ 1 + x T Px ] – 1 d
= w k + Px [ 1 + x T Px ] – 1 ( d – x T w k )
w k + 1 = w k + P k – 1 x k [ 1 + x kT P k – 1 x k ] – 1 ( d ( k ) – y ( k ) )
= wk + mk ( d ( k ) – y ( k ) ) (4)
= wk + mk e ( k )
The RLS adaptive filtering algorithm therefore requires that at each time step, the vector m k and the matrix P k
are computed. The filter weights are then updated using the error output, e ( k ). Therefore the block diagram for
the closed loop RLS adaptive FIR filter is:
Top
Exponentially Weighted RLS 6.35
• The RLS algorithm calculates the least squares vector at time k based
on all previous data, i.e. data from long ago is given as much relevance
as recently received data.
w LS = [ X kT Λ k X k ] – 1 X kT Λ k d k
v ( k ) = e kT Λ k e k
Therefore:
v ( k ) = [ dk – Xk w ] T Λk [ dk – Xk w ]
= d kT Λ k d + w T X kT Λ k X w – 2d kT Λ k X w
k k k
Following the same procedure as least squares solution is easily found to be:
w LS = [ X kT Λ k X ] – 1 X kT Λ k d
k k
Top
Exponentially Recusive Least Squares 6.36
d(k)
x(k)
w0 w1 wN-2 wN-1
+
−
y(k) e(k)
wk + 1 = wk + mk e ( k )
Pk – 1 xk
m k = ----------------------------------------
-
[ λ + x kT P k – 1 x k ]
P k – 1 – m k x kT P k – 1
P k = ------------------------------------------------
λ
There exist many “fast” forms of the RLS such as the FTF (fast transversal filter), however in general these “fast”
implementations, while reducing computations invariably have numerical integrity and stability problems.
Two key differences from LMS. First we require high level of computation, and second we also require divisions.
But just to recap - why do we want RLS? It can train faster than LMS, and can invariably produce better final
MMSEs.
Top
QR-RLS Algorithm 6.37
• Aim of LS is to find set of filter weights w which minimise the total sum
of squared errors:
v(k) = ∑ λk – s[ e( s ) ]2 0<λ≤1
s=0
desired
signal d(k)
input +
signal Adaptive FIR y(k) e(k)
Σ
x(k)
Digital Filter
w
- error
signal
M2 – 1
1
E [ e 2 [ k ] ] = ----------------------
M2 – M1 ∑ e 2 [ k ] for large ( M 2 – M 1 )
k = M1
while RLS type algorithms (like the QR-RLS) attempt to minimise the sum of the squares of past errors:
k
v[k] = ∑ e2[ s ]
s=0
The previous equation applies in the real case (working with real numbers). When complex values are involved
the equation above can be expressed as:
k
v[k] = ∑ e [ s ]e * [ s ]
s=0
2 2
v[k] = e[k] = dk – X Tk w
w LS = [ X kT X k ] – 1 X kT d k
-1
x x x x x x x x x x
x x x x x x x x x x d
w N x x x x x x x x x x x x d
x x x x k+1
w x x x x x x x x x x x x d
N x x x x N k+1
w x x x x x x d
x x x x
w k+1 x x x x x x d
x x x x
d
k+1
N
-1
w XT X XT d
X k = QR
x x x x q q q q q q r r r r
x x x x q q q q q q 0 r r r
x x x x q q q q q q 0 0 r r
k k
q q q q q q
k QTQ = I
x x x x 0 0 0 r
x x x x q q q q q q 0 0 0 0
x x x x q q q q q q 0 0 0 0
N k N
X Q R
Ax = b (5)
where A is an m × n matrix, b is a known m element vector, and x is an unknown n element vector, then the
minimum norm solution is required, i.e. minimize, ε , where ε = Ax – b 2 . This can be found by the least
squares solution:
x LS = ( A T A ) – 1 A T b (6)
Top
QR Solution 6.40
w LS = [ X kT X k ] – 1 X kT d k
w LS = [ ( QR ) T ( QR ) ] – 1 ( QR ) T d k
= [ R T Q T QR ] –1 R T Q T d k
= [ R T R ] –1 R T Q T d k
= R –1 R –T R T Q T d k
= R –1 Q T d k
w LS = R –1 d k'
r 11 … r 1, N – 2 r 1, N – 1 r 1N w1 d1
: : : : : : :
Rw = d ⇒ 0 … r N – 2, N – 2 r N – 2, N – 1 r N – 2, N w N – 2 = d N – 2
0 … 0 r N – 1, N – 1 r N – 1, N w N – 1 dN – 1
0 … 0 0 r NN wN dN
has to be solved for the weight vector w, where R is an N × N non-singular upper triangular matrix, then the last
element of the unknown vector, w can be calculated from multiplication of the last row of R with the vector w:
dN
r NN w N = d N ⇒ wN = ---------
-
r NN
In general it can be shown that all elements of x can be calculated recursively from: n
di – ∑ r ij w j
j = i+1
w i = --------------------------------------
-
r ii
Top
QR Decomposition: Givens Rotations I 6.41
Element to zero R
Element which changes its values
Consider the following 2 × 2 example where matrix B is to be made upper triangular, element B 10 must be
zeroed:
B 00 B 01
B =
B 10 B 11
1
cos θ = ----------------------------------------------
2
1 + ( B 10 ⁄ B 00 )
B 10 ⁄ B 00
sin θ = ----------------------------------------------
2
1 + ( B 10 ⁄ B 00 )
Top
QR Decomposition: Givens Rotations II 6.42
T
• Q is composed of a series of Givens rotations G 1, G 2, G 3, etc which
zero the desired elements
1 0 0 1 0 0
G 3 = 0 cos θ sin θ = 0 0.17 0.98
0 – sin θ cos θ 0 – 0.98 0.17
R 00 R 01 R 02
y
BC
x
x’
cos θ
BC sin θ 0
R 11 R 12
BC
y’ = 0
2 2
x' = x cos θ + y sin θ = x +y
0
y' = y cos θ – x sin θ = 0 R 22
2 2
cos θ = x ⁄ x + y y BC
de
ti u
n
2 2
sin θ = y ⁄ x + y ag
m
θ 0
x x’
T T
Remembering that Q is orthogonal, i.e. Q Q = QQ = I , we can write the minimisation criterion as:
2 T T T T 2 T T 2
e = e e = e QQ e = Q e = Q d – Q Aw . Considering the following equivalencies
T T
Q d = p , Q Xw = Rw
v 0
2
the problem of minimising ξ is equivalent to minimising ξ = p – Rw , which implies solving the equation
v
p = Rw with backsubstitution. In the iterative case we introduce time dependence with the subindex k . It can
then be proved that the algorithm is then composed of the following steps.
Step 1: R [ k ] = Q λR [ k – 1 ]
0 T
x [k]
Step 2: p [ k ] = Q λp [ k – 1 ]
γ d[k]
Step 3: p [ k ] = R [ k ]w [ k ]
Top
QR-RLS Tri-Array 6.44
e(k)
Note that the cost of the diagonal boundary cells means that the Givens array is somewhat imbalanced
computationally. The square root and divide will have a higher cost than the multiplies and adds.
After the QR array has operated on the incoming x ( k ) and d ( k ) data, the next step is to perform the
backsubstition using the R matrix which is essentially inside the QR array.
Note that for infinite precision arithmetic, both QR and QR-RLS give exactly the same results.
The QR-RLS has better numerical integrity than the direct RLS when using fixed point numbers. One simple
way to see this is to consider have an N bit processor available. When performing RLS or direct least squares
we are using squared quantities of x ( k ) hence the significant wordlegnth of x ( k ) should be less than N ⁄ 2 .
Whereas in the QR we are using the data quantities directly and working with orthogonal normalised transforms
and could therefore have closer to N bits resolution in the data x ( k ) .
Top
Conclusions 6.45