Professional Documents
Culture Documents
ADAPTIVE FILTERS
PRELIMINARY VERSION
2 Fundamentals of Stochastics 18
2.1 Least-Mean-Squares Estimators . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2 Linear Least-Mean-Squares Estimators . . . . . . . . . . . . . . . . . . . . . 23
2.3 Different Interpretations of the Wiener Solutions . . . . . . . . . . . . . . . . 27
2.4 Complexity of the Exact Wiener Solution . . . . . . . . . . . . . . . . . . . . 27
2.4.1 Durbin Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4.2 Levinson Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.4.3 Trench Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.5 The Steepest Descent Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.6 Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2
Univ.Prof. DI Dr.-Ing. Markus Rupp 3
4.1.2 LS Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.1.3 Conditions on Excitation . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.1.4 Generalizations and special cases . . . . . . . . . . . . . . . . . . . . 63
4.1.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.2 Classic RLS Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.2.1 Underdetermined Forms . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.3 Stationary Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.4 Alternative Forms of LS Solutions . . . . . . . . . . . . . . . . . . . . . . . . 72
4.5 Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6 Generalized LS Methods 82
6.1 Recursive Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.2 Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
E More General Derivation for the Mean Squared Parameter Error Vector137
E.1 Decomposition of Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
E.2 Modal Space of LMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
E.3 Influence of Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
E.4 Complete LMS Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
E.5 Very Long Adaptive Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
@
@
Echo path
Loudspeaker
To switching board
N Hybrid -
Micro
-
During the past 30 years, adaptive filter algorithms have made their way into many
electronic products. In most of the cases, the user is not even aware of their existence
which demonstrates their excellent performance. Adaptive filters have the property to
adapt to a permanently changing environment by which their behaviour is kept optimal. In
the following, we will review various applications of adaptive filters. In Figure 1.1 the block
diagram of a so called hybrid is shown (Ger.: Gabelschaltung (Wien-Brücke)). It allows for
connecting a two-wire line from the switching board to the two wires of a microphone on the
one hand, and to the two wires of the loudspeaker on the other hand. A perfect balancing is
achieved if the user does not hear his own voice in the loudspeaker. Such optimal situation
is typically not given, as the far-end load (Ger.: Nachbildung N) is in general unknown.
Typically, a small leakage from the microphone to the loudspeaker is considered good, since
5
6 Adaptive Filters (preliminary)
then, the user is convinced that the device is working properly. However, in the context of
hands-free telephones (Ger.: Freisprechanlagen), this can cause acoustic feedback and thus
needs to be attenuated. The attenuation or even a complete cancellation of this echo signal
can be achieved if the leaking signal is subtracted at the loudspeaker input. The leaking
signal can be estimated by a convolution of the impulse response from the microphone to
the loudspeaker with the user’s voice signal. Since this impulse response is unknown, an
adaptive filter is required to estimate it.
A
-
-
Echo from B
- Hybrid N N
Hybrid -
A B
Speaker Speaker
Echo from A
A B
Figure 1.2 schematically depicts two far end users, connected via their local hybrids and
a landline. At each point along the landline, where a change from two wires to four wires
(and vice versa) is required, such hybrids are in use. For connections over long distances,
many of such hybrids may occur. In this case, the hybrids outside the user equipments
are not passing on the local signals from microphone to loudspeaker but instead far end
signals are reflected to their origin. For long distance connections, the time delays of 500ms
and more lead to very disturbing echos which provoke the end users to speak against their
own voices. If such an echo signal is additionally amplified along the transmission path,
under the assumption that the hybrids have an attenuation of 6dB, it can happen that the
closed loop connection (established via the acoustic coupling from the loudspeaker to the
microphone) has an attenuation of zero decibel or even less, and thus the system becomes
unstable. Adaptive filters on both ends can compensate for such effects and make the far
distance connection stable. While local echo compensation typically deals with few filter
coefficients (<100), far end echoes compensation may require a lot of coefficients (500-4000).
Univ.Prof. DI Dr.-Ing. Markus Rupp 7
ZZ
J
Z J
Z J
Z J
Z
Z J
Z J
Z J
ZJ
~
Z
-^
Z
JZ >
@ J Z
@ J Z
J ZZ
J Z
J Z
J Z
J
Z
Nowadays, almost all telephone sets feature hands-free equipment (Ger.: Freisprechein-
richtungen). They are being applied in offices as well as in video conferences, and even in
cars while driving. Figure 1.3 schematically depicts the sound propagation in a room (or
some other reflecting environment). The far end speaker signal enters the room via the
loudspeaker, The corresponding sound is reflected at the walls and adds up with the voice
of the local speaker at the microphone. The resulting electric signal is transfered to the far
end speaker where his own signal appears delayed as an echo. If the far end speaker is also
using a hands-free telephone, the loop is closed and in the worst-case, a strong sinusoidal
whistling is audible (Ger.: Rückkopplungspfeifen). In such scenarios, an adaptive filter can
estimate the impulse response of the two-port system loudspeaker-room-microphone and
estimate the echo signal in order to subtract it. It is important to notice that depending on
the room size, the occurring impulse responses may be very long (typically several thousand
taps). In this application, the local speaker is seen as a disturbance (for the adaptive filter
estimation). Nevertheless, it is the signal of interest which has to be transmitted which
necessitates a special treatment.
All applications discussed so far, fall in the category of system identification. As shown
in Figure 1.4, the path of the echo is a linear system with an unknown impulse response.
By observing the input and output signal of this system, the adaptive filter “learns” (i.e.,
identifies) its impulse response. With the identified impulse response and the known input
signal, the filter can reconstruct the echo excluding the signal of interest, and by subtraction
the clean echo-free signal can be obtained.
8 Adaptive Filters (preliminary)
v(k)
y(k) ?
x(k) r - - h - e(k)
? 6
−
ŷ(k)
-
ŵ
An entirely different problem is given in active noise control. Figure 1.5a depicts the
situation. A primary noise source (engine, hair dryer, ...) causes an undesired disturbance
at the position of the microphone. By a second controlled source, the disturbing noise is
reduced at the microphone position by destructive interference. Often, a direct access to
the primary source is not feasible. In this case, a second sensor captures a correlated signal
instead. Figure 1.5b exhibits the corresponding block diagram including the adaptive filter.
The adaptive filter does not estimate the impulse response corresponding to the transfer
function P̃ , representing the path from the primary source to the microphone. Instead,
it only estimates the impulse response corresponding to the path reduced by H, which
represents the path between secondary source and microphone. This can cause non-causal
parts in the solution. We will later classify such a problem as a reference model with linear
filter in the error path.
Univ.Prof. DI Dr.-Ing. Markus Rupp 9
@
@
- W
(b)
d(k)
- Pe(z)
x(k)
? e (k)
f
t
6
y(k)
- W - H(z)
Figure 1.5: (a) Active noise suppression. (b) Equivalent block diagram of the control algo-
rithm.
10 Adaptive Filters (preliminary)
In speech processing, adaptive filters are used in the context of linear prediction as
required for the vocoder principle. Figure 1.6 depicts the basic structure. The speech
signal is first delayed, then, it serves as input signal for the adaptive filter. In contrast, the
reference signal is the original speech signal without delay. Thus, the adaptive filter will aim
to approximate the speech signal based on past observations. Typical applications for linear
prediction are voice coder which are used to reduce the data rates in speech communications.
Here, in the simplest case, only the prediction error signal is transmitted which. Since this
signal carries less energy than the original signal and it can be quantized with less bits per
sample. More sophisticated is the so called vocoder principle. As the spectrum of speech
signals remains constant for approximately 10ms, it is possible to train the adaptive filter
during this time period, and to transmit only the resulting filter coefficients (prediction
coefficients). The speech signal is then artificially synthesized applying these prediction
coefficients. With this technique, an enormous reduction in data (rate) can be achieved.
d(k)
- 1
x(k)
? e(k)
t
6
–
y(k)
- q −1 - w
A further application of adaptive filters can be found in data transmission over frequency
selective (time-dispersive) channels. Figure 1.7 shows the block diagram of such a data
transmission when employing an adaptive filter for equalization. A digital signal is being
transmitted via a linear channel c and disturbed by additive noise v(n). At the receiver, a
digital filter needs to be adjusted, such that a nonlinear one-to-one mapping can be applied
to unambiguously map the receive symbols (extracted from the filtered receive signal) to
the transmit symbol alphabet. Often, the choice of a suitable filter structure as well as the
possibly high number of filter coefficients are secondary problems. Then, the predominant
problem is posed by finding an adequate algorithm that allows for rapid tracking of
channel alterations (frequency dispersive or time-variant channel), which at the same time
Univ.Prof. DI Dr.-Ing. Markus Rupp 11
does not require high demands on numerical precision. In contrast to the previously
described system identification problem, here, we do not have a reference signal available.
Yet, it can be obtained by introducing a training which is known by both, the transmitter
and the receive. Of course, the concept of data transmission is the transmission of unknown
data, and thus it is necessary to keep the training sequence very small compared to the
unknown data. Alternatively, the transmission of training sequences can be circumvented
if the decoded signals can be used as reference signals. However, this only works as long as
the errors remain small.
v(k)
x(k)
? y(k) z(k) x̂(k)
- c - - w - f -
Both applications, linear prediction and adaptive equalization, follow the same reference
structure as shown in Figure 1.8. In case of a linear prediction, we set c = q −1 and ∆ = 0,
while in case of the adaptive channel equalization, c denotes the channel and the training
sequence is represented by the reference signal, delayed by ∆.
d(k)
- z −∆
x(k)
? e(k)
t
6
–
y(k)
- c - w
A similar problem but even more complicated is the nonlinear pre-distortion of power
amplifiers, which is commonly used in wireless communications. In order to achieve a very
high power efficiency, such amplifiers are operated in the nonlinear C or F mode. For
bandwidths of less than 1MHz the occurring nonlinearity is typically memoryless (Saleh
model),
αA r
A (r) =
1 + βA r2
(1.1)
αΦ r 2
Φ (r) = ,
1 + βΦ r2
and can be corrected by nonlinear one-to-one mappings in amplitude and phase. For larger
bandwidths however, memory effects are emerge and become more and more pronounced
with increasing bandwidth. So called Volterra series are one possibility to describe such a
nonlinear system with memory. For example, a Volterra series truncated at order P maps
an input signal u(n) to an output y(n) according to
P
X
y(n) = hp,n [u (n)] , (1.2)
p=0
The advantage of this representation is that all coefficients are linear in the input signal.
The nonlinear effect is achieved by delaying the input signal and combining different delayed
version in high order polynomials. A further advantage of a truncated Volterra series is
that its inverse can be derived symbolically, which is very important for pre-distortion. In
practice however, Volterra series require a large amount of coefficients. Figure 1.9 shows a
typical pre-distortion scenario. After the identification of the power amplifier, the system
is inverted and the so obtained pre-distorter is placed before the power amplifier. In the
ideal case, the cascade of the pre-distorter followed by the power amplifier behaves linear
(up to the saturation point).
In order to reduce the complexity of the problem, other adaptive filter structures are
being employed. Most suitable are two-lock models consisting of a linear dispersive block
and a memoryless nonlinear block. Depending on the order of these two blocks, the systems
are called Wiener model or Hammerstein model. As depicted in Figure 1.10 the structure
of the Wiener model is given by a linear filter followed by a memoryless nonlinearity. For
the Hammerstein model the order of the blocks is reversed. Currently, it is not known how
to guarantee convergence of such adaptive filter structures.
Univ.Prof. DI Dr.-Ing. Markus Rupp 13
adaptive
laws
x
- b
?
z- d -
f (z)
- 6
a
q −1
As a second categorization, adaptive algorithms can be divided into online and offline
algorithms. If all required data are available before the algorithm is started, it can run a
certain number of iterations until it stops resulting in a (sub-) optimal solution. Hence,
algorithms operated in this offline mode are often called iterative algorithms instead. In
contrast, if new data become available with each iteration step, the algorithm operates in
online mode, which is also called recursive mode. An adaptive filter is typically working in
recursive mode, since it tries to adapt to a permanently changing environment.
Example 1.1 (Iterative Algorithm) Consider the following update equation, starting
with w0 = 0;
1
wk = wk−1 + a ; k = 1, 2, ..., N
k
Example 1.2 (Recursive Algorithm) Consider the following update equation, starting
with w0 = 0;
1
wk = wk−1 + ak ; k = 1, 2, ..., N
k
Univ.Prof. DI Dr.-Ing. Markus Rupp 15
In the second (recursive) algorithm the data ak are changing while in the first iterative
algorithm, a remains constant. Question: where does the first algorithm end?1
A third distinction of adaptive algorithms is done with respect to their cost functions. The
reference model defines a desired signal, the reference signal that is to be approximated by
the adaptive filter. In general, the error between this reference signal and its approximation
is assessed by a specific cost function. Typically, the larger the difference, the larger are the
costs. Depending on the class of signals which is considered, the cost function may contain
expectations for stochastic signals, or simple norms respectively metrics for deterministic
signals. Often, in the stochasticPcase, the expectation E[|e(n)|p ] has to be minimized, in
the deterministic case, the sum ni=0 |e(n)|p . These two cost functions are similar but not
the same. Also minimax formulations are common and will be presented in later chapters.
1.3 Nomenclature
The following nomenclature is being applied throughout these lecture notes:
∗ is the complex conjugate of a scalar, a vector, or a matrix.
T as a superscript, indicates the transpose of a vector or a matrix.
H as a superscript, indicates the the Hermitian transpose, i.e. the conjugate
transpose of a vector or a matrix.
a, b, c, ... denote (deterministic) scalars.
a, b, c, ... denote scalar random variables.
fa (a) denotes the probability density functions of the random variable a.
E(·) or E[·] denotes the expectation of a random variable. If the argument is a vector
or a matrix, the expectation is considered element by element.
σa2 denotes the variance of the random variable a.
ā denotes the mean of the random variable a.
a,b,c... denote (column-)vectors which contain random variables as entries.
1 is a (column-)vector (of adequate dimension) with all elements equal to one.
a, b, c, . . . denote (column-)vectors of which entries are all random variables.
fa (a) denotes the joint probability density functions of the random vector a.
Raa denotes the autocorrelation matrix of the random vector a, Raa = E[aaH ].
Rab denotes the cross-correlation matrix of the two random vectors a and b,
Rab = E[abH ].
rab or rba denotes the cross-correlation vector of the random vector a with the random
variable b, or vice versa. rab = E[ab∗ ] is a column vector, rba = E[baH ] is
a row vector, and obviously, it holds that rab = rHba .
rab denotes the cross-correlation of the random variables a and b, rab = E[ab∗ ].
1
The algorithm converges to (e − 1)a ≈ 1.71a.
16 Adaptive Filters (preliminary)
In these lecture notes, in some contexts, linear filters are represented by capital letters
followed by the argument q −1 . Here, q −1 denotes the backward shift operator and it is added
to stress that the capital letter has the meaning of an operator as well. For example, we write
B(q −1 )[v(k)] = B[v(k)] for a sequence v(k), filtered by an FIR filter with the coefficients
b(0), b(1), ..., b(M − 1). Using this operator notation, also recursive filter structures (IIR)
B(q −1 )
can easily be described. To see this, consider the expression y(k) = 1−A(q −1 ) [v(k)], which
N
X M
X −1
y(k) = a(l)y(k − l) + b(l)v(k − l). (1.4)
l=1 l=0
We prefer this notation to the z-transform since by this, we do not need to state any
conditions regarding the input sequence (Dirichlet conditions).
As we operate a lot with matrices and vectors, we will introduce some short notations.
In particular, for the differentiation with respect to a real or a complex valued vector, we
will use the following rules (for w ∈ IR and z ∈ C
l , see also Appendix A for a more detailed
2
Reminder: a Hermitian matrix is positive (semi-)definite if all of its eigenvalues are greater than (or
equal to) zero.
Univ.Prof. DI Dr.-Ing. Markus Rupp 17
discussion):
∂Rw
= R
∂w
T
∂w Rw
= wT [R + RT ]
∂w
∂Rz
= R
∂z
H
∂z Rz
= z H R.
∂z
Exercise 1.1 Consider the adaptive structure in Figure 1.12. The blocks indicate transver-
sal filters with the weights collected in the corresponding vectors. Find an expression for
the error signal e(k) as a function of the input signal x(k) and the disturbance v(k) (hint:
make use of the operator notation for linear systems!). What solution is to be expected for
wk if perfect adaptation is achieved? Which class of adaptive schemes does this structure
belong to? Which class would it belong to if the blocks a and wk were interchanged?
d(k) v(k)
- b - c
x(k)
? e(k)
t
6
–
y(k)
- a - wk
Exercise 1.2 Show under which conditions the equalizer in Figure 1.7 becomes a system
identification. Assume that the blocks c and w are linear systems.
Chapter 2
Fundamentals of Stochastics
In this chapter, we briefly summarize a few fundamental methods which are commonly used
in stochastics. These methods will form the basis for many analyses presented throughout
these lecture notes. We will present least-mean-squares (LMS) in the general case as well
as for a linear system model which will lead us to the so called linear least-mean-squares
(LLMS) estimator together with the Wiener-solution. As such parabolic problems in most
cases lead to systems of linear equations, we discuss efficient methods (Durbin, Levinson,
and Trench) to solve them. Furthermore, we will discuss the so called steepest-descent
algorithm, which is an iterative approach for solving parabolic problems.
Consequently, for zero mean random variables, we find σx2 = E[x2 ]. Intuitively, a low
variance indicates that a desired value is near to the mean. More precisely, with a high
probability, the desired value will be found in the vicinity around the mean, and the size of
the corresponding interval around the mean can be deduced from the variance σx2 :
• A small variance σx2 indicates that the desired value is (likely to be) close to the mean.
• A large variance σx2 indicates that the desired value may fall into a large interval
around the mean.
18
Univ.Prof. DI Dr.-Ing. Markus Rupp 19
Thus, the variance provides a measure for the uncertainty of the estimated (desired) value.
The just described meaning of the variance can be formulated quantitatively for some
random (desired) variable x by the so called Chebyshev’s inequality1 , which states
σx2
P (|x − x̄| ≥ δ) ≤ . (2.2)
δ2
Accordingly, the probability that the desired value lies in the interval [x̄ − δ, x̄ + δ] is limited
2
by σδ2x . For example, the probability that the desired value lies outside of ±σx is bounded by
100%. Thus, in this case, there is no effective bound on the probability. But the probabil-
ity that the desired value is not contained in the interval [x̄−5σx , x̄+5σx ] is bounded by 4%.
We assume now that (only) mean and variance of the random variable x are known,
and ask the question how to obtain an adequate estimate for the (unknown) value of x.
Of course, which estimate is adequate depends on the chosen method. Such a method is
generally called an estimator and relies on some measure of quality. An estimator suitable
with respect to one quality measure may be unacceptable if evaluated based on another
quality measure. A suitable quality measure could be the distance of the estimate from the
true value:
E[x − x̂]. (2.3)
Here, negative and positive values are obviously canceling each other out, and thus, an
estimator based on this quality measure may deliver a wrong picture. Recognizing that,
a more suitable measure or metric for the quality of estimation is given by the quadratic
measure E[(x − x̂)2 ]. Having agreed on a metric, the best estimator is the one which leads
to a minimal value of the metric. Consequently, for the quadratic metric, we want to know
the estimator x̂ which is optimal according to the minimization problem:
Quadratic metrics are very suitable as they are simple to manipulate, and they typically
lead to explicit analytical results. Also other metrics like the l1 -norm (absolute norm) are
used, because sometimes they allow for implementations with lower complexity.
Lemma 2.1 Given the mean x̄ and variance σx2 of a random variable x, the least-mean-
squares (LMS) estimate x̂ is optimal, if x̂ = x̄.
Proof: E[(x − x̂)2 ] = E [([x − x̄] + [x̄ − x̂])2 ] = σx2 + (x̄ − x̂)2 .
1
Note that this inequality does not rely on the actual probability density function of x
20 Adaptive Filters (preliminary)
Obviously under the absence of further information the best estimate is given by
the mean. Consider the estimation error e:
e = x − x̂ = x − x̄.
At his point it is noteworthy to mention that the variance of the estimation error is as large
as the variance of the random variable itself2 . We thus learn that this estimator has not
changed the uncertainty regarding x.
Let us now imagine that we observe a second random variable y that is correlated to
the first one x, meaning that y increases our knowledge about x. It should be possible
to formulate the estimate x̂ for x such that the quality of it is improved in comparison to
the estimator which does not incorporate the additional information provided by y. If the
estimator for x can be expressed by a function (mapping) of y such that
x̂ = h[y],
then, this function h[·] itself is called an estimator (Ger.: Schätzverfahren oder Schätzer).
Once an argument is provided, we obtain an estimate (Ger.: Schätzwert). Note that now
the estimate x̂ is a random variable itself, because it is obtained by the mapping of the
random variable correlated to x.
Lemma 2.2 The least-mean-squares estimator (LMSE) of x given y is E[x|y]. (The esti-
mate is given by E[x|y = y].) The minimum mean-square-error (MMSE, Ger.: minimales
Fehlerquadrat) is given by:
min E[(x − x̂)2 ] = E[x2 ] − E[x̂2 ].
x̂
Due to fy (y) ≥ 0, the first term is positive. Therefore, it simply can be interpreted as a
positive weighting term. The second term can be differentiated with respect to x̂ which
provides us with the solution for the minimum:
Z
x̂ = E[x|y = y] = xfx|y (x|y)dx. (2.8)
2
Check this!
Univ.Prof. DI Dr.-Ing. Markus Rupp 21
Example 2.1 Consider a random variable z = x + y. Let the two random variables x and
y be statistically independent. Moreover assume, that x takes on the values ±1 with equal
probability, and that y is zero mean Gaussian distributed with variance σy2 . We now want
to identify the LMS estimator for x given z = z.
Solution:
The LMS Estimator is given by
Z
x̂ = E[x|z = z] = xfx|z (x|z)dx. (2.9)
In order to obtain the conditional density function, we compute first the density of fz (z).
As x takes on only two different values with equal probability we find:
1 1
fz (z) = fy (z + 1) + fy (z − 1).
2 2
In the next step we compute the joint density function fx,z (x, z):
fy (z − 1)δ(x − 1) fy (z + 1)δ(x + 1)
fx|z (x|z) = +
fy (z + 1) + fy (z − 1) fy (z + 1) + fy (z − 1)
fy (z − 1) fy (z + 1)
x̂ = − (2.12)
fy (z + 1) + fy (z − 1) fy (z + 1) + fy (z − 1)
z
= tanh . (2.13)
σy2
Obviously, if y has unit variance, the best estimator for x given z is x̂ = tanh(z). Often,
it is not as easy to derive an explicit expression. However, if x and z are jointly Gaussian
distributed, this is usually possible.
There is also a geometric interpretation of the LMS estimator: Consider the function g(·)
operating on the random variable y. As we have:
Thus, we find:
E[(x − x̂)g(y)] = 0. (2.16)
Consequently, the random variable of the estimation error e = x − x̂ is uncorrelated to
any arbitrary function of some second (correlated) random variable. The latter equation
actually states the orthogonality of e and g(y), that they are also uncorrelated is a direct
consequence of e being zero mean. Hence, we briefly say: ”the error is orthogonal”.
Exercise 2.1 Let x and y be two complex-valued (second order circular) jointly Gaussian
and zero mean random vectors with dimensions p×1 and q×1, respectively. Their individual
probability density functions are given by
1 1
fx (x) = exp{−xH Rxx
−1
x}
π p |det Rxx |
1 1
fy (y) = exp{−y H Ryy
−1
y}.
π det R
q
yy
H
Moreover, assume that the cross-correlation matrix does not vanish, i.e., Rxy = Ryx 6= 0.
1.) Determine the joint probability density function fx,y (x, y).
3.) Arrange the terms in fx|y (x|y) such that one term only depends on y.
Hint:
−1
Rxx Rxy I 0 Σ−1 0 −1
I −Rxy Ryy
= −1 −1 ,
Ryx Ryy −Ryy Ryx I 0 Ryy 0 I
4.) Derive the optimum estimator h(y) for x (in the least-mean-squares sense).
5.) Determine the (minimum) mean-square-error which is achieved by the estimator h(y).
6.) Now, allow the random vectors to have some non vanishing mean. Modify the results
of 4.) and 5.) according to this case.
Univ.Prof. DI Dr.-Ing. Markus Rupp 23
x̂ = E[x|y1 , y2 , ..., yn ].
Now, we will consider particular solutions for special cases. We have already seen that joint
Gaussian processes typically lead to estimators which can be handled much easier since
they lead to linear expressions. Due to this insight, in this section we will consider linear
estimators even if the joint probability densities are not Gaussian. Such estimators are of
particular interest if the considered random variables are zero mean.
Consider two correlated zero mean random vectors x and y which may have different
dimensions. Let an estimator for x be given by x̂ = Ky. The dimensions of the matrix K
are inherently given by the size of x and y. We now want to find K such that the error
metric becomes minimal, that is
min E (x − x̂)(x − x̂)H .
K
Here, the cost function is not a scalar but a matrix! Consequently, we search for the matrix
K which minimizes the error-covariance matrix (which can actually be done based on any
matrix norm). Using the fact that both random vectors were assumed to be zero mean, we
find:
E[x̂] = E[Ky] = K E[y] = 0 = E[x]. (2.17)
Obviously, in this case, the estimator x̂ is bias free (Ger.: erwartungstreu). The following
Lemma generalizes the just considered example.
Lemma 2.3 The best linear LMS estimator (LLMSE) Ko for two correlated random vectors
x and y which are both zero mean, is given by:
−1
Ko = Rxy Ryy . (2.18)
We are interested in the MMSE, hence, this expression needs to be minimized with respect
to K. To achieve this, we compare the above expression with (K − Ko )B(K − Ko )H and
identify the various terms. As B = Ryy is positive definite, the minimum is achieved for
−1
K = Ko = Rxy Ryy . The MMSE is finally found, by substituting K in the expression on
the right-hand side of (2.19) with the optimal estimator.
Both equations, the optimal linear estimator and the corresponding MMSE can be
unified in one single expression:
Rxx Rxy I Go
= . (2.21)
Ryx Ryy −KoH 0
This set of equations is called the normal equations (Ger.: Normalengleichungen), the
estimator is called the Wiener solution. The estimated signal x is often called the
desired signal (Ger.: Wunschsignal). We thus find the optimal linear estimator by
minimizing the covariance matrix between the desired signal and the estimator.
Theorem 2.1 Consider two zero mean random vectors x and y. The linear estimator Ky
is a LLMSE for x, if and only if:
E (x − Ky)yH = 0. (2.22)
If Ryy is invertible (Ger.: regulär), then, a unique K = Ko exists with this property.
Proof: From (2.18) it directly follows that Ko Ryy − Rxy = 0 which is equivalent to
E xyH − Ko yyH = 0 (note the similarity to (2.16)). Accordingly, given the LLMSE,
orthogonality in the sense of (2.22) is ensured. However, we still have to show that the
orthogonality is also necessary to obtain the LLMSE. For an arbitrary K, the MSE is
E (x − Ky)(x − Ky)H = E (x − Ky)xH − (x − Ky)yH K H . (2.23)
Due to the orthogonality condition in (2.18), the second term is zero. We see that then,
(2.23) is equal to the minimum Go in Lemma 2.3 if K = Ko . Since it is also known that
the cost function reaches its minimum exactly in one point, which is Ko , it is shown that
the orthogonality condition (2.22) has to be satisfied to achieve the MMSE, respectively,
to obtain the LLMSE.
y = W x + v.
Rxx . We furthermore assume that Rxv = 0. We want to optimally estimate x based on the
observation y. We obtain:
Ryy = W Rxx W H + Rvv (2.24)
Rxy = Rxx W H . (2.25)
Thus, we find the LLMSE of x given by:
x̂ = Rxx W H [W Rxx W H + Rvv ]−1 y. (2.26)
If both matrices Rxx and Rvv are invertible we can reformulate this as:
x̂ = [W H Rvv
−1 −1 −1
W + Rxx ] W H Rvv
−1
y. (2.27)
We thus find the corresponding MMSE:
min E (x − x̂)(x − x̂)H = min E (x − x̂)xH (2.28)
x̂ x̂
= Rxx − [W H Rvv
−1 −1 −1
W + Rxx ] W H Rvv
−1
W Rxx (2.29)
= [W H Rvv
−1 −1 −1
W + Rxx ] . (2.30)
In the above example, (2.27) and (2.29) are derived from (2.26) and (2.28), respectively,
using the matrix inversion lemma:
Example 2.3 Finally, we want to consider a special case of Example 2.2, where y and v
reduce to scalars. Assuming that the vector x has dimension m × 1, the matrix W reduces
to a row vector wT , where w has the same dimension like x. We then find the observation
equation to be
y = wT x + v.
Based on the results of Example 2.2, the estimator can be shown to be
yRxx w∗
x̂ = .
wT Rxx w∗ + σv2
If x is moreover a white random process, its autocorrelation simplifies to Rxx = σx2 I, and
thus,
yw∗
x̂ = 2 .
kwk22 + σσv2
x
26 Adaptive Filters (preliminary)
Given LLMSE of x
{x, y},
−1
{Ryy , Rxy , Rxx } x̂ = Rxy Ryy y
−1
E x= E y=0 MMSE = Rxx − Rxy Ryy Ryx
y = W x + v, x̂ = Rxx W H [W Rxx W H + Rvv ]−1 y
{Rxx , Rvv , W }, or
E x= E y= E v=0, x̂ = [W H Rvv
−1 −1 −1
W + Rxx ] W H Rvv −1
y
E xvH =0 MMSE= [W H Rvv −1
W + Rxx−1 −1
]
yRxx w∗
y = wT x + v, x̂ =
wT Rxx w∗ + σv2
{σx2 , σv2 , w},
Rxx w∗ wT Rxx
E x=0, E y= E v=0 MMSE=Rxx − wT Rxx w∗ +σv
2
E xv∗ =0
Exercise 2.2 Consider a linear estimator K for the random vector x based on the obser-
vation y. Show that the error G(K) satisfies for arbitrary v and K:
v H Go v ≤ v H G(K)v,
where Go denotes the minimum mean-square-error (MMSE), which is achieved by the op-
timal linear estimator.
Let further a be a zero-mean random variable with variance σa2 , and v(k) be zero-mean white
noise with variance σv2 . Assume that a and v(k) are uncorrelated for all k = 1, 2, . . . , N ,
and that the frequency f0 is constant and known. Derive the best linear estimator for a,
based on the observations y(1), y(2), . . . , y(N ) and determine the achieved MMSE.
Univ.Prof. DI Dr.-Ing. Markus Rupp 27
Table 2.2: Comparison of the linear LMS estimator for a minimal parameter error and a
minimal observation error.
d = wT x + v = xT w + v. (2.32)
In contrast to the previous analyses we now assume that the output d as well as the input
x can be observed. Thus, the autocorrelation matrix3 Rxx and the cross-correlation vector
rxd = E xd∗ are known. Moreover it is assumed that the additive noise v is statistically
∗ T
3
Note the identities Rxx = E[xxH ] = E[x∗ xT ] = E[x∗ xT ] !
28 Adaptive Filters (preliminary)
To find the optimal w which minimizes the observation error in the LMS sense, it is obviously
sufficient to know Rxx and r∗xd . If Rxx is invertible the solution is uniquely given by:
∗
−1
wo = Rxx r∗xd .
In Section 2.2, it has already been shown that this is the linear LMS estimator respectively
the Wiener solution. The required matrix inversion can be problematic and challenging. On
the one hand, the matrix may not be well conditioned which leads to numerical problems
(this problem commonly occurs if speech signals are involved). On the other hand, for a
matrix with dimensions M ×M , the numerical inversion generally has a complexity order of
O(M 3 ), which especially for large matrices leads to high requirements in processing power.
Essential is the observation that we can derive the solution for order M as soon as the
solution of order M − 1 is available. To achieve this, we reformulate the matrix as a block
Univ.Prof. DI Dr.-Ing. Markus Rupp 29
structure:
Rxx,M wM = rxx,M (2.36)
Rxx,M −1 rB
xx,M −1 w̃M −1 rxx,M −1
= . (2.37)
rBH
xx,M −1 rxx (0) w(M ) rxx (M )
Here, the index indicates the dimension of the solution (as well as the dimension of the
square matrices and the vectors). The notation rB
xx,M means that the vector is applied in
a backward form:
[rxx (1), rxx (2), rxx (3), . . . , rxx (M )]B = [rxx (M ), rxx (M −1), rxx (M −2), . . . , rxx (1)]. (2.38)
Equivalently, such a backward notation can be achieved by a Hankel matrix B:
0 0 ··· 0 1
0
. . . 1 0
rxx,M = ...
B . . . .
. . . . . . .. rxx,M . (2.39)
.
0 1 .. 0
1 0 ··· 0 0
| {z }
B
k k
For reasons of symmetry, we find: Rxx,M B = BRxx,M , and Rxx,M B = BRxx,M . Without
loss of generality, we now assume that rxx (0) = 1, which simply normalized the entire
equation. We thus obtain4 :
Rxx,M y M = rxx,M (2.40)
Rxx,M −1 Brxx,M −1 z M −1 rxx,M −1
= . (2.41)
rH
xx,M −1 B 1 αM rxx (M )
Note that we deliberately introduced the vector z M −1 to circumvent the symbol y M −1 which
would denote the solution of order M − 1. The Durbin algorithm can now be derived as
follows. Assume that the solution y M −1 of order M − 1 is available:
For the remaining element αM , we find from the second line of (2.41):
Let us reorder all relevant equations and we obtain the well known Durbin algorithm:
Note that the dimensions of the vectors rxx,k , wk , and z k increase according to the index k.
If we count the number of required MAC (Multiply and Accumulate or simply Mult/Add)
operations, we find a complexity of 3M 2 . We can furthermore show that
2
βk = (1 − αk−1 )βk−1 , (2.49)
Rxx,M y M = b (2.50)
Rxx,M −1 rB
xx,M −1 z M −1 bM −1
= . (2.51)
rBH
xx,M −1 rxx (0) δ b(M )
Univ.Prof. DI Dr.-Ing. Markus Rupp 31
Similarly as before, the right hand side is partitioned into a vector of dimension M − 1 and
a scalar b(M ). The entire Levinson algorithm reads:
Analogous to the Durbin algorithm, the dimensions of the vectors rxx,k , wk , η k , and z k
grow corresponding to the index k. The complexity of the Levinson algorithm is 4M 2 Thus,
it is just twice as large as the complexity of the Durbin algorithm.
γ = (1 + rTxx,M −1 y M −1 )−1 ;
z M −1 = γy BM −1
;
L(1, 1) = γ;
L(1, 2 : M ) = γy TM −1 ;
for m = 2 : 21 (M − 1) + 1
for n = 2 : M − m + 1
L(m, n) = L(m − 1, n − 1) + . . .
+ γ1 [z(M + 1 − m)z(M + 1 − n) − z(m − 1)z(n − 1)] ;
(2.52)
32 Adaptive Filters (preliminary)
The result is found in matrix L; note that only the the upper triangular entries of the
matrix are calculated by the algorithm, since the matrix is symmetric. The complexity is
roughly 3M 2 (exactly: 13/4M 2 ).
Exercise 2.4 Show that for the Durbin algorithm (2.49), we find:
2
βk = (1 − αk−1 )βk−1 .
Exercise 2.5 Formulate the Levinson algorithm, without conditions of the kind (k < M −
1). Program a RISC processor (for example TI-C6x) and compare the number of operations
with the predicted complexity.
Exercise 2.6 Reformulate the Durbin algorithm to solve Rxx,M −1 y M −1 = −rxx,M −1 . Show
that the computation of γ, or actually of 1/γ, is already solved in the Trench algorithm.
Let us consider one more time the observation equation (2.32). Again, the input x and
the output d are assumed to be known d, and the final goal is to find a vector w such
that the linear combination of x and w optimally resembles the original output. Similar to
(2.33), this optimum is given by the ŵ which minimizes the quadratic cost function
h 2 i
g(ŵ) = E d − ŵT x . (2.54)
The cost function has the shape of a multi-dimensional paraboloid, with the minimum
2
go = g(wo ) = min E d − xT ŵ . (2.56)
ŵ
The optimum solution ŵ = wo is already known from the Wiener solution, and the cor-
responding (minimum) value of the cost function can be easily calculated using the well
known orthogonality relation (2.16):
leading to
go = σd2 − rH −1
xd Rxx r xd . (2.57)
By substitution of (2.57) in (2.55), we obtain the following description:
which again demonstrates that the cost function is quadratic. Knowing that, obviously, we
can expand it into a Taylor series at some point ŵk−1 :
H
g(ŵ) = g(ŵk−1 ) + ∇g(ŵk−1 ) ŵ − ŵk−1 + 1
2
ŵ − ŵk−1 ∇2 g(ŵk−1 ) ŵ − ŵk−1 . (2.59)
However, we still need to calculate the first and second order derivatives.
The gradient is obtained by differentiating the cost function in (2.58) with respect to ŵ:
∂g(ŵ)
∇g(ŵk−1 ) = = [ŵk−1 − wo ]H Rxx
∗
. (2.60)
∂ ŵ ŵ=ŵk−1
Note that the gradient is a row vector! Expanding the result in (2.60) leads to an alternative
representation of the gradient:
∂g(ŵ)
= ŵH Rxx
∗
− rTxd . (2.61)
∂ ŵ
34 Adaptive Filters (preliminary)
In (2.61), the orthogonality relation was applied to directly see that after the expansion,
the second term coincides with the cross-correlation between x and d, i.e., wH ∗ T
o Rxx = r xd .
The second derivative is obtained by differentiating the gradient:
∗
∇2 g(ŵk−1 ) = Rxx . (2.62)
The function g(ŵ) is sufficiently smooth so that every point can be associated with
its gradient. In each point, the gradient points in the direction of the (locally) steepest
ascent. Thus, it points away from the minimum. On the other hand, then, its negative
(and complex conjugate) value has to point in the direction of (locally) steepest descent,
which (at least roughly) aims at the global minimum. According to this insight, in the
update equation (2.53), we take the negative (complex conjugate, transpose) gradient as
direction of improvement z k = (r∗xd − Rxx
∗
ŵk−1 ).
A typical shape of the cost function (2.58) is depicted in In Figure 2.1 for a 2-dimensional
real-valued parameter vector(ŵT = [ŵ1 , ŵ2 ]). On the right-hand side, the corresponding
contour plot is given, which shows some lines of constant cost. Additionally, for a few
points, the negative gradients are included. We observe that for points far away from the
minimum, the negative gradient does not directly point to the minimum. Nevertheless, it
gives a hint where the minimum will be found. Therefore, it becomes clear that only a
step-wise iteration can ensure the localization of the minimum.
Considering again the initial general update equation (2.53), we can express the cost of
ŵk with respect to the cost of ŵk−1 by substituting (2.53) in the Taylor series expansion
(2.59), leading to:
then, (2.63) provides us with the insight that the following inequality has to be satisfied:
Here, the index i denotes the i−th entry of the vector u and λi are the diagonal terms of
the diagonal matrix Λ. From the diagonalized form (2.72), we can immediately identify the
convergence conditions of this iterative algorithm:
We see that by means of the diagonalization and coordinate transformation, the solution
of theQ equivalent system of homogeneous linear difference equations can be written in the
form kl=1 (1 − µ(l)λi ). If the above discussed convergence condition for the step-size is
satisfied, these products are exponentially decaying and we can determine the adaptation
rate. The adaptation rate is the speed at which these products decay, and therefore, it
is also the speed of the learning process performed by the filter when it adapts towards
the correct value. The nearer the expressions (1 − µ(k)λi ) are to zero, the faster the error
value approaches zero. At the first glance, the choice µ(k) = 1/λi may seem to be optimal,
however, this is only optimal with respect to one eigenvalue.
The considerations in this section were focused on quadratic cost functions. Nevertheless,
this does not mean that the method of steepest descent is restricted to such cost functions.
If an arbitrary smooth cost function is given, we can always derive a Taylor series as done
in (2.59), leading to:
H
g(ŵ) = g(ŵk−1 )+∇g(ŵk−1 ) ŵ − ŵk−1 + 21 ŵ − ŵk−1 ∇2 g(ŵk−1 ) ŵ − ŵk−1 +... (2.75)
Of course, if the cost function has higher than quadratic order, also the Taylor series
expansion contains not only the constant, the linear, and the quadratic term, but also
terms of higher order. Then, the condition (2.65) for finding the global minimum is not
suitable any longer. On the other hand, the quadratic terms can be seen as a hint that
there exist one or more points in which vicinity, the cost function has approximately the
shape of a parabola, and thus may show one or more (local) minima. If the steepest descent
method is applied, we will find one of these local minima. However, (in most cases) this
may not be the desired global minimum. Only for a quadratic cost function, it is ensured
that the found minimum is the global one.
Exercise 2.7 Let x be the excitation signal of the steepest descent algorithm with the update
direction chosen to be the negative conjugate gradient of the quadratic cost function. Assume
Univ.Prof. DI Dr.-Ing. Markus Rupp 37
that the eigenvalues λi of the autocorrelation matrix Rxx are known. Find the optimum fixed
step-size µopt in the sense that
µopt = arg min max |1 − µλi | .
µ λi
Exercise 2.9 Derive the steepest descent algorithm (2.58)-(2.67) for real-valued signals.
Compare the results to the complex-valued case. What are the differences?
Exercise 2.10 Assume the matrix R to be positive definite. Under which conditions does
the following series converge, and which limit does it converge to?
∞
X
(I − µR)k
k=0
Exercise 2.11 Consider the quadratic cost function (2.59) for the standard steepest descent
algorithm (2.67). Show that the costs decrease fastest for the time-variant step-size:
k∇g(ŵk−1 )k2
µopt (k) = .
∇g(ŵk−1 )Rxx ∇H g(ŵk−1 )
2.6 Literature
A good overview on estimation methods can be found in [34]. An introduction to the
steepest descent algorithm is given in [29], respectively the more recent edition [30], and in
[71].
Chapter 3
The least-mean-squares (LMS) algorithm is by far the most frequently applied adaptive
algorithm. Its advantages are its numerical stability, its low computational complexity, as
well as its robustness. Almost all adaptive algorithms which are employed in practice, are
LMS algorithms or derivatives of it. In this chapter, we introduce the LMS algorithm start-
ing with its classic interpretation as an approximation of the steepest descent algorithm. Its
most important properties like convergence bounds, convergence speed, and steady-state
error will be derived based on stochastic analyses, that is, the driving signals will be as-
sumed to be random processes. Additionally, we will also investigate the behavior of the
algorithm under deterministic sinusoidal excitation. The chapter will close with application
examples.
b ∗ = x∗ x T
R (3.2)
xx k k
rxd = xk d∗ (k).
b (3.3)
38
Univ.Prof. DI Dr.-Ing. Markus Rupp 39
In (2.67), the gradient is a fixed direction which is given by the signal statistics of xk
and d(k). In contrast, the approximation of the gradient in (3.4) is itself a stochastically
changing direction, since it varies with the instantaneous values of the input vector xk .
Therefore, the name LMS is somewhat misleading. More precisely, it is a stochastic gradient
method. Nevertheless, the name LMS has been used extensively in literature, and thus,
will be used throughout this text as well. Note however that the original LMS estimator
from Lemma 2.2, in general, is an estimator given by a nonlinear function as was shown
in Section 2.1. We will see later in the context of robustness (see Chapter 7) that even
the name ‘stochastic gradient method’ is not entirely correct, since the algorithm works
perfectly well in absence of any randomness.
Returning to (3.4), note that in contrast to (2.67), the estimated parameter vector ŵk is
also random such as d(k) and xk . The error term [d(k) − xTk ŵk−1 ] is called the disturbed
a-priori error ẽa , as it is constructed by a-priori estimates ŵk−1 . Analogously, there exists
also a disturbed a-posteriori error constructed by the a-posteriori estimates ŵk : ẽp =
d(k) − xTk ŵk .
We obtain first variants of this algorithm by choosing different step-sizes. Time variant
step-sizes µ(k) appear to be practical, in particular when they are related to the power of
the input process. The following algorithms are common:
• general time variant step-size µ(k): stochastic gradient type algorithm.
Differentiating the cost function with respect to the parameters ŵk−1 and writing down a
gradient method following the idea: New estimate is old estimate plus negative gradient, is
40 Adaptive Filters (preliminary)
in most cases a successful approach. Correctly, the algorithm needs to be analyzed first.
With statistical methods this is often not feasible. Finally we will show a typical example.
We minimize E[|ẽa (k)|K ]. Differentiating with respect to ŵk−1 leads to the gradient:
E[− K2 |ẽa (k)|K−2 xk ẽa (k)]. The expectation is substituted by its instantaneous values and
we obtain:
Exercise 3.1 Derive an adaptive algorithm to minimize the cost function E[|ẽa (k)|] and
distinguish here complex-valued as well as real-valued signals.
Exercise 3.2 Consider an undisturbed, nonlinear system: y(k) = xTk w1 + xxTk w2 with
xx(k − i) = x(k)x(k − i), i = 0, 1, ..M2 − 1. The parameter vectors w1 and w2 have the
dimensions M1 × 1 and M2 × 1. Derive an adaptive algorithm to minimize the additively
disturbed squared error signal. What are the acf matrix of the input process if x(k) is a
white Gaussian process?
3.2.1 Assumptions
Assumptions: Independence Assumption (Ger.: Unabhängigkeitsannahme)
• The observed desire d(k) originates from a reference model d(k) = wTo xk + v(k), with
zero mean processes x(k) and v(k).
• The vectors xk of the input process are statistically independent to each other that
is fxx (xk , xl ) = fx (xk )fx (xl ) for k 6= l.
Univ.Prof. DI Dr.-Ing. Markus Rupp 41
• The driving input process xk is of zero mean and circular (spherically invariant[5])
Gaussian distributed.
• The additive noise v(k) is statistically independent of the input process xk .
Note that by such conditions the vectors ŵk are statistically independent of xl , l > k.
Note that due to the independence assumption we can write E[Pk−1 x∗k xTk ] = Pk−1 Rxx
∗
and
∗ T 2 ∗ 2
E[xk xk |v(k)| ] = Rxx σv . Equation (3.13) can this be reformulated to:
∗ ∗
Pk = Pk−1 − µPk−1 Rxx − µRxx Pk−1 + µ2 E[x∗k xTk Pk−1 x∗k xTk ] + µ2 Rxx
∗
σv2 . (3.14)
Furthermore we have for complex-valued spherically invariant Gaussian processes (see
Appendix D):
E[x∗k xTk Pk−1 x∗k xTk ] = E[x∗k xTk ]Pk−1 E[x∗k xTk ] + trace[Pk−1 E[x∗k xTk ]]E[x∗k xTk ]
∗ ∗ ∗ ∗
= Rxx Pk−1 Rxx + trace[Pk−1 Rxx ]Rxx .
Hint 1: for real-valued spherically invariant Gaussian processes we have:
E[xk xTk Pk−1 xk xTk ] = 2E[xk xTk ]Pk−1 E[xk xTk ] + trace[Pk−1 E[xk xTk ]]E[xk xTk ]
= 2Rxx Pk−1 Rxx + trace[Pk−1 Rxx ]Rxx .
Hint 2: The same statements are also true for the larger class of spherically invariant
complex-valued processes.
Knowing the matrix B we find sufficient conditions for convergence in the mean square
sense.
Theorem 3.1 The LMS algorithm is convergent in the mean square sense if it satisfies
the given assumptions and the condition:
2
0<µ< . (3.21)
2λmax + trace[Λ]
Proof: Convergence of Equation (3.19) is given if the eigenvalues of matrix B are smaller
than one in magnitude. Note that B is positive definite, that is all eigenvalues are positive.
A sufficient condition for convergence is thus that the largest eigenvalue is smaller than
one. The largest eigenvalue is given by the 2-induced norm. It can further be bounded by
the 1-induced norm that is
λmax = kBk2,ind ≤ kBk1,ind .
We can take an arbitrary row of B:
(B)
in which the eigenvalues λl of matrix B are weighted. In this form we assumed that
all eigenvalues are different. The weighting factors depends also of the initial values
of the parameter error vector. It can thus happen that particular eigenvalues have no
appearance. If we consider the worst case then the largest eigenvalue of B will dominate
the convergence speed. By Equation (3.18) we can describe the temporal movement of
the adaptation process. If we on the other hand consider a single realization the process
can look very different. The reason for this is that (3.18) describes the learning in the
44 Adaptive Filters (preliminary)
mean. Only if we ensemble average many realizations we will find a good agreement with
the theoretical prediction. The ensemble averaged adaptation curves are called learning
curves. Figure 3.1 displays a learning curve of the relative system mismatch for various
step-sizes.
0
Theorie, µ=µ /2
OPT
Grenzwert
−10 Simulation
Theorie, µ=µOPT
Grenzwert
Simulation
−20 Theorie, µ=0.8 µ
g
Grenzwert
Relativer Systemabstand / [dB]
Simulation
−30
−40
−50
−60
−70
−80
0 100 200 300 400 500 600 700 800 900 1000
Iterationen
Figure 3.1: Learning curves: relative system distance for various step-sizes.
Next to the convergence speed also the remaining parameter error vector also called
mismatch is of interest. This steady-state value is theoretically achieved for k → ∞. Let
us consider again Equation (3.18). For k → ∞, the mismatch is given by
Even more interesting than the mismatch is the distorted a-priori error
∆
ẽa (k) = d(k) − ŵTk−1 xk . (3.27)
for k → ∞. We obtain:
E[|ẽa (k)|2 ] = E[|d(k) − ŵTk−1 xk |2 ] (3.28)
= E[|v(k) + w̃Tk−1 xk |2 ] (3.29)
= σv2 + E[|w̃Tk−1 xk |2 ] (3.30)
= σv2 + E[w̃Tk−1 xk xH ∗
k w̃k−1 ] (3.31)
= σv2 + E[w̃Tk−1 Rxx
∗
w̃∗k−1 ] (3.32)
= σv2 + trace{Pk−1 Rxx
∗
} (3.33)
T
= σv2 + λ ck−1 . (3.34)
Here, we can recognize the part of the LMS approximation. The Wiener solution only
has a noise part σv2 while the LMS algorithm produces an additional error term gex called
the excess mean square error. By applying the matrix inversion lemma we can also give a
simple expression for this error:
PM λl
l=1 2−2µλl
gex = λT c∞ = µσv2 P . (3.35)
1− µ M λl
l=1 2−2µλl
A further parameter that is often used is the so called misadjustment (Ger.: Fehlanpassung).
It is the excess mean square error relative to the Wiener solution:
PM λl
gex l=1 2−2µλl
mLM S = =µ P . (3.36)
go 1−µ M λl
l=1 2−2µλ l
−1
Simulation
10 Theorie, µ=0.8 µg
Grenzwert
Simulation
−2
10
−3
10
−4
10
−5
10
−6
10
−7
10
0 100 200 300 400 500 600 700 800 900 1000
Iterationen
Figure 3.2: Learning curves: mean squared a-priori error for various step-sizes.
Due to the independence assumption we can interpret the convergence in the mean squared
as if we have isolated terms (I − µx∗l xTl ) although in reality the entire product in (3.37) is
of importance.
Due to the independence assumption we find the convergence condition in the mean square
sense by applying the expectation on both sides:
" k #
Y 2
E |w̃(k)|2 = E 1 − µ|x(l)|2 E |w̃(0)|2 (3.42)
l=1
k
Y h 2 i
= E 1 − µ|x(l)|2 E |w̃(0)|2 . (3.43)
l=1
h i
2
It is thus sufficient to consider the term E (1 − µ|x(l)|2 ) to guarantee convergence in the
mean square sense. For this the following needs to be true:
h 2 i
E 1 − µ|x(l)|2 <1 (3.44)
Xk 2
ln |w̃(k)|2 = ln 1 − µ|x(l)|2 ln |w̃(0)|2 . (3.46)
l=1
Dividing this term by the number of iterations k and letting this number grow, we obtain
for ergodic random processes x(k):
ln (|w̃(k)|2 ) h 2 i
lim = E ln 1 − µ|x(l)|2 . (3.47)
k→∞ k
Remark: This we can also argue by the law of large numbers as the elements x(k) are i.i.d.
with bounded variance.
The condition for convergence is now, after we made use of the logarithm:
h i
2 2
E ln 1 − µ|x(l)| < 0. (3.48)
If we compare this with our previous condition on the convergence in the mean square
sense, we can formulate this equivalently as:
h 2 i
ln E 1 − µ|x(l)|2 < 0. (3.49)
48 Adaptive Filters (preliminary)
In Figure 3.3 both functions are plotted for the case of a uniform distribution of x(k) in
the range [−1, +1]. The convergence condition in the mean square sense delivers a stability
bound for µg = 10/3 = 3.33, while our new condition only requires µg = 6.1. As the
new condition was found by a stochastic limit, we call it almost sure convergence or
convergence with probability one1 .
2
1.5
0.5
−0.5
−1
−1.5 E[log(u)]
log(E[u])
−2
0 1 2 3 4 5 6 7
u
Exercise 3.5 Compute the step-size so that for a white random process the largest
eigenvalue of B becomes minimal. How fast does the algorithm converge in dependence to
the filter length M ?
1
See also Appendix B for more details.
Univ.Prof. DI Dr.-Ing. Markus Rupp 49
Exercise 3.7 Provide the adaptation for a parameter error vector w̃k of length M = 1
and draw a signal for graph for it. Which conditions for the step-size µ and the pdf of the
driving process are required to obtain stability?
Exercise 3.8 Provide the adaptation for the parameter error vector w̃k of length M = 1.
Assume the driving process to be bipolar noise of zero mean with σx2 = 1. Compute the pdf
of the error vector. Now compute mean and variance matrix of the parameter error vector
for arbitrary length M .
Exercise 3.9 Substitute the a-priori error ẽa (k) = d(k) − xTk ŵk−1 by the a-posteriori
error ẽp (k) = d(k) − xTk ŵk . Reformulate the gradient method so that only a-priori error
terms occur.
K = XKX + R
with K, X, R Hermitian matrices. Given X and R, solve the equation by using Kronecker
properties and vectorization of X.
Apply the same method to solve for the recursive equation of the parameter covariance
matrix Kk in the LMS algorithm. What step-size condition for stability in the mean square
sense can be derived?
Exercise 3.11 Show that the stability limit for convergence with probability one after
Example 3.4 is indeed µg = 6.1.
Matlab Experiment 3.1 Write a Matlab Program for parameter identification. The
driving process is a real-valued, zero-mean Gaussian process with σx2 = 1. Let the unknown
system have M = 32 coefficients all different from zero. Run the LMS adaptation of a
transversal filter for various step-sizes and plot the relative system mismatch as well as the
a-priori error energy over time. Compare with theoretical results on fastest convergence
and stability limit.
50 Adaptive Filters (preliminary)
In a second experiment realize for each of the step-sizes 50 independent runs and plot the
ensemble averaged value. Discuss the differences.
Matlab Experiment 3.2 Rerun the previous experiment however with a colored driving
process, obtained by filtering the white process with the filter
√
1 − b2
F (z) = , b = −0.7.
1 − bz −1
Compute the autocorrelation function and provide the acf matrix for M = 32. Repeat the
previous experiment o wand compare to the theoretical values. Discuss the results.
Repeat the experiments with an NLMS algorithm. What is different now?
x(k) = A exp(−jΩo k)
with A ∈ C, a complex-valued amplitude. Thus we have for the vector of the driving
process:
for the vector ŵk−1 . Staying with the reference model d(k) = v(k) + wTo xk , we now obtain:
The equation has obviously simplified and temporal changes are only visible now in the
parameter error vector w̃k and in the noise ṽ(k). If we consider first only the homoge-
neous difference equation, we recognize that only one directional component of the vector
is changing. The error vector at time instant k can be decomposed into two components:
one parallel to x and one being orthogonal to it, thus
with xT z ∗ = 0. The homogeneous part in Equation (3.53) only has a change in γ(k − 1)
while z remains unchanged. We can thus write for the homogeneous part alone:
The component γ(k − 1) is thus reduced by the value (1 − µ|A|2 kxk22 ). With this we can
formulate the convergence condition immediately to:
2
0<µ< . (3.55)
|A|2 kxk22
Under such condition the reduction of γ(k − 1) will continue until it reaches zero (asymp-
totically). This can also be obtained with a single adaption step, if
1
µ= .
|A|2 kxk22
52 Adaptive Filters (preliminary)
The steady-state γ(k) = 0 is equivalent with the complete suppression of the signal at
frequency Ωo . We call this also a signal adaptation rather than a system adaptation,
in which the entire error vector tends to zero.
Let us continue with the inhomogeneous Equation (3.53). The disturbance ṽ(k) causes
that the component γ(k) does not remain constant at zero. As the excitation is only in
direction x, no component of z will be changed in the inhomogeneous equation. In other
words the inhomogeneous equation changes only γ(k). We thus obtain:
If we consider the energy of the terms and utilize the statistical independence of ṽ(k) and
γ(k − 1), we obtain
Let us further assume that the initial estimate is zero, then we recognize that after k steps
we have a linear combination of vectors xl∗ , l = 1..k. Let us combine them in a matrix
Xk and the weighting terms µẽa (l) in a vector ẽk , then we can formulate the adaptation
(without the initial value ŵ0 ) as follows:
ŵk = Xk ẽk .
Obviously, the vectors Xk span a space. If the space is of dimension M , thus of the length
of the vector wo , then the algorithm can select the weights ẽk so that ŵk will approximate
the optimal value wo . Is the dimension of the spanned vector space smaller than M , the
resulting estimator cannot approach the solution. We thus have found a required condition
for system adaptation:
Univ.Prof. DI Dr.-Ing. Markus Rupp 53
Lemma 3.1 In order to achieve a system adaptation (and not only a signal adaptation)
with the LMS algorithm, the following condition needs to be satisfied additionally (Ger.:
hartnäckige Anregung, Engl: persistent excitation)
Let us consider an LMS algorithm, excited by a cosine. The i−th entry of vector xk :
B
x(k − i) = B cos(Ωo (k − i)) = [exp(jΩo (k − i)) + exp(−jΩo (k − i))] , (3.60)
2
with a real-valued amplitude B ∈ R. Thus the i−th entry of the parameter error vector
reads:
µB
w̃i (k) = w̃i (k − 1) − [exp(jΩo (k − i)) + exp(−jΩo (k − i))] ẽa (k). (3.61)
2
Neglecting the initial conditions and applying the Z-Transform we obtain:
z µB
W̃i (z) = − Ẽa (z exp(−jΩo )) exp(−jiΩo ) + Ẽa (z exp(jΩo )) exp(jiΩo ) . (3.62)
z−1 2
The undistorted error ea (k) can also be described as a linear combination of xk and w̃k−1 .
This leads in the Z-domain to:
M
X
B
Ea (z) = z −1 W̃i (z exp(−jΩo )) exp(jΩo (1 − i)) + W̃i (z exp(jΩo )) exp(−jΩo (1 − i)).
2 i=1
(3.63)
If we neglect the terms at 2Ωo , we obtain after substitution:
µB 2 M 1 1
Ea (z) = − Ẽa (z) + (3.64)
4 z exp(−jΩo ) − 1 z exp(jΩo ) − 1
µB 2 M 1 − z cos(Ωo )
= − 2
Ẽa (z). (3.65)
2 z − 2z cos(Ωo ) + 1
As the distorted a-priori error comprises of the undistorted version and the noise, Ẽa (z) =
Ea (z) + V(z), we obtain the following Z-transfer function:
µ
Ea (z) µ̄
[1 − z cos(Ωo )]
= (3.66)
V(z) z 2 − 2z cos(Ωo ) 1 − 2µ̄µ
+ 1 − µ̄µ
√ √
µ̄ V(z) µ̄ Ea (z)
- l - l - z −1 −cos(Ωo ) -
z−cos(Ωo )
µ −
6
µ̄
l
µ
1−
µ̄
Figure 3.4: LMS Algorithm under sinusoidal excitation as allpass in the feedforward path
and lossy feedback.
Exercise 3.13 Applying the same approximations (3.66) as shown in this section, compute
the expression D(z)/Ẽa (z), that is often by applied in active noise control.
Univ.Prof. DI Dr.-Ing. Markus Rupp 55
Is a scaling with respect to input energy desired, it can be realized with little effort. For
the NLMS algorithm for example, the computation:
can be achieved recursively. Note that this only works in fixed-point arithmetic. In floating
point arithmetic such recursion can lead to cut-off and rounding errors. In this case a block
operation is useful. The recursion is implemented over a length of M but at the same time
a new block is computed in parallel and the correct results are taken at block boundaries
to avoid an increasing round off effect:
At very high processing speeds the low complexity of 2M MAC operations may still be too
high. Three variants are in use with reduced complexity with the drawback of less precision:
The sign operation is applied to every element in the vector individually. For complex-
valued numbers it works independently on real and imaginary part. For sure the algorithms
change their behavior sue to such brute force changes.
Note that next to complexity also the data rate can become a problem. If the LMS
algorithm is operated in two steps, (first the error computation, then the updates), then
56 Adaptive Filters (preliminary)
for each x a value from ŵ needs to be loaded to compute the error. Then after th error
is computed, again all values for x and ŵ need to be loaded and finally ŵ stored. If two
parallel data buses are available, one for x and one for ŵ, the LMS algorithm could not be
computed in 2M steps. In order to achieve the complexity in 2M steps, we have to include
some further tricks. As in transversal filters the values in xk+1 are obtained by shifting all
elements of xk by one position and a single new value is introduced. Taking advantage of
such property we can start computing the update error for k + 1 already at k. With this
we only require two load and one store operation per step.
Most research went into developing algorithms that are learning faster. As we have
already seen that a correlated input process causes a slow learning, a natural way to speed
up algorithms is to decorrelate them first. For example, it is possible to either know the
correlation matrix Rxx of the input process beforehand or to estimate it and then apply
the following matrix step-size:
∗
−1
Newton-LMS: ŵk = ŵk−1 + µẽa (k) Rxx x∗k .
A better idea is shown in Figure 3.5 (after Schultheiss [18]). Here, a filter F is included
in such a way that on the one hand the input signal is decorrelated and on the other hand
the estimation problem remains unchanged. If the input process is speech, we can take
advantage of their short time stability and compute every 10-20ms a new optimal filter F .
The switch form old to new filter needs to be done cleverly so that the click is not audible.
The filter length of F is typically 10-20 coefficients, thus very short when compared to the
adaptive filter length of w.
In the area of speech processing in which typically many operations per sample are
computed on a DSP, often block operations are applied. In this case not every sample
is being operated but a block of say 20 or 50 samples. In general this leads to gain in
complexity. For example the error computation which is a filter process can be computed
by an FFT. A successful approach in block processing suitable for echo compensation in
hands free telephones as well as in the long distance calls are so called polyphase filter
banks. They split the entire frequency band of interest into small bands, so called sub-
bands. The processing in sub-bands has the advantage that due to the smaller bandwidth
one can operate at lower sampling rate and that the signals in the sub-bands are roughly
white. Thus next to the complexity reduction there is also an increase in learning rate
due to decorrelation. The drawback of such filter banks is the additional delay as a block
processing causes a delay according to its block length. Depending on the filter bank design
it can be several 10ms.
Univ.Prof. DI Dr.-Ing. Markus Rupp 57
v(k)
y(k)
x(k) - w - h
?- - h - e(k)
F 6
−
ŷ(k)
- - ŵ
F
A further problem in hands free telephony is the double talk detection (Ger.:
Gegensprecherkennung). Presume that after successful adaptation the local speaker
becomes active and thus the microphone signal increases dramatically. The DSP has
to decide whether this is because the local speaker became active or the system has
changed and the adaption needs to continue in order to track such system change.
Also both situation can appear simultaneously. Modern algorithms thus have a so-called
step-size control unit that tries to make the right decision and sets the step-size accordingly.
If block processing is not allowed (due to delay constraints) other means are required.
One possibility is to extend the scalar step-size to a matrix similar to the Newton LMS.
It does not need to be the inverse of the acf matrix, a diagonal matrix with well chosen
diagonal elements can also be of advantage. The diagonal elements can be chosen to be
proportional to the expected weights, or adaptively selected depending on the estimated
weights:
(ŵk−1 )1
(ŵk−1 )2
Mk|k = . . (3.75)
.
(ŵk−1 )M
µMk|k
Mk = (3.76)
trace[Mk|k ]
ŵk = ŵk−1 + Mk ẽa (k)x∗k . (3.77)
Exercise 3.14 Formulate the LMS algorithm for a transversal filter so that it requires
only two load and one store operation.
Exercise 3.15 Show that the Newton LMS algorithm under correlated excitation behaves
like an LMS under white excitation
Exercise 3.16 How does the filter structure in Figure 3.5 alter the distortion? Is it still a
system identification?
Matlab Experiment 3.3 Extend the algorithm from Experiment 3.2 by adding a
prefiltering after Schultheiss and compare the results. Is the learning rate the same as
if a white excitation was applied? Also implement the Newton LMS algorithm and compare.
3.5 Literature
Good overviews on adaptive filters and LMS algorithm can be found in [29, 80, 72, 42].
The original idea of the LMS algorithm goes back to Widrow and Hoff [78], although
gradient type algorithms in similar form can be found in older literature. The independence
assumptions were introduced by [46]; in [10] a very complex method is derived to make exact
predictions without the independence assumption, however the results are not in analytical
form. The here shown derivation of the parameter error vector is based in [31, 17], although
older work [75] was already moving along such paths. Extensions to the NLMS algorithm
can be found in [3, 51] and spherically invariant processes in [53]. A first analysis of the Sign-
Error algorithm can be found in [8]. A good explanation for convergence with probability
one is in [70]. A deeper understanding is provided in [72]. Polyphase filter banks for hands
free telephony were introduced by Kellermann [36] and for equalizers in [47]. The excitation
with sinusoidal signals was introduced by [21, 9], the feedback structure in this form the
first time in [58]. The PNLMS can be found in [13, 20, 65].
Chapter 4
Next to the LMS algorithm the Recursive Least Squares (RLS) algorithm is the most
prominent one. The problems that come with RLS often exclude it in practical applications.
As the RLS algorithm is just a recursive implementation of the LS problem, its properties
are identical to a classic LS solution. We therefore will start with a brief introduction into
the problem of least squares.
d N = XN w o + v N . (4.1)
Here we wrote N observations d(k) = wTo xk + v(k), k = 1..N in vectors and matrices:
59
60 Adaptive Filters (preliminary)
Note that the solution depends on the number of observations N . Is the value of N unde-
termined and grows with time, the problem is called LS with growing window. The solution
is given by
wLS,N = wN = arg min kdN − XN ŵk22 . (4.7)
ŵ
As we only discuss LS estimations in this chapter, we will leave out the index ’LS’. We
keep, however, the index N as it does not only indicate of how many observations we are
using but for a growing window it also denotes the time. If not indicated otherwise, we will
assume in the following that
• The system of equations is underdetermined, that is M > N –we have less observa-
tions than parameter to estimate. In this case we assume: rank(XN ) = N .
Differentiating of the quadratic form (4.7) with respect to the unknown vector leads to the
following orthogonality condition:
∂kdN − XN ŵk22
= − (dN − XN ŵ)H XN = 0. (4.9)
∂ ŵ
∂ 2 kdN − XN ŵk22
2
= XNH XN > 0. (4.11)
∂ ŵ
The form XNH XN > 0 indicates that the matrix XNH XN is positive definite. The minimum
cost function can be computed without explicitly knowing the LS solution:
= kdN k22 − dH H −1 H
N XN (XN XN ) XN dN . (4.13)
Compare this equation with the corresponding equations of the steepest descent method.
Except of the expectation values they formally appear identical. Also the various terms
Univ.Prof. DI Dr.-Ing. Markus Rupp 61
gLS (ŵN ) = gLS (ŵLS,N ) + (ŵLS,N − ŵN )H XNH XN (ŵLS,N − ŵN ). (4.14)
Both equations, (4.10) and (4.12), can be combined as the so called normal equations:
H
dN dN dH N X N 1 g LS ( ŵ N )
= . (4.15)
XNH dN XNH XN −ŵN 0
4.1.1 Existence
Before we continue with closed form solutions, we have to consider existence and
uniqueness of the LS solution.
Lemma 4.1 Is XN of row rank N ≥ M , then the solution is unique and given by ŵLS,N =
[XNH XN ]−1 XNH dN .
Is XN not of row rank N , then the normal equations have more than one solution of which
any two solutions ŵ1 and ŵ2 differ by a vector in the nullspace of XN , thus XN [ŵ2 − ŵ1 ] = 0.
z = XNH XN p = XNH q.
The columns of XNH XN span a space (column range) to which the vector z belongs to.
If there is a vector, different to the zero vector, p for which we have that XNH XN p = 0,
then such vector p belongs to the null space of XNH XN . Assuming that XN is of full row
rank M , the nullspace is empty. As XN and XNH XN have the same nullspace (see also
Appendix F), there cannot exist a vector different to the zero vector for which XN p = 0.
Thus, with choice of vector q the vector p is uniquely defined and vice versa.
In the second part we assume that XN is not of full row rank M . Then XNH XN is also
not of full row rank and solutions p different to the zero vector exist in the null space
of XNH XN to which exist the same solutions in the nullspace of XN (as it is the same
nullspace). The choice of q is now not uniquely defined by p.
The matrix XN XNH spans a smaller space and thus the nullspace will become larger. A
multitude of solutions exist under which those with minimum norm are of most interest.
With Singular Value Decomposition (SVD) it can be shown [29], that the solution (4.16)
has minimum norm.
4.1.2 LS Estimation
Lemma 4.2 Consider the following linear model
dN = XN wo + vN . (4.17)
• E[ŵLS,N ] = wo ,
• the LS-Estimator is the best, linear estimate without a bias (Best Linear UnBi-
ased=BLUE).
The first expectation over the noise results in the unit matrix and finally we find the desired
result.
The third property is shown by assuming an arbitrary linear estimator B, thus
w̄ = BdN . (4.18)
Univ.Prof. DI Dr.-Ing. Markus Rupp 63
w̄ = wo + BvN . (4.19)
Lemma 4.3 A set of input vectors {xk , k > 0} is a persistent excitation if positive
numbers α, β, No exist so that
n+N
Xo
αI < xk xH
k ≤ βI, for all n. (4.23)
k=n
and Q ∈ IRN ×N , to by symmetric and positive definite. We then obtain the following cost
function:
By the choice of the window length N , the solution is determined. If N < M , we have an
underdetermined problem and we need to take solution (4.16). Alternatively such solution
is useful if the algorithm is not excited persistently. This underdetermined algorithm is also
known in literature under the name Affine Projection Algorithm (APA). For the spacial case
N = 1 we obtain the NLMS algorithm with step-size µ(k) = 1/[kxk k22 ].
4.1.5 Summary
Finally we like to compare the RLS algorithm in table 4.1 with respect to a stochastic
and a deterministic view. In order to achieve this we extended (4.17) in such a way that
the estimated system is considered a random variable. Table 4.1 reveals a few unexpected
analogies.
stochastic deterministic
d = Xw + v, d = Xw + v
mw = E[w] wo = w̄
E[(w − mw )(w − mw )H ] = Rww Πo
mv = E[v] vo
E[(v − mv )(v − mv )H ] = Rvv Q−1
md = Xmx + mv do = Xwo + v o
ŵ ŵ
minK kw − mw − K(d − md )k22 minw (w − wo )H Π−1 2
o (w − w o ) + kd − Xw − v o kQ
Ko = Rww X H [XRww X H + Rvv ]−1 Ko = Πo X H [XΠo X H + Q−1 ]−1
−1
Ko = [Rww + X H Rvv
−1
X]−1 X H Rvv
−1
Ko = [Π−1 H
o + X QX] X Q
−1 H
ŵ = Ko [d − Xmw − mv ] ŵ = Ko [d − Xwo − v o ]
Table 4.1: Comparison of terms of the LS algorithm in stochastic and deterministic de-
scriptions.
Exercise 4.3 Based on the statement of Lemma 4.1 show that the LS algorithm is also
capable of a linear prediction. For this consider the autoregressive random process of order
P
XP
x(k) = x(k − i)a(i) + v(k) (4.28)
i=1
and estimate by LS its coefficients a(i). Show the estimator’s properties by formulating
the random process in vector notation of length M > P . Show that for such vectors v̂k =
xk − Xk â we have T
xk v̂k 1
T 2
= . (4.29)
Xk kv̂k k2 0
Exercise 4.4 To transmit data over a wireless channel, a constant modulus signal (|x(k)| =
1) is employed. The best channel estimation can be achieved if trace([XNH XN ]−1 ) becomes
minimal. By which property(ies) of the transmitted training signal can this be achieved?
Exercise 4.5 Minimize the following cost function with constraint
kŵk − ŵk−1 k22 + λkdk − XP (k)ŵk k22 , (4.30)
for optimal ŵk . Let XP (k) be a matrix with instantaneous value xk and past values
xk−1 ...xk−P +1 .
Consider alternative the formulation
kŵk − ŵk−1 k22 + λT [dk − XP (k)ŵk ] + λH [dk − XP (k)ŵk ]∗ (4.31)
66 Adaptive Filters (preliminary)
With help of the Matrix-inversion Lemma 2.4 we can describe also this recursion in its
inverted form and obtain:
PN x∗N +1 xTN +1 PN
PN +1 = PN − , P0 = Πo . (4.38)
1 + xTN +1 PN x∗N +1
We now recognize the meaning of our initial certainty parameter Πo . This recursion does
not require a matrix inversion, that is instead of O(M 3 ) we only require O(M 2 ) operations.
Univ.Prof. DI Dr.-Ing. Markus Rupp 67
The substitution of the recursive form in Eqn. (4.35) results in the following:
ŵN +1 = PN +1 XNH dN + x∗N +1 d(N + 1) (4.39)
PN xN +1 xN +1 PN H
∗ T
= PN − T ∗
XN dN + x∗N +1 d(N + 1) (4.40)
1 + xN +1 PN xN +1
H PN x∗N +1 xTN +1 H ∗ xTN +1 PN x∗N +1
= PN XN dN − P X d +PN xN +1 1 − d(N + 1)
| {z } 1 + xTN +1 PN x∗N +1 | N {zN N} 1 + xTN +1 PN x∗N +1
ŵN ŵN | {z }
1
1+xT P x∗
N +1 N N +1
PN x∗N +1 T
= ŵN + d(N + 1) − x N +1 ŵ N . (4.41)
1 + xTN +1 PN x∗N +1
This description is not so different to the description of the LMS algorithm. The essential
difference is the new regression vector, thus a new direction for the updates than before. In
the RLS algorithm this direction depends on the previous directions. Let us consider the
regression vector
PN x∗N +1
k N +1 = (4.42)
1 + xTN +1 PN x∗N +1
= PN +1 x∗N +1 (4.43)
= PN x∗N +1 γ(N + 1). (4.44)
In the matrix PN +1 all past values of the vectors xk , k = 1..N are gathered. The scalar
term γ(N + 1) is called conversion factor. We have:
1
γ(N + 1) = . (4.45)
1+ xTN +1 PN x∗N +1
k N +1 k H
N +1
PN +1 = PN − ; P0 = Πo . (4.46)
γ(N + 1)
A further interesting connection is given between the a-priori and the a-posteriori error,
that is
By substituting the recursion (4.41) in the definition of the a-priori error, we obtain:
As γ(N + 1) is strictly smaller than one, the a-posteriori error is always smaller than the
a-priori error in magnitude.
With the last substitution from (4.49) we also recognize that the conversion factor
γ(N + 1) is real valued.
The index N only indicated that the observation duration stretches over N elements in the
past. The step-size µ and the regularization parameter ǫ are both assumed to be positive.
A time index k is required now to distinguish the various terms. For the classical RLS
algorithm this was not required as with growing window N also the time was defined.
Note that the inner matrix XNH (k)XN (k) is of dimension M × M . Selecting an ob-
servation window N < M results in an underdetermined matrix. Due to the positive
regularization parameter ǫ > 0 the modified matrix ǫI + XNH (k)XN (k) can be inverted.
Univ.Prof. DI Dr.-Ing. Markus Rupp 69
However, it is not necessary to invert the matrix of large dimension M × M as we can apply
the matrix inversion lemma:
−1 H −1
ǫI + XNH (k)XN (k) XN (k) = XNH (k) ǫI + XN (k)XNH (k) . (4.60)
We thus have only to invert a matrix of dimension N × N . The update equation of the
ǫ-APA is thus given by:
−1
ŵN (k) = ŵN (k − 1) + µXNH (k) ǫI + XN (k)XNH (k) [dN (k) − XN (k)ŵN (k − 1)] . (4.61)
Two special cases are of particular interest and will be discussed next. The first case is
for N = 1. We then obtain the ǫ−LMS algorithm and recognize that it can be interpreted
as a special case of the ǫ-APA with observation window length one.
The second special case is given for µ = 1 and ǫ = 0. If the matrix XN (k)XNH (k) is of
full rank for every k, it can be inverted and the algorithm will converge. The dependency
of the a-posteriori errors to the a-priori errors is now of interest. Considering N values in
a vector we find:
∆ ∆
ẽa (k) = [dN (k) − XN (k)ŵN (k − 1)] , ẽp (k) = [dN (k) − XN (k)ŵN (k)] . (4.62)
Substituting this in the update equation and we obtain the desired relation:
ẽp (k) = I − XNH (k)(XN (k)XNH (k))−1 XN (k) ẽa (k). (4.63)
The matrix I − XNH (k)(XN (k)XNH (k))−1 XN (k) is a so-called projection matrix. Consider a
vector z comprising of two orthogonal components: a linear combination of XNH (k) and a
vector orthogonal to the subspace spanned by XNH (k), thus z = y + XNH (k)x. Multiplying
the projection matrix from the right by such vector we recognize that it will become
smaller by the amount of the linear combination in XNH (k), thus x disappears. On the
other hand, the orthogonal part y remains unchanged. We can further show that the
a-priori error vector ẽa (k) can be composed only by a linear combination in XNH (k) and
thus the a-posteriori error ẽp (k) = 0, must disappear.
With such property the cost function of the APA can also be described as:
min kŵN (k) − ŵN (k − 1)k with the constraint ẽp (k) = 0.
There exists an infinite amount of vectors ŵN (k), who solve ẽp (k) = 0 (all with
different orthogonal part y). This set of solutions will be denoted as affine subspace (also:
hyperplane, manifold) to indicate that the plane defined by the set of solutions is not
necessarily passing ŵN (k) = 0. For the special case N = 1 we say that the APA (NLMS
algorithm with normalized step-size α = 1) obtains the solution ŵN (k) by a projection
70 Adaptive Filters (preliminary)
with respect to the affine subspace. For N > 1 the solution is the Intersection of all these
affine subspaces. The APA thus finds its solution by projection onto the Intersection of
all affine subspaces. Note that such projection properties get lost for the overdetermined
case N > M , thus for the RLS algorithm.
Exercise 4.6 Derive the recursive form of the LS algorithm with sliding rectangular
window (4.26).
Exercise 4.7 Derive the recursive form of the LS algorithm with exponential window :
N
X N
X
N −i 2
gLS (ŵ) = λ |ẽa (i)| = λN −i |d(i) − xTi ŵ|2 . (4.64)
i=1 i=1
Exercise 4.8 Is it possible to derive for the LMS algorithm a relation like in (4.41)
between a-priori- and a-posteriori error?
Exercise 4.9 Derive the LMS algorithm from the Steepest-Descent algorithm for the fol-
lowing estimates:
N −1
1 X ∗ T
R̂xx = x x (4.65)
N l=0 l−k l−k
N −1
1 X ∗
r̂∗xd = x d(l − k). (4.66)
N l=0 l−k
Assume for this derivation the driving process x(k) to be stationary. Under which step-size
condition do you obtain the RLS algorithm?
d N = XN w o + v N . (4.67)
Univ.Prof. DI Dr.-Ing. Markus Rupp 71
As the LS problem (same goes for its recursive form) is solved in form of a set of linear
equations of order M , a stationary solution will exist after N = M steps, whose value
only alternates further with the noise terms. If applying the recursive form, we typically
start with zero vectors xk (initialized by zero entries) and thus for the first M steps, the
algorithm cannot work properly. We thus require 2M steps from starting with zero to
the convergence of the algorithm. This is a substantial speed-up compared to the LMS
algorithm even compared to the Newton-LMS algorithm for which the RLS algorithm
can be viewed as approximation: P −1 ≈ Rxx . Similar to the Newton-LMS algorithm the
learning speed is independent of the driving sequence correlation.
To compute the steady-state misadjustment, we consider the RLS algorithm with ex-
ponential decaying window as its most common form. Its update is given by:
ŵk = ŵk−1 + k k [d(k) − xTk ŵk−1 ], (4.68)
λ−1 Pk−1 x∗k
kk = , (4.69)
1 + λ−1 xTk Pk−1 x∗
Pk = λ−1 [Pk−1 − k k xTk Pk−1 ]. (4.70)
Note that we changed notation slightly: instead of an index N that denoted time as
well as growing size before, the index now denotes the time instant k. On the one
hand we like to point out by this the formal similarity to the LMS algorithm but on
the other hand we also like to point out that there is no growing window of size N any more.
We consider the additive noise as random process by which the estimate ŵk now also
becomes a random process. From the last line in (4.25) and the reference model we find for
Π−1 = 0, Q = Λ and arbitrary σv2
lim Ev [(wo − ŵk )(wo − ŵk )H ] = lim [Xk−1
H
ΛXk−1 ]−1 [Xk−1
H
Λ2 Xk−1 ][Xk−1
H
ΛXk−1 ]−1 σv2 .
k→∞ k→∞
(4.71)
k−i
The diagonal entries of matrix Λ are Λii = λ . Correspondingly the driving process x(k)
can be viewed as random process which allows to compute (approximately) expectations in
Xk−1 :
k
!−1
H X
Ex [Xk−1 ΛXk−1 ]−1 [XH 2 H
k−1 Λ Xk−1 ][Xk−1 ΛXk−1 ]
−1
≈ Rxx λk−i (4.72)
i=1
" k
# k
!−1
X X
Rxx λ2k−2i Rxx λk−i .
i=1 i=1
The steady-state value of the a-priori error is found to be (assuming σx2 = 1):
lim E[|ẽa (k)|2 ] = σv2 + lim tr E[(wo − ŵk )(wo − ŵk )H ]Rxx (4.76)
k→∞
k→∞
2 1−λ
= σv 1 + M (4.77)
1+λ
and thus the misadjustment
1−λ
mLS = M . (4.78)
1+λ
Note that for real- and complex-valued Gaussian processes more precise expressions exist,
as the terms can be modeled as Wishart process. Typically the provided expressions are
valid for dimensions of M ≥ 10.
Matlab Experiment 4.1 Repeat Matlab Experiments 3.1 and 3.2, however with expo-
nentially weighted RLS algorithm. Instead of various step-sizes, apply forgetting factors λ
in the range [0.7..1.0]. Compare experimental results with theoretical predictions.
Exercise 4.10 Compute the exact expression of (4.72) in case teh forgetting factor is one.
Assume the driving process to be a zero mean Gaussian process. What is then obtained for
misadjustment and mismatch?
the application in sub-bands. As previously described such methods divide the problem
in independent bands of smaller bandwidth. If run an FTF in such a subband, it also
becomes unstable. If however, the re-starting point of each sub band is shifted towards
each other in time, the instability in each subband will start at different times. This can
be detected and then the algorithm in each sub band is restarted from zero. During this
phase a simple LMS algorithm can take over the updates[23].
4.5 Literature
Good tutorials on LS and RLS algorithms can be found in [29]. Polyphase implementations
of FTF versions are explained in [23]. Detailed descriptions to various implementations are
in [67]. Details to CORDIC implementations are in [2].
Chapter 5
Until now our assumption was that the system under consideration is time-invariant, thus
a fixed wo . But not all systems are fixed. Due to aging and temperature properties of
systems alter slowly. Some systems change fast: the loudspeaker-room-microphone system
changes rapidly if the speaker moves through the room. Also the wireless channel may
change rapidly with moving receiver or moving scattering objects. Next to the initial
learning or transient response there is also a tracking (Ger: Nachführverhalten) behavior.
This describes how an adaptive filter reacts if the system is permanently changing. One
possibility to describe such behavior in a general form, is to assume a rotational change, as
described in the following:
d(k) = xTk wo ejΩo k + v(k) . (5.1)
The system wo that is to be estimated now rotates with an unknown frequency Ωo . A direct
application of this formulation is given in the wireless channel estimation under frequency
offset. As the receiver utilizes a different oscillator than the transmitter, a frequency offset
Ωo occurs. We will recognize that the reaction of the adaptive system of such a rotation
is typically of linear nature. This means that the reaction of an arbitrarily changing time-
variant system can be treated as the superposition of individual rotational components. It
is thus sufficient to analyze the behavior for a single rotation at frequency Ωo .
∆
w̃k = wo ejΩo k − ŵk . (5.2)
75
76 Adaptive Filters (preliminary)
The update equations for the LMS and RLS algorithm can be provided in a unified form:
The vector g k is simply µxk for the LMS algorithm and k k = Pk xk in case of the RLS, thus
µxk ;LMS
gk = . (5.6)
Pk xk ;RLS
In the next step, we consider the signals v(k) and xk as random processes, thus v(k) and
xk . We can now compute the expectation with respect to the driving source:
Theorem 5.1 The stationary solution for the LMS and RLS algorithm for a system that
changes periodically with frequency Ωo in the mean is given by:
n jΩo −1 o
jΩo
E[ŵk ] = I − (e − 1) e I − (I − A) (I − A) wo ejΩo k . (5.8)
Proof: As E[w̃k ] is an output of a linear system, we must have a solution of the form:
Thus, the expectation of the parameter error vector also becomes periodically time variant.
For k → ∞ the initial transients disappear and the mean parameter error vector becomes
eventually
−1
E[w̃k ] = (1 − e−jΩo ) ejΩo I − (I − A) (I − A)wo ejΩo (k+1) , (5.11)
Univ.Prof. DI Dr.-Ing. Markus Rupp 77
We can formulate the result (5.12) for an arbitrarily small frequency range dΩ:
n −1 o
dE[ŵk (Ω)] = I − (ejΩ − 1) ejΩ I − (I − A) (I − A) wo (Ω)ejΩk dΩ.
This interpretation allows the computation of the algorithmic response to arbitrary system
changes:
Z πn o
1 jΩ
jΩ −1
E[ŵk ] = I − (e − 1) e I − (I − A) (I − A) wo (Ω)ejΩk dΩ . (5.13)
2π −π
The kernel of this integral is the Fourier-transform of the algorithmic response or also called
the Fourier-transform G of the Green’s function of the LMS/RLS algorithm:
n −1 o
∆
G(Ω) = I − (ejΩ − 1) ejΩ I − (I − A) (I − A) .
This Green function in the mean is thus obtained by the inverse Fourier-transform:
g(k) = I − (I − A)k u(k) − (I − A)(k−1) u(k − 1)
with the step-response u(k). In other words: the algorithmic response to arbitrarily chang-
ing systems can be computed by a convolution with the Green function g(k). Two well
known results can be obtained by this:
• In case of a frequency offset we find: wo (Ω) = wo δ(Ω − Ωo ) and we obtain:
n −1 o
E[ŵk ] = I − (ejΩo − 1) ejΩo I − (I − A) (I − A) wo ejΩo k .
• In the initial phase of the adaptation we have: wo (Ω) = wo /[1 − e−jΩ ] and obtain
k+1
E[ŵk ] = I − [I − A] wo .
Theorem 5.2 Under white excitation the LMS and RLS algorithm show identical tracking
behavior in the mean.
78 Adaptive Filters (preliminary)
Proof: We obtain A = µI for the LMS algorithm and [1 − λ]I for the RLS. In other words
the choice µ = 1 − λ results in the same tracking behavior.
Matlab Experiment 5.1 A system wTo = [1, 10, 1] excited by a white sequence is changing
periodically with frequency Ωo . Compute the algorithmic response of the LMS and the RLS
algorithm as a function of the frequency. Simulate this in Matlab and verify your result.
Plot the relative parameter error vector for a range of Ω over µ and 1 − λ. How can ŵk be
used to estimate the unknown frequency Ω?
The unknown system wk varies according to (5.14), driven by the signal uk . The output of
the system is a linear combination of system and input Xk with an additional noise term
v k . In the general case also the output values are vectors. Such systems are often called
Multiple Input-Multiple Output (MIMO).
In order to estimate such a system wk the adaptive algorithm should reflect as much
of the a-priori knowledge as possible, for example the state-space form. In the recursive
update part of the adaptive algorithm, we introduce a prediction component Fk ŵk . As we
have errors now in vector form, we have to introduce an optimal matrix step-sise Mk . The
optimal adaptation algorithm is thus given by:
The only open problem is now to find the optimal step-size matrix Mk .
For this we assume that the signals uk and v k are random processes. We will further
assume that
E[vk vH
i ] = Rvv δ(k − i); E[uk uH
i ] = Ruu δ(k − i); E[vk uH
i ] = 0. (5.18)
Univ.Prof. DI Dr.-Ing. Markus Rupp 79
E[w0 vH
i ] = 0; E[w0 uH
i ] = 0; E[w0 wH
0 ] = P0 . (5.19)
In order to ensure a unique solution we also have to assume that Rvv is positive definite, a
condition that is in general satisfied. Note that other more relaxed conditions are possible
as well. the solution only becomes then more and more difficult to interpret.
then we can also compute the covariance matrix of the a-priori error vector:
E[ẽa,k ẽH H
a,k ] = Rvv + Xk Pk−1 Xk = Ree . (5.23)
The optimal step-size matrix can be found by minimizing the recursion of the covariance
matrix with respect to Mk :
Having this available we can now formulate the equation for the parameter covariance
matrix:
Pk = Fk Pk−1 FkH − M̄k Ree M̄kH + Gk Ruu GH
k . (5.28)
Eventually the complete Kalman algorithm is obtained.
80 Adaptive Filters (preliminary)
wk = Fk wk−1 + Gk uk (5.29)
dk = Xk wk−1 + vk . (5.30)
with conditions
uk H Ruu δ(k − i) 0 0
vk ui 0 Rvv δ(k − i) 0
E
w0 vi
=
.
(5.31)
0 0 P0
w0
1 0 0 0
Note that the application of the Kalman algorithm requires the assumption of random sig-
nals. This was not the case for the previous algorithms. In the LMS algorithm we needed
the statistic only for finding optimal step-size parameters, and to compute its properties.
The RLS worked entirely without randomness. We should further mention that the com-
plexity of the Kalman algorithm is given by O(M 3 ) as we have to invert a matrix. Only
in very simple cases this can be avoided. Classically the Kalman algorithm is found in au-
tomation control where the time variant nature of the system is somewhat known and there
is sufficient time to compute the algorithmic equations between to observation samples. In
the last years more applications included satellite communications and adaptive equalizers
in wireless systems. Often the time-variant behavior is only partly known or only in ap-
proximation. Then next to the Kalman equations also estimates for the system parameters
are required, for example for matrix Fk . This is called the extended Kalman filter. In
automation control the term LQC=linear, quadratic control is being used to describe that
the controler is linear and the cost function quadratic. Here the actuating variable {uk } in
(5.29) is to tune so that
N
X N
X
gLQC (uN +1 ) = wH
N +1 PN +1 w N +1 + uH
k Rvv uk + dH
k Rdd dk (5.35)
k=1 k=1
becomes minimal. Newer research is found under the name Model Predictive Control. It
can be shown that the substitution of Fk with Fk∗ , Hk∗ with −Gk and G∗k with Hk result
sin the Kalman algorithm as solution. These two problems are said to be dual.
Univ.Prof. DI Dr.-Ing. Markus Rupp 81
Exercise 5.2 For the special case Fk = I, Gk = 0, Rvv = 1 and Xk = xTk derive the
Kalman equations and compare them with the RLS algorithm.
Matlab Experiment 5.2 Repeat Matlab Experiment 5.1 and apply now the Kalman
algorithm. Compare the results with the previous ones.
5.3 Literature
First publications to the tracking behavior of adaptive algorithms are found [14, 15, 24, 45].
A good introduction to Kalman filters is given in [29] and [34, 67]. The original paper from
Kalman is [35]. In [1] Applications of the algorithm are described.
Chapter 6
Generalized LS Methods
We have considered several variants of LS solutions: with and without initial estimates,
under and over-determined systems, with and without weighting. When using weighting,
we made sure the weighting matrix was positive definite: Q > 0. We now address the
general question of what form such weighting matrices can have and what consequences
this can have for the solution. In particular we are interested in recursive forms of the
algorithms. To this end, let us start again with the standard LS problem:
d N = XN w o + v N . (6.1)
We have collected N observations d(k) = wTo xk + v(k), k = 1..N in vectors and matrices:
gW LS (ŵN ) = ŵH −1 H
N Πo ŵ N + (dN − XN ŵ N ) Q(dN − XN ŵ N ). (6.6)
Possible extensions including initial values as in (4.24) are straightforward and are not
shown to focus on the important aspects. We re-formulate the cost function:
H
dN Q −QXN dN
gW LS (ŵN ) = (6.7)
ŵN −XNH Q Π−1 H
o + XN QXN ŵN
82
Univ.Prof. DI Dr.-Ing. Markus Rupp 83
Obviously the point ŵN = w̄N is special and requires our interest. We consider three
cases:
1. Π−1 H
o + XN QXN > 0: the second term in (6.8) is non-negative for each choice of ŵ N ,
and achieves zero if and only if ŵN = w̄N . Thus the cost function
gW LS (ŵN ) ≥ dH
N Q − QX N [Π −1
o + X H
N QX N ] −1 H
X N Q dN
and takes on zero only if ŵN = w̄N . In this case we have a global minimum with a
unique solution.
2. Π−1 H
o + XN QXN < 0: the second term in (6.8) is not positive for each selection of ŵ N ,
and obtains zero if and only if ŵN = w̄N . The cost function is thus
gW LS (ŵN ) ≤ dH −1 H −1 H
N Q − QXN [Πo + XN QXN ] XN Q dN
and takes on zero only in case ŵN = w̄N . We have a global maximum with a unique
solution.
84 Adaptive Filters (preliminary)
3. Π−1 H −1 H
o + XN QXN is indefinite: At least one eigenvalue of Πo + XN XN is negative and
at least one is positive. Starting at point ŵN = w̄N we experience to run uphill in one
direction (eigenvector corresponding to positive eigenvalue) while we run downhill in
another direction (eigenvector corresponding to negative eigenvalue). Such a point
ŵN = w̄N is called saddle point.
In each case the point ŵN = w̄N is a particularity. It is thus called critical or stationary
point. We have not considered the case of vanishing eigenvalues yet but such non-invertible
matrices can be enclosed in our considerations.
This formulation is analogue to the formulation for the RLS algorithm only that the new
components are not vectors but matrices. The vector d̃N = [d1 (N ), d2 (N )] for example may
comprise of two components. We introduce again a matrix PN :
PN = [Π−1 H −1
o + XN QN XN ] ; P0 = Πo . (6.9)
In order to exist a solution for each time instant N it must be guaranteed that PN > 0.
We thus obtain the following recursive algorithm:
Generalized RLS Algorithm: Given a regular matrix Πo and a regular weighting matrix
QN with block structure. The solution of the minimization problem
min gV LS (ŵN )
ŵN
can be obtained by a recursive algorithm. Start with ŵ0 = 0 and P0 = Πo . Then we have
Univ.Prof. DI Dr.-Ing. Markus Rupp 85
k > 0:
h i−1
Γk = Q̃−1
k + X̃k Pk−1 X̃kH (6.10)
Kk = Pk−1 X̃kH Γk (6.11)
h i
ŵk = ŵk−1 + Kk d̃k − X̃k ŵk−1 (6.12)
Pk = Pk−1 − Kk Γ−1 H
k Kk . (6.13)
For each time instant 0 ≥ k ≥ N we find ŵk to be the desired minimum of gV LS (ŵk ), if
and only if Pk > 0, thus positive-definite. Compare this form with (4.46).
Note that such last condition was always guaranteed under the RLS algorithm assuming
persistent excitation. Now this condition need sot be required explicitly and also needs to
be checked at each time instant k. To compute the eigenvalues of a matrix is very complex.
There are alternative methods (matrix inertia) just for testing.
We can derive again a relation between the a-priori and the a-posteriori errors:
ẽp,k = Q̃−1
k Γk ẽa,k . (6.14)
6.2 Robustness
All adaptive algorithms can be driven by random processes and due to this a mean squared
error can be minimized. In many applications for example speech processing this is an
appropriate measure. In other applications however this measure may be inappropriate.
Consider for example a milling machine whose depth information is controlled in the mean
or another statistical measure. With a certain probability the desired depth would then
be exceeded. Even worse is the situation for an autopilot in an airplane. If an airplane
flights correctly only in the mean it may very well crash. We need a measure here that can
ensure a certain robustness against the worst case.
86 Adaptive Filters (preliminary)
Such a measure can be defined in terms of energy (or power). Let us again consider in-
and output of a linear system
Here the initial error in terms of initial (usually unknown) conditions, the control signal uk
and the additive noise are the input, while the a-priori error energy is the output. More
generally we can formulate this as:
We can for example select Zk = Xk which in turn means we observe the distorted as well
as the undistorted system. If it is possible to find an algorithm such that the values ka (l)
for every l remains below a threshold, that is
then such algorithm satisfies the desired robustness. The problem is known under the
name ”H∞ filter design with finite time horizon”. Note that the requirement (6.20) can be
formulated as:
N
X −1 N
X −1 N
X
w̃H −1
0 Πo w̃ 0 + 2
kuk k + 2
kv k k − γ −2
kea,k k2 ≥ 0. (6.23)
k=1 k=1 k=1
This is a quadratic form in ŵN , as we have considered in the previous chapter. After some
reformulation we recognize however:
−γ −2 I 0
QN = . (6.24)
0 I
For both formulations a filter algorithm is known. However, its existence is not necessarily
given for each value of γ 2 . The corresponding a-posteriori algorithm reads:
A-posteriori H∞ Filter: The a-posteriori filter with limit γ exists if and only if the
following expression is positive definite for each value of k:
−1 1
Pk−1 − Zk ZkH + Xk XkH > 0. (6.27)
γ2
In this case the filter equations are:
−γ 2 I 0 Zk H H
Re,k = + Pk−1 Zk , Xk , (6.28)
0 I Xk
ŵk = Fk ŵk−1 + Pk−1 XkH [I + Xk Pk−1 XkH ]−1 (dk − Xk Fk ŵk−1 ), (6.29)
Zk
−1
Pk = Fk Pk−1 FkH − Fk Pk−1 [ZkH , XkH ]Re,k Pk−1 FkH + Gk GHk ; Po = Πo .(6.30)
Xk
We immediately recognize that (6.27) may not be positive definite. The problem
in robust filtering thus is that it is difficult to predict whether a filter with the desired
robustness exists.
Comparing the filter algorithm with the generalized RLS and the Kalman filter, we find
that the robust filter algorithm contains elements of both. From the Kalman filter comes
the time-variant system dynamic described by matrix Fk . For Zk = Xk it takes on the
form of the generalized RLS algorithm. And indeed there exists also an a-priori form of the
algorithm for Fk = I tha tis identical to the RLS algorithm.
The problem to formulate adaptive filter algorithm with robust properties, is thus not to
find them but to predict their stability. Die H∞ formulation delivers a robust solution
form but does not tell us whether the filter converges. We twill thus have to follow a
different path in the following and derive robustness directly by means of energy passivity
relations. The method allows to cover a very large class of adaptive algorithms.
Closely related with robust filtering are the passivity relations. If we consider the
signals w̃o and v(k) as inputs and the corresponding error signals as output of the adaptive
algorithm, we recognize that for γ 2 < 1 less energy comes out of the system than goes in.
If this is the case we cal the system passive.
To avoid mathematic difficulties in case q = w, we can also argue with |xTk w − xTk q|2 −
µ−1 (k)kw − qk22 ≤ 0 instead. The term is certainly still correct if we extend the numerator
by
T
x w − xT q 2
k k
≤1. (7.2)
µ−1 (k)kw − qk22 + |v(k)|2
88
Univ.Prof. DI Dr.-Ing. Markus Rupp 89
As such relations are true for any arbitrary estimate q, as long as µ(k)kxk k22 ≤ 1 holds,
they also must be true for the estimates of the LMS algorithm, thus for
T
x w − xT ŵk−1 2
k k
≤1. (7.3)
µ−1 (k)kw − ŵk−1 k22 + |v(k)|2
The urgent question is now, in which sense an LMS estimate would change such
limit. Denoting ea (k) = xTk [w − ŵk−1 ] = xTk w̃k−1 for the undistorted a-priori
error, ep (k) = xTk [w − ŵk ] = xTk w̃k for the undistorted a-posteriori error and
γ(k) = [µ−1 (k) − kxk k22 ], then the following theorem holds.
Theorem 7.1 (Local Passivity Property) For the adaptive gradient method (LMS al-
gorithm with variable step-size) we have at every time instant k:
Proof: We show the first relation. The update equations for the parameter error vector
are:
w̃k = w̃k−1 − µ(k)x∗k [ea (k) + v(k)], (7.5)
where we split the distorted a-priori error ẽa (k) = ea (k) + v(k) into an undistorted a-priori
error and noise. Computing the quadratic l2 −norm on both sides, we obtain
kw̃k k22 = kw̃k−1 k22 +µ2 (k)kxk k22 |ea (k)+v(k)|2 −µ(k)[ea (k)+v(k)]e∗a (k)−µ(k)[ea (k)+v(k)]∗ ea (k).
Note that
|ea (k) + v(k)|2 = |ea (k)|2 + |v(k)|2 + ea (k)v ∗ (k) + e∗a (k)v(k).
We thus obtain
Exercise 7.2 Show that the LMS algorithm can be derived by the local cost function
The
p numerator of (7.9) is thus the energy of the undistorted normalized a-priori error
µ(k)ea (k) from 1 ≤ k ≤ N , plus the energy of the remaining parameter error vector
at time instant N . Correspondingly, the denominator also consists of two terms: the
energy of the normalized noise/disturbance over the entire time period as well as the initial
parameter error vector energy. This is a global energy relation: a matrix TN maps the signals
Univ.Prof. DI Dr.-Ing. Markus Rupp 91
p p
{ µ(k)v(k)}N k=1 and w̃ 0 onto the normalized a-priori error signals { µ(k)ea (k)}N
k=1 and
the remaining parameter error vector w̃N .
p
µ(1)ea (1) x p w̃ 0
.. x x µ(1)v(1)
.
p = .. . . .. . (7.10)
µ(N )ea (N ) . . .
p
w̃N x x x µ(N )v(N )
| {z }
TN
Such a matrix must be passive, according to (7.9), or contracting that is the induced
l2 −norm of the matrix is bounded: kTN k2,ind ≤ 1. In terms of robust control, such induced
matrix norm is called H∞ norm. Figure 7.1 illustrates the relation.
w̃0 w̃N
- -
TN
- -
p p
{ µ(k)v(k)}N
k=1 { µ(k)ea (k)}N
k=1
Exercise 7.3 Show the robustness of the LMS algorithm with fixed step-size µ. Which
relation has µ to satisfy, so that the algorithm is robust?
Exercise 7.4 Consider the gradient algorithm with regular step-size matrix Mk :
ŵk = ŵk−1 + Mk x∗k ẽa (k). (7.11)
Derive the following relations for 0 < xH H
k Mk xk ≤ 1, γ(k) = 1 − xk Mk xk :
−1
w̃H
k Mk w̃ k + |ea (k)|
2
−1 ≤ 1, (7.12)
w̃H
k−1 Mk w̃ k−1 + |v(k)|
2
−1
γ(k)w̃Hk Mk w̃ k + |ep (k)|
2
−1 ≤ 1, (7.13)
γ(k)w̃H
k−1 Mk w̃ k−1 + |v(k)|
2
For Mk = ν(k)M and ν(k) > 0, M > 0 derive the robustness conditions of this algorithm
from the first relation.
92 Adaptive Filters (preliminary)
Which value exactly maximizes the expression? If we consider a particular noise sequence
v(k) = −ea (k), we recognize that the gradient method is not updating; in other words: the
estimate ŵk remains at its initial value ŵk = w0 . We thus have
PN
kw − ŵN k22 + k=1 µ(k)|ea (k)|
2
max
√ P = 1. (7.16)
ŵ0 6=w, µ(k)v(·) kw − ŵ0 k22 + Nk=1 µ(k)|v(k)|
2
The selection of ŵ0 6= w is of technical matter. If we allow ŵ0 = w, the the denomina-
tor can become zero. In this case we would have to argue with differences rather than ratios.
Let us consider now the ratio in (7.15) for an arbitrary algorithm A. If we select again
v(k) = −ea (k), and this time ŵ0 = w, then we will have
N
X N
X N
X
2 2
µ(k)|v(k)| = µ(k)|ea (k)| ≤ kw − ŵN k22 + µ(k)|ea (k)|2 , (7.17)
k=1 k=1 k=1
or differently written:
PN
kw − ŵN k22 + µ(k)|ea (k)| 2
max PNk=1 ≥ 1. (7.18)
v̄(·) kw − ŵ0 k22 + k=1 µ(k)|v(k)|
2
If for an arbitrary algorithm A the relation (7.18) is true, while we know that for the
gradient method (7.16) is true, then we can summarize this property in the following
theorem.
Theorem 7.2 (Minimax Property of the gradient type algorithm) The gradient
type algorithm solves for µ(k)kxk k22 ≤ 1 the following minimax problem:
PN
kw − ŵN k22 + k=1 µ(k)|ea (k)|
2
min √max P . (7.19)
class of algorithms ŵ, µ(k)v(·) kw − ŵ0 k22 + Nk=1 µ(k)|v(k)|
2
For a bounded initial error kw̃0 k22 < ∞ and bounded disturbance energy, the energy of the
a-priori
perror must be bounded as well. Infinite series of finite energy are Cauchy series and
thus: µ(k)ea (k) → 0.
Consider further the update equation at time instant k:
w̃k = w̃k−1 − µ(k)ẽa (k)x∗k , (7.23)
we can write equivalently
p−1
X
w̃k+p−1 = w̃k − µ(k + l)ẽa (k + l)x∗k+l . (7.24)
l=1
Consider a series of P > M vectors xk+p ; p = 1..P , all of them can be tested on a value w̃k
and we find
p−1
X
xTk+p w̃k = xTk+p w̃k+p−1 + µ(k + l)ẽa (k + l)xTk+p x∗k+l (7.25)
l=1
p−1
X
= ẽa (k + p) + µ(k + l)ẽa (k + l)xTk+p x∗k+l . (7.26)
l=1
Because all ea (k) → 0 and also due to the bounded energy we have v(k) → 0, and thus we
conclude that ẽa (k) → 0. We thus have shown that the right-hand side of (7.26) converges
towards zero. Concatenating all vectors xk+p we find further:
xTk+1
xT
k+2
.. w̃k → 0. (7.27)
.
xTk+P
94 Adaptive Filters (preliminary)
Unfortunately, we cannot conclude form here that w̃k → 0, as w̃k could have components
in the null space of the matrix. A consequence of the persistent excitation condition is that
the matrix is of full rank M and thus its null space must be zero. Formally this can be
shown by multiplying the hermitian transpose matrix from the left. We then have
xTk+1
∗ ∗ ∗
T
xk+2
xk+1 , xk+2 , ..., xk+P .. w̃k → 0. (7.28)
.
xTk+P
Due to the persistent excitation condition (Lemma 4.3) the left hand side lies between
αI and βI, with 0 > α > β. Thus the matrix is regular and the null space becomes empty.
Some interesting consequences follow. In case the noise sequence v(k) is not of bounded
energy a bound must be guaranteed by the step-size, for example by µ(k) = a/[b + k]2 .
Lemma 7.1 For the gradient type method we find at each time instant k with µ̄(k) =
1/kxk k22 :
2 2 ≤ 1 for 0< µ(k)< µ̄(k)
kw̃k k2 + µ(k) |ea (k)|
=1 if µ(k) = µ̄(k) (7.29)
kw̃k−1 k22 + µ(k) |v(k)|2 ≥ 1 for µ(k) > µ̄(k)
Proof: The first relation has been shown already. The second is obtained by substituting
µ(k) = µ̄(k). For the third relation we consider again (7.1):
kw̃k k22 − kw̃k−1 k22 + µ(k)|ea (k)|2 − µ(k)|v(k)|2 = µ(k)|ea (k) + v(k)|2 [µ(k)kxk k22 − 1].
µ(k)
−v̄(k) = ea (k) − [ea (k) + v(k)] = ep (k). (7.31)
µ̄(k)
Figure 7.2 illustrated this structure. The gradient type method can thus be explained
as feedback structure. In teh forward path there is a lossless system, an allpass with
kTk k2,ind = 1, while the feedback path contains a lossy system. Compare this with Equation
(3.68) and Figure 3.4. Such local relation can also be reformulated into a global one, allowing
q −1
w̃k−1 w̃k
p -
µ̄(k) v(k) p
kT k k = 1 µ̄(k) ea (k)
- n - n - -
−
6
µ(k)
µ̄(k)
n
µ(k)
1−
µ̄(k)
Figure 7.2: Gradient type method as (lossless) in the forward path and lossy feedback path.
N
X N
X
2
µ̄(k)|ea (k)| ≤ kw˜0 k22 + µ̄(k)|v̄(k)|2 (7.32)
k=1 k=1
As long as δ(N ) < 1, we can conclude in a global sense what we expected. The
statements are summarized in the following theorem.
ŵk → w. (7.39)
Interesting to remark is the energy flow of such system. As the forward path is lossless,
all energy that enters must also come out. The energy component in the parameter error
vector is fed back into the input of the system. Energy can thus only be lost in the
feedback path. The more energy is lost, the faster learns the algorithm. Thus, the fastest
learning is obtained for µ(k) = µ̄(k) as the feedback becomes zero.
for the third condition µ̄(k) < µ(k) < 2µ̄(k) in Lemma 7.1.
We recognize the RLS algorithm with exponential weighting that is obtained for
µ(k) = β(k) = 1 and λ(k) = λ.
With the second property we can derive a feedback structure according to Figure 7.3.
98 Adaptive Filters (preliminary)
λ 2 (k)q −1
1
1 −1 −1
λ 2 (k)Pk−1
2
w̃k−1 Pk 2 w̃k
p -
µ̄(k) v(k) p p
kT k k = 1 µ̄(k) − β(k) ea (k)
- m - m
µ̄(k)v̄(k)
- -
µ(k) −
6
µ̄(k)
m
− 21
β(k) µ(k)
1− 1−
µ̄(k) µ̄(k)
(7.45)
We have employed the following short terms:
µ(k)
1 − µ̄(k)
δ(N ) = max q und γ(N ) = max µ(k) , (7.46)
1≤k≤N 1≤k≤N µ̄(k)
1 − β(k) µ̄(k)
as well as
j
Y 1
λ[i,j] = λ(k); µ̄(k) = . (7.47)
k=i
xTk Pk x∗k
Univ.Prof. DI Dr.-Ing. Markus Rupp 99
(7.49)
now with a modified short term:
µ(k) − β(k)
γ̃(N ) = max , (7.50)
1≤k≤N µ̄(k) − β(k)
Exercise 7.7 Find the stability bound for the step-size µ(k) of the Gauß-Newton Algo-
rithm. Consider both variants (7.45) and (7.49).
Exercise 7.8 For the particular case P0 = ǫI, λ(k) = λ, µ(k) = µo µ̄(k) and β(k) = βo µ̄(k)
find the stability bound as well as the robustness measures. Compare the results to the
gradient algorithm.
A linear synapse comprises of a linear combiner (as before) but has additionally a nonlin-
ear device f [z] at the output of the linear combiner as shown in Figure 7.4. Such nonlinear
function is called activation function. Its value can also be interpreted as likelihood to
which class a given vector x belongs to. A common choice of such activation functions f [z]
z f [z]
x
Let us consider now a set of possible input vectors {xk } with their corresponding correct
output decisions {y(k)}. The values {y(k)} belong to the range of the activating function
f [·] that is there are unknown vectors w, such that
In a supervised learning data pairs {xk , y(k)} are presented to the synapse so that the
adaptation algorithm can estimate the unknown w. Most well known is the Perceptron
Learning Algorithm (PLA). The algorithm starts with an initial guess w1 and applies the
following rule:
ŵk = ŵk−1 + µxk y(k) − f [xTk ŵk−1 ] . (7.53)
Univ.Prof. DI Dr.-Ing. Markus Rupp 101
In order to keep the result more general we added also noise v(k) in the reference. This can
be interpreted as modeling error. The additively distorted reference values we denominate
again by {d(k)}. We thus observe
for which we also included a variable step-size. The only difference compared to our previous
method is the non linear mapping f [·], as it occurs in the estimation path. By the mean
value theorem we have:
Figure 7.5 exhibits the feedback structure in this case. The nonlinear mapping occurs in
the passive feedback path, resulting in a modified convergence condition. Writing down the
equations for global convergence we obtain:
∆ µ(k)
δ(N ) = max 1 − f ′ [η(k)] , (7.56)
1≤k≤N µ̄(k)
which defines the condition for convergence and thus robustness: δ(N ) < 1.
q −1
w̃k−1 w̃k
-
p
kT k k = 1 µ̄(k) ea (k)
- m - m - -
µ(k) −
6
µ̄(k)
m
µ(k)
1 − f ′ [η(k)]
µ̄(k)
Very typical for the PLA is that the analysis is limited to real-valued signals and so
is its typical application. The problem of the limitation is the definition of the nonlinear
mapping for complex valued signals. As this is a requirement for adaptive equalizers, we
will have to have a closer look at it now. Let us consider Figure 1.7 of the first chapter.
A complex valued symbol s(k) is transmitted through a linear channel c. Additive noise
alters the received symbol further before it is sent through a linear adaptive filter. We
do not expect that a pure linear filtering will recover the symbol s(k − D) entirely (up
to an unavoidable delay D). On the other hand only very particular symbols are being
sent. We expect that after linear filtering the symbol value lies close to a valid symbol
from the transmission alphabet and can thus be mapped onto the correct symbol by a
suitable nonlinear mapping. We thus assume that this structure with a given filter length
and optimal parameter set wo to ”equalize” the channel so that the output is close to the
correct symbol. Our reference model thus delivers a value y(k), which we can compare
with the estimate ŷ(k) conditioned by a parameter set wk−1 . We call this a training modes
as we are lacking a concrete reference signal.
Let us consider this situation more closely. The reference signal (assume s(k − D), or
a function of it) is given by f [y(k)] = f [xT wo ], by which we can compute the error signal
eo (k) = f [y(k)] − f [ŷ(k)]. In order to relate it to the a-priori error, we write:
With help of this new function the update equations can be formulated in typical form:
ŵk = ŵk−1 + µ(k)x∗k eo (k) = ŵk−1 + µ(k)x∗k h[y(k), ŷ(k)]ea (k). (7.60)
The difficulty is given with the function h[·, ·]. In order to guarantee l2 −stability, we must
have
∆ µ(k)
δ(N ) = max 1 − h[y(k), ŷ(k)] < 1. (7.61)
1≤k≤N µ̄(k)
Univ.Prof. DI Dr.-Ing. Markus Rupp 103
Example 7.1: Consider for example BPSK transmission. The excepted symbols are thus
[−1, 1]. We select as non linear mapping f [z] = sgn[z]. Then we obtain
sgn[y(k)] − sgn[ŷ(k)]
h[y(k), ŷ(k)] = . (7.62)
y(k) − ŷ(k)
As negative values of sgn[y(k)] − sgn[ŷ(k)] can only occur when the difference of the
arguments is negative, the function itself is positive and thus a step-size exists for which
stability is guaranteed.
In case there is no reference signals the algorithm can be driven in the so-called blind
mode. As we have some a-priori information about the transmitted symbols, we can select
the nonlinear mapping, so that a constant appears at the output if the filter has been
selected correctly. We can for example employ so called constant modulus (CM) signals,
thus signals whose amplitude remains constant. The information is transmitted in its
phase. Computing |z(k)| = |xTk wo |, the optimal system will give out a constant amplitude,
thus |z(k)| = γ. An adaptive algorithm can thus be:
CMA-q-2 Algorithm
Corresponding stability conditions are obtained from the function h[·, ·].
Exercise 7.9 Compute the stability conditions of the step-size µ(k) in case of BPSK train-
ing and utilization of a sign function.
Exercise 7.10 CM-signals are being used for a transmission and the CMA-2-2 algorithm
for training the equalizer. What is its stability condition?
Exercise 7.11 BPSK is being transmitted and the equalizer run in blind mode. Its updates
are given by:
ŵk = ŵk−1 + µ(k)x∗k [sgn[ŷ(k)] − ŷ(k)]. (7.64)
Draw the algorithm in its feedback structure and define its stability conditions.
known signals are offered as desired signal. This is equivalent to the mode when the switch
is selected on MMSE. In this mode the filter coefficients ŵ are selected so that the MSE is
minimized, a requirement that can only be satisfied with some error term as the optimal
solution requires a double infinite length equalizer. Let us denote the received signal by
r(k) = cT sk + v(k). The MMSE can be computed as
eM M SE (k) = z(k) − rTk ŵk−1 = s(k − D) − rTk ŵk−1 + z(k) − s(k − D), (7.65)
| {z } | {z }
eN L (k) −g(z)
where we have written the receiver values r(k) in the vector rTk = [r(k), r(k − 1), .., r(k −
M + 1)]. In the blind mode of operation (also decision directed mode) the reference signal
v(k)
?
s(k) - c - + r(k)
- wo z(k)- s(k − D)
-
NL
?
?
KAA
−
−
A 6 6
- A
ŵA
A f
?MMSE
A
A v f
NL
si extracted out of the linear equalized signal by a nonlinear mapping. We thus have a
nonlinear device in the reference path (Switch in position NL). The relation between the
so obtained error signal eN L (k) and the MMSE is shown in Equation (7.65). However, it
was assumed in (7.65) that only correct symbol s(k − D) have been recovered. In teh blind
mode we do not have the correct values but only estimates of them. Equation (7.65) thus
requires some correction:
The updates thus occur with an error signal êN L (k) = eM M SE (k) + ĝ[z(k)]. Out of this we
can conclude that:
• The adaptive filter of a nonlinear equalizer works in a system identification mode.
Univ.Prof. DI Dr.-Ing. Markus Rupp 105
• The excitation for the system identification is a composite signal, comprising of lin-
early filtered transmit symbols and additive noise.
Important for the equalizer is the tracking behavior of the algorithm. This can be
described well by the feedback structure. Consider the update equation in the form:
For den CMA-2-2 algorithm we obtain for example f [ŷ(k)] = ŷ(k)[γ − |ŷ(k)|2 ]. Reformu-
lating in the well known form we obtain:
2
µ(k)
2
2 2
µ̄(k)|ea (k)| + kw̃k k = µ̄(k) ea (k) − f [ŷ(k)] +
w̃k−1
(7.68)
µ̄(k)
If we consider the signals as random processes, we
can
compute the expectation on both
2
2
ends. IN steady-state we find that E[kw̃k k ] = E[ w̃k−1 ] and thus we have
" 2 #
µ(k)
E µ̄(k)|ea (k)| = E µ̄(k) ea (k) −
2
f [ŷ(k)] . (7.69)
µ̄(k)
• We assume that in steady-state the reciprocal instantaneous energy µ̄(k) and the
estimated value ŷ(k) are statistically independent.
With these assumptions we can further process Equation (7.69). For small and constant
step-sizes we obtain
E [|s(k)|2 γ 2 − 2γ|s(k)|4 + |s(k)|6 ]
E[|ea (k)|2 ] ≈ µ E kx k k2
2 (7.70)
2E [δ|s(k)|2 − γ]
with δ = 2 for the complex-valued case and δ = 3 for the real valued case. A few remarks
to the procedure and the result (7.70).
• It is interesting that the steady-state error energy can now also be minimized with
respect to γ and thus we can find the smallest steady-state error energy.
• Utilizing a gradient algorithm, the learning speed is strongly dependent on the eigen-
value spread of the input process. This spread is defined only by the channel, if we
assume a white data sequence to be transmitted. On the other hand, the additive
white noise works positively to decrease the eigenvalue spread.
106 Adaptive Filters (preliminary)
Exercise 7.13 Compute the Excess-Mean-Square Error of the LMS algorithm based on the
statistical method presented here and compare the results with those in Chapter 3. Compute
also the steady-state error energy for the Least-Mean Fourth algorithm.
Exercise 7.14 Under the assumption of a reference model wk = wk−1 + qk in which the
vectors qk are statistically independent, compute the Excess-Mean-Square error in depen-
dence to the step-size for the LMS algorithm.
Consider the odd function f [·] as derivative of a convex cost function ψ(·). The adaptive
algorithm (7.71) is a gradient type method that minimizes in approximation E{ψ(d(k) −
xTk ŵk )}. To this situation several algorithms that has been treated in literature, as the
sign-error, least-mean-K, and power-of-two quantized algorithms (see [12, 79]). The sign-
error algorithm we had briefly discussed in Chapter 3 to safe complexity. For its analysis
we take our standard reference model:
Theorem 7.5 The update for the gradient type method with nonlinear mapping in the error
path is given by:
x∗k
ŵk = ŵk−1 + (ẽa (k) − q[ẽa (k)]) , (7.73)
kxk k22
in which q[·] is an odd function with e q[e] > 0. Under this condition the gradient method
is l2 −stable and robust, as long as q[e] is contracting, that is
and
2 − β(β 2 + 1) |ea (k)|2 |v(k)|2
wk k22 +
ke ≤ (β 2
+ 1) wk−1 k22 .
+ ke (7.80)
2 kx(k)k2 kxk k22
Now we only have to iterate (7.80) from k = 1 to k = N and we obtain:
N N
2 − β(β 2 + 1) X |ea (k)|2 X |v(k)|2
wN k22 +
ke 2
≤ (β 2
+ 1) 2
w0 k22 ,
+ ke (7.81)
2 k=1
kxk k k=1
kxk k2
and thus
N
X N
|ea (k)|2 2(β 2 + 1) X |v(k)|2
w0 k22 +
< ke . (7.82)
k=1
kxk k22 2 − β(β 2 + 1) k=1 kxk k22
108 Adaptive Filters (preliminary)
For N → ∞ the squared error terms |ea (k)|2 /kxk k22 remain bounded as long as |v(k)|2 /kxk k22
remain bounded.
We had already mentioned in Chapter 3 that the gradient method with a-posteriori
error is equivalent to a particular normalization of the step-size in the LMS algorithm.
Such form of the gradient method can be extended as shown next.
Theorem 7.6 Given a gradient method with odd nonlinearity ef [e] > 0 over the a-
posteriori error:
ŵk = ŵk−1 + µ(k)f [ẽp (k)]x∗k . (7.83)
The gradient method is l2 −stable and robust for every bounded step-size µ(k) > 0.
Proof: My multiplication with xTk and subtraction from d(k) we obtain the relation
or also
ẽp (k) + µ(k)kxk k22 f [ẽp (k)] = ẽa (k). (7.85)
As f [e] is an odd function, we have sgn[ẽp (k)] = sgn[ẽa (k)], and thus
The relation between ẽp (k) and ẽa (k) is thus contracting therefore with Theorem 7.5 we
conclude that the method is l2 −stable.
Exercise 7.15 Compute the robustness condition for the sign-error algorithm:
before we can apply the update equation. Pipelining results in an error signal that is
available only a few clock cycles later. If this is the case the update equation takes on the
form:
ŵk = ŵk−1 + µ(k)x∗k ẽa (k − D), (7.88)
in which the error occurs D cycles delayed. This is the simplest case of filtering.
In active noise control the error is constructed in the acoustic path and captured by a
microphone (see also Figure 1.5). The path from the loudspeaker through the control unit
is part of the linear filter path of the update error.
A further application of such filtered error path are adaptive IIR filter. Until now we
only considered transversal filter structures (and linear combiners). If the reference model
consists of an IIR filter, a large amount of coefficients in a transversal filter would need
to be estimated (depending on the pole locations), while a corresponding IIR filter would
require only few taps. This motivated already in the 70s many researchers to investigate
adaptive IIR algorithms [77, 73, 16]. Their success however, remained very small, the
major problem being stability. The reason for this problem are again the occurrence of a
linear filter in the error path as we will show next.
In adaptive IIR structures we have to distinguish between the so called output error
and the equation error. In the first form we apply estimated output values as input of the
filter, thus the output of the estimator
uTk = [x(k), x(k − 1), ..., x(k − M + 1), ŷ(k − 1), ŷ(k − 2), ..., ŷ(k − N )]. (7.90)
As opposed to the equation error method where noise outputs of the reference model are
being employed:
uTk = [x(k), x(k − 1), ..., x(k − M + 1), d(k − 1), d(k − 2), ..., d(k − N )]. (7.91)
Applying the equation error we can straightforwardly write down the Wiener solution for
random signals:
E[u∗k uTk ]ŵ = E[d∗k uTk ]. (7.92)
As parts of d(k) are entries of the regression vector, there is more correlation than desired.
Splitting the regression vector uk into two components xk and dk , we obtain
Rxx Rdx rdx
E ŵ = E . (7.93)
Rxd Rdd rdd
110 Adaptive Filters (preliminary)
Assuming white additive noise, we find rdd = ryy . On the left hand side of the equation
we find a term Rdd = Ryy + σv2 I that behaves different than usual. Due to the noise it has
an additional component. The so obtained estimator is not bias free. For the output error
method on the other hand it is not expected that such a bias occurs. We will understand
this better after a detailed analysis.
Until now it remains unclear why the output error method would belong to the algo-
rithms with linear filter in the error path. To understand this we use the IIR reference
model:
wT = [b(0), b(1), ..., b(M − 1), a(1), a(2), ..., a(N )] = [bT , aT ]. (7.94)
with the IIR filter coefficients b(0)..b(M − 1) and a(1)..a(N ). For the undistorted a-priori
output error we find
We recognize a particular linear filtering in the error path. In case of a constant step-size
µ(k) = µ the algorithm is called Feintuch-algorithm [16]. These considerations are of
course not restricted to simple gradient methods. Methods in the class of Gauß-Newton
can be treated equally. Utilizing an output error with the Gauß Newton type algorithm,
the corresponding algorithm is called pseudo linear regression algorithm (PLR).
Univ.Prof. DI Dr.-Ing. Markus Rupp 111
v(k)
uk ?ẽa (k) F [ẽa (k)]
r - w - m - F -
6−
- w k−1
Figure 7.7: Adaptive algorithm structure with linear filter in the error path.
If the error path F [·] is given in matrix form, for example here with three coefficients
f0 , f1 , f2 :
f0
f1 f0
F N = f2 f1 f0 (7.106)
f2 f1 f0
... ... ...
and furthermore step-sizes in diagonal matrix form M̄k and Mk , then the abbreviations are
∆ −1 −1
δ(N ) = kI − M̄N 2 MN FN M̄N 2 k2,ind (7.107)
∆ − 12 − 21
γ(N ) = kM̄N MN FN M̄N k2,ind . (7.108)
112 Adaptive Filters (preliminary)
q −1
w̃k−1 w̃k
-
p p
kT k k = 1 µ̄(k) ea (k)
- m - m
µ̄(k)v(k)
- -
−
6
√1 F [·] √µ(k)
µ̄(k) µ̄(k)
m
1 − √µ(k) F [·] √ 1
µ̄(k) µ̄(k)
Figure 7.8: Gradient type algorithm with linear filter in the error path in feedback structure.
With such definitions it is possible to derive robustness conditions also for the case of
linearly filtered errors.
Theorem 7.7 For the gradient method with linearly filtered error we find l2 stability with
the following definitions (7.107) and (7.108):
v v
u N u N
uX 1 uX
t µ̄(k)|ea (k)|2 ≤ kw̃0 k2 + γ(N )t µ̄(k)|v(k)|2 , (7.109)
k=1
1 − δ(N ) k=1
v v
u N u N
uX 1
γ (N )
2 u X
t µ(k)|ea (k)|2 ≤ kw̃0 k2 + γ 2 (N )t
1
µ(k)|v(k)|2 . (7.110)
k=1
1 − δ(N ) k=1
The proof follows the previous procedure (see also Exercise 7.11). However, the stability
condition is much harder to check with the abbreviations in (7.107)
−1 −1
kI − M̄N 2 MN FN M̄N 2 k2,ind < 1, (7.111)
as we have to deal with time variant components in matrices. For relatively large filters
M we can claim approximately that µ̄(k) = 1/kxk k22 ≈ M σx2 . With constant step-size
µ(k) = µo the following relation is true:
µ o
max 1 − F (e jΩ
) < 1.
(7.112)
Ω M σx2
Univ.Prof. DI Dr.-Ing. Markus Rupp 113
The step-size µo , providing the fastest convergence, can be found by the following minimax
optimization:
µ o
µopt = arg min max 1 − F (e jΩ
) . (7.113)
µo Ω M σx2
From the stability condition (7.112) one can recognize that it is not necessarily satisfied
for all positive step-sizes. The linear function F [·] can exhibit a negative real value for
various frequencies. In this case the algorithm behaves unstable even for small step-sizes
(assuming excitation at these frequencies). The necessary condition for F [·] is known in
the literature as strict positive real (SPR):
Real F (ejΩ ) > 0; for all Ω. (7.114)
Exercise 7.16 Prove Theorem 7.7.
Exercise 7.17 Extend the proof to include the Gauß-Newton algorithm with linearly
filtered error path.
Exercise 7.18 An adaptive IIR filter with two feedback coefficients a(1) and a(2) is to be
adapted by the Feintuch algorithm. Which condition must the two coefficients satisfy, so
that a stable algorithm s obtained? What is its optimal step-size?
Exercise 7.19 Let an undisturbed, nonlinear system be: y(k) = aTk yk + bT1 xk + bT2 xxk with
xxTk = [x(k)x(k), x(k)x(k − 1), .., x(k)x(k − M2 + 1)]. Derive stability conditions for the
corresponding gradient method with output error.
Exercise 7.20 A compromise could be to derive an adaptive algorithm whose update error
is partially output error and partially equation error. Derive such n algorithm and find its
stability conditions.
Exercise 7.21 Steiglitz and McBride had the idea [74], that the adaptive IIR filter can be
improved by a pre-filtering in the update error of the Feintuch algorithm by 1 − Â(q −1 ).
Derive the stability condition in this case.
Exercise 7.22 The neural network by Narendra and Parthasarathy (see Figure 1.11) can
be interpreted as adaptive IIR filter with nonlinearity in the estimation path. Derive a
gradient algorithm for the training of such network and find the required stability conditions.
114 Adaptive Filters (preliminary)
Matlab Exercise 7.1 Write a Matlab programme to identify the following system:
Let the system be disturbed additively with white Gaussian noise of σv2 = 0.01. Utilize an
equation error as well as an output error and compare the result. Use the three different
sets
with the frequencies f1 = 0.1, f2 = 0.17, f3 = 0.25, f4 = 0.35, f5 = 0.4. Find the step-sizes
for fastest convergence and the stability bound.
For small step-sizes the estimate changes only slowly and assuming it remains constant
during MF updates, MF being the filter length of F [·], we can write
which can be argued similarly as we did for the method with decorrelation filter. However,
usually slow adaptations are not of much interest, raising the question how the algorithm
Univ.Prof. DI Dr.-Ing. Markus Rupp 115
reacts for larger step-sizes. To treat the most general case we consider the generalized
FXLMS algorithm:
ŵk = ŵk−1 + µ(k)F [x∗k ]Gk (q −1 ) {F [ẽa (k)]} . (7.116)
Additional to the fixed filter Filter F we assume a time-variant filter Gi . By multiple
reformulations it can be shown [61], that
M
X F −1
F [ẽa (k)] = F [v(k)] + F [xTk ]ŵk−1 + c(k, l) Gk−l (q −1 )F [ẽa (k − l)] , (7.117)
l=1
which allows to convert the FXLMS algorithm into a type with filtered error path (and not
filtered regression vector). The generalized FXLMS algorithm can thus be reformulated as:
Gk (q −1 )
ŵk = ŵk−1 + µ(k)F [x∗k ] F [v(k)] + F [x T
k ] w̃ k−1 . (7.121)
1 − Ck (q −1 )Gk (q −1 )
Gk (q −1 )
1 − Ck (q −1 )Gk (q −1 )
Exercise 7.24 Show that the optimal filter for the generalized FXLMS algorithm is given
by:
1
Gopt,k (q −1 ) = . (7.122)
1 + Ck (q −1 )
116 Adaptive Filters (preliminary)
Exercise 7.25 Under the assumption of constant coefficients c(k) and a normalized step-
size version of the algorithm, provide the stability condition as function of the normalized
step-size µo and the error path filter with optimal Gopt,k .
Exercise 7.27 The Zero-Forcing algorithmus for channel equalization is given by the
updates
ŵk = ŵk−1 + µ(k)[s(k − D) − ŝ(k − D)]s∗k , (7.124)
for which the meaning of the values can be found in Figure 7.6 and the vector sTk =
[s(k), s(k − 1), ...s(k − M + 1)]. Analyze the algorithm and find step-sizes µ(k) to guar-
antee convergence.
7.7 Literature
Good tutorial sand introductions to robust control are found in [11, 37]. An introduction
into adaptive, robust filtering can be found in [42]. The small-gain-theorem is well explained
in [38, 76]. First publications for robustness of LMS algorithms are in [25, 26]. In [69] the
minimax explanation of the LMS algorithm can be found. In [57, 58] the feedback structure
of gradient type algorithms as well as the Gauß-Newton method is explained. In [27, 33, 39]
are good tutorials for neuronal networks. In [56] the explanation to the Feintuch algorithm
is found. The DLMS algorithm has been treated classically in [40, 41] and [54]. Further
details to tracking of equalizers are in [43]. Details to the FXLMS algorithm can be found
in [64].
Appendix A
with a single unit entry at position i, all other entries are equal to zero. If we consider a
linear combination of the elements vi with the constant coefficients ai , the differentiation
with respect to one element vj leads to:
M
∂ T ∂ X
a v= ak v k = aj .
∂vj ∂vj k=1
117
118 Adaptive Filters (preliminary)
If we now build a vector of length M that contains at position j the derivative of the
vector v with respect to vj , we can write:
∂
∂v1
∂ T ∂
∂v2 T
a v= .. a v = aT .
∂v .
∂
∂vM
Note that here, choosing the gradient as a row vector has been done arbitrarily. Since in
these lecture notes, we continuously use gradients in the form of row vectors, this choice
simplifies matters.
These operations are also called the Wirtinger differential operators, in recognition of the
mathematician Wilhelm Wirtinger [81].
Consider for example the term g(z) = |z|2 = zz ∗ = x2 +y 2 . By differentiating separately
with respect to x and y, we obtain
∂g(z)
= 2x
∂x
and
∂g(z)
= 2y.
∂y
Consequently, according to the above stated definitions, the derivative of g(z) = |z|2 with
respect to z reads
∂g(z)
= x − jy = z ∗ .
∂z
We observe that obviously, the differentiation with respect to z is not affected by the
complex conjugate z ∗ . Analogously, we can differentiate with respect to z ∗ . For the above
example g(z) = |z|2 , the derivative with respect to z ∗ simply becomes z.
Univ.Prof. DI Dr.-Ing. Markus Rupp 119
Finally, the rules for differentiation with respect to complex-valued vectors can be derived
based on the rules presented in the previous part of this Appendix A, in combination with
the above introduced Wirtinger differential operators. Below, we present a collection of a
few useful rules for the differentiation of scalar functions which depend on a complex-valued
vector z:
∂kzk22 ∂ xT x + y T y
= = xT − jy T = z H ,
∂z ∂z
∂kzk22 ∂ xT x + y T y
= = x + jy = z,
∂z H ∂z H
∂z H Az ∂ xT Ax − jy T Ax + jxT Ay + y T Ay
= = xT A − jy T A = z H A.
∂z ∂z
Appendix B
Definition [Convergence]: If there exists for every δ > 0 an integer no such that the
distance dist(xn , y) < δ for each n > no with some constant value y, then we call the
sequence xn convergent and y is the limit of xn :
xn → y
y = lim xn .
n→∞
We also say, the sequence xn converges towards y. Here, the distance dist(·) is a metric. If
there are more than one limit values (for example at the output of a discrete time oscillator),
we call these points limit points.
Such a definition of convergence is not very practical since it requires the limit point y
to be known a-priori. Only then, it is possible to check whether the sequence converges or
not. Such form of convergence is also called convergence everywhere. It is applicable
for deterministic sequences as well as for random sequences and processes. For a random
process xn (ζ), the limit may depend on the result of a random experiment ζ. Definition
[Convergence with probability one]: If a random process xn (ζ) converges to a limit
x(ζ), such that:
P {xn → x} = 1,
then we call it convergence with probability one or convergent almost everywhere.
120
Univ.Prof. DI Dr.-Ing. Markus Rupp 121
lim E[xn − x] = 0,
then we call it convergence in the mean which is not to be confused with the next expression.
Definition [Convergence in the mean square]: If the following is true for a random
process
lim E[|xn − x|2 ] = 0,
then we call it convergence in the mean square sense or limit in the mean.
d
p
a.e.
MS
Definition Two random variables: Let us first consider two random variables x and y.
They are called spherically invariant if their joint density function fxy (x, y) can be written
as p
fxy (x, y) = g( x2 + y 2 ) = g(r).
The following theorem holds.
Theorem C.1 If two random variables x and y are spherically invariant and statistically
independent then they are Gaussian distributed, zero-mean and of identical variance.
Proof: Statistical independence means:
p
g(r) = g( x2 + y 2 ) = fx (x)fy (y).
122
Univ.Prof. DI Dr.-Ing. Markus Rupp 123
The theorem is thus proven. Note however, that this does not mean the all spherically
invariant random variables are Gaussian distributed. It means very much that those that
are not Gaussian distributed are not statistically independent. In those cases we do not
have fxy (x, y) = fx (x)fy (y).
We can now relax the condition of circles and extend the space towards elliptically
invariant random processes:
and thus the condition for an ellipsoid: B 2 − 4AC = 4b2 − 4ac < 0, thus b2 < ac, which is
given as the correlation coefficient ρ is bounded by −1 ≤ ρ ≤ 1 (proof by Cauchy-Schwarz-
Inequality). The formC.3 is much easier to handle as we can immediately recognize the
124 Adaptive Filters (preliminary)
translational parameters (x0 , y0 ). Also the variances and the correlation factor are given
by:
c a b
σx2 = ; σ 2
y = ; ρ = − √ .
ac − b2 ac − b2 ac
The form in (C.2) comprises of a matrix inversion but on the other hand allows for relatively
quick reading of desired parameters. With this form we can easily extend the description
towards more dimensions than two, for example three:
T −1
x − x0 σx2 ρxy σx σy ρxz σx σz x − x0
y − y0 ρxy σx σy σy2 ρyz σy σz y − y0 = const. (C.4)
z − z0 ρxz σx σz ρyz σy σz σz2 z − z0
xT Rxx
−1
x = const (C.5)
Theorem C.2 Elliptical random processes maintain such property under linear transfor-
mation. There exists at least one linear transformation that transforms elliptical random
variables into spherical ones.
and
Rxx = A−1 Ryy A−T .
Thus yT Ryy
−1
y = xT AT Ryy
−1
Ax = xT Rxx −1
x. For density functions under linear transforma-
tion we have
fx (x) g(xT Rxx−1
x) g(yT Ryy−1
y)
fy (y) = = = .
| det A| | det A| | det A|
−1/2
Thus, the first part of the theorem is shown. Let us now select a particular A = Rxx ,
then we have
Ryy = E[yyT ] = ARxx AT = I.
The in this way linearly transformed random vector y is thus uncorrelated. The particular
choice of A is always possible as the autocorrelation matrix is always Hermitian and
Univ.Prof. DI Dr.-Ing. Markus Rupp 125
positive. Note that with this choice of A we do not only guarantee decorrelation but also
identical variance of all entries in y. If there would be required only decorrelation we can
−1/2
satisfy this for all diagonal matrices D with A = DRxx .
p fr (r) 1
fxy (x, y) = g( x2 + y 2 ) = g(r) = = f (r2 , 2). (C.6)
2π π
Such formulation is independent of the radial density- Note that the angle component
φ is always constant between [−π, π]. Furthermore we immediately conclude statistical
independence of the two random variables r and φ in polar form
while the Cartesian counterparts x and y are not necessarily statistical independent. We
have already shown that this is only the case for Gaussian distributions. Such formulations
in polar coordinates can be extended to more dimensions. If done so we recognize that all
angle densities are statistically independent to each other and to the radial density. For all
higher angles we always find a constant distribution in [0, π].
In (C.6) we already introduced a particular description for the radial density in this
1
case for the two dimensional density function π1 f (r2 , 2) or more general πM/2 f (r2 , M ) for
M dimensions. To emphasis on the dimension M we took it on as a parameter of the
1
density function. The pre term πM/2 turns out to be quite practical for compact writing.
As marginal densities can be computed by integration out of joint densities we find:
Z ∞
1 2 1
M f (r , M ) = M +1 f (r2 + s2M +1 , M + 1)dsM +1 .
π 2 −∞ π 2
1
Here the joint density M f (r2 , M ) consists of M Cartesian components s1 , s2 , ...sM with
π 2
condition r2 = s21 + s22 + ... + s2M .
For spherically invariant process even the converse is true or in other words, if the radial
density is known in one dimension, it can be computed for all dimensions. This can be easily
126 Adaptive Filters (preliminary)
Z0 ∞ 0 ππ 2
1 2 2
= M 2ρf (r + ρ , M + 2)dρ.
0 π 2
vector as the presence of a random process of which we always see a fraction of length
M . Independent of the time we always obtain the same joint density function for M
components. We thus conclude that we have stationary processes.
In order to obtain the radial density we have to integrate over all angles, thus to compute
the marginal density in r
Z Z Z Z
1 2
... f (r , M )dφ1 ..dφM −1 = fr (r) ... fφ1 (φ1 )...fφM −1 (φM −1 )∆(φ1 , .., φM −1 )dφ1 ..dφM −1 .
π M/2
Since all angular density functions are known this term can be precomputed for a given
dimension M
2
frM (r) = rM −1 f (r2 , M ). (C.9)
Γ(M/2)
1
Note that the radial component in M f (r2 , M ) has M components while it has M + 1 in
π 2
1
f (r2 , M + 1). The radial density frM (r) behaves similar. To emphasize this we now
M +1
π 2
introduced a dimension M in form of an index rM . Special cases are known for M = 1, 2, 3:
r
2 2 −r2 /2
Positive Gaussian : fr1 (r) = f (r2 , 1) = e
Γ(1/2) π
2 2
Rayleigh : fr2 (r) = rf (r2 , 2) = re−r /2
Γ(1)
r
2 2 2 2 2 −r2 /2
Maxwell : fr3 (r) = r f (r , 3) = r e
Γ(3/2) π
128 Adaptive Filters (preliminary)
Note that r ≥ 0 which is a bit unusual for the Gaussian√distribution as they have usually
negative arguments as well. In that case the factor is 1/ 2π instead.
Thus its variance is σM = M −1/2 rM and follows the radial density. Let us now grow M to
infinity. What happens?
lim σM = lim M −1/2 rM
M →∞ M →∞
We find thus that in general in the limit the variance is also a random variable described by
the density function fσ (r). We cannot expect that the ensemble average and the temporal
average are identical as the temporal average would need to be a constant. We can thus
conclude that in general spherically invariant processes are not ergodic although they are
stationary.
A particularity is again the Gaussian process. For the Gaussian process we have (just
considering even orders)
2 2 1 2M −1 −r2 /2
fr2M (r) = r2M −1 f (r2 , 2M ) = r e . (C.10)
Γ(M ) Γ(M ) 2M
The density function follows a χ2 − distribution. The limit of M 1/2 frM (M 1/2 r) tends to 1,
thus
lim frM (r) = δ(r − 1).
M →∞
The Gaussian process is thus ergodic. The property of the Gaussian process that the radial
component tends to be a constant with increasing dimension is often called the hardening
phenomenon.
Example C.2 This opens the question how we can generate other spherically invariant
processes next to the Gaussian process, possibly with particular radial density. Consider
first a multiplication of a Gaussian process xk by a random variable y to obtain zk = xk y.
Let us first consider one element for time instant k, thus z = xy. The density function of
z can be obtained by the following two methods:
Univ.Prof. DI Dr.-Ing. Markus Rupp 129
The joint density fzy (z, y) = fz (z|y)fy (y). We obtain the desired marginal density by
integration Z Z
1 2
fz (z) = fz (z|y)fy (y)dy = p e−(z/y) /2 fy (y)dy.
2πy 2
Path 2: Construct the auxiliary variable w = y and compute the transformation of x, y
to z, w:
fxy (z/w, w) fx (z/w)fy (w)
fzw (z, w) = = .
|w| |w|
We obtain the marginal density of z again by integration
Z Z 2
fx (z/w)fy (w) 1 e−(z/w) /2 fy (w)
fz (z) = dw = √ dw
|w| 2π |w|
In both cases we thus obtain the same result. We recall that r ≥ 0 and identify
Z 2
2 1/2 e−(r/w) /2 fy (w)
f (r , 1) = 2 dw. (C.11)
|w|
Example C.3 We now want to extend this example. Consider three IID Gaussian dis-
tributed random variables x1 , x2 , x3 being multiplied by a fourth random variable y of density
fy (y) that is statistically independent of the first three. For the joint densities we have:
fx1 ,x2 ,x3 ,y (x1 , x2 , x3 , y) = fx1 (x1 )fx2 (x2 )fx3 (x3 )fy (y)
We obtain
fx1 (z1 /w)fx2 (z2 /w)fx3 (z3 /w)fy (w)
fz1 ,z2 ,z3 ,w (z1 , z2 , z3 , w) = .
|w|3
As all three variables x1 , x2 , x3 are Gaussian we find:
2 +z 2 +z 2
z1 2 3
fy (w)e− 2w2
fz1 ,z2 ,z3 ,w (z1 , z2 , z3 , w) = (2π)−3/2 .
|w|3
130 Adaptive Filters (preliminary)
In this way we can compute the joint densities of more than three variables. We must have
d(n)
f (s, 2n + 1) = (−1)n f (s, 1).
dsn
In our case thus
Z 2
n −(2n−1)/2 e−s/w /2 fy (w)
f (s, 2n + 1) = (−1) 2 dw.
|w|2n+1
With this trick we can now compute the radial densities for odd dimensions M = 2n + 1
(see (C.9))
Z −r 2 /w2 /2
1 M −1 e fy (w)
frM (r) = (−1)(M −1)/2 2−M/2 r dw.
Γ(M/2) |w|M
Remark: The examples here showed a so called product process. It can be shown
that all spherically invariant processes are a product process, in particular a product of a
Gaussian process with a random variable. (for more details see [5]).
Appendix D
Gaussian random variables play an important role in many applications, in particular if the
sum of several random variables shows up. Most often the central value theorem is then
true which says that loosely speaking a sum of many arbitrary random variables results
asymptotically in a Gaussian random variable. For a real-valued vector of p dimensions x
with Gaussian entries mean x̄ and covariance matrix Cxx =E[(x − x̄)(x − x̄)T ] we obtain
the following probability density function:
1 T −1
1 1 − [x − x̄] Cxx [x − x̄]
fx (x) = p p e 2 . (D.1)
(2π)p det Cxx
For zero-mean random variables the covariance matrix and the autocorrelation matrix
Rxx = Cxx are identical. Correspondingly two joint variables x ∈ IR1×p and y ∈ IR1×q
can be described by their joint pdf
1 T T −1 x − x̄
− [(x − x̄) , (y − ȳ) ]Cxx,yy
1 1 1 2 y − ȳ
fx,y (x, y) = p p q e , (D.2)
(2π) p (2π) q
det Cxx,yy
131
132 Adaptive Filters (preliminary)
Gaussian. Assume the first two moments of the complex-valued Gaussian process are given,
thus z̄ and Czz :
This information is not sufficient to find uniquely Cxx , Cyy , Cxy as the real part of Czz only
defines the sum of Cxx + Cyy and the imaginary part only the difference of Cyx − Cxy . In
order to separate the three parts we need further knowledge, for example
Czz∗ = E[(z − z̄)(z − z̄)] = Cxx − Cyy + j Cyx + Cxy . (D.5)
With this additional knowledge the three parameter could be defined uniquely.Gaussian
processes that are circular (spherically invariant) have the following property1 :
Czz∗ = Cxx − Cyy + j Cyx + Cxy = 0. (D.6)
This is equivalent to
Cxx = Cyy and Cxy = −Cyx . (D.7)
Note that the requirement Cxy = −Cyx is unusual and indeed this requirement assures the
circularity of the process. Consider for example a two dimensional random vector. The
correlation matrix between x and y is given by
α11 α12
Cxy = . (D.8)
α21 α22
In order to have Cxy = −Cyx , we must have: α11 = α22 = 0 und α12 = −α21 . Thus we
obtain the particular form
0 α12
Cxy = . (D.9)
−α12 0
Is this a contradiction to the statement that quadratic forms need to be positive, thus their
imaginary part being negative? The answer is no as we will show on the next example. We
consider the quadratic form of a circular complex-valued random vectors z H Czz z:
1 H
T
z Czz z = (x − jy) Cxx + Cyy (x + jy). (D.10)
2
We obtain for its real part:
1 H
ℜz Czz z = xT [Cxx + Cyy ]x + y T [Cxx + Cyy ]y ≥ 0. (D.11)
2
1
Note that spherically invariance requires that the joint density function is a function of the radius r,
for a Gaussian process it is proportional to exp(−r2 ).
Univ.Prof. DI Dr.-Ing. Markus Rupp 133
1 H
ℑz Czz z = xT [Cxy + Cyy ]y − y T [Cxx + Cyy .]x = 0. (D.12)
2
Equivalently the covariance matrix in (D.3) for a circular complex-valued vector z can
be written as
Cxx Cxy 1 Real {Czz } −Imag {Czz }
Cxx,yy = = . (D.13)
−Cxy Cxx 2 Imag {Czz } Real {Czz }
Wit this we can write the p-dimensional joint density of a complex-valued random vector
z can be compactly described by its covariance matrix Czz (compare to (D.2) with p = q):
1 1 H −1
−[z − z̄] Czz [z − z̄]
fx,y (x, y) = fz (z) = e . (D.14)
π p |det Czz |
The first identity is shown by computing the inverse of the matrix and comparing the
individual terms of the x and y pairs.
−1
−1 Cxx Cxy ∆−1 Cxx
−1
Cxy ∆−1
Cxx,yy = = −1 (D.17)
−Cxy Cxx −Cxx Cxy ∆−1 ∆−1
−1
∆ = Cxx + Cxy Cxx Cxy (D.18)
1 −1
C = [Cxx + jCxy ]−1 = [I − jCxx
−1
Cxy ]∆−1 . (D.19)
2 zz
Writing both terms in real and imaginary part x and y, we obtain for the second part (D.19)
1 H −1
z Czz z = xT ∆−1 x + y T ∆−1 y (D.20)
2
+ xT Cxx
−1
Cxy ∆−1 y − y T Cxx
−1
Cxy ∆−1 x
h i
− j xT Cxx−1
Cxy ∆−1 x + y T Cxx
−1
Cxy ∆−1 y
− j y T ∆−1 x − xT ∆−1 y ,
134 Adaptive Filters (preliminary)
and realize that the real part is indeed identical. Note that the imaginary part becomes
zero again. However, this is not obvious and will be considered more closely. The last term
in the imaginary part T −1
y ∆ x − xT ∆−1 y = 0.
The first term is of more difficult nature:
h i
T −1 −1 T −1 −1
x Cxx Cxy ∆ x + y Cxx Cxy ∆ y .
−1
Here we have that (Cxx Cxy ∆−1 )T = −∆−1 Cxy Cxx−1 T
, and due tot symmetry Cxy = −Cxy
also: ∆−1 Cxy Cxx
−1 −1
= Cxx Cxy ∆−1 . And further we find
xT Cxx
−1
Cxy ∆−1 x = 0.
The later relation we prove by
−1
Cxx Cxy ∆−1 = ∆−1 Cxy Cxx
−1
.
Multiplying ∆ from left and right we obtain:
−1
Cxy [I + (Cxx Cxy )2 ] = [I + (Cxy Cxx
−1 2
) ]Cxy .
Multiplying with Cxy finally shows the identity.
Obviously g(0, 0) = 1 brings back the original density. We further introduce the matrix
−1
0 α 0 0
−1 α 0 0 0
L=
Rxx + 0
(D.29)
0 0 β
0 0 β 0
for which we have s
det(Rxx )
g(α, β) =1 (D.30)
det(L)
independent of α and β. Differentiating g(·, ·) with respect to α and β delivers
Z ∞ Z ∞
∂2
g(α, β) = ... x1 x2 x3 x4 fx (x)dx, (D.31)
∂α∂β −∞ −∞
Development of the determinant in the first row we find the following terms:
(1 + αr12 ) βr13 βr14
det(L) = (1 + αr21 ) αr42 (1 + βr43 ) βr44
αr32 βr33 (1 + βr34 )
αr11 βr13 βr14
− αr22 αr41 (1 + βr34 ) βr44
αr31 βr33 (1 + βr34 )
αr11 1 + αr12 βr
14
+ βr23 αr41 αr42 βr44
αr31 αr32 (1 + βr34 )
αr11 1 + αr12 βr
13
− βr24 αr41 αr42 (1 + βr43 )
αr31 αr32 βr33
= h(α, β).
We now take advantage of symmetries rij = rji and differentiate h(α, β) with respect to α
and β:
∂2 1 3 5 1 3
p = h(α, β)− 2 hα (α, β)hβ (α, β) − h(α, β)− 2 hα,β (α, β). (D.34)
∂α∂β h(α, β) 4 2
Here the first derivatives with respect to α and β are index accordingly. As we are interested
in the result for α = 0 and β = 0, we finally obtain
∂2 1
E[x1 x2 x3 x4 ] = p (D.35)
∂α∂β h(α, β)
α=0,β=0
= r12 r34 + r13 r24 + r14 r23 . (D.36)
It is worth to read the corresponding section in Papoulis’ textbook where a much simpler
derivation is shown. Note however that the path here can readily be modified to compute
other terms, like six order moments.
Appendix E
are in general not in the modal space of the driving process uTk = [uk , uk−1 , ..., uk−M +1 ] with
autocorrelation matrix Ruu = E[uk uTk ] = QΛu QT thus
Since the derivation of the LMS algorithm only requires the knowledge of the trace of such
matrices, it is sufficient to analyze only the algorithm’s impact on the parameter error
vector with respect to Ruu . It is therefore proposed to decompose a given matrix K into
a first part that is in the modal space of the autocorrelation matrix Ruu of the driving
process uk and a second part in its orthogonal complement space, that is:
K = b0 I + b1 Ruu + ... + bM −1 RM
uu
−1
+ K⊥
= P (Ruu ) + K⊥ . (E.3)
Here, P (Ruu ) denotes a polynomial in Ruu . Note that due to the Cayley-Hamilton theo-
rem, an exponent larger than M − 1, with M denoting the system order is not required [66].
137
138 Adaptive Filters (preliminary)
Lemma E.1 Any Hermitian matrix K can be decomposed into a part from the subspace
of a given modal space Ru = span{I, Ruu , R2uu , ..., RM −1
uu } and its orthogonal complement
subspace R⊥u for which
tr[K⊥ Rluu ] = 0
for any value of l = 0, 1, 2, ....
Proof: The optimal set of coefficients to approximate the Hermitian matrix K is found by
min tr[(K − P (Ruu ))(K − P (Ruu ))T ], (E.4)
{b0 ,b1 ,..,bM −1 }
Example 1: Assume first a specific solution for a real-valued Gaussian random process and
for K0 = R0uu = I, that is K0 is member of the modal space Ru of Ruu . If it is for example
assumed that the initial parameter estimate w0 = 0 and an average over many possible
systems w is performed, K0 = E[wwT ] = I can be a realistic assumption. If on the other
hand a-priori knowledge on the set of systems is present, other values may be more realistic
(see for example [44] for impulse responses of rooms in which K0 is of diagonal shape). In
the first step
K1 = I − 2µRuu + µ2 (2R2uu + Ruu tr[Ruu ]), (E.8)
is obtained that is K1 is a second order polynomial in Ruu and thus in the modal space
of Ruu . Assume now that Kk develops into an arbitrary polynomial in Ruu . How does it
change from time instant k to k + 1?
In other words, it remains to be a polynomial in Ruu . The same is in fact true for K0
to be any polynomial in Ruu . It can thus be concluded that the LMS update equation
under a real-valued Gaussian process forces the parameter error vector covariance matrix
K0 = P (Ruu ) to evolve into a polynomial in the modal space of Ruu . Terms of the
orthogonal complement are never generated.
Example 2: Let us now assume that the initial covariance matrix is entirely from the
orthogonal complement R⊥ ⊥
u , that is K0 = K . In the first step
The part K⊥ 1 from the orthogonal complement space thus only contributes to this space
⊥
as K2 but has no influence in the modal space of Ruu . Thus, any component from the
orthogonal complement will remain there and will not generate a component in the modal
space of Ruu .
140 Adaptive Filters (preliminary)
A general K0 will be a linear combination as shown in (E.3). Take for example a fixed
system w to be identified. In this case K0 = wwT . This value can be decomposed into
P (Ruu ) in the modal space of Ruu and a component K⊥ from its orthogonal complement.
The polynomial will evolve into a polynomial and thus stay in the modal space and con-
tribute to the learning performance terms, while the perpendicular terms will not contribute
to the algorithm’s performance curves under the trace operation. This also allows the de-
k
scription of the evolution of the individual components, starting with Kk = Kk + K⊥ k , with
k ⊥ ⊥
Kk ∈ Ru and Kk ∈ Ru a set of homogeneous equations is obtained
k k k k
Kk+1 = Kk − µRuu Kk − µKk Ruu (E.12)
k k
+µ2 (2Ruu Kk Ruu + Ruu tr[Kk Ruu ]).
K⊥
k+1 = K⊥ ⊥ ⊥
k − µRuu Kk − µKk Ruu (E.13)
+2µ2 Ruu K⊥k Ruu ,
which in turn allows to formulate a first statement for Gaussian driving processes.
Lemma E.2 Assume the driving process uk = Axk a linearly filtered white Gaussian
process xk with E[xk xTk ] = IM +P and A an upperright Toeplitz matrix for linearly filtering.
Under the IA the initial parameter error vector covariance matrix K0 of the LMS algorithm
evolves 1) into a polynomial in AAT = Ruu of the modal space of Ruu , solely responsible
for the mismatch and the misadjustment of the algorithm and 2) a part in its orthogonal
complement which has no impact on the performance measures.
As such examples are rather intuitive for the particular case of a Gaussian driving
process, (spherically invariant process as a generalization of Gaussian processes can be
included straightforwardly), it is of interest what can be said about larger classes of driving
processes. To achieve this goal a few considerations with respect to the driving process are
required.
Driving Process: The properties of Lemma E.2 are not only maintained by Gaussian
random processes but by a much larger class of driving processes. It will be shown that
these properties hold for random processesPthat are constructed by a linearly filtered white,
zero mean random process uk = A[xk ] = Pm=0 am xk−m , whose only conditions are that:
Univ.Prof. DI Dr.-Ing. Markus Rupp 141
Lemma E.3 Assume the driving process uk = A[xk ] to originate from a linearly filtered
white random process xk so that uk = Axk with xTk = [xk , xk−1 , ..., xk−N +1 ], A denoting an
upperright Toeplitz matrix with the correlation filter impulse response and xk satisfying the
k
conditions (E.14)-(E.22). The parameter error vector covariance matrix K0 = K0 + K⊥ 0
of the LMS algorithm essentially (with error of order O(µ2 /M )) evolves into a polynomial
in AAT in the modal space of Ruu while terms in its orthogonal complement K⊥ remain
there or die out.
1
Alternatively, the IA can be removed by employing particular processes in which each element of
the regression vector uTk = [uk,1 , uk,2 , ..., uk,M ] is generated by a filtered version of individual processes
xk,1 , ..., xk,M . As such processes seem artificial, this approach is not followed here.
142 Adaptive Filters (preliminary)
Note that this formulation may associate that this is only true for linearly filtered
processes of moving average (MA) type. As no condition on the order N of such process
is imposed, N can become arbitrarily large and thus autoregressive processes (AR) or
combinations (ARMA) can be resembled as well (e.g., see [28](Chapter 2.7)).
Proof: The proof proceeds in two steps: first, rewriting (E.7) for K0 = I to get to know
the most important terms and mathematical steps based on a simpler formulation, and
then refining the arguments for arbitrary values of Kk to Kk+1 .
For K0 = I and recalling that Ruu = E[uk uTk ] = AE[xk xTk ]AT = AAT the following is
obtained:
K1 = I − 2µAAT + µ2 AE[xk xTk AT Axk xTk ]AT . (E.23)
P P
On the main diagonal of the M × M matrix AAT identical elements are found: i=0 |ai |2 ,
P
thus tr[AT A] = tr[AAT ] = M Pi=0 |ai |2 , with P denoting the filter order of the MA
process.
is found, where diag[L] denotes a diagonal matrix with the diagonal terms of a matrix L
as entries. In case xk ∈ C
l the slightly different result is obtained
E[xk xH
k Lxk xk ]
H
= m(2,2)
x L + m(2,2)
x tr[L]IM +P
(4) (2,2)
+ mx − 2mx diag[L]. (E.26)
(4) (2,2)
For spherically invariant random processes (including Gaussian) the term (mx −3mx )
(4) (2,2)
for real-valued signals or (mx − 2mx ) for complex-valued signals, vanishes and thus the
problem can be solved classically.
P
In our particular case L = AT A ∈ R(M +P )×(M +P ) with tr[AT A] = M Pi=0 |ai |2 ,
PP
diag[AAT ] = 2 T
i=0 |ai | IM +P = tr[A A]/M IM +P . One problematic term remains how-
ever: diag[AT A]. At this point the following is proposed:
Univ.Prof. DI Dr.-Ing. Markus Rupp 143
M
IM +P
(4)
mx
a value that depends on the statistic of the process xk . The term (2,2) − 3 is similar to
mx
4 (4)
E[|x−mx | ] mx
the excess kurtosis E[|x−m 2 2 − 3 = (2) − 3. Processes with negative excess kurtosis are
x| ] (mx )2
often referred to as sub-Gaussian processes, while a positive excess kurtosis leads to so-called
super-Gaussian processes. This (slightly abused) terminology will be used correspondingly
(4)
to discriminate the term m(2,2) x
− 3. Thus sub-Gaussian processes in this sense take on
mx
144 Adaptive Filters (preliminary)
γx values smaller than one, while super-Gaussian processes have values larger than one.
However, it is also noted that our approximation error ∆ǫ has an impact only in case
γx 6= 1 which vanishes not only for Gaussian pdfs but also with growing filter order M !
Note further that the term in the LMS algorithm where Approximation A2 applies, is
proportional to µ2 . It thus has no impact for small step-sizes but certainly on the stability
bound. A first conclusion thus is that the error on the parameter error vector covariance
matrix due to this approximation is of O(µ2 ). Furthermore, the approximation error term is
proportional to γx −1 that is proportional to 1/M . Approximation A2 can thus be concluded
to cause an error of the parameter error vector covariance matrix of order O(µ2 /M ). The
consequence that the applied approximation is negligible for large filter order M as well as
for Gaussian-type processes is reflected in Lemma E.3 by the wording ”essentially”. This
means that in extreme cases (small M and far away from Gaussian) indeed a very small
proportion can leak into the complementary space. At the first update with K0 = I
Now the proof starts for general updates from Kk to Kk+1 . While the first terms that
are linear in µ are straightforward, the quadratic part in µ needs more attention.
Now this can be split into two parts in the modal space Kk and in its orthogonal
Univ.Prof. DI Dr.-Ing. Markus Rupp 145
The consequence of this statement is that the parameter error vector covariance matrix
is forced by the driving process to remain only in the modal space of the driving process.
This is not only true for its initial values but at every time instant k. The components of
the orthogonal complement remain in there or die out. This statement will be addressed
later in more detail in the context of step-size bounds for stability. Note also that for
complex-valued processes the only difference in (E.33) is the occurrence of AAH Kk AAH
rather than 2AAT Kk AAT .
is obtained, or equivalently
Since K∞ exists only in the modal space of AAT diagonalizing both by the same unitary
matrix3 leads to QK∞ QT = ΛK and QAAT QT = Λu .
2Λu ΛK − µm(2,2)
x (2Λ2u ΛK − γx Λu tr[Λu ΛK ]) = µσv2 Λu . (E.39)
obtained by employing the matrix inversion lemma [28]: [P (Λu ) + λu λTu ]−1 λu = 1/[1 +
λTu P (Λu )−1 λu ]P (Λu )−1 λu . The final steady-state system mismatch is thus given by
the only difference to classic solutions for SIRPS [53] being the term γx that contains
(4) (2,2) (2,2)
influences of the fourth order moments mx and mx as well as mx explicitly.
3
Even if A2 is not satisfied and K∞ has a component in the orthogonal complement space, the method
applying Q can be used. Although K∞ is not diagonalized then and ΛK is not of diagonal form then, for
the performance measures only its diagonal terms are of importance and will be considered later on.
Univ.Prof. DI Dr.-Ing. Markus Rupp 147
For complex-valued driving processes the final steady-state system mismatch is given
correspondingly by
P 1
µσv2 i (2,2)
T 2−µmx λi
tr[K∞ ] = 1 λK = (2,2) P λi
(E.46)
1 − µmx γx i (2,2)
2−µmx λi
Theorem E.1 Assuming the driving process uk = A[xk ] to originate from a linearly
filtered white random process xk with properties (E.14)-(E.22), any parameter error vector
covariance matrix K0 evolves essentially into two parts: a polynomial in AAT , stemming
from its decomposition onto the modal space of Ruu and a second part K⊥ k of its orthogonal
complement. The LMS update affects these two parts independently of each other.
The proof is straightforward by applying all previous results. In other words the comple-
mentary subspace part K⊥ k has no impact on the LMS performance measures and can thus
be neglected not only for Gaussian but for a large class of linearly filtered random processes.
A consequence for this theorem is the step-size bound that can be derived either from
(E.44) or by Gershgorin’s circle theorem from the matrix in (E.40). The result is identical
in both ways and conservative for real-valued xk :
2
0<µ≤ (2,2)
. (E.48)
mx (2λmax + γx tr[Ruu ])
Depending on the statistic of the driving process a more or less conservative bound is
obtained. It is worth to distinguish sub- and super Gaussian cases. For sub-Gaussian
distributions, γx < 1 while for super-Gaussian γx > 1. The step-size bound thus varies with
the distribution type more or less by tr[Ruu ] in the bound (E.48). For SIRPs (and thus
Gaussian) distributions as well as for very long filters γx = 1 and thus
2 2
0<µ≤ (2,2)
≤ (2,2)
. (E.49)
3mx tr[Ruu ] mx (2λmax + tr[Ruu ])
148 Adaptive Filters (preliminary)
(2,2)
This result is identical to the classic term 2/3tr[Ruu ] for Gaussian processes as mx = 1.
Note that the results are conservative. For a statistically white driving process, for
(2,2)
example, an exact bound leads to µ ≤ 2/[mx tr[Ruu ](γx + 2/M )], thus a significantly
larger bound and still depending on the distribution by the value of γx . For complex-valued
processes xk the bounds are very similar to obtain, simply substituting the term 2λmax by
λmax .
Further note that the components in the orthogonal complement space indeed vanish
as argued at the beginning of Section III. Take for example Eqn. (E.13) or (E.36) as the
evolution of the orthogonal complement. It is straightforward to show that for the given
step-size range in (E.48) or (E.49) the components K⊥ k vanish as long as there is no new
components induced by violating Assumption A2.
Theorem E.2 Assuming the driving process uk = C[xk ] to originate from a linearly filtered
random process xk , the very long adaptive LMS filter simplifies to:
−1
λK = µσv2 2Λu − µm(2,2)
x Λ2u − µm(2,2)
x λu λTu λu . (E.50)
Proof: Following the fact that cyclic matrices can be diagonalized by DFT matrices, say
F, we reconsider the term E[uk uTk Kk uk uTk ], remembering that a process linearly filtered
by a unitary filter F for very long filters preserves its properties at the output of the filter.
We thus find
FH ΛH/2
u F (E.52)
= F Λu E[fk fkH (ΛH/2
H 1/2 H 1/2 H H/2
u FKk F Λu )fk fk ]Λu F
= FH Λu1/2 E[fk fkH (ΛH/2 1/2 H H/2
u ΛK Λu )fk fk ]Λu F. (E.53)
Univ.Prof. DI Dr.-Ing. Markus Rupp 149
Note that the filter is now being formulated in terms of a complex-valued driving process
fk = Fxk . We now notice that the center term of the last equation is of diagonal form,
simplifying the terms to
(2,2) (2,2)
E[fk fkH Lfk fkH ] = mf L + mf tr(L)I
(4) (2,2)
+(mf − 2mf )L, (E.54)
(2,2) (2,2)
= mf tr(L)I + mf L, (E.55)
H/2 1/2
with the particular solution L = Λu ΛK Λu = Λu ΛK . Note that we have used the fact
(4) (2,2)
that mf = 2mf as shown in the appendix. The parameter error vector covariance matrix
evolves now as
This shows that for very larger filter orders our previous approximations hold exactly and
the parameter error vector covariance matrix Kk indeed remains in the modal space of the
driving process uk defined by the DFT matrices of order M .
and finally
(2,2)
λK = β∞ [2Λu − µmf Λ2u ]−1 λu . (E.60)
with
µσv2
β∞ = (2,2) P λi
. (E.61)
1 − µmf i 2−µm(2,2) λ
f i
(2,2)
Note that this result for the long filter depends only on the joint moment mf of the
DFT of the driving process. As shown in the appendix for most distributions this moment
takes on the same value. This explains that the long LMS filter behaves more or less
identical, independently of the driving process, as long as the correlation is the same. The
interesting reader may also compare to an older publication by Gardner [19] where the
(4)
forth order moment mx was emphasized for purely white driving processes.
The step-size bound for the very long filter is thus considerably larger (by one third) than
for short lengths. An explicit dependency on the distribution or the length M of the filter
(2,2)
is now no longer present. Note however, that mf is dependent on M and the correction
term γx can be applied to the bound (E.64) as well to address small filters.
There is also an alternative bound possible now. As the eigenvalues λi are simply
originating from the cyclic matrices C that linearly filter the driving process, they are
obtained by a DFT on the filter matrices CCT , or equivalently correspond to the powers
of the spectrum of uk at equidistant frequencies 2π/M , allowing an alternative bound for
λmax ≥ maxΩ |A(ejΩ )|2 , thus
1 1
0<µ≤ (2,2)
≤ (2,2)
. (E.65)
mf λmax mf maxΩ |C(ejΩ )|2
This bound corresponds more to the spectral variations in the driving process while the
former bound including the trace term focuses more on the gain of the correlation filter. A
similar bound was already proposed by Butterweck [7] in the context of the LMS analysis
following a wave-theoretical argument. In terms of our notation it reads:
1
0 < µButterweck ≤ (2) |C(ejΩ )|2
. (E.66)
mx M maxtrΩ(C T C)
It is thus confirming to learn that classical matrix approaches are leading to similar
(2,2)
results. In Butterweck’s analysis the fourth order moment mf is not accounted for. In
particular for spherically invariant processes that are not Gaussian this plays a crucial role.
Appendix F
Both sets of vectors thus build a basis of a linear vector space of dimension four. While
the first set {e1 , e2 , e3 , e4 } is an orthonormal (=normalized to unity and orthogonal) basis
, the second set {g 1 , g 2 , g 3 , g 4 } is simply a basis of such space.
N ×M
Let us now consider a matrix H ∈ Cl . Obviously, H describes a mapping of vectors
x∈Cl M ×1 onto vectors y ∈ C
l N ×1 :
y = Hx. (F.1)
151
152 Adaptive Filters (preliminary)
The range space of the matrix H is that linear vector space that is spanned by his columns,
thus
R(H) = {Hx|x ∈ C l M ×1 } (F.2)
That R(H) indeed is a linear space follows from the property that if we have for {x1 , x2 }
that {Hx1 , Hx2 } ∈ R(H), then it is also true that c1 Hx1 +c2 Hx2 ∈ R(H). Such properties
do not allow to conclude whether the mapping (F.1) is unique that is that there are two
different x1 and x2 that generate the same y. If this is the case , we have H[x1 − x2 ] = 0.
If such vectors exist, then their differences define the nullspace of H,
N = x∈C l M ×1 |Hx = 0 . (F.3)
Repeating the above argument, we can show that the nullspace is a linear space too. In
summary, we find that the uniqueness of a solution depends on the null space. If it is
empty, then the solution is unique.
Lemma F.1 (column space and nullspace) We have the following properties:
N (H) = R(H H )⊥ , N (H H ) = R(H)⊥ , R(H) = N (H H )⊥ , R(H H ) = N (H)⊥ .
Proof: Let us consider the first property. Let be x ∈ N (H). Then we have also that
Hx = 0. Then the following is also true y H Hx = xH H H y = 0 for all y. Thus, x is
orthogonal onto H H , thus x ∈ R(H H )⊥ and as x ∈ N (H), we must have N (H) ⊆ R(H)⊥ .
Starting the argumentation with x ∈ R(H H )⊥ , we thus conclude that x ∈ N (H) and
thus R(H H )⊥ ⊆ N (H). Therefore, it must be that N (H) = R(H H )⊥ . The remaining
properties follow accordingly.
R(H H ) = (H H H).
Proof: With Lemma F.1 it is sufficient to show that N (H) = N (H H H). Let x ∈
N (H H H), then H H Hx = 0. Thus, we also have xH H H Hx = kHxk22 = 0. Hx = 0 is
equivalent to x ∈ N (H), and therefore N (H H H) ⊆ N (H). Let’s start from the other
end: let x ∈ N (H), then Hx = 0 and also H H Hx = 0. Thus we have x ∈ N (H H H) and
therefore N (H) ⊆ N (H H H). Then, there is only possible that N (H) = N (H H H) and
thus also R(H H ) = R(H H H).
Appendix G
A solution xo to this problem, if existent, will in general not be a stationary (or extremal)
point of J(x). It thus does not need to be point (in general it is not a point) for which the
gradient of J(·) disappears.
J(x) = x2 + x − 2, f (x) = x2 , b = 1.
The solutions of f (x) = 1 are x = ±1, while the extrema point of J(x) is at x = −1/2.
Under the constraint f (x) = 1 we obtain two possible solutions x = ±1 of which J(1) = 0
and J(−1) = −2 only x = −1 minimized the function J(x).
In general such elimination is not necessarily given and thus difficult to obtain. The
method of the Lagrangian multipliers that we explain next, offers such a procedure which
may lead to the desired result. We fist have to consider the differentials J(x) and f (x):
∂J ∂J ∂J
dJ(x) = dx1 + dx2 + ... + dxn (G.3)
∂x1 ∂x2 ∂xn
∂f ∂f ∂f
df (x) = dx1 + dx2 + ... + dxn . (G.4)
∂x1 ∂x2 ∂xn
153
154 Adaptive Filters (preliminary)
df (xo ) = 0. (G.5)
J(x) = x2 + x − 2, f (x) = x2 , b = 1.
(2xo + 1) − 2λxo = 0,
Summarizing we can say that the method of Lagrangian multipliers transforms an op-
timization problem J(x) with constraint f (x) = 0 into a new optimization problem
n
X
V (x, λ) = J(x) − λi [fi (x) − bi ] (G.8)
i=1
dV (xo ) = 0, fi (xo ) = bi .
Univ.Prof. DI Dr.-Ing. Markus Rupp 155
If random processes with particular spectral properties are desired, they can be designed
by filtering a white process. The properties of the process are then defined by the filter
parameters. In particular Gaussian processes are of interest as they are not loosing their
Gaussian distribution during linear filtering. The filtering just changes variance and
autocorrelation of the process. Essentially there are three important classes of filtered
model processes: AR,MA and ARMA. The last one is a superposition of the first two.
P
X
xk = al xk−l + vk (H.1)
p=1
Here vk ∈ N (0, 1) is the Gaussian driving source with unit variance. AR processes
typically appear when resonances are to be modeled. Speech signals for example can be
modeled well by AR processes, as the vowel tract in mouth and throat has many such
resonances. Even sinusoidal or small band signals with heavy noise can be modeled well.
A further advantage of AR processes is that the autocorrelation matrix (typically auto
co-variance matrix as the signals are of zero mean) can directly be computed out of the
AR parameters a1 , a2 , .., aP and vice versa. For this we only need to multiply the process
in (H.1) by xk (in complex-valued processes by x∗k ) and compute the expectation. We thus
obtain:
P
X
2
E[xk ] = rxx (0) = al rxx (l) + 1. (H.2)
p=1
The last term E[xk vk ] =E[vk vk ] = 1 is obtained since xk consists only of past values of
the white process vk . By multiplying with previous values xk−l to (H.1) we obtain further
156
Univ.Prof. DI Dr.-Ing. Markus Rupp 157
If we want to compute the spectrum of such a process, we can use the generator equation
(H.1) and find: ! !
1 1
sxx (Ω) = P P . (H.4)
1 − Pl=1 al ejΩl 1 − Pl=1 al e−jΩl
By recursive filtering we can preserve the property to exhibit sharp resonances. It is poles
close to the unit circle that ensure such property.
A problem in AR processes in simulations are their initial values. Note that starting at
time instant zero the past values xk−1 , xk−2 ... are undefined. Thus, they are often set to
zero, which is incorrect as the joint density of such variables need also to be Gaussian. As
recursive processes forget their initial values over time, it is recommended to let them run
for a while first and then use their outputs.
However, different to the AR process is now the computation of the filter coefficients not
so straightforward as we do not obtain a set of linear equations. There exists iterative
solutions for this problem.
158 Adaptive Filters (preliminary)
The spectrum of the MA process can be computed again out of the generator (H.5) by:
Q
! Q
!
X X
sxx (Ω) = 1 − bl ejΩl 1− bl e−jΩl . (H.8)
l=1 l=1
Different to the AR process, the MA process can describe zeros in the spectrum rather
well. The initial values cab be generated easily by selecting independent Gaussian variables
vk−1 , vk−2 , ..., vk−Q of unit variance.
A combination of AR and MA processes, the so called ARMA processes allow for both:
resonances as well as zeros in the spectrum. They have a high flexibility as we need only
few coefficients to model specific spectra. However, it can be difficult to compute the filter
coefficients out of a given acf.
Appendix I
159
Appendix J
Small-Gain Theorem
The small gain theorem can be interpreted as the generalization of stability in the linear
case. It is well known for linear time-invariant systems that a closed loop system is stable
if and only if the open loop system has a gain smaller than one. The small gain theorem
extents such statement towards arbitrary nonlinear systems.Consider an input signal xk , k =
1..N described by the vector xN of dimension 1 × N . The response of a system HN on such
a signal is given by
y N = H N xN . (J.1)
Definition 1: A mapping H is called l-stable, if two positive constants γ, β exist, such
that for all input signals xN the output is upper bounded by:
Definition 2: The smallest positive constant γ, which satisfies l-stability is called the
gain of the system HN .
Remark: So called bounded input-bounded output (BIBO) stability is nothing else but
l∞ -stability.
Let us consider now a feedback system with two components HN and GN of individual
gains γh and γg . We find:
y N = HN hN = HN [xN − z N ] (J.3)
z N = GN g N = GN [uN + y N ]. (J.4)
Theorem J.1 (Small Gain Theorem) If the gains γh and γg are such that
γh γg < 1, (J.5)
160
Univ.Prof. DI Dr.-Ing. Markus Rupp 161
1
khN k ≤ [kxN k + γg kuN k + βh + γg βh ] (J.6)
1 − γh γg
1
kg N k ≤ [kuN k + γh kxN k + βg + γh βg ] . (J.7)
1 − γh γg
Proof: We have
hN = x N − z N (J.8)
g N = uN + y N . (J.9)
[1] A. Bahai, M. Rupp, “Training and tracking of adaptive DFE algorithms under IS-136,”
SPAWC97 in Paris, April 1997.
[2] Behrooz and Parhami, “Computer Arithmetic,” Oxford University Press, 2000.
[3] N.J. Bershad, “Analysis of the normalized LMS algorithm with Gaussian inputs,”
IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP–34, no. 4, pp. 793–806,
Aug. 1986.
[4] N.J. Bershad, “Behavior of the ǫ–normalized LMS algorithm with Gaussian inputs,”
IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP–35, no. 5, pp. 636–644,
May 1987.
[6] H.-J. Butterweck, ”Iterative analysis of the steady-state weight fluctuations in LMS-
type adaptive filters,” IEEE Transactions on Signal Processing, vol. 47, pp. 2558-2561,
Sept. 1999.
[7] H.-J. Butterweck, ”A wave theory of long adaptive filters,” IEEE Transactions on
Circuits and Systems I: Fundamental Theory and Applications, vol. 48, pp. 739-747,
June 2001.
[8] Theo A.C.M. Claasen, Wolfgang F.G. Mecklenbräuker, “Comparison of the conver-
gence of two algorithms for adaptive FIR digital filters,” IEEE Trans. Acoust., Speech,
Signal Processing, vol. ASSP–29, no. 3, pp. 670–678, Juni 1981.
[9] Peter M. Clarkson, Paul R. White, “Simplified analysis of the LMS adaptive filter using
a transfer function approximation,” IEEE Trans. Acoust., Speech, Signal Processing,
vol. ASSP–35, no. 7, pp. 987–993, Juli 1987.
[10] S.C. Douglas, T.H.–Y. Meng, “Exact expectation analysis of the LMS adaptive filter
without the independence assumption,” Proc. ICASSP, San Francisco, pp. IV61–IV64,
Apr. 1992.
162
Univ.Prof. DI Dr.-Ing. Markus Rupp 163
[24] M. Hajivandi, W.A. Gardner, “Measures of tracking performance for the LMS algo-
rithm,” IEEE Trans. Acoustics, Speech and Signal Proc., vol. ASSP-38, no. 11, pp.
1953-1958, Nov. 1990.
[25] B. Hassibi, A.H. Sayed, and T. Kailath, “LMS and Backpropagation are minimax
filters,” in Neural Computation and Learning, ed. V. Roychowdhury, K. Y. Siu, and
A. Orlitsky, Ch. 12, pp. 425–447, Kluwer Academic Publishers, 1994.
[26] B. Hassibi, A.H. Sayed, and T. Kailath,“H∞ optimality of the LMS algorithm,” IEEE
Trans. Signal Processing, vol. 44, no. 2, pp. 267–280, Feb. 1996.
[28] Simon Haykin, Adaptive Filter Theory, 1. edition, Prentice Hall, 1986.
[29] Simon Haykin, Adaptive Filter Theory, 3. edition, Prentice Hall, 1996.
[30] Simon Haykin, Adaptive Filter Theory, 4. edition, Prentice Hall, 2002.
[31] L.L. Horowitz and K.D. Senne, ”Performance advantage of complex LMS for control-
ling narrow-band adaptive arrays,” IEEE Transactions on Signal Processing, vol. 29 ,
no. 3, pp.722-736, June 1981.
[32] S. Hui, S.H. Zak, “The Widrow-Hoff algorithm for McCulloch-Pitts type neurons,”
IEEE Transactions on Neural Networks, vol. 5, no. 6, pp. 924-929, Nov. 1994.
[33] D.R. Hush, B.G. Horne, “Progress in supervised neural networks,” IEEE Signal Pro-
cessing Magazine, vol. 10, no. 1, pp. 8-39, Jan. 1993.
[35] R. E. Kalman, “Design of a self–optimizing control system,” Trans. ASME, vol. 80,
pp. 468–478, 1958.
[36] W. Kellermann, “Analysis and Design of Multirate Systems for Cancellation of Acous-
tical Echoes,” Proc. IEEE International Conference on Acoustics, Speech, and Signal
Processing, NY, 1988, vol. 5, pp. 2570-2573.
[39] R.P. Lippmann, “An introduction to computing with neural nets,” IEEE Acoustics,
Speech and Signal Processing Magazine, vol. 4, no. 2, pp.4-22, April 1987.
Univ.Prof. DI Dr.-Ing. Markus Rupp 165
[40] Guozho Long, Fuyun Ling, John A. Proakis, “The LMS algorithm with delayed coeffi-
cient adaptation,” IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP–37, no.
9, pp. 1397–1405, Sep. 1989.
[41] Guozho Long, Fuyun Ling, John A. Proakis, “Corrections to ‘The LMS algorithm with
delayed coefficient adaptation’,” IEEE Trans. Signal Processing, vol. SP–40, no. 1, pp.
230–232, Jan. 1992.
[42] Vijay K. Madisetti: Editor, The DSP Handbook, CRC Press, 1997.
[44] S. Makino, Y. Kaneda and N. Koizumi, “Exponentially weighted step-size NLMS adap-
tive filter based on the statistics of a room impulse response, IEEE Trans. on Speech
and Audio Processing, vol. 1, no. 1, pp. 101–108, Jan. 1993.
[45] S. Marcos, O. Macchi, “Tracking capability of the least mean square algorithm: appli-
cation to an asynchronous echo canceller, IEEE Trans. Acoustics, Speech and Signal
Proc., vol. ASSP-35, no. 11, pp. 1570-1578, Nov. 1987.
[46] J. E. Mazo, “On the independence theory of equalizer convergence, Bell Syst. Tech.
J., vol. 58, pp. 963–993, 1979.
[50] K.S. Narendra, K. Parthasarathy, “Gradient methods for the optimization of dynamical
systems containing neural networks, IEEE Trans. on Neural Networks, vol. 2, p. 252-
262, March 1991.
[51] R. Nitzberg, “Normalized LMS algorithm degradation due to estimation noise,” IEEE
Trans. Aerosp. Electron. Syst., vol. AES–22, no. 6., p. 740–749, Nov. 1986.
[52] K. Ozeki, T. Umeda, “An adaptive filtering algorithm using orthogonal projection to
an affine subspace and its properties,” Electronics and Communications in Japan, vol.
67–A, no. 5, pp. 19–27, 1984.
166 Adaptive Filters (preliminary)
[53] M. Rupp, “The behavior of LMS and NLMS algorithms in the presence of spherically
invariant processes,” IEEE Trans. Signal Processing, vol. SP–41, no. 3, pp. 1149-1160,
March 1993.
[54] M. Rupp, R. Frenzel “The behavior of LMS and NLMS algorithms with delayed coef-
ficient update in the presence of spherically invariant processes,” IEEE Trans. Signal
Processing, vol. SP–42, no. 3, pp. 668-672, March 1994.
[55] M. Rupp, “Bursting in the LMS algorithm,” IEEE Transactions on Signal Processing,
vol. 43, no. 10, pp. 2414-2417, Oct. 1995.
[56] M. Rupp, A. H. Sayed, “On the stability and convergence of Feintuch’s algorithm for
adaptive IIR filtering,” Proc. IEEE Conf. ASSP, Detroit, MI, May 1995.
[59] M. Rupp, “Saving complexity of modified filtered-X-LMS and delayed update LMS
algorithm,” IEEE Transactions on Circuits & Systems, pp. 57-59, Jan. 1997.
[60] M.Rupp, A.H. Sayed, “Improved convergence performance for supervised learning of
perceptron and recurrent neural networks: a feedback analysis via the small gain
theorem,” IEEE Transaction on Neural Networks, vol. 8, no. 3, pp. 612-623, May
1997.
[61] M. Rupp, A.H. Sayed, “Robust FxLMS algorithm with improved convergence perfor-
mance,” IEEE Transactions on Speech and Audio Processing, vol. 6, no. 1, pp. 78-85,
Jan. 1998.
[62] M. Rupp, “A family of adaptive filter algorithm with decorrelating properties,” IEEE
Transactions on Signal Processing, vol. 46, no. 3, pp. 771-775, March 1998.
[63] M. Rupp, “On the learning behavior of decision feedback equalizers,” 33rd. Asilomar
Conference, Monterey, California, Oct. 1999
[64] M. Rupp, A.H. Sayed, “On the convergence of blind adaptive equalizers for constant
modulus signals” IEEE Transactions on Communications, vol. 48, no. 5, pp. 795-803,
May 2000.
[65] M. Rupp, J. Cezanne, “Robustness conditions of the LMS algorithm with time-variant
matrix step-size,” Signal Processing, vol. 80, no. 9, Sept. 2000.
Univ.Prof. DI Dr.-Ing. Markus Rupp 167
[66] M. Rupp and H.-J. Butterweck, ”Overcoming the independence assumption in LMS
filtering,” in Proc. of Asilomar Conference, pp. 607-611, Nov. 2003.
[67] A.H. Sayed and T. Kailath, “A state-space approach to adaptive RLS filtering,” IEEE
Signal Processing Magazine, vol. 11, no. 3, pp. 18–60, July 1994.
[68] A.H. Sayed, M. Rupp, “Error energy bounds for adaptive gradient algorithms,” IEEE
Transactions on Signal Processing, vol. 44, no. 8, pp. 1982-1989, Aug. 1996.
[69] A.H.Sayed, M. Rupp, “An l2 −stable feedback structure for nonlinear H ∞ -adaptive
filtering,” Automatica, vol. 33, no.1, pp. 13-30, Jan. 1997.
[70] V.H.Nascimento, A.H.Sayed, “Are ensemble averaging learning curves reliable in eval-
uating the performance of adaptive filters,” Proceedings 32nd Asilomar Conference on
Circuits, Systems, and Computers, pp. 1171-1174, Nov. 1998.
[71] A.H. Sayed, ”Fundamentals of Adaptive Filtering,” Wiley 2003.
[72] Solo, V. and X. Kong, Adaptive Signal Processing Algorithms: Stability and Perfor-
mance, Prentice Hall, New Jersey, 1995.
[73] S.D. Stearns, G. R. Elliot, “On adaptive recursive filtering,” Proceedings 10th Asilomar
Conference on Circuits, Systems, and Computers, pp. 5–11, Nov. 1976.
[74] K. Steiglitz, L. E. McBride, “A technique for the identification of linear systems,”
IEEE Trans. Autom. Control, vol. AC–10, pp. 461–464, 1965.
[75] G. Ungerboeck, “Theory on the speed of convergence in adaptive equalizers for digital
communication,” IBM J. Res. Develop., vol. 16, no. 6, pp. 546–555, 1972.
[76] M. Vidyasagar, Nonlinear Systems Analysis, Prentice Hall, New Jersey, second edition,
1993.
[77] S. A. White, “An adaptive recursive digital filter,” Proc. 9th Asilomar Conf. Circuits
Syst. Comput., pp. 21–25, Nov. 1975.
[78] Bernard Widrow, M. E. Hoff Jr., “Adaptive switching circuits,” IRE WESCON conv.
Rec., Part 4, pp. 96–104, 1960.
[79] E. Walach and B. Widrow, “The least-mean-fourth (LMF) adaptive algorithm and its
family,” IEEE Trans. Inform. Theory, vol. IT-30, pp.275-283, March 1984.
[80] B. Widrow and S. D. Stearns, Adaptive Signal Processing, NY: Prentice-Hall, Inc.,
1985.
[81] W. Wirtinger, ”Zur formalen Theorie der Funktionen von mehr komplexen
Veränderlichen,” Mathematische Annalen, vol.. 97, pp. 357-376, 1927.