Adaptive Filters

VU 389.
069 – Signal Processing, Advanced Course:
ADAPTIVE FILTERS
Univ.Prof. DI Dr.-Ing. Markus Rupp

Institute of Telecommunications
Vienna University of Technology
PRELIMINARY VERSION
Summer Semester 2011

Contents
1 Adaptive Filters: An Overview 5

1.1 Applications of Adaptive Filters . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Classification Schemes for Adaptive Filters . . . . . . . . . . . . . . . . . . . 14
1.3 Nomenclature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2 Fundamentals of Stochastics 18
2.1 Least-Mean-Squares Estimators . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2 Linear Least-Mean-Squares Estimators . . . . . . . . . . . . . . . . . . . . . 23
2.3 Different Interpretations of the Wiener Solutions . . . . . . . . . . . . . . . . 27
2.4 Complexity of the Exact Wiener Solution . . . . . . . . . . . . . . . . . . . . 27
2.4.1 Durbin Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4.2 Levinson Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.4.3 Trench Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.5 The Steepest Descent Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.6 Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3 The LMS Algorithm 38

3.1 Classic Approach: Approximating the Wiener Solution . . . . . . . . . . . . 38
3.2 Stationary Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2.2 The Mean Error Vector . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2.3 The Mean Square Error Vector . . . . . . . . . . . . . . . . . . . . . 41
3.2.4 Describing Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2.5 Convergence with probability one . . . . . . . . . . . . . . . . . . . . 46
3.3 Behavior under Sinusoidal Excitation . . . . . . . . . . . . . . . . . . . . . . 50
3.4 Application Specific Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.5 Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4 The RLS Algorithm 59

4.1 Least Squares Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.1.1 Existence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
2
Univ.Prof. DI Dr.-Ing. Markus Rupp 3
4.1.2 LS Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.1.3 Conditions on Excitation . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.1.4 Generalizations and special cases . . . . . . . . . . . . . . . . . . . . 63
4.1.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.2 Classic RLS Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.2.1 Underdetermined Forms . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.3 Stationary Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.4 Alternative Forms of LS Solutions . . . . . . . . . . . . . . . . . . . . . . . . 72
4.5 Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5 Tracking behavior of Adaptive Systems 75

5.1 Tracking Behavior of LMS and RLS Algorithm . . . . . . . . . . . . . . . . . 75
5.2 Kalman Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.3 Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6 Generalized LS Methods 82
6.1 Recursive Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.2 Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
7 Robust Adaptive Filters 88

7.1 Local Passivity Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
7.2 Robustness Analysis of Gradient Type Algorithms . . . . . . . . . . . . . . . 90
7.2.1 Minimax Optimality of Gradient Method . . . . . . . . . . . . . . . . 92
7.2.2 Sufficient Convergence Conditions . . . . . . . . . . . . . . . . . . . . 93
7.2.3 The Feedback Nature of the Gradient Method . . . . . . . . . . . . . 94
7.2.4 The Gauß-Newton Algorithm . . . . . . . . . . . . . . . . . . . . . . 97
7.3 Algorithms with Nonlinear Filter without Memory in the Estimation Path . 99
7.3.1 The Perceptron-Learning Algorithm . . . . . . . . . . . . . . . . . . . 99
7.3.2 Adaptive Equalizer Structures . . . . . . . . . . . . . . . . . . . . . . 102
7.3.3 Tracking of Equalizer Structures . . . . . . . . . . . . . . . . . . . . . 103
7.4 Algorithms with Nonlinear Filter in the Error Path . . . . . . . . . . . . . . 106
7.5 Adaptive Algorithms with Linear Filter in the Error Path . . . . . . . . . . . 108
7.6 Adaptive Algorithms with Linearly Filtered Regression Vector . . . . . . . . 114
7.7 Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
A Differentiation with Respect to Vectors 117
B Convergence of Random Sequences 120
C Spherically Invariant Random Processes 122
D Remarks on Gaussian Processes 131

4 Adaptive Filters (preliminary)
E More General Derivation for the Mean Squared Parameter Error Vector137
E.1 Decomposition of Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
E.2 Modal Space of LMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
E.3 Influence of Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
E.4 Complete LMS Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
E.5 Very Long Adaptive Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
F Basics of linear Algebra 151
G Method of Lagrange Multipliers 153
H Parametric Model Processes 156
I State Space Description of Systems 159
J Small-Gain Theorem 160

Chapter 1
Adaptive Filters: An Overview
1.1 Applications of Adaptive Filters
@
@

Echo path
Loudspeaker
To switching board
N Hybrid -
Micro

-

Figure 1.1: Near echo connection.
During the past 30 years, adaptive filter algorithms have made their way into many
electronic products. In most of the cases, the user is not even aware of their existence
which demonstrates their excellent performance. Adaptive filters have the property to
adapt to a permanently changing environment by which their behaviour is kept optimal. In
the following, we will review various applications of adaptive filters. In Figure 1.1 the block
diagram of a so called hybrid is shown (Ger.: Gabelschaltung (Wien-Brücke)). It allows for
connecting a two-wire line from the switching board to the two wires of a microphone on the
one hand, and to the two wires of the loudspeaker on the other hand. A perfect balancing is
achieved if the user does not hear his own voice in the loudspeaker. Such optimal situation
is typically not given, as the far-end load (Ger.: Nachbildung N) is in general unknown.
Typically, a small leakage from the microphone to the loudspeaker is considered good, since
5
then, the user is convinced that the device is working properly. However, in the context of
hands-free telephones (Ger.: Freisprechanlagen), this can cause acoustic feedback and thus
needs to be attenuated. The attenuation or even a complete cancellation of this echo signal
can be achieved if the leaking signal is subtracted at the loudspeaker input. The leaking
signal can be estimated by a convolution of the impulse response from the microphone to
the loudspeaker with the user’s voice signal. Since this impulse response is unknown, an
adaptive filter is required to estimate it.
A
-
-
Echo from B
- Hybrid N N
Hybrid -
A B
Speaker Speaker
Echo from A
A B

Figure 1.2: Far echo connection.
Figure 1.2 schematically depicts two far end users, connected via their local hybrids and
a landline. At each point along the landline, where a change from two wires to four wires
(and vice versa) is required, such hybrids are in use. For connections over long distances,
many of such hybrids may occur. In this case, the hybrids outside the user equipments
are not passing on the local signals from microphone to loudspeaker but instead far end
signals are reflected to their origin. For long distance connections, the time delays of 500ms
and more lead to very disturbing echos which provoke the end users to speak against their
own voices. If such an echo signal is additionally amplified along the transmission path,
under the assumption that the hybrids have an attenuation of 6dB, it can happen that the
closed loop connection (established via the acoustic coupling from the loudspeaker to the
microphone) has an attenuation of zero decibel or even less, and thus the system becomes
unstable. Adaptive filters on both ends can compensate for such effects and make the far
distance connection stable. While local echo compensation typically deals with few filter
coefficients (<100), far end echoes compensation may require a lot of coefficients (500-4000).

ZZ
J

Z J

Z J

Z J
Z

Z J

Z J

Z J
ZJ

~
Z
-^
Z
JZ >

@ J Z
@ J Z
J ZZ

J Z
J Z
J Z
J
Z
Figure 1.3: Loudspeaker-Room-Microphone system.
Nowadays, almost all telephone sets feature hands-free equipment (Ger.: Freisprechein-
richtungen). They are being applied in offices as well as in video conferences, and even in
cars while driving. Figure 1.3 schematically depicts the sound propagation in a room (or
some other reflecting environment). The far end speaker signal enters the room via the
loudspeaker, The corresponding sound is reflected at the walls and adds up with the voice
of the local speaker at the microphone. The resulting electric signal is transfered to the far
end speaker where his own signal appears delayed as an echo. If the far end speaker is also
using a hands-free telephone, the loop is closed and in the worst-case, a strong sinusoidal
whistling is audible (Ger.: Rückkopplungspfeifen). In such scenarios, an adaptive filter can
estimate the impulse response of the two-port system loudspeaker-room-microphone and
estimate the echo signal in order to subtract it. It is important to notice that depending on
the room size, the occurring impulse responses may be very long (typically several thousand
taps). In this application, the local speaker is seen as a disturbance (for the adaptive filter
estimation). Nevertheless, it is the signal of interest which has to be transmitted which
necessitates a special treatment.
All applications discussed so far, fall in the category of system identification. As shown
in Figure 1.4, the path of the echo is a linear system with an unknown impulse response.
By observing the input and output signal of this system, the adaptive filter “learns” (i.e.,
identifies) its impulse response. With the identified impulse response and the known input
signal, the filter can reconstruct the echo excluding the signal of interest, and by subtraction
the clean echo-free signal can be obtained.
v(k)
y(k) ?
x(k) r - - h - e(k)
? 6
−

ŷ(k)
-
ŵ

Figure 1.4: System identification.
An entirely different problem is given in active noise control. Figure 1.5a depicts the
situation. A primary noise source (engine, hair dryer, ...) causes an undesired disturbance
at the position of the microphone. By a second controlled source, the disturbing noise is
reduced at the microphone position by destructive interference. Often, a direct access to
the primary source is not feasible. In this case, a second sensor captures a correlated signal
instead. Figure 1.5b exhibits the corresponding block diagram including the adaptive filter.
The adaptive filter does not estimate the impulse response corresponding to the transfer
function P̃ , representing the path from the primary source to the microphone. Instead,
it only estimates the impulse response corresponding to the path reduced by H, which
represents the path between secondary source and microphone. This can cause non-causal
parts in the solution. We will later classify such a problem as a reference model with linear
filter in the error path.
Primary Electrical Secundary Error

Source Filter Source Micro

(a)
- x
@
@

x(t) y(t) ef (t)

7

- W

(b)
d(k)
- Pe(z)
x(k)
? e (k)
f
t

6
y(k)
- W - H(z)

Figure 1.5: (a) Active noise suppression. (b) Equivalent block diagram of the control algo-
rithm.
In speech processing, adaptive filters are used in the context of linear prediction as
required for the vocoder principle. Figure 1.6 depicts the basic structure. The speech
signal is first delayed, then, it serves as input signal for the adaptive filter. In contrast, the
reference signal is the original speech signal without delay. Thus, the adaptive filter will aim
to approximate the speech signal based on past observations. Typical applications for linear
prediction are voice coder which are used to reduce the data rates in speech communications.
Here, in the simplest case, only the prediction error signal is transmitted which. Since this
signal carries less energy than the original signal and it can be quantized with less bits per
sample. More sophisticated is the so called vocoder principle. As the spectrum of speech
signals remains constant for approximately 10ms, it is possible to train the adaptive filter
during this time period, and to transmit only the resulting filter coefficients (prediction
coefficients). The speech signal is then artificially synthesized applying these prediction
coefficients. With this technique, an enormous reduction in data (rate) can be achieved.
d(k)
- 1
x(k)
? e(k)
t

6
–
y(k)
- q −1 - w

Figure 1.6: Linear Predictor.
A further application of adaptive filters can be found in data transmission over frequency
selective (time-dispersive) channels. Figure 1.7 shows the block diagram of such a data
transmission when employing an adaptive filter for equalization. A digital signal is being
transmitted via a linear channel c and disturbed by additive noise v(n). At the receiver, a
digital filter needs to be adjusted, such that a nonlinear one-to-one mapping can be applied
to unambiguously map the receive symbols (extracted from the filtered receive signal) to
the transmit symbol alphabet. Often, the choice of a suitable filter structure as well as the
possibly high number of filter coefficients are secondary problems. Then, the predominant
problem is posed by finding an adequate algorithm that allows for rapid tracking of
channel alterations (frequency dispersive or time-variant channel), which at the same time
does not require high demands on numerical precision. In contrast to the previously
described system identification problem, here, we do not have a reference signal available.
Yet, it can be obtained by introducing a training which is known by both, the transmitter
and the receive. Of course, the concept of data transmission is the transmission of unknown
data, and thus it is necessary to keep the training sequence very small compared to the
unknown data. Alternatively, the transmission of training sequences can be circumvented
if the decoded signals can be used as reference signals. However, this only works as long as
the errors remain small.
v(k)
x(k)
? y(k) z(k) x̂(k)
- c - - w - f -

Figure 1.7: Optimal equalizer and decoder.
Both applications, linear prediction and adaptive equalization, follow the same reference
structure as shown in Figure 1.8. In case of a linear prediction, we set c = q −1 and ∆ = 0,
while in case of the adaptive channel equalization, c denotes the channel and the training
sequence is represented by the reference signal, delayed by ∆.
d(k)
- z −∆
x(k)
? e(k)
t

6
–
y(k)
- c - w

Figure 1.8: Reference model for inverse modeling.

A similar problem but even more complicated is the nonlinear pre-distortion of power
amplifiers, which is commonly used in wireless communications. In order to achieve a very
high power efficiency, such amplifiers are operated in the nonlinear C or F mode. For
bandwidths of less than 1MHz the occurring nonlinearity is typically memoryless (Saleh
model),
αA r
A (r) =
1 + βA r2
(1.1)
αΦ r 2
Φ (r) = ,
1 + βΦ r2
and can be corrected by nonlinear one-to-one mappings in amplitude and phase. For larger
bandwidths however, memory effects are emerge and become more and more pronounced
with increasing bandwidth. So called Volterra series are one possibility to describe such a
nonlinear system with memory. For example, a Volterra series truncated at order P maps
an input signal u(n) to an output y(n) according to
P
X
y(n) = hp,n [u (n)] , (1.2)
p=0
where the operator hp,n [·] represents a multi-dimensional convolution:

N N p
X X Y
hp,n [u (n)] = ··· hp,n (n1 , . . . np ) u (n − ni ) . (1.3)
n1 =0 np =0 i=1
The advantage of this representation is that all coefficients are linear in the input signal.
The nonlinear effect is achieved by delaying the input signal and combining different delayed
version in high order polynomials. A further advantage of a truncated Volterra series is
that its inverse can be derived symbolically, which is very important for pre-distortion. In
practice however, Volterra series require a large amount of coefficients. Figure 1.9 shows a
typical pre-distortion scenario. After the identification of the power amplifier, the system
is inverted and the so obtained pre-distorter is placed before the power amplifier. In the
ideal case, the cascade of the pre-distorter followed by the power amplifier behaves linear
(up to the saturation point).
In order to reduce the complexity of the problem, other adaptive filter structures are
being employed. Most suitable are two-lock models consisting of a linear dispersive block
and a memoryless nonlinear block. Depending on the order of these two blocks, the systems
are called Wiener model or Hammerstein model. As depicted in Figure 1.10 the structure
of the Wiener model is given by a linear filter followed by a memoryless nonlinearity. For
the Hammerstein model the order of the blocks is reversed. Currently, it is not known how
to guarantee convergence of such adaptive filter structures.
reference yref (n)

-
model

adaptive
laws
u(n) nonlinear y(n)

r - controller - r -
plant

Figure 1.9: Nonlinear adaptive predistortion.
u(n) x(n) y(n)

- L(h, ·) - f (θ, ·) -
Figure 1.10: Nonlinear Wiener Model.
A problem similar to adaptive equalization is faced in pattern recognition. Here, (binary)

patterns need to be classified. As an example, consider a digital camera used in an automatic
bottling machine. Based on the images taken by the camera, it has to be decided whether
the bottle has the correct size, and whether the label is placed correctly. This can be done by
comparing each of these images to several reference images. Relying on the outcome of such
a comparison, a decision is made whether the corresponding bottle fulfills the requirements
or not. Neural networks are well suited for such decisions. Figure 1.11 depicts a simple
neural network consisting of a decision operation f (·) and a feedback of the decision path.
In the training phase, many training patterns x are presented as input, together with the
correct decisions d. The neural network has to find the correct set of coefficients a and b to
guarantee that similar patterns result in the same decisions.
x
- b

?
z- d -
f (z)

- 6
a
q −1
Figure 1.11: Feedback Neuronal Network.
1.2 Classification Schemes for Adaptive Filters

Adaptive filter algorithms are typically algorithms for optimization. In contrast to clas-
sic optimization problems, adaptive filter algorithms try to find optima in a permanently
changing environment. In the previous section, we already classified adaptive filters ac-
cording to their application. There, different reference models were presented which are
of special importance. In a reference model, one signal is selected as the reference signal.
This signal is used by the adaptive filter to converge to the optimal solution. The relation
of the reference signal to the excitation signals of the adaptive filter, strongly affects the
structure of the adaptive system as well as its properties. We already introduced three
models: system identification, inverse modeling, and identification with filtered error path.
In all of these cases, the adaptive filters can be linear or nonlinear.
As a second categorization, adaptive algorithms can be divided into online and offline
algorithms. If all required data are available before the algorithm is started, it can run a
certain number of iterations until it stops resulting in a (sub-) optimal solution. Hence,
algorithms operated in this offline mode are often called iterative algorithms instead. In
contrast, if new data become available with each iteration step, the algorithm operates in
online mode, which is also called recursive mode. An adaptive filter is typically working in
recursive mode, since it tries to adapt to a permanently changing environment.
Example 1.1 (Iterative Algorithm) Consider the following update equation, starting
with w0 = 0;
1
wk = wk−1 + a ; k = 1, 2, ..., N
k
Example 1.2 (Recursive Algorithm) Consider the following update equation, starting
with w0 = 0;
1
wk = wk−1 + ak ; k = 1, 2, ..., N
k
In the second (recursive) algorithm the data ak are changing while in the first iterative
algorithm, a remains constant. Question: where does the first algorithm end?1
A third distinction of adaptive algorithms is done with respect to their cost functions. The
reference model defines a desired signal, the reference signal that is to be approximated by
the adaptive filter. In general, the error between this reference signal and its approximation
is assessed by a specific cost function. Typically, the larger the difference, the larger are the
costs. Depending on the class of signals which is considered, the cost function may contain
expectations for stochastic signals, or simple norms respectively metrics for deterministic
signals. Often, in the stochasticPcase, the expectation E[|e(n)|p ] has to be minimized, in
the deterministic case, the sum ni=0 |e(n)|p . These two cost functions are similar but not
the same. Also minimax formulations are common and will be presented in later chapters.
1.3 Nomenclature
The following nomenclature is being applied throughout these lecture notes:
∗ is the complex conjugate of a scalar, a vector, or a matrix.
T as a superscript, indicates the transpose of a vector or a matrix.
H as a superscript, indicates the the Hermitian transpose, i.e. the conjugate
transpose of a vector or a matrix.
a, b, c, ... denote (deterministic) scalars.
a, b, c, ... denote scalar random variables.
fa (a) denotes the probability density functions of the random variable a.
E(·) or E[·] denotes the expectation of a random variable. If the argument is a vector
or a matrix, the expectation is considered element by element.
σa2 denotes the variance of the random variable a.
ā denotes the mean of the random variable a.
a,b,c... denote (column-)vectors which contain random variables as entries.
1 is a (column-)vector (of adequate dimension) with all elements equal to one.
a, b, c, . . . denote (column-)vectors of which entries are all random variables.
fa (a) denotes the joint probability density functions of the random vector a.
Raa denotes the autocorrelation matrix of the random vector a, Raa = E[aaH ].
Rab denotes the cross-correlation matrix of the two random vectors a and b,
Rab = E[abH ].
rab or rba denotes the cross-correlation vector of the random vector a with the random
variable b, or vice versa. rab = E[ab∗ ] is a column vector, rba = E[baH ] is
a row vector, and obviously, it holds that rab = rHba .
rab denotes the cross-correlation of the random variables a and b, rab = E[ab∗ ].
1
The algorithm converges to (e − 1)a ≈ 1.71a.
A, B, C, ... denote matrices of which all entries are deterministic.

A, B, C, ... denote matrices which contain random variables as entries.
I is the identity matrix (of adequate dimension).
0 is the zero matrix (of adequate dimension).
P
kakqp applied to a vector a denotes its p−norm of power q, kakqp = ( i |ai |p )q/p .
kakQ applied to a vector pa denotes its weighted norm with the positive definite
2
matrix Q, kakQ = aH Qa.
⌊a⌋ denotes the floor-operation applied to some scalar a; thus, it returns the
largest integer contained in a.
Attaching an argument (k) to a scalar variable, or an index k (small letter) to a vector or

a matrix, indicates a further temporal dependency. Depending on the context, the symbol
may refer to the actual value at time instant k or to the complete sequence. Hence, in
the second case, random variables become the meaning of a random processes. A capital
letter as a subscript of a vector or a matrix symbol defines a dimension. Since we always
consider column vectors, for a vector, this subscript is unambiguous. For a matrix, it will
become clear from the context which dimension the subscript refers to. Positive definite
square matrices are briefly denoted by A > 0.
In these lecture notes, in some contexts, linear filters are represented by capital letters
followed by the argument q −1 . Here, q −1 denotes the backward shift operator and it is added
to stress that the capital letter has the meaning of an operator as well. For example, we write
B(q −1 )[v(k)] = B[v(k)] for a sequence v(k), filtered by an FIR filter with the coefficients
b(0), b(1), ..., b(M − 1). Using this operator notation, also recursive filter structures (IIR)
B(q −1 )
can easily be described. To see this, consider the expression y(k) = 1−A(q −1 ) [v(k)], which
represents an IIR filter given by the difference equation
N
X M
X −1
y(k) = a(l)y(k − l) + b(l)v(k − l). (1.4)
l=1 l=0
We prefer this notation to the z-transform since by this, we do not need to state any
conditions regarding the input sequence (Dirichlet conditions).
As we operate a lot with matrices and vectors, we will introduce some short notations.
In particular, for the differentiation with respect to a real or a complex valued vector, we
will use the following rules (for w ∈ IR and z ∈ C
l , see also Appendix A for a more detailed
2
Reminder: a Hermitian matrix is positive (semi-)definite if all of its eigenvalues are greater than (or
equal to) zero.
discussion):
∂Rw
= R
∂w
T
∂w Rw
= wT [R + RT ]
∂w
∂Rz
= R
∂z
H
∂z Rz
= z H R.
∂z
Exercise 1.1 Consider the adaptive structure in Figure 1.12. The blocks indicate transver-
sal filters with the weights collected in the corresponding vectors. Find an expression for
the error signal e(k) as a function of the input signal x(k) and the disturbance v(k) (hint:
make use of the operator notation for linear systems!). What solution is to be expected for
wk if perfect adaptation is achieved? Which class of adaptive schemes does this structure
belong to? Which class would it belong to if the blocks a and wk were interchanged?
d(k) v(k)
- b - c

x(k)
? e(k)
t

6
–
y(k)
- a - wk

Figure 1.12: Adaptive Scheme.
Exercise 1.2 Show under which conditions the equalizer in Figure 1.7 becomes a system
identification. Assume that the blocks c and w are linear systems.
Chapter 2
Fundamentals of Stochastics
In this chapter, we briefly summarize a few fundamental methods which are commonly used
in stochastics. These methods will form the basis for many analyses presented throughout
these lecture notes. We will present least-mean-squares (LMS) in the general case as well
as for a linear system model which will lead us to the so called linear least-mean-squares
(LLMS) estimator together with the Wiener-solution. As such parabolic problems in most
cases lead to systems of linear equations, we discuss efficient methods (Durbin, Levinson,
and Trench) to solve them. Furthermore, we will discuss the so called steepest-descent
algorithm, which is an iterative approach for solving parabolic problems.
2.1 Least-Mean-Squares Estimators

In the following, we deal with the problem of determining specific desired values (figures of
merit) of a random variable. Since we do not have complete knowledge about the random
variable, but only a finite number of available observations, these desired values can only
be estimated based on those observations. Consider for example a random variable x with
mean x̄ and variance σx2 , thus,
σx2 = E[x − x̄]2 = E[x2 ] − x̄2 . (2.1)
Consequently, for zero mean random variables, we find σx2 = E[x2 ]. Intuitively, a low
variance indicates that a desired value is near to the mean. More precisely, with a high
probability, the desired value will be found in the vicinity around the mean, and the size of
the corresponding interval around the mean can be deduced from the variance σx2 :
• A small variance σx2 indicates that the desired value is (likely to be) close to the mean.
• A large variance σx2 indicates that the desired value may fall into a large interval
around the mean.
18
Thus, the variance provides a measure for the uncertainty of the estimated (desired) value.
The just described meaning of the variance can be formulated quantitatively for some
random (desired) variable x by the so called Chebyshev’s inequality1 , which states
σx2
P (|x − x̄| ≥ δ) ≤ . (2.2)
δ2
Accordingly, the probability that the desired value lies in the interval [x̄ − δ, x̄ + δ] is limited
2
by σδ2x . For example, the probability that the desired value lies outside of ±σx is bounded by
100%. Thus, in this case, there is no effective bound on the probability. But the probabil-
ity that the desired value is not contained in the interval [x̄−5σx , x̄+5σx ] is bounded by 4%.
We assume now that (only) mean and variance of the random variable x are known,
and ask the question how to obtain an adequate estimate for the (unknown) value of x.
Of course, which estimate is adequate depends on the chosen method. Such a method is
generally called an estimator and relies on some measure of quality. An estimator suitable
with respect to one quality measure may be unacceptable if evaluated based on another
quality measure. A suitable quality measure could be the distance of the estimate from the
true value:
E[x − x̂]. (2.3)
Here, negative and positive values are obviously canceling each other out, and thus, an
estimator based on this quality measure may deliver a wrong picture. Recognizing that,
a more suitable measure or metric for the quality of estimation is given by the quadratic
measure E[(x − x̂)2 ]. Having agreed on a metric, the best estimator is the one which leads
to a minimal value of the metric. Consequently, for the quadratic metric, we want to know
the estimator x̂ which is optimal according to the minimization problem:
min E[(x − x̂)2 ]. (2.4)

x̂
Quadratic metrics are very suitable as they are simple to manipulate, and they typically
lead to explicit analytical results. Also other metrics like the l1 -norm (absolute norm) are
used, because sometimes they allow for implementations with lower complexity.
Lemma 2.1 Given the mean x̄ and variance σx2 of a random variable x, the least-mean-
squares (LMS) estimate x̂ is optimal, if x̂ = x̄.
Proof: E[(x − x̂)2 ] = E [([x − x̄] + [x̄ − x̂])2 ] = σx2 + (x̄ − x̂)2 .
1
Note that this inequality does not rely on the actual probability density function of x
Obviously under the absence of further information the best estimate is given by
the mean. Consider the estimation error e:
e = x − x̂ = x − x̄.
At his point it is noteworthy to mention that the variance of the estimation error is as large
as the variance of the random variable itself2 . We thus learn that this estimator has not
changed the uncertainty regarding x.
Let us now imagine that we observe a second random variable y that is correlated to
the first one x, meaning that y increases our knowledge about x. It should be possible
to formulate the estimate x̂ for x such that the quality of it is improved in comparison to
the estimator which does not incorporate the additional information provided by y. If the
estimator for x can be expressed by a function (mapping) of y such that
x̂ = h[y],
then, this function h[·] itself is called an estimator (Ger.: Schätzverfahren oder Schätzer).
Once an argument is provided, we obtain an estimate (Ger.: Schätzwert). Note that now
the estimate x̂ is a random variable itself, because it is obtained by the mapping of the
random variable correlated to x.
Lemma 2.2 The least-mean-squares estimator (LMSE) of x given y is E[x|y]. (The esti-
mate is given by E[x|y = y].) The minimum mean-square-error (MMSE, Ger.: minimales
Fehlerquadrat) is given by:
min E[(x − x̂)2 ] = E[x2 ] − E[x̂2 ].
x̂
Proof: Per definition we find:

E[(x − x̂)2 ] = E[(x − h[y])2 ] (2.5)
Z Z
= (x − h[y])2 fx,y (x, y)dxdy (2.6)
Z Z
2
= fy (y)dy (x − h[y]) fx|y (x|y)dx . (2.7)
Due to fy (y) ≥ 0, the first term is positive. Therefore, it simply can be interpreted as a
positive weighting term. The second term can be differentiated with respect to x̂ which
provides us with the solution for the minimum:
Z
x̂ = E[x|y = y] = xfx|y (x|y)dx. (2.8)
2
Check this!
Example 2.1 Consider a random variable z = x + y. Let the two random variables x and
y be statistically independent. Moreover assume, that x takes on the values ±1 with equal
probability, and that y is zero mean Gaussian distributed with variance σy2 . We now want
to identify the LMS estimator for x given z = z.
Solution:
The LMS Estimator is given by
Z
x̂ = E[x|z = z] = xfx|z (x|z)dx. (2.9)
In order to obtain the conditional density function, we compute first the density of fz (z).
As x takes on only two different values with equal probability we find:
1 1
fz (z) = fy (z + 1) + fy (z − 1).
2 2
In the next step we compute the joint density function fx,z (x, z):
fx,z (x, z) = fx,y (x, z − x) (2.10)

1 1
= fy (z + 1)δ(x + 1) + fy (z − 1)δ(x − 1), (2.11)
2 2
where δ(x) denotes the Dirac delta function (distribution). We find now:
fy (z − 1)δ(x − 1) fy (z + 1)δ(x + 1)
fx|z (x|z) = +
fy (z + 1) + fy (z − 1) fy (z + 1) + fy (z − 1)
Which finally enables us to evaluate the integral in (2.9):
fy (z − 1) fy (z + 1)
x̂ = − (2.12)
fy (z + 1) + fy (z − 1) fy (z + 1) + fy (z − 1)

z
= tanh . (2.13)
σy2
Obviously, if y has unit variance, the best estimator for x given z is x̂ = tanh(z). Often,
it is not as easy to derive an explicit expression. However, if x and z are jointly Gaussian
distributed, this is usually possible.
There is also a geometric interpretation of the LMS estimator: Consider the function g(·)
operating on the random variable y. As we have:
E(x) = Ey [Ex (x|y)] , (2.14)

the following is true as well:
E[xg(y)] = Ey [Ex [xg(y)|y]] = Ey [Ex [x|y] g(y)] = E[x̂g(y)]. (2.15)
Thus, we find:
E[(x − x̂)g(y)] = 0. (2.16)
Consequently, the random variable of the estimation error e = x − x̂ is uncorrelated to
any arbitrary function of some second (correlated) random variable. The latter equation
actually states the orthogonality of e and g(y), that they are also uncorrelated is a direct
consequence of e being zero mean. Hence, we briefly say: ”the error is orthogonal”.
Exercise 2.1 Let x and y be two complex-valued (second order circular) jointly Gaussian
and zero mean random vectors with dimensions p×1 and q×1, respectively. Their individual
probability density functions are given by
1 1
fx (x) = exp{−xH Rxx
−1
x}
π p |det Rxx |
1 1
fy (y) = exp{−y H Ryy
−1
y}.
π det R
q
yy
H
Moreover, assume that the cross-correlation matrix does not vanish, i.e., Rxy = Ryx 6= 0.
1.) Determine the joint probability density function fx,y (x, y).
2.) Determine the conditional probability density function fx|y (x|y).
3.) Arrange the terms in fx|y (x|y) such that one term only depends on y.
Hint:
−1
Rxx Rxy I 0 Σ−1 0 −1
I −Rxy Ryy
= −1 −1 ,
Ryx Ryy −Ryy Ryx I 0 Ryy 0 I
where Σ still has to be determined.
4.) Derive the optimum estimator h(y) for x (in the least-mean-squares sense).
5.) Determine the (minimum) mean-square-error which is achieved by the estimator h(y).
6.) Now, allow the random vectors to have some non vanishing mean. Modify the results
of 4.) and 5.) according to this case.
2.2 Linear Least-Mean-Squares Estimators

In the previous section, we have shown that the general solution to the LMS problem is
given by the conditional expectation
x̂ = E[x|y1 , y2 , ..., yn ].
Now, we will consider particular solutions for special cases. We have already seen that joint
Gaussian processes typically lead to estimators which can be handled much easier since
they lead to linear expressions. Due to this insight, in this section we will consider linear
estimators even if the joint probability densities are not Gaussian. Such estimators are of
particular interest if the considered random variables are zero mean.
Consider two correlated zero mean random vectors x and y which may have different
dimensions. Let an estimator for x be given by x̂ = Ky. The dimensions of the matrix K
are inherently given by the size of x and y. We now want to find K such that the error
metric becomes minimal, that is

min E (x − x̂)(x − x̂)H .
K
Here, the cost function is not a scalar but a matrix! Consequently, we search for the matrix
K which minimizes the error-covariance matrix (which can actually be done based on any
matrix norm). Using the fact that both random vectors were assumed to be zero mean, we
find:
E[x̂] = E[Ky] = K E[y] = 0 = E[x]. (2.17)
Obviously, in this case, the estimator x̂ is bias free (Ger.: erwartungstreu). The following
Lemma generalizes the just considered example.
Lemma 2.3 The best linear LMS estimator (LLMSE) Ko for two correlated random vectors
x and y which are both zero mean, is given by:
−1
Ko = Rxy Ryy . (2.18)
The corresponding MMSE is given by:

Go = min E (x − x̂)(x − x̂)H = Rxx − Rxy Ryy
−1
Ryx .
K
Proof: Consider the MSE:
E[(x − x̂)(x − x̂)H ] = E[(x − Ky)(x − Ky)H ] (2.19)

= Rxx − Rxy K H − KRyx + KRyy K H (2.20)
We are interested in the MMSE, hence, this expression needs to be minimized with respect
to K. To achieve this, we compare the above expression with (K − Ko )B(K − Ko )H and
identify the various terms. As B = Ryy is positive definite, the minimum is achieved for
−1
K = Ko = Rxy Ryy . The MMSE is finally found, by substituting K in the expression on
the right-hand side of (2.19) with the optimal estimator.
Both equations, the optimal linear estimator and the corresponding MMSE can be
unified in one single expression:

Rxx Rxy I Go
= . (2.21)
Ryx Ryy −KoH 0
This set of equations is called the normal equations (Ger.: Normalengleichungen), the
estimator is called the Wiener solution. The estimated signal x is often called the
desired signal (Ger.: Wunschsignal). We thus find the optimal linear estimator by
minimizing the covariance matrix between the desired signal and the estimator.
Theorem 2.1 Consider two zero mean random vectors x and y. The linear estimator Ky
is a LLMSE for x, if and only if:

E (x − Ky)yH = 0. (2.22)
If Ryy is invertible (Ger.: regulär), then, a unique K = Ko exists with this property.
Proof: From (2.18) it directly follows that Ko Ryy − Rxy = 0 which is equivalent to

E xyH − Ko yyH = 0 (note the similarity to (2.16)). Accordingly, given the LLMSE,
orthogonality in the sense of (2.22) is ensured. However, we still have to show that the
orthogonality is also necessary to obtain the LLMSE. For an arbitrary K, the MSE is

E (x − Ky)(x − Ky)H = E (x − Ky)xH − (x − Ky)yH K H . (2.23)
Due to the orthogonality condition in (2.18), the second term is zero. We see that then,
(2.23) is equal to the minimum Go in Lemma 2.3 if K = Ko . Since it is also known that
the cost function reaches its minimum exactly in one point, which is Ko , it is shown that
the orthogonality condition (2.22) has to be satisfied to achieve the MMSE, respectively,
to obtain the LLMSE.
Example 2.2 Consider a linear model:
y = W x + v.
where W is a matrix of suitable dimensions, v is an additive, zero mean disturbance with

autocorrelation matrix Rvv , and x is a random vector with known autocorrelation matrix
Rxx . We furthermore assume that Rxv = 0. We want to optimally estimate x based on the
observation y. We obtain:
Ryy = W Rxx W H + Rvv (2.24)
Rxy = Rxx W H . (2.25)
Thus, we find the LLMSE of x given by:
x̂ = Rxx W H [W Rxx W H + Rvv ]−1 y. (2.26)
If both matrices Rxx and Rvv are invertible we can reformulate this as:
x̂ = [W H Rvv
−1 −1 −1
W + Rxx ] W H Rvv
−1
y. (2.27)
We thus find the corresponding MMSE:

min E (x − x̂)(x − x̂)H = min E (x − x̂)xH (2.28)
x̂ x̂
= Rxx − [W H Rvv
−1 −1 −1
W + Rxx ] W H Rvv
−1
W Rxx (2.29)
= [W H Rvv
−1 −1 −1
W + Rxx ] . (2.30)
In the above example, (2.27) and (2.29) are derived from (2.26) and (2.28), respectively,
using the matrix inversion lemma:
Lemma 2.4 (Matrix-Inversion-Lemma) For nonsingular matrices A and C the follow-

ing is true:
(A + BCD)−1 = A−1 − A−1 B[DA−1 B + C −1 ]−1 DA−1 .
Proof: The proof is straightforwardly found by substitution and verification.
Example 2.3 Finally, we want to consider a special case of Example 2.2, where y and v
reduce to scalars. Assuming that the vector x has dimension m × 1, the matrix W reduces
to a row vector wT , where w has the same dimension like x. We then find the observation
equation to be
y = wT x + v.
Based on the results of Example 2.2, the estimator can be shown to be
yRxx w∗
x̂ = .
wT Rxx w∗ + σv2
If x is moreover a white random process, its autocorrelation simplifies to Rxx = σx2 I, and
thus,
yw∗
x̂ = 2 .
kwk22 + σσv2
x
In summary, the linear least-mean-squares estimators are presented in Table 2.1.
Given LLMSE of x
{x, y},
−1
{Ryy , Rxy , Rxx } x̂ = Rxy Ryy y
−1
E x= E y=0 MMSE = Rxx − Rxy Ryy Ryx
y = W x + v, x̂ = Rxx W H [W Rxx W H + Rvv ]−1 y
{Rxx , Rvv , W }, or
E x= E y= E v=0, x̂ = [W H Rvv
−1 −1 −1
W + Rxx ] W H Rvv −1
y
E xvH =0 MMSE= [W H Rvv −1
W + Rxx−1 −1
]
yRxx w∗
y = wT x + v, x̂ =
wT Rxx w∗ + σv2
{σx2 , σv2 , w},
Rxx w∗ wT Rxx
E x=0, E y= E v=0 MMSE=Rxx − wT Rxx w∗ +σv
2
E xv∗ =0
Table 2.1: Linear LMS estimators for zero mean variables.
Exercise 2.2 Consider a linear estimator K for the random vector x based on the obser-
vation y. Show that the error G(K) satisfies for arbitrary v and K:
v H Go v ≤ v H G(K)v,
where Go denotes the minimum mean-square-error (MMSE), which is achieved by the op-
timal linear estimator.
Exercise 2.3 Consider the following observation
y(k) = a cos (2πf0 k) + v(k); k = 1, 2, . . . , N.
Let further a be a zero-mean random variable with variance σa2 , and v(k) be zero-mean white
noise with variance σv2 . Assume that a and v(k) are uncorrelated for all k = 1, 2, . . . , N ,
and that the frequency f0 is constant and known. Derive the best linear estimator for a,
based on the observations y(1), y(2), . . . , y(N ) and determine the achieved MMSE.
2.3 Different Interpretations of the Wiener Solutions

In the literature we find often very different concepts under the names Wiener, Wiener-
Hopf, Linear MMSE and Normal equations which may cause some confusion for the reader.
Although the common idea is always that we have a linear estimator and the cost function is
quadratic, the side constraints may differ tremendously. First of all, we have to distinguish
between the parameter estimation error and the observation error. Let us reconsider the
estimation problem:
y = wT x + v. (2.31)
The term w is fix, while the other terms y, x, v are of random nature. Here, v describes
an additive noise, a term that cannot be observed but causes an estimation error, while y
is the term that is observed. For a linear LMSE, we assume that the parameter estimation
error x− x̂ is minimal in the mean-square sense. In the previous sections, the relation to the
observation error y−wT x̂ was never of interest. Alternatively, we could design an estimator
so that the observation error becomes minimal in a mean-squares sense. In Table 2.2 both
estimators are listed.
Quantity Parameter estimation error Observation error

w fix, known fix, but unknown
x random, Rxx known random, Rxx known
v random, σv2 known random
y observed observed
Rxx w∗
∗ −1 ∗
to be estimated x̂ = y wT Rxx w∗ +σ 2
ŵ = Rxx rxy
v
4
observation error σy2 [wT Rxxσwv∗ +σ2 ]2 σy2 − rH −1
xy Rxx r xy
v
Table 2.2: Comparison of the linear LMS estimator for a minimal parameter error and a
minimal observation error.
2.4 Complexity of the Exact Wiener Solution

Let us reconsider the following observation equation:
d = wT x + v = xT w + v. (2.32)
In contrast to the previous analyses we now assume that the output d as well as the input
x can be observed. Thus, the autocorrelation matrix3 Rxx and the cross-correlation vector
rxd = E xd∗ are known. Moreover it is assumed that the additive noise v is statistically
∗ T
3
Note the identities Rxx = E[xxH ] = E[x∗ xT ] = E[x∗ xT ] !
independent of x. The unknown weights w can be found by minimizing the observation

error in the least-mean-squares sense:
h 2 i
min E d − wT x (2.33)
w
Due to the assumption of statistically independent noise, the cross-correlation fulfills:
rxd = Rxx w∗ . (2.34)
To find the optimal w which minimizes the observation error in the LMS sense, it is obviously
sufficient to know Rxx and r∗xd . If Rxx is invertible the solution is uniquely given by:
∗
−1
wo = Rxx r∗xd .
In Section 2.2, it has already been shown that this is the linear LMS estimator respectively
the Wiener solution. The required matrix inversion can be problematic and challenging. On
the one hand, the matrix may not be well conditioned which leads to numerical problems
(this problem commonly occurs if speech signals are involved). On the other hand, for a
matrix with dimensions M ×M , the numerical inversion generally has a complexity order of
O(M 3 ), which especially for large matrices leads to high requirements in processing power.
A considerable reduction can be achieved by so-called order recursive structures. They

are known under the names Durbin, Levinson, and Trench and will be explained in the
following.
2.4.1 Durbin Algorithm

Let us first consider a particular problem as it occurs in linear prediction of random pro-
cesses. Here, the right-hand side of the Wiener problem is given by the particular vector
r∗xd = r∗xx . This vector contains the autocorrelation values of the process xk starting with
rxx (1). We thus obtain:
    
rxx (0) rxx (1) ... rxx (M − 1) w(1) rxx (1)
 rxx (1) rxx (0) ... rxx (M − 2)     
   w(2)   rxx (2) 
 rxx (2) rxx (1) ... rxx (M − 3)     
   w(3)  =  rxx (3)  . (2.35)
 .. ..  ..   .. 
 . .   .   . 
rxx (M − 1) rxx (M − 2) ... rxx (0) w(M ) rxx (M )
Essential is the observation that we can derive the solution for order M as soon as the
solution of order M − 1 is available. To achieve this, we reformulate the matrix as a block
structure:
Rxx,M wM = rxx,M (2.36)

Rxx,M −1 rB
xx,M −1 w̃M −1 rxx,M −1
= . (2.37)
rBH
xx,M −1 rxx (0) w(M ) rxx (M )
Here, the index indicates the dimension of the solution (as well as the dimension of the
square matrices and the vectors). The notation rB
xx,M means that the vector is applied in
a backward form:
[rxx (1), rxx (2), rxx (3), . . . , rxx (M )]B = [rxx (M ), rxx (M −1), rxx (M −2), . . . , rxx (1)]. (2.38)
Equivalently, such a backward notation can be achieved by a Hankel matrix B:
 
0 0 ··· 0 1
0
 . . . 1 0 
 
rxx,M =  ...
B . . . .
. . . . . . ..  rxx,M . (2.39)
 . 
0 1 .. 0
1 0 ··· 0 0
| {z }
B
k k
For reasons of symmetry, we find: Rxx,M B = BRxx,M , and Rxx,M B = BRxx,M . Without
loss of generality, we now assume that rxx (0) = 1, which simply normalized the entire
equation. We thus obtain4 :
Rxx,M y M = rxx,M (2.40)

Rxx,M −1 Brxx,M −1 z M −1 rxx,M −1
= . (2.41)
rH
xx,M −1 B 1 αM rxx (M )
Note that we deliberately introduced the vector z M −1 to circumvent the symbol y M −1 which
would denote the solution of order M − 1. The Durbin algorithm can now be derived as
follows. Assume that the solution y M −1 of order M − 1 is available:
Rxx,M −1 y M −1 = rxx,M −1 . (2.42)
The first line of (2.41) gives:

−1

z M −1 = Rxx,M −1 r xx,M −1 − αM Br xx,M −1 (2.43)
−1
= y M −1 − αM Rxx,M −1 Br xx,M −1 (2.44)
−1
= y M −1 − αM BRxx,M −1 rxx,M −1 (2.45)
= y M −1 − αM By M −1 . (2.46)
4
We do not specify the dimensions of B explicitly. However, the adequate dimensions of B become clear
from the context.
For the remaining element αM , we find from the second line of (2.41):
αM = rxx (M ) − rTxx,M −1 Bz M −1 , (2.47)
and by inserting (2.46),
rxx (M ) − rTxx,M −1 BwM −1

αM = . (2.48)
1 − rTxx,M −1 wM −1
Let us reorder all relevant equations and we obtain the well known Durbin algorithm:
y(1) = y 1 = rxx,1 = rxx (1);

for k = 1 : M − 1
{
βk = 1 −rTxx,k y k ;
αk = β1k rxx (k + 1) − rTxx,k By k ;
zk y k − αk By k ;
=
zk
y k+1 = ;
αk
}
Note that the dimensions of the vectors rxx,k , wk , and z k increase according to the index k.
If we count the number of required MAC (Multiply and Accumulate or simply Mult/Add)
operations, we find a complexity of 3M 2 . We can furthermore show that
2
βk = (1 − αk−1 )βk−1 , (2.49)
which can be used to again reduce the complexity to 2M 2 .
2.4.2 Levinson Algorithm

The Levinson algorithm extends the concept of Durbin to arbitrary right hand sides:
Rxx,M y M = b (2.50)

Rxx,M −1 rB
xx,M −1 z M −1 bM −1
= . (2.51)
rBH
xx,M −1 rxx (0) δ b(M )
Similarly as before, the right hand side is partitioned into a vector of dimension M − 1 and
a scalar b(M ). The entire Levinson algorithm reads:
β0 = 1; η 1 = η(1) = α0 = −rxx (1); y 1 = y(1) = b(1);

for k = 1 : M − 1
{
2
βk = (1 − αk−1 )βk−1 ;
1
δ = βk (b(k + 1) − rTxx,k η B k
);
B
zk = yk + δη k;
zk
y k+1 = ;
δ
if k < M − 1
{ αk = β1k (−rxx (k + 1) + rkT xBk)
B

η k + αk η k
η k+1 = }
αk
}
Analogous to the Durbin algorithm, the dimensions of the vectors rxx,k , wk , η k , and z k
grow corresponding to the index k. The complexity of the Levinson algorithm is 4M 2 Thus,
it is just twice as large as the complexity of the Durbin algorithm.
2.4.3 Trench Algorithm

To make the picture complete, we finally describe the so called Trench algorithm, which
allows for inverting an autocorrelation matrix with O(M 2 ) operations. The basic idea is
to employ the Levinson algorithm for particular right-hand sides bT = e1 = [1, 0, 0, .., 0],
eT2 = [0, 1, 0, ..., 0], and so on. Thus, for each of these cases, we calculate exactly one
−1
column of the inverse matrix L = Rxx,M . In a first step, the Durbin algorithm is solved for
Rxx,M −1 y M −1 = −rxx,M −1 . With the obtained solution y M −1 , the Trench algorithm reads:
γ = (1 + rTxx,M −1 y M −1 )−1 ;
z M −1 = γy BM −1
;
L(1, 1) = γ;
L(1, 2 : M ) = γy TM −1 ;

for m = 2 : 21 (M − 1) + 1
for n = 2 : M − m + 1
L(m, n) = L(m − 1, n − 1) + . . .
+ γ1 [z(M + 1 − m)z(M + 1 − n) − z(m − 1)z(n − 1)] ;
(2.52)
The result is found in matrix L; note that only the the upper triangular entries of the
matrix are calculated by the algorithm, since the matrix is symmetric. The complexity is
roughly 3M 2 (exactly: 13/4M 2 ).
Exercise 2.4 Show that for the Durbin algorithm (2.49), we find:
2
βk = (1 − αk−1 )βk−1 .
Exercise 2.5 Formulate the Levinson algorithm, without conditions of the kind (k < M −
1). Program a RISC processor (for example TI-C6x) and compare the number of operations
with the predicted complexity.
Exercise 2.6 Reformulate the Durbin algorithm to solve Rxx,M −1 y M −1 = −rxx,M −1 . Show
that the computation of γ, or actually of 1/γ, is already solved in the Trench algorithm.
2.5 The Steepest Descent Algorithm

In Section 2.4, we considered the exact solution of a system of linear equations where
the matrix on the left-hand side has a Toeplitz structure and is additionally Hermitian.
Three algorithms were presented which aim to reduce the corresponding computational
complexity. In this section, we first step back to the original minimization problem (2.33),
and present an alternative method that completely circumvents the explicit inversion of the
matrix. This method is known under the term steepest descent. It iteratively improves
the estimated solution, i.e., the weights ŵ, based on the following update equation:
ŵk = ŵk−1 + µ(k)z k ; k = 1, 2, .... (2.53)
The hat indicates that the symbol denotes an estimate. Hence, the symbol ŵk refers to
the estimate of w at iteration step k (which is not necessarily a time index!). The iteration
starts with an initial value ŵ0 (so to say a first guess), out of which the first improved
estimate ŵ1 is calculated in the first iteration step. The improvement is done in some
direction z k , and the amount of change introduced by this improvement is controlled by
a positive parameter, called the step-size µ(k) > 0. The problem we have to tackle is to
find the direction z k which aims towards the desired solution, and which step-size to choose
leading to the best results.
Actually, from a more general point of view, (2.53) has the typical form:
New estimate = old estimate + correction,
a formulation that is encountered very often in adaptive filter algorithms. The choice of
the correction term determines the characteristics of the algorithm, such as convergence
behaviour, adaptation speed, and accuracy of its steady-state.
Let us consider one more time the observation equation (2.32). Again, the input x and
the output d are assumed to be known d, and the final goal is to find a vector w such
that the linear combination of x and w optimally resembles the original output. Similar to
(2.33), this optimum is given by the ŵ which minimizes the quadratic cost function
h 2 i
g(ŵ) = E d − ŵT x . (2.54)
Since rxd = r∗dx , we can further write

2
g(ŵ) = E d − xT ŵ = σd2 − ŵT rxd − rTdx ŵ∗ + ŵT Rxx ŵ∗ . (2.55)
The cost function has the shape of a multi-dimensional paraboloid, with the minimum
2
go = g(wo ) = min E d − xT ŵ . (2.56)
ŵ
The optimum solution ŵ = wo is already known from the Wiener solution, and the cor-
responding (minimum) value of the cost function can be easily calculated using the well
known orthogonality relation (2.16):
E[(d − wTo x)xH ] = 0H ,
leading to
go = σd2 − rH −1
xd Rxx r xd . (2.57)
By substitution of (2.57) in (2.55), we obtain the following description:
g(ŵ) = go + (ŵ − wo )T Rxx (ŵ − wo )∗ , (2.58)
which again demonstrates that the cost function is quadratic. Knowing that, obviously, we
can expand it into a Taylor series at some point ŵk−1 :
H
g(ŵ) = g(ŵk−1 ) + ∇g(ŵk−1 ) ŵ − ŵk−1 + 1
2
ŵ − ŵk−1 ∇2 g(ŵk−1 ) ŵ − ŵk−1 . (2.59)
However, we still need to calculate the first and second order derivatives.
The gradient is obtained by differentiating the cost function in (2.58) with respect to ŵ:

∂g(ŵ)
∇g(ŵk−1 ) = = [ŵk−1 − wo ]H Rxx
∗
. (2.60)
∂ ŵ ŵ=ŵk−1
Note that the gradient is a row vector! Expanding the result in (2.60) leads to an alternative
representation of the gradient:
∂g(ŵ)
= ŵH Rxx
∗
− rTxd . (2.61)
∂ ŵ
In (2.61), the orthogonality relation was applied to directly see that after the expansion,
the second term coincides with the cross-correlation between x and d, i.e., wH ∗ T
o Rxx = r xd .
The second derivative is obtained by differentiating the gradient:
∗
∇2 g(ŵk−1 ) = Rxx . (2.62)
The function g(ŵ) is sufficiently smooth so that every point can be associated with
its gradient. In each point, the gradient points in the direction of the (locally) steepest
ascent. Thus, it points away from the minimum. On the other hand, then, its negative
(and complex conjugate) value has to point in the direction of (locally) steepest descent,
which (at least roughly) aims at the global minimum. According to this insight, in the
update equation (2.53), we take the negative (complex conjugate, transpose) gradient as
direction of improvement z k = (r∗xd − Rxx
∗
ŵk−1 ).
A typical shape of the cost function (2.58) is depicted in In Figure 2.1 for a 2-dimensional
real-valued parameter vector(ŵT = [ŵ1 , ŵ2 ]). On the right-hand side, the corresponding
contour plot is given, which shows some lines of constant cost. Additionally, for a few
points, the negative gradients are included. We observe that for points far away from the
minimum, the negative gradient does not directly point to the minimum. Nevertheless, it
gives a hint where the minimum will be found. Therefore, it becomes clear that only a
step-wise iteration can ensure the localization of the minimum.
Figure 2.1: Cost function: left as paraboloid, right as contour plot.

Considering again the initial general update equation (2.53), we can express the cost of
ŵk with respect to the cost of ŵk−1 by substituting (2.53) in the Taylor series expansion
(2.59), leading to:
g(ŵk ) = g(ŵk−1 ) + µ(k)∇g(ŵk−1 )z k + 12 µ2 (k)z H 2

k ∇ g(ŵ k−1 )z k , (2.63)
If we require the cost function to decrease
g(ŵk ) < g(ŵk−1 ) (2.64)
then, (2.63) provides us with the insight that the following inequality has to be satisfied:
µ(k)∇g(ŵk−1 )z k < 0. (2.65)
This condition can actually be fulfilled by (infinitely) many directions. In particular, by

directions of the form:
z k = −B∇H g(ŵk−1 ) (2.66)
for positive definite matrices B (note that we still assume µ(k) > 0).
Let B = I, then, we obtain the standard steepest descent algorithm:
ŵk = ŵk−1 + µ(k)[r∗xd − Rxx

∗
ŵk−1 ], k = 1, 2, . . . (2.67)
Consider the difference

∆
w̃k = wo − ŵk (2.68)
between the current estimate ŵk and the (optimal) Wiener solution wo , it is also called the
parameter estimation error vector, or simply, the error vector. We can re-formulate the
update equation (2.67) in terms of the error vector:
∗
w̃k = w̃k−1 − µ(k)Rxx w̃k−1 (2.69)
∗

= I − µ(k)Rxx w̃k−1 , k = 1, 2, . . . (2.70)
Obviously, Equation (2.70) is a system of linear homogeneous difference equations of first

∗
order. It can be diagonalized QRxx QH = Λ, and with the transformation ũ = Qw̃ we
obtain:
ũk = (I − µ(k)Λ)ũk−1 , (2.71)

ũi (k) = (1 − µ(k)λi )ũi (k − 1). (2.72)
Here, the index i denotes the i−th entry of the vector u and λi are the diagonal terms of
the diagonal matrix Λ. From the diagonalized form (2.72), we can immediately identify the
convergence conditions of this iterative algorithm:
|1 − µ(k)λi | < 1. (2.73)

Equivalently, we can formulate this condition in terms of the step-size µ(k)

2 2
0 < µ(k) < ≤ . (2.74)
λmax λi
We thus conclude that the step-size needs to be positive and bounded from above by the
reciprocal of the largest eigenvalue of Rxx . From a practical point of view, such a condition
may not be very useful as we first havePto undertake an eigenvalue analysis of Rxx . Note
however that trace(Rxx ) = trace(Λ) = λi > λmax . Therefore, we have the possibility to
find a (looser) upper bound for the step-size without any knowledge of the eigenvalues.
We see that by means of the diagonalization and coordinate transformation, the solution
of theQ equivalent system of homogeneous linear difference equations can be written in the
form kl=1 (1 − µ(l)λi ). If the above discussed convergence condition for the step-size is
satisfied, these products are exponentially decaying and we can determine the adaptation
rate. The adaptation rate is the speed at which these products decay, and therefore, it
is also the speed of the learning process performed by the filter when it adapts towards
the correct value. The nearer the expressions (1 − µ(k)λi ) are to zero, the faster the error
value approaches zero. At the first glance, the choice µ(k) = 1/λi may seem to be optimal,
however, this is only optimal with respect to one eigenvalue.
The considerations in this section were focused on quadratic cost functions. Nevertheless,
this does not mean that the method of steepest descent is restricted to such cost functions.
If an arbitrary smooth cost function is given, we can always derive a Taylor series as done
in (2.59), leading to:
H
g(ŵ) = g(ŵk−1 )+∇g(ŵk−1 ) ŵ − ŵk−1 + 21 ŵ − ŵk−1 ∇2 g(ŵk−1 ) ŵ − ŵk−1 +... (2.75)
Of course, if the cost function has higher than quadratic order, also the Taylor series
expansion contains not only the constant, the linear, and the quadratic term, but also
terms of higher order. Then, the condition (2.65) for finding the global minimum is not
suitable any longer. On the other hand, the quadratic terms can be seen as a hint that
there exist one or more points in which vicinity, the cost function has approximately the
shape of a parabola, and thus may show one or more (local) minima. If the steepest descent
method is applied, we will find one of these local minima. However, (in most cases) this
may not be the desired global minimum. Only for a quadratic cost function, it is ensured
that the found minimum is the global one.
Exercise 2.7 Let x be the excitation signal of the steepest descent algorithm with the update
direction chosen to be the negative conjugate gradient of the quadratic cost function. Assume
that the eigenvalues λi of the autocorrelation matrix Rxx are known. Find the optimum fixed
step-size µopt in the sense that

µopt = arg min max |1 − µλi | .
µ λi
Hint: Consider the two extremes 1 − µλmin and 1 − µλmax .

∗
−1
Exercise 2.8 Consider Equation (2.66) and choose B = Rxx . By this, an iteration of
the Newton type is obtained.
• How does this iteration behave?
• For which step-sizes µ does it converge?
Exercise 2.9 Derive the steepest descent algorithm (2.58)-(2.67) for real-valued signals.
Compare the results to the complex-valued case. What are the differences?
Exercise 2.10 Assume the matrix R to be positive definite. Under which conditions does
the following series converge, and which limit does it converge to?
∞
X
(I − µR)k
k=0
Exercise 2.11 Consider the quadratic cost function (2.59) for the standard steepest descent
algorithm (2.67). Show that the costs decrease fastest for the time-variant step-size:
k∇g(ŵk−1 )k2
µopt (k) = .
∇g(ŵk−1 )Rxx ∇H g(ŵk−1 )
Matlab Experiment 2.1 For a linear system of length M = 10 which is exited by a

white (or alternatively a colored) signal, calculate the Wiener solution. Implement the
steepest descent algorithm with several fixed step-sizes. Repeat the simulations while ex-
perimenting with an optimal time-variant step-size. Illustrate the results by plotting the
cost function g(ŵk ) versus the iteration step k for the first 100 iterations. The plots should
also include the minimum costs obtained by the Wiener solution! Which scaling of the
ordinate presents a clearer picture of the results, linear or logarithmic scaling?
2.6 Literature
A good overview on estimation methods can be found in [34]. An introduction to the
steepest descent algorithm is given in [29], respectively the more recent edition [30], and in
[71].
Chapter 3
The LMS Algorithm
The least-mean-squares (LMS) algorithm is by far the most frequently applied adaptive
algorithm. Its advantages are its numerical stability, its low computational complexity, as
well as its robustness. Almost all adaptive algorithms which are employed in practice, are
LMS algorithms or derivatives of it. In this chapter, we introduce the LMS algorithm start-
ing with its classic interpretation as an approximation of the steepest descent algorithm. Its
most important properties like convergence bounds, convergence speed, and steady-state
error will be derived based on stochastic analyses, that is, the driving signals will be as-
sumed to be random processes. Additionally, we will also investigate the behavior of the
algorithm under deterministic sinusoidal excitation. The chapter will close with application
examples.
3.1 Classic Approach: Approximating the Wiener So-

lution
Let us reconsider the update equation (2.67) of the steepest-descent algorithm with constant
step-size µ:
ŵk = ŵk−1 + µ(k)[r∗xd − Rxx
∗
ŵk−1 ], k = 1, 2, . . .
We require a-priori knowledge of the cross-correlation rxd as well as the autocorrelation Rxx
of the driving input process. If these are unknown, we can use estimates instead. Starting
with the input signal vector
∆
xTk = [x(k), x(k − 1), ..., x(k − M + 1)], (3.1)
then, the possibly simplest (instantaneous) estimates are given by:
b ∗ = x∗ x T
R (3.2)
xx k k
rxd = xk d∗ (k).
b (3.3)
38
Substituting these estimates in (2.67), we directly obtain the LMS algorithm:
ŵk = ŵk−1 + µx∗k [d(k) − xTk ŵk−1 ] . (3.4)

| {z }
ẽa (k)
In (2.67), the gradient is a fixed direction which is given by the signal statistics of xk
and d(k). In contrast, the approximation of the gradient in (3.4) is itself a stochastically
changing direction, since it varies with the instantaneous values of the input vector xk .
Therefore, the name LMS is somewhat misleading. More precisely, it is a stochastic gradient
method. Nevertheless, the name LMS has been used extensively in literature, and thus,
will be used throughout this text as well. Note however that the original LMS estimator
from Lemma 2.2, in general, is an estimator given by a nonlinear function as was shown
in Section 2.1. We will see later in the context of robustness (see Chapter 7) that even
the name ‘stochastic gradient method’ is not entirely correct, since the algorithm works
perfectly well in absence of any randomness.
Returning to (3.4), note that in contrast to (2.67), the estimated parameter vector ŵk is
also random such as d(k) and xk . The error term [d(k) − xTk ŵk−1 ] is called the disturbed
a-priori error ẽa , as it is constructed by a-priori estimates ŵk−1 . Analogously, there exists
also a disturbed a-posteriori error constructed by the a-posteriori estimates ŵk : ẽp =
d(k) − xTk ŵk .
We obtain first variants of this algorithm by choosing different step-sizes. Time variant
step-sizes µ(k) appear to be practical, in particular when they are related to the power of
the input process. The following algorithms are common:
• general time variant step-size µ(k): stochastic gradient type algorithm.
• µ(k) = α/kxk k2 : normalized LMS or NLMS algorithm.
• µ(k) = α/[ǫ + kxk k2 ] with ǫ > 0: ǫ-NLMS algorithm [4].
• µ(k) = α/[1 + αkxk k2 ]: a-posteriori version of the LMS algorithm.

Besides its low complexity and simplicity, the LMS algorithm has another advantage which
is reason for its popularity: it allows for deriving new generic adaptive algorithms. We
only require an error term that can be converted into a cost function for the minimization
process. For example, we may want to find the optimum solution for:
min E [f [ẽa (k)]] .

ŵk
Differentiating the cost function with respect to the parameters ŵk−1 and writing down a
gradient method following the idea: New estimate is old estimate plus negative gradient, is
in most cases a successful approach. Correctly, the algorithm needs to be analyzed first.
With statistical methods this is often not feasible. Finally we will show a typical example.
We minimize E[|ẽa (k)|K ]. Differentiating with respect to ŵk−1 leads to the gradient:
E[− K2 |ẽa (k)|K−2 xk ẽa (k)]. The expectation is substituted by its instantaneous values and
we obtain:
The Least-Mean-K Algorithm:
ŵk = ŵk−1 + µ(k)|ẽa (k)|K−2 x∗k ẽa (k). (3.5)
Exercise 3.1 Derive an adaptive algorithm to minimize the cost function E[|ẽa (k)|] and
distinguish here complex-valued as well as real-valued signals.
Exercise 3.2 Consider an undisturbed, nonlinear system: y(k) = xTk w1 + xxTk w2 with
xx(k − i) = x(k)x(k − i), i = 0, 1, ..M2 − 1. The parameter vectors w1 and w2 have the
dimensions M1 × 1 and M2 × 1. Derive an adaptive algorithm to minimize the additively
disturbed squared error signal. What are the acf matrix of the input process if x(k) is a
white Gaussian process?
3.2 Stationary Behavior

The learning properties like learning rate in a stationary environment that is a constant
impulse response given by the Wiener solution wo can be computed analytically under
certain assumptions. Such assumptions are strongly simplifying the situation and are
only correct in case of a linear combiners. We require for this derivation the knowledge
of statistical convergence as well as some properties of Gaussian and spherically invariant
processes as pre-requisite. More details can be found in the Appendices B, C and D. For
the more interested reader, there is an even more general derivation with much looser
conditions in Appendix E.
3.2.1 Assumptions
Assumptions: Independence Assumption (Ger.: Unabhängigkeitsannahme)
• The observed desire d(k) originates from a reference model d(k) = wTo xk + v(k), with
zero mean processes x(k) and v(k).
• The vectors xk of the input process are statistically independent to each other that
is fxx (xk , xl ) = fx (xk )fx (xl ) for k 6= l.
• The driving input process xk is of zero mean and circular (spherically invariant[5])
Gaussian distributed.
• The additive noise v(k) is statistically independent of the input process xk .
Note that by such conditions the vectors ŵk are statistically independent of xl , l > k.
3.2.2 The Mean Error Vector

We consider the parameter error vector (also weight error vector, tap error vector)
∆
w̃k = wo − ŵk (3.6)
in the mean, that is
w̃k = (I − µx∗k xTk )w̃k−1 − µx∗k v(k); k = 1, 2, ... (3.7)
E[w̃k ] = E[(I − µx∗k xTk )w̃k−1 ]; k = 1, 2, .... (3.8)
As the vectors ŵk−1 are statistically independent of xk , we have:
∗
E[w̃k ] = E[I − µx∗k xTk ]E[w̃k−1 ] = (I − µRxx )E[w̃k−1 ]. (3.9)
From this equation we recognize that the error vectors in the mean behave exactly as the
error vectors of the steepest descent algorithm. We thus know the condition for convergence
in the mean:
2
0<µ< . (3.10)
λmax
With the results of Exercise 2.7 we recognize that the eigenvalue ratio is responsible for
the maximum convergence speed in the mean. We further can interpret Equation (3.9),
that we have an asymptotically bias-free estimator.
3.2.3 The Mean Square Error Vector

The just found conditions in the mean are relatively weak conditions. A much stronger
condition is given if we consider the error vector in the mean square sense. For this we
consider the error vector covariance matrix Pk at time instant k:
Pk = E[(wo − ŵk )(wo − ŵk )H ] = E[w̃k w̃H
k ]. (3.11)
The substituting the LMS equation (3.4) into the definition above we obtain a recursive
equation for the error vector covariance matrix:
Pk = E[(I − µx∗k xTk )Pk−1 (I − µx∗k xTk )] + µ2 E[x∗k xTk |v(k)|2 ] (3.12)
= Pk−1 − µE[Pk−1 x∗k xTk ] − µE[x∗k xTk Pk−1 ] + µ2 E[x∗k xTk Pk−1 x∗k xTk ]
+µ2 E[x∗k xTk |v(k)|2 ]. (3.13)
Note that due to the independence assumption we can write E[Pk−1 x∗k xTk ] = Pk−1 Rxx
∗
and
∗ T 2 ∗ 2
E[xk xk |v(k)| ] = Rxx σv . Equation (3.13) can this be reformulated to:
∗ ∗
Pk = Pk−1 − µPk−1 Rxx − µRxx Pk−1 + µ2 E[x∗k xTk Pk−1 x∗k xTk ] + µ2 Rxx
∗
σv2 . (3.14)
Furthermore we have for complex-valued spherically invariant Gaussian processes (see
Appendix D):
E[x∗k xTk Pk−1 x∗k xTk ] = E[x∗k xTk ]Pk−1 E[x∗k xTk ] + trace[Pk−1 E[x∗k xTk ]]E[x∗k xTk ]
∗ ∗ ∗ ∗
= Rxx Pk−1 Rxx + trace[Pk−1 Rxx ]Rxx .
Hint 1: for real-valued spherically invariant Gaussian processes we have:
E[xk xTk Pk−1 xk xTk ] = 2E[xk xTk ]Pk−1 E[xk xTk ] + trace[Pk−1 E[xk xTk ]]E[xk xTk ]
= 2Rxx Pk−1 Rxx + trace[Pk−1 Rxx ]Rxx .
Hint 2: The same statements are also true for the larger class of spherically invariant
complex-valued processes.
We can thus reformulate Equation (3.14) into:

∗ ∗
Pk = Pk−1 − µPk−1 Rxx − µRxx Pk−1 + µ2 (2Rxx
∗ ∗
Pk−1 Rxx ∗
+ trace[Pk−1 Rxx ∗
]Rxx ) + µ2 Rxx
∗
σv2 .
(3.15)
H ∗
Diagonalizing the acf matrix Q Rxx Q = Λ, and we obtain:
QH Pk Q = QH Pk−1 Q − µQH Pk−1 QΛ − µΛQH Pk−1 Q

+µ2 (2ΛQH Pk−1 QΛ + trace[QH Pk−1 QΛ]Λ) + µ2 Λσv2 . (3.16)
Unfortunately, we cannot expect that the same diagonalization of Rxx also diagonalizes
the error vector covariance matrices. We can only expect that the unitary transformation
changes the matrix into QH Pk Q = Ck :
Ck = Ck−1 − µCk−1 Λ − µΛCk−1 + µ2 (2ΛCk−1 Λ + trace[Ck−1 Λ]Λ) + µ2 Λσv2 . (3.17)
Since trace[Ck−1 Λ] affects only the elements on the main diagonal, we can further focus
on the main diagonal elements of Ck and neglect the others. Concentrating all diagonal
elements of Ck into a vector ck we obtain
ck = Bck−1 + µ2 λσv2 . (3.18)
Here, we introduced two new terms: the vector λ contains the eigenvalues of the acf matrix
Rxx and the matrix B with the following entries:
B = I − 2µΛ + µ2 (2Λ2 + λλT ) (3.19)

1 − 2µλi + 3µ2 λ2i main diagonal
= (3.20)
µ2 λi λj else
Knowing the matrix B we find sufficient conditions for convergence in the mean square
sense.
Theorem 3.1 The LMS algorithm is convergent in the mean square sense if it satisfies
the given assumptions and the condition:
2
0<µ< . (3.21)
2λmax + trace[Λ]
Proof: Convergence of Equation (3.19) is given if the eigenvalues of matrix B are smaller
than one in magnitude. Note that B is positive definite, that is all eigenvalues are positive.
A sufficient condition for convergence is thus that the largest eigenvalue is smaller than
one. The largest eigenvalue is given by the 2-induced norm. It can further be bounded by
the 1-induced norm that is
λmax = kBk2,ind ≤ kBk1,ind .
We can take an arbitrary row of B:
(1 − µλi )2 + µ2 λi (λi + trace[Λ]) < 1.
Reordering with respect to µ results in:

2
0<µ< .
2λi + trace[Λ]
As this must be true for each eigenvalue, we find eventually the condition above.
3.2.4 Describing Parameters

The convergence speed of the algorithm is given by the eigenvalues of matrix B in Equation
(3.18). In general we expect a relation of the form
X k
(B)
tr{Pk } = 1T ck = γ l λl ,
l=1
(B)
in which the eigenvalues λl of matrix B are weighted. In this form we assumed that
all eigenvalues are different. The weighting factors depends also of the initial values
of the parameter error vector. It can thus happen that particular eigenvalues have no
appearance. If we consider the worst case then the largest eigenvalue of B will dominate
the convergence speed. By Equation (3.18) we can describe the temporal movement of
the adaptation process. If we on the other hand consider a single realization the process
can look very different. The reason for this is that (3.18) describes the learning in the
mean. Only if we ensemble average many realizations we will find a good agreement with
the theoretical prediction. The ensemble averaged adaptation curves are called learning
curves. Figure 3.1 displays a learning curve of the relative system mismatch for various
step-sizes.
0
Theorie, µ=µ /2
OPT
Grenzwert
−10 Simulation
Theorie, µ=µOPT
Grenzwert
Simulation
−20 Theorie, µ=0.8 µ
g
Grenzwert
Relativer Systemabstand / [dB]
Simulation
−30
−40
−50
−60
−70
−80
0 100 200 300 400 500 600 700 800 900 1000
Iterationen
Figure 3.1: Learning curves: relative system distance for various step-sizes.
Next to the convergence speed also the remaining parameter error vector also called
mismatch is of interest. This steady-state value is theoretically achieved for k → ∞. Let
us consider again Equation (3.18). For k → ∞, the mismatch is given by
lim ck = c∞ = Bc∞ + µ2 λσv2 (3.22)

k→∞
= [I − B]−1 µ2 λσv2 (3.23)
= [2Λ − µ2Λ2 − µλλT ]−1 µλσv2 . (3.24)
We can further simplify the term by applying the matrix-inversion-lemma

PM 1
l=1 2−2µλl
1T c∞ = µσv2 P . (3.25)
1− µ M λl
l=1 2−2µλl
For small step-sizes µ the term can further be simplified to

µM σv2
1 T c∞ ≈ . (3.26)
2
Often instead of the mismatch the relative mismatch is provided:

1T c∞
.
kwo k22
This term is independent of the particular optimal value.
Even more interesting than the mismatch is the distorted a-priori error
∆
ẽa (k) = d(k) − ŵTk−1 xk . (3.27)
for k → ∞. We obtain:
E[|ẽa (k)|2 ] = E[|d(k) − ŵTk−1 xk |2 ] (3.28)
= E[|v(k) + w̃Tk−1 xk |2 ] (3.29)
= σv2 + E[|w̃Tk−1 xk |2 ] (3.30)
= σv2 + E[w̃Tk−1 xk xH ∗
k w̃k−1 ] (3.31)
= σv2 + E[w̃Tk−1 Rxx
∗
w̃∗k−1 ] (3.32)
= σv2 + trace{Pk−1 Rxx
∗
} (3.33)
T
= σv2 + λ ck−1 . (3.34)
Here, we can recognize the part of the LMS approximation. The Wiener solution only
has a noise part σv2 while the LMS algorithm produces an additional error term gex called
the excess mean square error. By applying the matrix inversion lemma we can also give a
simple expression for this error:
PM λl
l=1 2−2µλl
gex = λT c∞ = µσv2 P . (3.35)
1− µ M λl
l=1 2−2µλl
A further parameter that is often used is the so called misadjustment (Ger.: Fehlanpassung).
It is the excess mean square error relative to the Wiener solution:
PM λl
gex l=1 2−2µλl
mLM S = =µ P . (3.36)
go 1−µ M λl
l=1 2−2µλ l
By this operation the misadjustment is independent of the noise variance. It is also

approximately proportional to the step-size µ. By selecting an arbitrarily small step-size
we can thus make the misadjustment as small as desired. However, we loose convergence
speed by this. Figure 3.2 displays learning curves for the means squared error. Different
to the system distance the curves are rather rocky. The curves exhibited here are obtained
by averaging over 50 ensemble values. By averaging over more curves, the learning curves
become smoother and smoother.
1
10
Theorie, µ=µ /2
OPT
Grenzwert
0
10 Simulation
Theorie, µ=µ
OPT
Grenzwert
Mittleres, quadratisches a−priori Fehlerquadrat
−1
Simulation
10 Theorie, µ=0.8 µg
Grenzwert
Simulation
−2
10
−3
10
−4
10
−5
10
−6
10
−7
10
0 100 200 300 400 500 600 700 800 900 1000
Iterationen
Figure 3.2: Learning curves: mean squared a-priori error for various step-sizes.
3.2.5 Convergence with probability one

Let us reconsider Equation (3.7), shown here again for the noise free case:
w̃k = (I − µx∗k xTk )w̃k−1 ; k = 1, 2, ... (3.37)
k
Y
= (I − µx∗l xTl )w̃0 . (3.38)
l=1
Due to the independence assumption we can interpret the convergence in the mean squared
as if we have isolated terms (I − µx∗l xTl ) although in reality the entire product in (3.37) is
of importance.
Example 3.4 A parameter vector of length M = 1 is to be estimated by the LMS algorithm.

Let the additive noise to be zero. We thus obtain:

w̃(k) = 1 − µ|x(k)|2 w̃(k − 1); k = 1, 2, ... (3.39)
k
Y
= 1 − µ|x(l)|2 w̃(0). (3.40)
l=1
Squaring on both sides, we obtain the energy relation:

k
Y 2
|w̃(k)|2 = 1 − µ|x(l)|2 |w̃(0)|2 . (3.41)
l=1
Due to the independence assumption we find the convergence condition in the mean square
sense by applying the expectation on both sides:
" k #
Y 2
E |w̃(k)|2 = E 1 − µ|x(l)|2 E |w̃(0)|2 (3.42)
l=1
k
Y h 2 i
= E 1 − µ|x(l)|2 E |w̃(0)|2 . (3.43)
l=1
h i
2
It is thus sufficient to consider the term E (1 − µ|x(l)|2 ) to guarantee convergence in the
mean square sense. For this the following needs to be true:
h 2 i
E 1 − µ|x(l)|2 <1 (3.44)
or, equivalently for the step-size µ :

2σx2
0<µ< (4)
. (3.45)
mx
On the other hand we can also require that convergence holds for the entire product.
This is a much looser requirement than the condition that convergence holds fo reach term
of the product. We can thus expect that this requirement leads to a much larger stability
bound. To find this we apply the logarithm on both sides of (3.41) and obtan:
Xk 2
ln |w̃(k)|2 = ln 1 − µ|x(l)|2 ln |w̃(0)|2 . (3.46)
l=1
Dividing this term by the number of iterations k and letting this number grow, we obtain
for ergodic random processes x(k):
ln (|w̃(k)|2 ) h 2 i
lim = E ln 1 − µ|x(l)|2 . (3.47)
k→∞ k
Remark: This we can also argue by the law of large numbers as the elements x(k) are i.i.d.
with bounded variance.
The condition for convergence is now, after we made use of the logarithm:
h i
2 2
E ln 1 − µ|x(l)| < 0. (3.48)
If we compare this with our previous condition on the convergence in the mean square
sense, we can formulate this equivalently as:
h 2 i
ln E 1 − µ|x(l)|2 < 0. (3.49)
In Figure 3.3 both functions are plotted for the case of a uniform distribution of x(k) in
the range [−1, +1]. The convergence condition in the mean square sense delivers a stability
bound for µg = 10/3 = 3.33, while our new condition only requires µg = 6.1. As the
new condition was found by a stochastic limit, we call it almost sure convergence or
convergence with probability one1 .
2
1.5
0.5
−0.5
−1
−1.5 E[log(u)]
log(E[u])
−2
0 1 2 3 4 5 6 7
u
Figure 3.3: Comparison of convergence conditions.
Exercise 3.3 Show that the adaptation
ŵk = ŵk−1 + µ[x∗k xTk + ǫI]−1 x∗k ẽa (k)
leads to the ǫ−NLMS algorithm for ǫ > 0.
Exercise 3.4 Show that the matrix B is positive definite.
Exercise 3.5 Compute the step-size so that for a white random process the largest
eigenvalue of B becomes minimal. How fast does the algorithm converge in dependence to
the filter length M ?
1
See also Appendix B for more details.
Exercise 3.6 Compute the misadjustment for the following cases:

• λmin = λmax , thus a white driving process,
• λmin = λ1 , λ2 = λ3 = ... = λM = λmax ,
• λmin = λ1 = λ2 = ... = λM −1 , λM = λmax .
Exercise 3.7 Provide the adaptation for a parameter error vector w̃k of length M = 1
and draw a signal for graph for it. Which conditions for the step-size µ and the pdf of the
driving process are required to obtain stability?
Exercise 3.8 Provide the adaptation for the parameter error vector w̃k of length M = 1.
Assume the driving process to be bipolar noise of zero mean with σx2 = 1. Compute the pdf
of the error vector. Now compute mean and variance matrix of the parameter error vector
for arbitrary length M .
Exercise 3.9 Substitute the a-priori error ẽa (k) = d(k) − xTk ŵk−1 by the a-posteriori
error ẽp (k) = d(k) − xTk ŵk . Reformulate the gradient method so that only a-priori error
terms occur.
Exercise 3.10 Consider the matrix Riccati equation
K = XKX + R
with K, X, R Hermitian matrices. Given X and R, solve the equation by using Kronecker
properties and vectorization of X.
Apply the same method to solve for the recursive equation of the parameter covariance
matrix Kk in the LMS algorithm. What step-size condition for stability in the mean square
sense can be derived?
Exercise 3.11 Show that the stability limit for convergence with probability one after
Example 3.4 is indeed µg = 6.1.
Matlab Experiment 3.1 Write a Matlab Program for parameter identification. The
driving process is a real-valued, zero-mean Gaussian process with σx2 = 1. Let the unknown
system have M = 32 coefficients all different from zero. Run the LMS adaptation of a
transversal filter for various step-sizes and plot the relative system mismatch as well as the
a-priori error energy over time. Compare with theoretical results on fastest convergence
and stability limit.
In a second experiment realize for each of the step-sizes 50 independent runs and plot the
ensemble averaged value. Discuss the differences.
Matlab Experiment 3.2 Rerun the previous experiment however with a colored driving
process, obtained by filtering the white process with the filter
√
1 − b2
F (z) = , b = −0.7.
1 − bz −1
Compute the autocorrelation function and provide the acf matrix for M = 32. Repeat the
previous experiment o wand compare to the theoretical values. Discuss the results.
Repeat the experiments with an NLMS algorithm. What is different now?
3.3 Behavior under Sinusoidal Excitation

Until now we have only considered the LMS algorithm under stochastic excitation. Classic
approaches consider filters under sinusoidal excitation. It is thus of interest how adaptive
filters like the LMS algorithm reacts under sinusoidal excitation.
Consider the following excitation:
x(k) = A exp(−jΩo k)
with A ∈ C, a complex-valued amplitude. Thus we have for the vector of the driving
process:
xTk = A exp(−jΩo k) [1, exp(−jΩo ), ..., exp(−jΩo (M − 1))] (3.50)

= A exp(−jΩo k) xT . (3.51)
The weight vector ŵk excited by such a signal can be written as

M
X
ŵTk−1 xk = ŵl (k − 1)A exp(−jΩo (k + l − 1))
l=1
M
X
= A exp(−jΩo k) ŵl (k − 1) exp(−jΩo (l − 1))
l=1
c (k − 1, Ωo ).
= A exp(−jΩo k)W
Here, we used the notation
ŵk−1 = [ŵ1 (k − 1), ŵ2 (k − 1), ..., ŵM (k − 1)]T

for the vector ŵk−1 . Staying with the reference model d(k) = v(k) + wTo xk , we now obtain:
d(k) = v(k) + A exp(−jΩo k)Wo (Ωo ).
The a-priori error signal can now be written as:
ẽa (k) = d(k) − ŵTk−1 xk

h i
= v(k) + A exp(−jΩo k) Wo (Ωo ) − Wc(k − 1, Ωo )
h i
c(k − 1, Ωo ) .
= A exp(−jΩo k) v̄(k) + Wo (Ωo ) − W
If we consider now the LMS algorithm in the form (3.7), thus
w̃k = (I − µx∗k xTk )w̃k−1 − µx∗k v(k),
we can substitute the particular excitation (3.51) for xk and obtain:
w̃k = (I − µ|A|2 x∗ xT )w̃k−1 − µx∗ A∗ exp(jΩo k)v(k) (3.52)

= (I − µ|A|2 x∗ xT )w̃k−1 − µx∗ ṽ(k). (3.53)
The equation has obviously simplified and temporal changes are only visible now in the
parameter error vector w̃k and in the noise ṽ(k). If we consider first only the homoge-
neous difference equation, we recognize that only one directional component of the vector
is changing. The error vector at time instant k can be decomposed into two components:
one parallel to x and one being orthogonal to it, thus
w̃k−1 = γ(k − 1)x∗ + z ∗ ,
with xT z ∗ = 0. The homogeneous part in Equation (3.53) only has a change in γ(k − 1)
while z remains unchanged. We can thus write for the homogeneous part alone:
γ(k) = (1 − µ|A|2 kxk22 )γ(k − 1). (3.54)
The component γ(k − 1) is thus reduced by the value (1 − µ|A|2 kxk22 ). With this we can
formulate the convergence condition immediately to:
2
0<µ< . (3.55)
|A|2 kxk22
Under such condition the reduction of γ(k − 1) will continue until it reaches zero (asymp-
totically). This can also be obtained with a single adaption step, if
1
µ= .
|A|2 kxk22
The steady-state γ(k) = 0 is equivalent with the complete suppression of the signal at
frequency Ωo . We call this also a signal adaptation rather than a system adaptation,
in which the entire error vector tends to zero.
Let us continue with the inhomogeneous Equation (3.53). The disturbance ṽ(k) causes
that the component γ(k) does not remain constant at zero. As the excitation is only in
direction x, no component of z will be changed in the inhomogeneous equation. In other
words the inhomogeneous equation changes only γ(k). We thus obtain:
γ(k) = (1 − µ|A|2 kxk22 )γ(k − 1) − µṽ(k). (3.56)
If we consider the energy of the terms and utilize the statistical independence of ṽ(k) and
γ(k − 1), we obtain
E[|γ(k)|2 ] = (1 − µ|A|2 kxk22 )2 E[|γ(k − 1)|2 ] + µ2 |A|2 σv2 , (3.57)
which leads for k → ∞ to

µ2 |A|2 σv2 µσv2 /M
lim E[|γ(k)|2 ] = = . (3.58)
k→∞ 1 − (1 − µkxk22 )2 2 − µ|A|2 M
We thus raise the question whether there is a condition that describes whether the
specific input signal will lead to signal or system adaptation. Let us consider again Equation
(3.7) in form of the estimator
ŵk = ŵk−1 + µx∗k ẽa (k),
and write the equation starting with time instant k = 1:

k
X
ŵk = ŵ0 + µ x∗l ẽa (l).
l=1
Let us further assume that the initial estimate is zero, then we recognize that after k steps
we have a linear combination of vectors xl∗ , l = 1..k. Let us combine them in a matrix
Xk and the weighting terms µẽa (l) in a vector ẽk , then we can formulate the adaptation
(without the initial value ŵ0 ) as follows:
ŵk = Xk ẽk .
Obviously, the vectors Xk span a space. If the space is of dimension M , thus of the length
of the vector wo , then the algorithm can select the weights ẽk so that ŵk will approximate
the optimal value wo . Is the dimension of the spanned vector space smaller than M , the
resulting estimator cannot approach the solution. We thus have found a required condition
for system adaptation:
Lemma 3.1 In order to achieve a system adaptation (and not only a signal adaptation)
with the LMS algorithm, the following condition needs to be satisfied additionally (Ger.:
hartnäckige Anregung, Engl: persistent excitation)
rang [x∗1 , x∗2 , ..., x∗k ] = M. (3.59)
Let us consider an LMS algorithm, excited by a cosine. The i−th entry of vector xk :
B
x(k − i) = B cos(Ωo (k − i)) = [exp(jΩo (k − i)) + exp(−jΩo (k − i))] , (3.60)
2
with a real-valued amplitude B ∈ R. Thus the i−th entry of the parameter error vector
reads:
µB
w̃i (k) = w̃i (k − 1) − [exp(jΩo (k − i)) + exp(−jΩo (k − i))] ẽa (k). (3.61)
2
Neglecting the initial conditions and applying the Z-Transform we obtain:
z µB
W̃i (z) = − Ẽa (z exp(−jΩo )) exp(−jiΩo ) + Ẽa (z exp(jΩo )) exp(jiΩo ) . (3.62)
z−1 2
The undistorted error ea (k) can also be described as a linear combination of xk and w̃k−1 .
This leads in the Z-domain to:
M
X
B
Ea (z) = z −1 W̃i (z exp(−jΩo )) exp(jΩo (1 − i)) + W̃i (z exp(jΩo )) exp(−jΩo (1 − i)).
2 i=1
(3.63)
If we neglect the terms at 2Ωo , we obtain after substitution:

µB 2 M 1 1
Ea (z) = − Ẽa (z) + (3.64)
4 z exp(−jΩo ) − 1 z exp(jΩo ) − 1
µB 2 M 1 − z cos(Ωo )
= − 2
Ẽa (z). (3.65)
2 z − 2z cos(Ωo ) + 1
As the distorted a-priori error comprises of the undistorted version and the noise, Ẽa (z) =
Ea (z) + V(z), we obtain the following Z-transfer function:
µ
Ea (z) µ̄
[1 − z cos(Ωo )]
= (3.66)
V(z) z 2 − 2z cos(Ωo ) 1 − 2µ̄µ
+ 1 − µ̄µ
including the abbreviation µ̄ = 2/[B 2 M ]. This equation can be interpreted as feedback

structure. After introducing a modified noise value V̄(z):

µ µ
V̄(z) = V(z) − 1 − Ea (z). (3.67)
µ̄ µ̄
√ √
µ̄ V(z) µ̄ Ea (z)
- l - l - z −1 −cos(Ωo ) -
z−cos(Ωo )
µ −
6
µ̄
l

µ
1−
µ̄
Figure 3.4: LMS Algorithm under sinusoidal excitation as allpass in the feedforward path
and lossy feedback.
the Z-transfer function (3.66) can be expressed as:

Ea (z) z −1 − cos(Ωo )
= , (3.68)
V̄(z) z − cos(Ωo )
thus an expression that is entirely independent of step-size and other parameters. Fig-
ure 3.4 depicts the feedback structure graphically. The Z-transfer function (3.68) is an
allpass. The entire Z-transfer function (3.66) can thus be described as an allpass and a
feedback part. Loss of energy can only occur in the feedback path and is determined by
the step-size. Following the Small-Gain-Theorem we can denote any nonlinear feedback
system as stable (lp − stable) when the open loop has attenuation. As we have a lossless
system in the forward path, the feedback part has to introduce the losses. In order to
guarantee stability we have to ensure that

1 − µ < 1, (3.69)
µ̄
or equivalently,
4
0<µ< . (3.70)
M B2
Comparing with (3.55) we recognize that we obtain the same statement, that is , the step-
size needs to be bounded by 2/input energy.
Exercise 3.12 Run the LMS algorithm with a constant vector x. Which solution is
obtained for k → ∞?
Exercise 3.13 Applying the same approximations (3.66) as shown in this section, compute
the expression D(z)/Ẽa (z), that is often by applied in active noise control.
3.4 Application Specific Variants

In the following we will consider some implementation specific particularities of the LMS
algorithm. In hardware architectures comprising of a Multiply-Accumulator unit (MAC) a
complexity of 2M per iteration step is considered for the LMS algorithm. The first half is
required for the error signal ẽa (k) = d(k) − ŵTk−1 xk , the second half for the update equation
ŵk = ŵk−1 + µẽa (k)x∗k . The operation µẽa (k) is required only once and can be neglected
for large M . If saving even this operation, the step-size can be realized as a factor of two,
thus: µ = 2l . As this is only a shift operation, in an ASIC design this would not cost
anything.
Is a scaling with respect to input energy desired, it can be realized with little effort. For
the NLMS algorithm for example, the computation:
kxk k22 = kxk−1 k22 + |x(k)|2 − |x(k − M )|2 (3.71)
can be achieved recursively. Note that this only works in fixed-point arithmetic. In floating
point arithmetic such recursion can lead to cut-off and rounding errors. In this case a block
operation is useful. The recursion is implemented over a length of M but at the same time
a new block is computed in parallel and the correct results are taken at block boundaries
to avoid an increasing round off effect:
Pk|k = Pk−1 + |x(k)|2 (3.72)

2 kxk−1 k22 + |x(k)|2 − |x(k − M )|2 k 6= lM
kxk k2 = (3.73)
Pk|k else

Pk|k k 6= lM
Pk = (3.74)
0 else
At very high processing speeds the low complexity of 2M MAC operations may still be too
high. Three variants are in use with reduced complexity with the drawback of less precision:
• Sign-Regressor-LMS: ŵk = ŵk−1 + µ ẽa (k) sign[x∗k ].
• Sign-Error-LMS: ŵk = ŵk−1 + µ sign[ẽa (k)] x∗k .
• Sign-Sign-LMS: ŵk = ŵk−1 + µ sign[ẽa (k)] sign[x∗k ].
The sign operation is applied to every element in the vector individually. For complex-
valued numbers it works independently on real and imaginary part. For sure the algorithms
change their behavior sue to such brute force changes.
Note that next to complexity also the data rate can become a problem. If the LMS
algorithm is operated in two steps, (first the error computation, then the updates), then
for each x a value from ŵ needs to be loaded to compute the error. Then after th error
is computed, again all values for x and ŵ need to be loaded and finally ŵ stored. If two
parallel data buses are available, one for x and one for ŵ, the LMS algorithm could not be
computed in 2M steps. In order to achieve the complexity in 2M steps, we have to include
some further tricks. As in transversal filters the values in xk+1 are obtained by shifting all
elements of xk by one position and a single new value is introduced. Taking advantage of
such property we can start computing the update error for k + 1 already at k. With this
we only require two load and one store operation per step.
Most research went into developing algorithms that are learning faster. As we have
already seen that a correlated input process causes a slow learning, a natural way to speed
up algorithms is to decorrelate them first. For example, it is possible to either know the
correlation matrix Rxx of the input process beforehand or to estimate it and then apply
the following matrix step-size:
∗
−1
Newton-LMS: ŵk = ŵk−1 + µẽa (k) Rxx x∗k .
This procedure unfortunately is very costly in terms of complexity. On the one

hand we have to estimate the acf matrix, and then a matrix inversion is required, which
can be achieved with the order of M 2 when applying Levinson algorithm (see Section 2.4.2).
A better idea is shown in Figure 3.5 (after Schultheiss [18]). Here, a filter F is included
in such a way that on the one hand the input signal is decorrelated and on the other hand
the estimation problem remains unchanged. If the input process is speech, we can take
advantage of their short time stability and compute every 10-20ms a new optimal filter F .
The switch form old to new filter needs to be done cleverly so that the click is not audible.
The filter length of F is typically 10-20 coefficients, thus very short when compared to the
adaptive filter length of w.
In the area of speech processing in which typically many operations per sample are
computed on a DSP, often block operations are applied. In this case not every sample
is being operated but a block of say 20 or 50 samples. In general this leads to gain in
complexity. For example the error computation which is a filter process can be computed
by an FFT. A successful approach in block processing suitable for echo compensation in
hands free telephones as well as in the long distance calls are so called polyphase filter
banks. They split the entire frequency band of interest into small bands, so called sub-
bands. The processing in sub-bands has the advantage that due to the smaller bandwidth
one can operate at lower sampling rate and that the signals in the sub-bands are roughly
white. Thus next to the complexity reduction there is also an increase in learning rate
due to decorrelation. The drawback of such filter banks is the additional delay as a block
processing causes a delay according to its block length. Depending on the filter bank design
it can be several 10ms.
v(k)
y(k)
x(k) - w - h
?- - h - e(k)
F 6
−

ŷ(k)
- - ŵ
F

Figure 3.5: System Identification with Decorrelation Filter (after Schultheiss).
A further problem in hands free telephony is the double talk detection (Ger.:
Gegensprecherkennung). Presume that after successful adaptation the local speaker
becomes active and thus the microphone signal increases dramatically. The DSP has
to decide whether this is because the local speaker became active or the system has
changed and the adaption needs to continue in order to track such system change.
Also both situation can appear simultaneously. Modern algorithms thus have a so-called
step-size control unit that tries to make the right decision and sets the step-size accordingly.
If block processing is not allowed (due to delay constraints) other means are required.
One possibility is to extend the scalar step-size to a matrix similar to the Newton LMS.
It does not need to be the inverse of the acf matrix, a diagonal matrix with well chosen
diagonal elements can also be of advantage. The diagonal elements can be chosen to be
proportional to the expected weights, or adaptively selected depending on the estimated
weights:
 
(ŵk−1 )1

 (ŵk−1 )2 
 
Mk|k =  . .  (3.75)
 . 
(ŵk−1 )M
µMk|k
Mk = (3.76)
trace[Mk|k ]
ŵk = ŵk−1 + Mk ẽa (k)x∗k . (3.77)
The algorithm is known under the name Proportionate-weight NLMS (=PNLMS).

Exercise 3.14 Formulate the LMS algorithm for a transversal filter so that it requires
only two load and one store operation.
Exercise 3.15 Show that the Newton LMS algorithm under correlated excitation behaves
like an LMS under white excitation
Exercise 3.16 How does the filter structure in Figure 3.5 alter the distortion? Is it still a
system identification?
Matlab Experiment 3.3 Extend the algorithm from Experiment 3.2 by adding a
prefiltering after Schultheiss and compare the results. Is the learning rate the same as
if a white excitation was applied? Also implement the Newton LMS algorithm and compare.
3.5 Literature
Good overviews on adaptive filters and LMS algorithm can be found in [29, 80, 72, 42].
The original idea of the LMS algorithm goes back to Widrow and Hoff [78], although
gradient type algorithms in similar form can be found in older literature. The independence
assumptions were introduced by [46]; in [10] a very complex method is derived to make exact
predictions without the independence assumption, however the results are not in analytical
form. The here shown derivation of the parameter error vector is based in [31, 17], although
older work [75] was already moving along such paths. Extensions to the NLMS algorithm
can be found in [3, 51] and spherically invariant processes in [53]. A first analysis of the Sign-
Error algorithm can be found in [8]. A good explanation for convergence with probability
one is in [70]. A deeper understanding is provided in [72]. Polyphase filter banks for hands
free telephony were introduced by Kellermann [36] and for equalizers in [47]. The excitation
with sinusoidal signals was introduced by [21, 9], the feedback structure in this form the
first time in [58]. The PNLMS can be found in [13, 20, 65].
Chapter 4
The RLS Algorithm
Next to the LMS algorithm the Recursive Least Squares (RLS) algorithm is the most
prominent one. The problems that come with RLS often exclude it in practical applications.
As the RLS algorithm is just a recursive implementation of the LS problem, its properties
are identical to a classic LS solution. We therefore will start with a brief introduction into
the problem of least squares.
4.1 Least Squares Problems

We consider again the estimation problem based on the following observation
d N = XN w o + v N . (4.1)
Here we wrote N observations d(k) = wTo xk + v(k), k = 1..N in vectors and matrices:
dTN = [d(1), d(2), ..., d(N )], (4.2)

v TN = [v(1), v(2), ..., v(N )], (4.3)
wTo = [wo (1), wo (2), ..., wo (M )], (4.4)
   
x(1) x(2) ... x(M ) xT1
 x(2) x(3) ... x(M + 1)   xT2 
   
 x(3) x(4) ... 
x(M + 2)  =  xT3 
XN =   . (4.5)
 .. .. ..   .. 
 . . ... .   . 
x(N ) x(N + 1) ... x(N + M − 1) xTN
Depending on whether we have N < M or N ≥ M we distinguish the underdetermined

and the overdetermined case (of a system of equations). The problem is to estimate the
parameter vector wo optimally based on the observation dN and XN , that is to minimize
the cost function
gLS (ŵ) = kdN − XN ŵk22 . (4.6)
59
Note that the solution depends on the number of observations N . Is the value of N unde-
termined and grows with time, the problem is called LS with growing window. The solution
is given by
wLS,N = wN = arg min kdN − XN ŵk22 . (4.7)
ŵ
As we only discuss LS estimations in this chapter, we will leave out the index ’LS’. We
keep, however, the index N as it does not only indicate of how many observations we are
using but for a growing window it also denotes the time. If not indicated otherwise, we will
assume in the following that
rank(XN ) = min(M, N ). (4.8)
As mentioned above, we have to distinguish two cases:
• The system of equations is underdetermined, that is M > N –we have less observa-
tions than parameter to estimate. In this case we assume: rank(XN ) = N .
• The system of equations is overdetermined, that is M ≤ N –we have more observa-

tions than parameters to estimate. In this case we assume: rank(XN ) = M .
Differentiating of the quadratic form (4.7) with respect to the unknown vector leads to the
following orthogonality condition:
∂kdN − XN ŵk22
= − (dN − XN ŵ)H XN = 0. (4.9)
∂ ŵ
From here we find the solution ŵLS,N
XNH XN ŵLS,N = XNH dN . (4.10)
By differentiating a second time we can validate that we indeed have a minimum:
∂ 2 kdN − XN ŵk22
2
= XNH XN > 0. (4.11)
∂ ŵ
The form XNH XN > 0 indicates that the matrix XNH XN is positive definite. The minimum
cost function can be computed without explicitly knowing the LS solution:
gLS (ŵLS,N ) = (dN − XN ŵLS,N )H (dN − XN ŵLS,N ) = dH

N (dN − XN ŵ LS,N ) (4.12)
= kdN k22 − dH H −1 H
N XN (XN XN ) XN dN . (4.13)
Compare this equation with the corresponding equations of the steepest descent method.
Except of the expectation values they formally appear identical. Also the various terms
can be interpreted as instantaneous estimates of such expectation values: XNH dN =

r̂xd , XNH XN = R̂xx
∗
. Also, a general quadratic form in terms of ŵLS,N can be provided:
gLS (ŵN ) = gLS (ŵLS,N ) + (ŵLS,N − ŵN )H XNH XN (ŵLS,N − ŵN ). (4.14)
Both equations, (4.10) and (4.12), can be combined as the so called normal equations:
H
dN dN dH N X N 1 g LS ( ŵ N )
= . (4.15)
XNH dN XNH XN −ŵN 0
Compare this system of equations with (2.21).
4.1.1 Existence
Before we continue with closed form solutions, we have to consider existence and
uniqueness of the LS solution.
Lemma 4.1 Is XN of row rank N ≥ M , then the solution is unique and given by ŵLS,N =
[XNH XN ]−1 XNH dN .
Is XN not of row rank N , then the normal equations have more than one solution of which
any two solutions ŵ1 and ŵ2 differ by a vector in the nullspace of XN , thus XN [ŵ2 − ŵ1 ] = 0.
Proof: Let us assume there is a vector z, so that
z = XNH XN p = XNH q.
The columns of XNH XN span a space (column range) to which the vector z belongs to.
If there is a vector, different to the zero vector, p for which we have that XNH XN p = 0,
then such vector p belongs to the null space of XNH XN . Assuming that XN is of full row
rank M , the nullspace is empty. As XN and XNH XN have the same nullspace (see also
Appendix F), there cannot exist a vector different to the zero vector for which XN p = 0.
Thus, with choice of vector q the vector p is uniquely defined and vice versa.
In the second part we assume that XN is not of full row rank M . Then XNH XN is also
not of full row rank and solutions p different to the zero vector exist in the null space
of XNH XN to which exist the same solutions in the nullspace of XN (as it is the same
nullspace). The choice of q is now not uniquely defined by p.
In the case of the underdetermined equation system (rank(XN ) = N < M ) we obtain

the following solution:
−1
ŵLS,N = XNH XN XNH dN . (4.16)
The matrix XN XNH spans a smaller space and thus the nullspace will become larger. A
multitude of solutions exist under which those with minimum norm are of most interest.
With Singular Value Decomposition (SVD) it can be shown [29], that the solution (4.16)
has minimum norm.
4.1.2 LS Estimation
Lemma 4.2 Consider the following linear model
dN = XN wo + vN . (4.17)
Let XN be a deterministic matrix of full rank and wo an unknown deterministic variable.

Let the distortion vN be a zero mean random vector with variance one. Then the LS
estimator ŵLS,N = [XNH XN ]−1 XNH dN has the following properties:
• E[ŵLS,N ] = wo ,
• E[(ŵLS,N − wo )(ŵLS,N − wo )H ] = [XNH XN ]−1 ,
• the LS-Estimator is the best, linear estimate without a bias (Best Linear UnBi-
ased=BLUE).
Proof: Plugging in the estimate in the first property and we find:

E[ŵLS,N ] = E [XNH XN ]−1 XNH dN .
If we furthermore substitute dN from (4.17) we obtain:

E[ŵLS,N ] = wo + E [XNH XN ]−1 XNH vN = wo
due to the zero-mean property of the distortion.

The second part is also shown by straightforward substitution:
h H i
H H −1 H H −1 H
E[(ŵLS,N − wo )(ŵLS,N − wo ) ] = E [XN XN ] XN vN [XN XN ] XN vN

= E [XNH XN ]−1 XNH vN vH H
N XN [XN XN ]
−1
.
The first expectation over the noise results in the unit matrix and finally we find the desired
result.
The third property is shown by assuming an arbitrary linear estimator B, thus
w̄ = BdN . (4.18)
In order to be unbiased it needs to be BXN = I. Thus the estimator is
w̄ = wo + BvN . (4.19)
Its covariance matrix becomes
E[(w̄ − wo )(w̄ − wo )H ] = BB H . (4.20)
Consider now the positive semidefinite matrix CC H , that is constructed by C = B −

[XNH XN ]−1 XNH , we find
BB H ≥ [XNH XN ]−1 (4.21)
and thus the desired result.
4.1.3 Conditions on Excitation

Until now we assume the matrix XNH XN to be of full rank. With the definition in Equation
(4.5) we can also write the matrix as sum of outer vector products:
N
X
XNH XN = x i xH
i . (4.22)
i=1
With each additional term xi xH

i the matrix can grow in rank. In similar form as in the
LMS algorithm we can also provide a necessary condition on excitation.
Lemma 4.3 A set of input vectors {xk , k > 0} is a persistent excitation if positive
numbers α, β, No exist so that
n+N
Xo
αI < xk xH
k ≤ βI, for all n. (4.23)
k=n
4.1.4 Generalizations and special cases

The considerations so far can be extended by augmenting the equations with the initial
conditions w̄ and by utilizing a positive definite weight matrix Q. Furthermore the initial
values obtain also a weighting Π−1 o , that can be interpreted as confidence in such initial
values. Is the initial value w̄ very certain, we set Π−1
o to a large value. The first term is
then dominating the following ones. In the following we assume the matrices Π ∈ IRM ×M
and Q ∈ IRN ×N , to by symmetric and positive definite. We then obtain the following cost
function:
gW LS (ŵN ) = (ŵN − w̄)H Π−1 H

o (ŵ N − w̄) + (dN − XN ŵ N ) Q(dN − XN ŵ N ). (4.24)
The corresponding normal equations are:

(dN − XN w̄)H Q(dN − XN w̄) (dN − XN w̄)H QXN 1 gW LS (ŵW LS,N )
= .
XNH Q(dN − XN w̄) Π−1 H
o + XN QXN w̄ − ŵW LS,N 0
(4.25)
All LS methods discussed so far assume a growing window over the input values. As a
consequence all values impact the final solution. If the system changes during the adaptation
process, the estimation would have to be wrong. A weighting in the LS solution allows to
emphasizing the newer values than the older ones (see also Exercise 4.1). A formulation
in which similar to the LMS algorithm only a set of new values is being applied and old
ones discarded (apart from the accumulation in the a-priori estimates) applies a rectangular
window (also called sliding window) over the input vectors. We then minimize the following
function:
Xk
gAP A,k (ŵ) = |d(i) − xTi ŵ|2 . (4.26)
i=k−N +1
By the choice of the window length N , the solution is determined. If N < M , we have an
underdetermined problem and we need to take solution (4.16). Alternatively such solution
is useful if the algorithm is not excited persistently. This underdetermined algorithm is also
known in literature under the name Affine Projection Algorithm (APA). For the spacial case
N = 1 we obtain the NLMS algorithm with step-size µ(k) = 1/[kxk k22 ].
4.1.5 Summary
Finally we like to compare the RLS algorithm in table 4.1 with respect to a stochastic
and a deterministic view. In order to achieve this we extended (4.17) in such a way that
the estimated system is considered a random variable. Table 4.1 reveals a few unexpected
analogies.
Exercise 4.1 Consider the LS cost function in the form

N
X N
X
N −k 2
gLS (ŵN ) = λ |ẽa (k|ŵN )| = λN −k |d(k) − xTk ŵN |2 (4.27)
k=1 k=1
Derive the normal equations.

stochastic deterministic
d = Xw + v, d = Xw + v
mw = E[w] wo = w̄
E[(w − mw )(w − mw )H ] = Rww Πo
mv = E[v] vo
E[(v − mv )(v − mv )H ] = Rvv Q−1
md = Xmx + mv do = Xwo + v o
ŵ ŵ
minK kw − mw − K(d − md )k22 minw (w − wo )H Π−1 2
o (w − w o ) + kd − Xw − v o kQ
Ko = Rww X H [XRww X H + Rvv ]−1 Ko = Πo X H [XΠo X H + Q−1 ]−1
−1
Ko = [Rww + X H Rvv
−1
X]−1 X H Rvv
−1
Ko = [Π−1 H
o + X QX] X Q
−1 H
ŵ = Ko [d − Xmw − mv ] ŵ = Ko [d − Xwo − v o ]
Table 4.1: Comparison of terms of the LS algorithm in stochastic and deterministic de-
scriptions.
Exercise 4.2 Derive the normal equations of (4.25).
Exercise 4.3 Based on the statement of Lemma 4.1 show that the LS algorithm is also
capable of a linear prediction. For this consider the autoregressive random process of order
P
XP
x(k) = x(k − i)a(i) + v(k) (4.28)
i=1
and estimate by LS its coefficients a(i). Show the estimator’s properties by formulating
the random process in vector notation of length M > P . Show that for such vectors v̂k =
xk − Xk â we have T
xk v̂k 1
T 2
= . (4.29)
Xk kv̂k k2 0
Exercise 4.4 To transmit data over a wireless channel, a constant modulus signal (|x(k)| =
1) is employed. The best channel estimation can be achieved if trace([XNH XN ]−1 ) becomes
minimal. By which property(ies) of the transmitted training signal can this be achieved?
Exercise 4.5 Minimize the following cost function with constraint
kŵk − ŵk−1 k22 + λkdk − XP (k)ŵk k22 , (4.30)
for optimal ŵk . Let XP (k) be a matrix with instantaneous value xk and past values
xk−1 ...xk−P +1 .
Consider alternative the formulation
kŵk − ŵk−1 k22 + λT [dk − XP (k)ŵk ] + λH [dk − XP (k)ŵk ]∗ (4.31)
Let the factor λ be a Lagrange multiplier.
4.2 Classic RLS Derivation

A major drawback of the LS method is so far that the inverse of the Gramian matrix XNH XN
needs to be computed every time anew once a new observation arrives. As soon as there are
more observations N than parameters M , the Gramian to invert is of dimension M × M ,
and can, assuming it is of full rank M be inverted by standard Matrix-Inversion methods
that are of order O(M 3 ). Furthermore, the number of terms to compute the matrix XNH XN
grows with the observations linearly in N M . It is thus of great interest to develop a method
that allows to compute the result with at least operations and memory as possible. In order
to achieve this, let us consider the case of N + 1 observations:
gLS (ŵN +1 ) = (ŵN +1 − w̄)H Π−1 H

o (ŵ N +1 − w̄) + (dN +1 − XN +1 ŵ N +1 ) (dN +1 − XN +1 ŵ N +1 ),
2
d X
= (ŵN +1 − w̄) Πo (ŵN +1 − w̄) +
H −1 N
d(N + 1) − T
N
ŵ N +1
. (4.32)

xN +1 2
If we consider the solution at N observations, we can split the solution at N + 1 into an

already computed part and a new part:
−1 H
ŵN +1 = Π−1 H
o + XN +1 XN +1 XN +1 dN +1 (4.33)
−1
−1
H ∗ XN H ∗ dN
= Πo + XN xN +1 XN xN +1 (4.34)
xTN +1 d(N + 1)
−1 −1
= Πo + XNH XN + x∗N +1 xTN +1 XNH dN + x∗N +1 d(N + 1) . (4.35)
Defining the matrix

∆ −1
PN +1 = Π−1 H
o + XN +1 XN +1 ; P0 = Πo , (4.36)
we recognize the following recursion:
PN−1+1 = PN−1 + x∗N +1 xTN +1 ; P0 = Πo . (4.37)
With help of the Matrix-inversion Lemma 2.4 we can describe also this recursion in its
inverted form and obtain:
PN x∗N +1 xTN +1 PN
PN +1 = PN − , P0 = Πo . (4.38)
1 + xTN +1 PN x∗N +1
We now recognize the meaning of our initial certainty parameter Πo . This recursion does
not require a matrix inversion, that is instead of O(M 3 ) we only require O(M 2 ) operations.
The substitution of the recursive form in Eqn. (4.35) results in the following:

ŵN +1 = PN +1 XNH dN + x∗N +1 d(N + 1) (4.39)

PN xN +1 xN +1 PN H
∗ T
= PN − T ∗
XN dN + x∗N +1 d(N + 1) (4.40)
1 + xN +1 PN xN +1

H PN x∗N +1 xTN +1 H ∗ xTN +1 PN x∗N +1
= PN XN dN − P X d +PN xN +1 1 − d(N + 1)
| {z } 1 + xTN +1 PN x∗N +1 | N {zN N} 1 + xTN +1 PN x∗N +1
ŵN ŵN | {z }
1
1+xT P x∗
N +1 N N +1
PN x∗N +1 T

= ŵN + d(N + 1) − x N +1 ŵ N . (4.41)
1 + xTN +1 PN x∗N +1
This description is not so different to the description of the LMS algorithm. The essential
difference is the new regression vector, thus a new direction for the updates than before. In
the RLS algorithm this direction depends on the previous directions. Let us consider the
regression vector
PN x∗N +1
k N +1 = (4.42)
1 + xTN +1 PN x∗N +1
= PN +1 x∗N +1 (4.43)
= PN x∗N +1 γ(N + 1). (4.44)
In the matrix PN +1 all past values of the vectors xk , k = 1..N are gathered. The scalar
term γ(N + 1) is called conversion factor. We have:
1
γ(N + 1) = . (4.45)
1+ xTN +1 PN x∗N +1
With such definition, w can also reformulate (4.38):
k N +1 k H
N +1
PN +1 = PN − ; P0 = Πo . (4.46)
γ(N + 1)
A further interesting connection is given between the a-priori and the a-posteriori error,
that is
ẽa (N + 1) = d(N + 1) − xTN +1 ŵN (4.47)

ẽp (N + 1) = d(N + 1) − xTN +1 ŵN +1 . (4.48)
By substituting the recursion (4.41) in the definition of the a-priori error, we obtain:
ẽp (N + 1) = γ(N + 1)ẽa (N + 1). (4.49)

As γ(N + 1) is strictly smaller than one, the a-posteriori error is always smaller than the
a-priori error in magnitude.
A recursive formulation is also possible for the cost function gLS

H ∗ dN − XN ŵN +1
gLS (ŵN +1 ) = dN , d (N + 1) (4.50)
d(N + 1) − xTN +1 ŵN +1

H ∗ dN − XN {ŵN + k N +1 ẽa (N + 1)}
= dN , d (N + 1) (4.51)
d(N + 1) − xTN +1 {ŵN + k N +1 ẽa (N + 1)}

H ∗ dN − XN ŵN − XN k N +1 ẽa (N + 1)
= dN , d (N + 1) (4.52)
d(N + 1) − xTN +1 ŵN − xTN +1 k N +1 ẽa (N + 1)
= dH H
N [dN − XN ŵ N ] − dN XN k N +1 ẽa (N + 1) (4.53)
∗ T ∗ T
+d (N + 1)[d(N + 1) − xN +1 ŵN ] − d (N + 1)xN +1 k N +1 ẽa (N + 1)

= gLS (ŵN ) + ẽa (N + 1) d∗ (N + 1) − dH N +1 XN +1 k N +1 (4.54)
∗ H

= gLS (ŵN ) + ẽa (N + 1) d (N + 1) − dN +1 XN +1 PN +1 x∗N +1 (4.55)
T
∗
= gLS (ŵN ) + ẽa (N + 1) d(N + 1) − xN +1 ŵN +1 (4.56)
∗
= gLS (ŵN ) + ẽa (N + 1)ẽp (N + 1) (4.57)
= gLS (ŵN ) + |ẽa (N + 1)|2 γ(N + 1). (4.58)
With the last substitution from (4.49) we also recognize that the conversion factor
γ(N + 1) is real valued.
4.2.1 Underdetermined Forms

We already mention the APA (Affine Projection Algorithm) as a special solution to the
recursive LS problem with the advantage that its cost function does not grow over the
entire observation horizon bus utilizes a sliding window of length N . To understand its
adaptation better we generalize the algorithm first to the so called ǫ-APA:
−1
ŵN (k) = ŵN (k − 1) + µ ǫI + XNH (k)XN (k) XNH (k) [dN (k) − XN (k)ŵN (k − 1)] . (4.59)
The index N only indicated that the observation duration stretches over N elements in the
past. The step-size µ and the regularization parameter ǫ are both assumed to be positive.
A time index k is required now to distinguish the various terms. For the classical RLS
algorithm this was not required as with growing window N also the time was defined.
Note that the inner matrix XNH (k)XN (k) is of dimension M × M . Selecting an ob-
servation window N < M results in an underdetermined matrix. Due to the positive
regularization parameter ǫ > 0 the modified matrix ǫI + XNH (k)XN (k) can be inverted.
However, it is not necessary to invert the matrix of large dimension M × M as we can apply
the matrix inversion lemma:
−1 H −1
ǫI + XNH (k)XN (k) XN (k) = XNH (k) ǫI + XN (k)XNH (k) . (4.60)
We thus have only to invert a matrix of dimension N × N . The update equation of the
ǫ-APA is thus given by:
−1
ŵN (k) = ŵN (k − 1) + µXNH (k) ǫI + XN (k)XNH (k) [dN (k) − XN (k)ŵN (k − 1)] . (4.61)
Two special cases are of particular interest and will be discussed next. The first case is
for N = 1. We then obtain the ǫ−LMS algorithm and recognize that it can be interpreted
as a special case of the ǫ-APA with observation window length one.
The second special case is given for µ = 1 and ǫ = 0. If the matrix XN (k)XNH (k) is of
full rank for every k, it can be inverted and the algorithm will converge. The dependency
of the a-posteriori errors to the a-priori errors is now of interest. Considering N values in
a vector we find:
∆ ∆
ẽa (k) = [dN (k) − XN (k)ŵN (k − 1)] , ẽp (k) = [dN (k) − XN (k)ŵN (k)] . (4.62)
Substituting this in the update equation and we obtain the desired relation:

ẽp (k) = I − XNH (k)(XN (k)XNH (k))−1 XN (k) ẽa (k). (4.63)
The matrix I − XNH (k)(XN (k)XNH (k))−1 XN (k) is a so-called projection matrix. Consider a
vector z comprising of two orthogonal components: a linear combination of XNH (k) and a
vector orthogonal to the subspace spanned by XNH (k), thus z = y + XNH (k)x. Multiplying
the projection matrix from the right by such vector we recognize that it will become
smaller by the amount of the linear combination in XNH (k), thus x disappears. On the
other hand, the orthogonal part y remains unchanged. We can further show that the
a-priori error vector ẽa (k) can be composed only by a linear combination in XNH (k) and
thus the a-posteriori error ẽp (k) = 0, must disappear.
With such property the cost function of the APA can also be described as:
min kŵN (k) − ŵN (k − 1)k with the constraint ẽp (k) = 0.
There exists an infinite amount of vectors ŵN (k), who solve ẽp (k) = 0 (all with
different orthogonal part y). This set of solutions will be denoted as affine subspace (also:
hyperplane, manifold) to indicate that the plane defined by the set of solutions is not
necessarily passing ŵN (k) = 0. For the special case N = 1 we say that the APA (NLMS
algorithm with normalized step-size α = 1) obtains the solution ŵN (k) by a projection
with respect to the affine subspace. For N > 1 the solution is the Intersection of all these
affine subspaces. The APA thus finds its solution by projection onto the Intersection of
all affine subspaces. Note that such projection properties get lost for the overdetermined
case N > M , thus for the RLS algorithm.
Exercise 4.6 Derive the recursive form of the LS algorithm with sliding rectangular
window (4.26).
Exercise 4.7 Derive the recursive form of the LS algorithm with exponential window :
N
X N
X
N −i 2
gLS (ŵ) = λ |ẽa (i)| = λN −i |d(i) − xTi ŵ|2 . (4.64)
i=1 i=1
Find the cost function as function of a-priori- and a-posteriori error.
Exercise 4.8 Is it possible to derive for the LMS algorithm a relation like in (4.41)
between a-priori- and a-posteriori error?
Exercise 4.9 Derive the LMS algorithm from the Steepest-Descent algorithm for the fol-
lowing estimates:
N −1
1 X ∗ T
R̂xx = x x (4.65)
N l=0 l−k l−k
N −1
1 X ∗
r̂∗xd = x d(l − k). (4.66)
N l=0 l−k
Assume for this derivation the driving process x(k) to be stationary. Under which step-size
condition do you obtain the RLS algorithm?
4.3 Stationary Behavior

Similar to the LMS algorithm there is also key performance parameters for the RLS al-
gorithm describing its behavior in stationary environments, that is its learning speed and
steady-state precision. We consider the reference model from Lemma 4.2:
d N = XN w o + v N . (4.67)
As the LS problem (same goes for its recursive form) is solved in form of a set of linear
equations of order M , a stationary solution will exist after N = M steps, whose value
only alternates further with the noise terms. If applying the recursive form, we typically
start with zero vectors xk (initialized by zero entries) and thus for the first M steps, the
algorithm cannot work properly. We thus require 2M steps from starting with zero to
the convergence of the algorithm. This is a substantial speed-up compared to the LMS
algorithm even compared to the Newton-LMS algorithm for which the RLS algorithm
can be viewed as approximation: P −1 ≈ Rxx . Similar to the Newton-LMS algorithm the
learning speed is independent of the driving sequence correlation.
To compute the steady-state misadjustment, we consider the RLS algorithm with ex-
ponential decaying window as its most common form. Its update is given by:
ŵk = ŵk−1 + k k [d(k) − xTk ŵk−1 ], (4.68)
λ−1 Pk−1 x∗k
kk = , (4.69)
1 + λ−1 xTk Pk−1 x∗
Pk = λ−1 [Pk−1 − k k xTk Pk−1 ]. (4.70)
Note that we changed notation slightly: instead of an index N that denoted time as
well as growing size before, the index now denotes the time instant k. On the one
hand we like to point out by this the formal similarity to the LMS algorithm but on
the other hand we also like to point out that there is no growing window of size N any more.
We consider the additive noise as random process by which the estimate ŵk now also
becomes a random process. From the last line in (4.25) and the reference model we find for
Π−1 = 0, Q = Λ and arbitrary σv2
lim Ev [(wo − ŵk )(wo − ŵk )H ] = lim [Xk−1
H
ΛXk−1 ]−1 [Xk−1
H
Λ2 Xk−1 ][Xk−1
H
ΛXk−1 ]−1 σv2 .
k→∞ k→∞
(4.71)
k−i
The diagonal entries of matrix Λ are Λii = λ . Correspondingly the driving process x(k)
can be viewed as random process which allows to compute (approximately) expectations in
Xk−1 :
k
!−1
H X
Ex [Xk−1 ΛXk−1 ]−1 [XH 2 H
k−1 Λ Xk−1 ][Xk−1 ΛXk−1 ]
−1
≈ Rxx λk−i (4.72)
i=1
" k
# k
!−1
X X
Rxx λ2k−2i Rxx λk−i .
i=1 i=1
This term can be further simplified to

1 − λ −1 2
lim E[(wo − ŵk )(wo − ŵk )H ] ≈ R σ . (4.73)
k→∞ 1 + λ xx v
We can thus compute the mismatch of the parameter error vector

1−λ
lim tr E[(wo − ŵk )(wo − ŵk )H ] ≈ −1 2
tr[Rxx ]σv (4.74)
k→∞ 1+λ
M
1−λ 2X 1
= σv (4.75)
1 + λ i=1 λi
The steady-state value of the a-priori error is found to be (assuming σx2 = 1):

lim E[|ẽa (k)|2 ] = σv2 + lim tr E[(wo − ŵk )(wo − ŵk )H ]Rxx (4.76)
k→∞
k→∞
2 1−λ
= σv 1 + M (4.77)
1+λ
and thus the misadjustment
1−λ
mLS = M . (4.78)
1+λ
Note that for real- and complex-valued Gaussian processes more precise expressions exist,
as the terms can be modeled as Wishart process. Typically the provided expressions are
valid for dimensions of M ≥ 10.
Matlab Experiment 4.1 Repeat Matlab Experiments 3.1 and 3.2, however with expo-
nentially weighted RLS algorithm. Instead of various step-sizes, apply forgetting factors λ
in the range [0.7..1.0]. Compare experimental results with theoretical predictions.
Exercise 4.10 Compute the exact expression of (4.72) in case teh forgetting factor is one.
Assume the driving process to be a zero mean Gaussian process. What is then obtained for
misadjustment and mismatch?
4.4 Alternative Forms of LS Solutions

The drawback of the RLS method is the still high complexity in particular when the
number M of parameters grows. The question thus rises whether the complexity of
O(M 2 ) can be reduced further. To this end a set of similar solutions, mostly referred
to as Fast-Transversal Filter (=FTF) algorithms have been developed. In their lowest
complexity form, only 7M MAC operations are required. But several divisions are also
necessary. All known FTF algorithms are numerically unstable. After several successful
iterations, the numerical errors start to accumulate and increase exponentially. Although
many methods to stabilize such behavior were invented, the time period until unstable
behavior begins only became longer. Instability always occurs. One solution out of this is
the application in sub-bands. As previously described such methods divide the problem
in independent bands of smaller bandwidth. If run an FTF in such a subband, it also
becomes unstable. If however, the re-starting point of each sub band is shifted towards
each other in time, the instability in each subband will start at different times. This can
be detected and then the algorithm in each sub band is restarted from zero. During this
phase a simple LMS algorithm can take over the updates[23].
RLS algorithms have a basic problem in implementations. By the continuous matrix

inversion, a large range for the variables is required. Writing Matlab code this is an easy
task as internally all computations are performed in double precision floating point. IN
electronic mass products, floating point processors are much too expensive and only low
cost fixed-point processors are possible. Some methods are of particular interest due to
their lossless projections they apply. The energy in a vector remains constant, it is just
newly distributed. By this one can maintain all information in a finite range. In the
following we will provide some examples. Note however that those methods are not well
suited for implementation in a standard DSP and require dedicated hardware design.
As the solution to the LS-problem can be formulated a linear matrix equation:

Ax = y. (4.79)
Here A is the Gramian XNH XN and the vector y represents XNH dN . By a suitable transfor-
mation Θ the Gramian can be transformed into a triangular form matrix:
   
x x x x x 0 0 0
 x x x x   
 Θ =  x x 0 0 . (4.80)
 x x x x   x x x 0 
x x x x x x x x
The basis of such operations are rotational or hyperbolic transforms. Consider in the most
simplest case a 2 × 2 Matrix A:

a b C −S x 0
= . (4.81)
c d S C y z
If selected b/a = S/C, we obtain a zero as second entry. There is still sufficient freedom to
set aC + bS = 1 and we obtain as first entry x = 1. The factors C and S can be interpreted
as sinus and cosine of θ. They thus have the form
1 ρ
C=p ; S=p . (4.82)
1 + |ρ|2 1 + |ρ|2
Similar operations are hyperbolic transformations

1 1 −ρ ρ = ab if |a| > |b|
Θ= p for (4.83)
1 − |ρ|2 −ρ
∗
1 ρ∗ = ab if |a| < |b|
Such rotational or hyperbolic transformation can be implemented very efficiently by so

called CORDIC structures in which angles of increments of arctan(2k ) are implemented,
that is operations without complicated computation.
4.5 Literature
Good tutorials on LS and RLS algorithms can be found in [29]. Polyphase implementations
of FTF versions are explained in [23]. Detailed descriptions to various implementations are
in [67]. Details to CORDIC implementations are in [2].
Chapter 5
Tracking behavior of Adaptive

Systems
Until now our assumption was that the system under consideration is time-invariant, thus
a fixed wo . But not all systems are fixed. Due to aging and temperature properties of
systems alter slowly. Some systems change fast: the loudspeaker-room-microphone system
changes rapidly if the speaker moves through the room. Also the wireless channel may
change rapidly with moving receiver or moving scattering objects. Next to the initial
learning or transient response there is also a tracking (Ger: Nachführverhalten) behavior.
This describes how an adaptive filter reacts if the system is permanently changing. One
possibility to describe such behavior in a general form, is to assume a rotational change, as
described in the following:
d(k) = xTk wo ejΩo k + v(k) . (5.1)
The system wo that is to be estimated now rotates with an unknown frequency Ωo . A direct
application of this formulation is given in the wireless channel estimation under frequency
offset. As the receiver utilizes a different oscillator than the transmitter, a frequency offset
Ωo occurs. We will recognize that the reaction of the adaptive system of such a rotation
is typically of linear nature. This means that the reaction of an arbitrarily changing time-
variant system can be treated as the superposition of individual rotational components. It
is thus sufficient to analyze the behavior for a single rotation at frequency Ωo .
5.1 Tracking Behavior of LMS and RLS Algorithm

With the reference model (5.1) the error vector can be defined anew:
∆
w̃k = wo ejΩo k − ŵk . (5.2)
75
Thus we obtain for the a-priori error:
ẽa (k) = d(k) − xTk ŵk−1 (5.3)

= v(k) + xTk wo ejΩo k − xTk ŵk−1
= v(k) + xTk wo ejΩo (k−1) − xTk ŵk−1 + xTk wo ejΩo k − xTk wo ejΩo (k−1)

= v(k) + xTk w̃k−1 + xTk wo ejΩo k 1 − e−jΩo . (5.4)
The update equations for the LMS and RLS algorithm can be provided in a unified form:
w̃k = (I − g ∗k xTk )w̃k−1 − v(k)g ∗k + (I − g ∗k xTk )wo ejΩo k (1 − e−jΩo ). (5.5)
The vector g k is simply µxk for the LMS algorithm and k k = Pk xk in case of the RLS, thus

µxk ;LMS
gk = . (5.6)
Pk xk ;RLS
In the next step, we consider the signals v(k) and xk as random processes, thus v(k) and
xk . We can now compute the expectation with respect to the driving source:
E[w̃k ] = (I − A)E[w̃k−1 ] + (I − A)wo ejΩo k (1 − e−jΩo ) . (5.7)

∗
The matrix A is given by µRxx in case of the LMS and [1 − λ]I in case of the RLS
algorithm with exponential decaying window.
Theorem 5.1 The stationary solution for the LMS and RLS algorithm for a system that
changes periodically with frequency Ωo in the mean is given by:
n jΩo −1 o
jΩo
E[ŵk ] = I − (e − 1) e I − (I − A) (I − A) wo ejΩo k . (5.8)
Proof: As E[w̃k ] is an output of a linear system, we must have a solution of the form:
E[w̃k ] = aejΩo (k+1) . (5.9)
The solution for a is found by the substitution

−1
a = (1 − e−jΩo ) ejΩo I − (I − A) (I − A)wo . (5.10)
Thus, the expectation of the parameter error vector also becomes periodically time variant.
For k → ∞ the initial transients disappear and the mean parameter error vector becomes
eventually
−1
E[w̃k ] = (1 − e−jΩo ) ejΩo I − (I − A) (I − A)wo ejΩo (k+1) , (5.11)
or, equivalently in form of the mean estimator:

n −1 o
E[ŵk ] = I − (ejΩo − 1) ejΩo I − (I − A) (I − A) wo ejΩo k . (5.12)
Obviously frequency Ωo as well as algorithmic specific parameters like µ and λ influence

the result. Essentially we can state that the estimate runs behind the true value wo ejΩo k
in magnitude (smaller) and in phase.
We can formulate the result (5.12) for an arbitrarily small frequency range dΩ:
n −1 o
dE[ŵk (Ω)] = I − (ejΩ − 1) ejΩ I − (I − A) (I − A) wo (Ω)ejΩk dΩ.
This interpretation allows the computation of the algorithmic response to arbitrary system
changes:
Z πn o
1 jΩ
jΩ −1
E[ŵk ] = I − (e − 1) e I − (I − A) (I − A) wo (Ω)ejΩk dΩ . (5.13)
2π −π
The kernel of this integral is the Fourier-transform of the algorithmic response or also called
the Fourier-transform G of the Green’s function of the LMS/RLS algorithm:
n −1 o
∆
G(Ω) = I − (ejΩ − 1) ejΩ I − (I − A) (I − A) .
This Green function in the mean is thus obtained by the inverse Fourier-transform:

g(k) = I − (I − A)k u(k) − (I − A)(k−1) u(k − 1)
with the step-response u(k). In other words: the algorithmic response to arbitrarily chang-
ing systems can be computed by a convolution with the Green function g(k). Two well
known results can be obtained by this:
• In case of a frequency offset we find: wo (Ω) = wo δ(Ω − Ωo ) and we obtain:
n −1 o
E[ŵk ] = I − (ejΩo − 1) ejΩo I − (I − A) (I − A) wo ejΩo k .
• In the initial phase of the adaptation we have: wo (Ω) = wo /[1 − e−jΩ ] and obtain

k+1
E[ŵk ] = I − [I − A] wo .
Theorem 5.2 Under white excitation the LMS and RLS algorithm show identical tracking
behavior in the mean.
Proof: We obtain A = µI for the LMS algorithm and [1 − λ]I for the RLS. In other words
the choice µ = 1 − λ results in the same tracking behavior.
Matlab Experiment 5.1 A system wTo = [1, 10, 1] excited by a white sequence is changing
periodically with frequency Ωo . Compute the algorithmic response of the LMS and the RLS
algorithm as a function of the frequency. Simulate this in Matlab and verify your result.
Plot the relative parameter error vector for a range of Ω over µ and 1 − λ. How can ŵk be
used to estimate the unknown frequency Ω?
5.2 Kalman Algorithm

The analysis of LMS and RLS algorithm has shown that the adaptive filter solution lags
behind the true value. Although the choice of step-size and forgetting factor impact the
solution, even at optimal values the lag increases with higher frequency. If there exists
a-priori knowledge about the systems changes, such knowledge can be brought into the
design of the adaptive algorithm. Let us assume a general system in state-space form:
wk = Fk wk−1 + Gk uk , k = 1, 2, ... (5.14)

dk = Xk wk−1 + v k . (5.15)
The unknown system wk varies according to (5.14), driven by the signal uk . The output of
the system is a linear combination of system and input Xk with an additional noise term
v k . In the general case also the output values are vectors. Such systems are often called
Multiple Input-Multiple Output (MIMO).
In order to estimate such a system wk the adaptive algorithm should reflect as much
of the a-priori knowledge as possible, for example the state-space form. In the recursive
update part of the adaptive algorithm, we introduce a prediction component Fk ŵk . As we
have errors now in vector form, we have to introduce an optimal matrix step-sise Mk . The
optimal adaptation algorithm is thus given by:
ẽa,k = dk − Xk ŵk−1 (5.16)

ŵk = Fk ŵk−1 + Mk ẽa,k , k = 1, 2, .... (5.17)
The only open problem is now to find the optimal step-size matrix Mk .
For this we assume that the signals uk and v k are random processes. We will further
assume that
E[vk vH
i ] = Rvv δ(k − i); E[uk uH
i ] = Ruu δ(k − i); E[vk uH
i ] = 0. (5.18)
Further an initial state w0 is assumed as zero mean random variable with
E[w0 vH
i ] = 0; E[w0 uH
i ] = 0; E[w0 wH
0 ] = P0 . (5.19)
All conditions can compactly formulated as:

    
uk  H Ruu δ(k − i) 0 0
 vk  ui   0 Rvv δ(k − i) 0 
E  
 w0  vi
 =
 
.
 (5.20)
0 0 P0
w0
1 0 0 0
In order to ensure a unique solution we also have to assume that Rvv is positive definite, a
condition that is in general satisfied. Note that other more relaxed conditions are possible
as well. the solution only becomes then more and more difficult to interpret.
Let us now consider the random a-priori error vector ẽa,k :
ẽa,k = dk − Xk ŵk−1 = Xk w̃k−1 + vk . (5.21)
If we define the covariance matrix of the parameter error vector as

∆
Pk = E[w̃k w̃H
k ], (5.22)
then we can also compute the covariance matrix of the a-priori error vector:
E[ẽa,k ẽH H
a,k ] = Rvv + Xk Pk−1 Xk = Ree . (5.23)
The optimal step-size matrix can be found by minimizing the recursion of the covariance
matrix with respect to Mk :
Pk = Fk Pk−1 FkH + Mk E[ẽa,k ẽH H H H H H H

a,k ]Mk + Gk Ruu Gk − Fk E[w̃k−1 ẽa,k ]Mk − Mk E[ẽa,k w̃k−1 ]Fk .
(5.24)
The minimization w.r.t Mk is obtained by comparison with the optimal form (Mk −
M̄k )B(Mk − M̄k )H . We find B = Ree and
M̄k = Fk E[w̃k−1 ẽH −1

a,k ]Ree (5.25)
= Fk E[w̃k−1 (Xk w̃k−1 + vk )H ]Ree
−1
(5.26)
= Fk Pk−1 XkH Ree
−1
. (5.27)
Having this available we can now formulate the equation for the parameter covariance
matrix:
Pk = Fk Pk−1 FkH − M̄k Ree M̄kH + Gk Ruu GH
k . (5.28)
Eventually the complete Kalman algorithm is obtained.
Kalman Algorithm: Consider the state space description of a time-variant system:
wk = Fk wk−1 + Gk uk (5.29)
dk = Xk wk−1 + vk . (5.30)
with conditions
    
uk  H Ruu δ(k − i) 0 0
 vk  ui   0 Rvv δ(k − i) 0 
E 
 w0  vi
 =
 
.
 (5.31)
0 0 P0
w0
1 0 0 0
The optimal estimator for the system wk is given by:
M̄k = Fk Pk−1 XkH [Rvv + Xk Pk−1 XkH ]−1 , (5.32)

ŵk = Fk ŵk−1 + M̄k ẽa,k , (5.33)
Pk = Fk Pk−1 FkH − M̄k [Rvv + Xk Pk−1 XkH ]M̄kH + Gk Ruu GH
k . (5.34)
Note that the application of the Kalman algorithm requires the assumption of random sig-
nals. This was not the case for the previous algorithms. In the LMS algorithm we needed
the statistic only for finding optimal step-size parameters, and to compute its properties.
The RLS worked entirely without randomness. We should further mention that the com-
plexity of the Kalman algorithm is given by O(M 3 ) as we have to invert a matrix. Only
in very simple cases this can be avoided. Classically the Kalman algorithm is found in au-
tomation control where the time variant nature of the system is somewhat known and there
is sufficient time to compute the algorithmic equations between to observation samples. In
the last years more applications included satellite communications and adaptive equalizers
in wireless systems. Often the time-variant behavior is only partly known or only in ap-
proximation. Then next to the Kalman equations also estimates for the system parameters
are required, for example for matrix Fk . This is called the extended Kalman filter. In
automation control the term LQC=linear, quadratic control is being used to describe that
the controler is linear and the cost function quadratic. Here the actuating variable {uk } in
(5.29) is to tune so that
N
X N
X
gLQC (uN +1 ) = wH
N +1 PN +1 w N +1 + uH
k Rvv uk + dH
k Rdd dk (5.35)
k=1 k=1
becomes minimal. Newer research is found under the name Model Predictive Control. It
can be shown that the substitution of Fk with Fk∗ , Hk∗ with −Gk and G∗k with Hk result
sin the Kalman algorithm as solution. These two problems are said to be dual.
Exercise 5.1 Derive the Kalman algorithm under the condition

    
uk  H Ruu δ(k − i) Ruv δ(k − i) 0
 vk  ui   Rvu δ(k − i) Rvv δ(k − i) 0 
E  
 w0  vi
 =
 
.
 (5.36)
0 0 P0
w0
1 0 0 0
Exercise 5.2 For the special case Fk = I, Gk = 0, Rvv = 1 and Xk = xTk derive the
Kalman equations and compare them with the RLS algorithm.
Matlab Experiment 5.2 Repeat Matlab Experiment 5.1 and apply now the Kalman
algorithm. Compare the results with the previous ones.
5.3 Literature
First publications to the tracking behavior of adaptive algorithms are found [14, 15, 24, 45].
A good introduction to Kalman filters is given in [29] and [34, 67]. The original paper from
Kalman is [35]. In [1] Applications of the algorithm are described.
Chapter 6
Generalized LS Methods
We have considered several variants of LS solutions: with and without initial estimates,
under and over-determined systems, with and without weighting. When using weighting,
we made sure the weighting matrix was positive definite: Q > 0. We now address the
general question of what form such weighting matrices can have and what consequences
this can have for the solution. In particular we are interested in recursive forms of the
algorithms. To this end, let us start again with the standard LS problem:
d N = XN w o + v N . (6.1)
We have collected N observations d(k) = wTo xk + v(k), k = 1..N in vectors and matrices:
dTN = [d(1), d(2), ..., d(N )], (6.2)

v TN = [v(1), v(2), ..., v(N )], (6.3)
wTo = [wo (1), wo (2), ..., wo (M )], (6.4)
   
x(1) x(2) ... x(M ) xT1
 x(2) x(3) ... x(M + 1)   xT2 
   
 x(M + 2)   xT3 
XN =  x(3) x(4) ... = . (6.5)
 .. .. ..   .. 
 . . ... .   . 
x(N ) x(N + 1) ... x(N + M − 1) xTN
This leads to a weighted cost function without initial values:
gW LS (ŵN ) = ŵH −1 H
N Πo ŵ N + (dN − XN ŵ N ) Q(dN − XN ŵ N ). (6.6)
Possible extensions including initial values as in (4.24) are straightforward and are not
shown to focus on the important aspects. We re-formulate the cost function:
H
dN Q −QXN dN
gW LS (ŵN ) = (6.7)
ŵN −XNH Q Π−1 H
o + XN QXN ŵN
82
It is possible to decompose into the following matrix form:

Q −QXN I −QXN [Π−1 o + X H
N QX N ] −1
=
−XNH Q Π−1 H
o + XN QXN 0 I

Q − QXN [Π−1 H −1 H
o + XN QXN ] XN Q 0
×
0 Π−1 H
o + XN QXN

I 0
× −1 H −1 H .
−[Πo + XN QXN ] XN Q I
With this decomposition our cost function reads:

gW LS (ŵN ) = dH −1 H −1 H
N Q − QXN [Πo + XN QXN ] XN Q dN
+[ŵN − w̄N ]H [Π−1 H −1
o + XN QXN ] [ŵ N − w̄ N ] (6.8)
WITH w̄N = [Π−1 H −1 H

o + XN QXN ] XN QdN , by which now only the second term depends on
ŵN . Different to earlier considerations where we simply minimized in [Π−1 H
o + XN QXN ], we
have to consider now what properties [Π−1 H
o + XN QXN ] has. Depending on such the cost
function can be maximized, minimized or even something else. Note that in absence of a
weighting matrix Q we obtain always [Π−1 H
o + XN XN ] which is positive definite as long as
Πo is positive definite.
Obviously the point ŵN = w̄N is special and requires our interest. We consider three
cases:
1. Π−1 H
o + XN QXN > 0: the second term in (6.8) is non-negative for each choice of ŵ N ,
[ŵN − w̄N ]H [Π−1 H −1

o + XN QXN ] [ŵ N − w̄ N ] > 0
and achieves zero if and only if ŵN = w̄N . Thus the cost function

gW LS (ŵN ) ≥ dH
N Q − QX N [Π −1
o + X H
N QX N ] −1 H
X N Q dN
and takes on zero only if ŵN = w̄N . In this case we have a global minimum with a
unique solution.
2. Π−1 H
o + XN QXN < 0: the second term in (6.8) is not positive for each selection of ŵ N ,
[ŵN − w̄N ]H [Π−1 H −1

o + XN QXN ] [ŵ N − w̄ N ] ≤ 0
and obtains zero if and only if ŵN = w̄N . The cost function is thus

gW LS (ŵN ) ≤ dH −1 H −1 H
N Q − QXN [Πo + XN QXN ] XN Q dN
and takes on zero only in case ŵN = w̄N . We have a global maximum with a unique
solution.
3. Π−1 H −1 H
o + XN QXN is indefinite: At least one eigenvalue of Πo + XN XN is negative and
at least one is positive. Starting at point ŵN = w̄N we experience to run uphill in one
direction (eigenvector corresponding to positive eigenvalue) while we run downhill in
another direction (eigenvector corresponding to negative eigenvalue). Such a point
ŵN = w̄N is called saddle point.
In each case the point ŵN = w̄N is a particularity. It is thus called critical or stationary
point. We have not considered the case of vanishing eigenvalues yet but such non-invertible
matrices can be enclosed in our considerations.
6.1 Recursive Algorithm

We now address the question whether there is also a recursive algorithm like in case of
the RLS algorithm in case of indefinite weighting matrices, although we are typically only
interested of applications in which a cost function is minimized. We wil thus constrain
ourselves to the case Π−1 H
o + XN XN > 0 and derive a generalized RLS algorithm for mini-
mization of a cost function. As multi-channel descriptions are of interest we will extend our
generalization to this case as well. In consequence the vector dN will contain more than a
single component and the matrices XN will also contain several different observations. We
further assume that the weighting matrix Q exhibits a block structure. The transition from
N − 1 to N can be as follows:

dN −1 XN −1 QN −1 0
dN = , XN = , QN =
d̃N X̃N 0 Q̃N
This formulation is analogue to the formulation for the RLS algorithm only that the new
components are not vectors but matrices. The vector d̃N = [d1 (N ), d2 (N )] for example may
comprise of two components. We introduce again a matrix PN :
PN = [Π−1 H −1
o + XN QN XN ] ; P0 = Πo . (6.9)
In order to exist a solution for each time instant N it must be guaranteed that PN > 0.
We thus obtain the following recursive algorithm:
Generalized RLS Algorithm: Given a regular matrix Πo and a regular weighting matrix
QN with block structure. The solution of the minimization problem
min gV LS (ŵN )
ŵN
can be obtained by a recursive algorithm. Start with ŵ0 = 0 and P0 = Πo . Then we have
k > 0:
h i−1
Γk = Q̃−1
k + X̃k Pk−1 X̃kH (6.10)
Kk = Pk−1 X̃kH Γk (6.11)
h i
ŵk = ŵk−1 + Kk d̃k − X̃k ŵk−1 (6.12)
Pk = Pk−1 − Kk Γ−1 H
k Kk . (6.13)
For each time instant 0 ≥ k ≥ N we find ŵk to be the desired minimum of gV LS (ŵk ), if
and only if Pk > 0, thus positive-definite. Compare this form with (4.46).
Note that such last condition was always guaranteed under the RLS algorithm assuming
persistent excitation. Now this condition need sot be required explicitly and also needs to
be checked at each time instant k. To compute the eigenvalues of a matrix is very complex.
There are alternative methods (matrix inertia) just for testing.
We can derive again a relation between the a-priori and the a-posteriori errors:
ẽp,k = Q̃−1
k Γk ẽa,k . (6.14)
Correspondingly, also a cost function can be derived recursively:
gV LS (ŵk ) = gV LS (ŵk−1 ) + ẽH

p,k Q̃k ẽa,k (6.15)
= gV LS (ŵk−1 ) + ẽH
a,k Γk ẽa,k (6.16)
k
X
= ẽH
a,l Γl ẽa,l . (6.17)
l=1
Other forms of generalizations including indefinite weighting matrices as well as dynamic

process models will be considered in the next chapter.
6.2 Robustness
All adaptive algorithms can be driven by random processes and due to this a mean squared
error can be minimized. In many applications for example speech processing this is an
appropriate measure. In other applications however this measure may be inappropriate.
Consider for example a milling machine whose depth information is controlled in the mean
or another statistical measure. With a certain probability the desired depth would then
be exceeded. Even worse is the situation for an autopilot in an airplane. If an airplane
flights correctly only in the mean it may very well crash. We need a measure here that can
ensure a certain robustness against the worst case.
Such a measure can be defined in terms of energy (or power). Let us again consider in-
and output of a linear system
wk = Fk wk−1 + Gk uk , k = 1, 2, ... (6.18)

dk = Xk wk−1 + v k . (6.19)
as introduced in (5.29,5.30). We can certainly observe an energy relation of the form:

PN 2
k=1 kea,k k
PN −1 PN −1 = ka (N ). (6.20)
w̃H −1
0 Πo w̃ 0 +
2
k=1 kuk k + k=1 kv k k2
Here the initial error in terms of initial (usually unknown) conditions, the control signal uk
and the additive noise are the input, while the a-priori error energy is the output. More
generally we can formulate this as:
ea,k = Zk wk−1 − Zk ŵk−1 ; v k = dk − Xk wk−1 . (6.21)
We can for example select Zk = Xk which in turn means we observe the distorted as well
as the undistorted system. If it is possible to find an algorithm such that the values ka (l)
for every l remains below a threshold, that is
ka (l) < γ 2 ; 1 ≤ l ≤ N (6.22)
then such algorithm satisfies the desired robustness. The problem is known under the
name ”H∞ filter design with finite time horizon”. Note that the requirement (6.20) can be
formulated as:
N
X −1 N
X −1 N
X
w̃H −1
0 Πo w̃ 0 + 2
kuk k + 2
kv k k − γ −2
kea,k k2 ≥ 0. (6.23)
k=1 k=1 k=1
This is a quadratic form in ŵN , as we have considered in the previous chapter. After some
reformulation we recognize however:

−γ −2 I 0
QN = . (6.24)
0 I
It is thus an indefinite LS problem.
An equivalent problem formulation is given by the a-posteriori errors:

Pl
k=1 kep,k k2
Pl Pl = kp (l), (6.25)
w̃H −1
0 Πo w̃ 0 + k=1 kuk k2 + k=1 kv k k2
with the a-posteriori measures:
ep,k = Zk wk − Zk ŵk ; v(k) = dk − Xk wk . (6.26)
For both formulations a filter algorithm is known. However, its existence is not necessarily
given for each value of γ 2 . The corresponding a-posteriori algorithm reads:
A-posteriori H∞ Filter: The a-posteriori filter with limit γ exists if and only if the
following expression is positive definite for each value of k:
−1 1
Pk−1 − Zk ZkH + Xk XkH > 0. (6.27)
γ2
In this case the filter equations are:

−γ 2 I 0 Zk H H
Re,k = + Pk−1 Zk , Xk , (6.28)
0 I Xk
ŵk = Fk ŵk−1 + Pk−1 XkH [I + Xk Pk−1 XkH ]−1 (dk − Xk Fk ŵk−1 ), (6.29)

Zk
−1
Pk = Fk Pk−1 FkH − Fk Pk−1 [ZkH , XkH ]Re,k Pk−1 FkH + Gk GHk ; Po = Πo .(6.30)
Xk
We immediately recognize that (6.27) may not be positive definite. The problem
in robust filtering thus is that it is difficult to predict whether a filter with the desired
robustness exists.
Comparing the filter algorithm with the generalized RLS and the Kalman filter, we find
that the robust filter algorithm contains elements of both. From the Kalman filter comes
the time-variant system dynamic described by matrix Fk . For Zk = Xk it takes on the
form of the generalized RLS algorithm. And indeed there exists also an a-priori form of the
algorithm for Fk = I tha tis identical to the RLS algorithm.
Exercise 6.1 Set Fk = I, Gk = 0, Xk = Zk = xTk and Π0 = αI in the a-posteriori

H∞ -algorithm. Which algorithm do we get for the value γ 2 = 1? Can you formulate an
improved gradient type algorithm for Fk = F 6= I?
Chapter 7
Robust Adaptive Filters
The problem to formulate adaptive filter algorithm with robust properties, is thus not to
find them but to predict their stability. Die H∞ formulation delivers a robust solution
form but does not tell us whether the filter converges. We twill thus have to follow a
different path in the following and derive robustness directly by means of energy passivity
relations. The method allows to cover a very large class of adaptive algorithms.
Closely related with robust filtering are the passivity relations. If we consider the
signals w̃o and v(k) as inputs and the corresponding error signals as output of the adaptive
algorithm, we recognize that for γ 2 < 1 less energy comes out of the system than goes in.
If this is the case we cal the system passive.
7.1 Local Passivity Relations

To derive passivity relations let us assume that we have selected a specific step-size so
that µ(k)kxk k22 ≤ 1 and that we have an arbitrary system estimate q. By means of the
Cauchy-Schwarz’ inequality we can write:
T
x w − xT q 2
k k
≤1. (7.1)
µ−1 (k)kw − qk22
To avoid mathematic difficulties in case q = w, we can also argue with |xTk w − xTk q|2 −
µ−1 (k)kw − qk22 ≤ 0 instead. The term is certainly still correct if we extend the numerator
by
T
x w − xT q 2
k k
≤1. (7.2)
µ−1 (k)kw − qk22 + |v(k)|2
88
As such relations are true for any arbitrary estimate q, as long as µ(k)kxk k22 ≤ 1 holds,
they also must be true for the estimates of the LMS algorithm, thus for
T
x w − xT ŵk−1 2
k k
≤1. (7.3)
µ−1 (k)kw − ŵk−1 k22 + |v(k)|2
The urgent question is now, in which sense an LMS estimate would change such
limit. Denoting ea (k) = xTk [w − ŵk−1 ] = xTk w̃k−1 for the undistorted a-priori
error, ep (k) = xTk [w − ŵk ] = xTk w̃k for the undistorted a-posteriori error and
γ(k) = [µ−1 (k) − kxk k22 ], then the following theorem holds.
Theorem 7.1 (Local Passivity Property) For the adaptive gradient method (LMS al-
gorithm with variable step-size) we have at every time instant k:
µ−1 (k)kw − ŵk k22 + |ea (k)|2

≤ 1, (7.4)
µ−1 (k)kw − ŵk−1 k22 + |v(k)|2
|ea (k)|2 + |ep (k)|2
≤ 1,
µ−1 (k)kw − ŵk−1 k22 + |v(k)|2
γ(k)kw − ŵk k22 + |ep (k)|2
≤ 1,
γ(k)kw − ŵk−1 k22 + |v(k)|2
|ea (k)|2 + |ea (k + 1)|2
≤ 1.
µ−1 (k)kw − ŵk−1 k22 + |v(k)|2

The first three relations are true if µ(k)kxk k22 ≤ 1 while µ(k) ≤ min 1/kxk k22 , 1/kxk+1 k22
is required for the last.
Proof: We show the first relation. The update equations for the parameter error vector
are:
w̃k = w̃k−1 − µ(k)x∗k [ea (k) + v(k)], (7.5)
where we split the distorted a-priori error ẽa (k) = ea (k) + v(k) into an undistorted a-priori
error and noise. Computing the quadratic l2 −norm on both sides, we obtain
kw̃k k22 = kw̃k−1 k22 +µ2 (k)kxk k22 |ea (k)+v(k)|2 −µ(k)[ea (k)+v(k)]e∗a (k)−µ(k)[ea (k)+v(k)]∗ ea (k).
Note that
|ea (k) + v(k)|2 = |ea (k)|2 + |v(k)|2 + ea (k)v ∗ (k) + e∗a (k)v(k).
We thus obtain
kw̃k k22 = kw̃k−1 k22 + µ2 (k)kxk k22 |ea (k) + v(k)|2

−2µ(k)|ea (k)|2 − µ(k) |ea (k) + v(k)|2 − |ea (k)|2 − |v(k)|2 . (7.6)
Eventually, by reordering the terms we find

kw̃k k22 − kw̃k−1 k22 + µ(k)|ea (k)|2 − µ(k)|v(k)|2 = µ(k)|ea (k) + v(k)|2 µ(k)kxk k22 − 1 .
The right-hand side is negative, as long as µ(k)kxk k22 ≤ 1. We thus have
kw̃k k22 − kw̃k−1 k22 + µ(k)|ea (k)|2 − µ(k)|v(k)|2 ≤ 0. (7.7)
Dividing by kw̃k−1 k22 + µ(k)|v(k)|2 and we obtain the desired relation.
Exercise 7.1 Derive the last three relations of Theorem 7.1.
Exercise 7.2 Show that the LMS algorithm can be derived by the local cost function
kwk − wk−1 k22 + λT ẽp (k) + λH ẽ∗p (k). (7.8)
Here, λ is a Lagrangian multiplier.
7.2 Robustness Analysis of Gradient Type Algorithms

Although we have not proven robustness with our local passivity relations but we have
achieved a first step along this path. What is missing is a global relation spanning from
the initial time k = 1 to an arbitrary time instant k = N . In order to find such a global
relation, we start with our local property (7.7) and reformulate it for 1 ≤ k ≤ N ,
µ(k)|ea (k)|2 ≤ kw − wk−1 k22 − kw − wk k22 + µ(k)|v(k)|2 .
Summing up over k from k = 1 bis k = N we obtain

PN
kw̃N k22 + k=1 µ(k)|ea (k)|
2
P ≤ 1, (7.9)
kw̃0 k22 + Nk=1 µ(k)|v(k)|
2
The
p numerator of (7.9) is thus the energy of the undistorted normalized a-priori error
µ(k)ea (k) from 1 ≤ k ≤ N , plus the energy of the remaining parameter error vector
at time instant N . Correspondingly, the denominator also consists of two terms: the
energy of the normalized noise/disturbance over the entire time period as well as the initial
parameter error vector energy. This is a global energy relation: a matrix TN maps the signals
p p
{ µ(k)v(k)}N k=1 and w̃ 0 onto the normalized a-priori error signals { µ(k)ea (k)}N
k=1 and
the remaining parameter error vector w̃N .
 p    
µ(1)ea (1) x p w̃ 0
 ..   x x  µ(1)v(1) 
 .     
 p  =  .. . .   .. . (7.10)
 µ(N )ea (N )   . .  . 
p
w̃N x x x µ(N )v(N )
| {z }
TN
Such a matrix must be passive, according to (7.9), or contracting that is the induced
l2 −norm of the matrix is bounded: kTN k2,ind ≤ 1. In terms of robust control, such induced
matrix norm is called H∞ norm. Figure 7.1 illustrates the relation.
w̃0 w̃N
- -
TN
- -
p p
{ µ(k)v(k)}N
k=1 { µ(k)ea (k)}N
k=1
Figure 7.1: Contracting Mapping TN .
Exercise 7.3 Show the robustness of the LMS algorithm with fixed step-size µ. Which
relation has µ to satisfy, so that the algorithm is robust?
Exercise 7.4 Consider the gradient algorithm with regular step-size matrix Mk :
ŵk = ŵk−1 + Mk x∗k ẽa (k). (7.11)
Derive the following relations for 0 < xH H
k Mk xk ≤ 1, γ(k) = 1 − xk Mk xk :
−1
w̃H
k Mk w̃ k + |ea (k)|
2
−1 ≤ 1, (7.12)
w̃H
k−1 Mk w̃ k−1 + |v(k)|
2
−1
γ(k)w̃Hk Mk w̃ k + |ep (k)|
2
−1 ≤ 1, (7.13)
γ(k)w̃H
k−1 Mk w̃ k−1 + |v(k)|
2
|ea (k)|2 + |ep (k)|2

−1 ≤ 1. (7.14)
w̃H
k−1 Mk w̃ k−1 + |v(k)|
2
For Mk = ν(k)M and ν(k) > 0, M > 0 derive the robustness conditions of this algorithm
from the first relation.
7.2.1 Minimax Optimality of Gradient Method

Let us consider again (7.9) in the following formulation:
PN
kw − ŵN k22 + k=1 µ(k)|ea (k)|
2
P ≤ 1. (7.15)
kw − ŵ0 k22 + Nk=1 µ(k)|v(k)|
2
Which value exactly maximizes the expression? If we consider a particular noise sequence
v(k) = −ea (k), we recognize that the gradient method is not updating; in other words: the
estimate ŵk remains at its initial value ŵk = w0 . We thus have
PN
kw − ŵN k22 + k=1 µ(k)|ea (k)|
2
max
√ P = 1. (7.16)
ŵ0 6=w, µ(k)v(·) kw − ŵ0 k22 + Nk=1 µ(k)|v(k)|
2
The selection of ŵ0 6= w is of technical matter. If we allow ŵ0 = w, the the denomina-
tor can become zero. In this case we would have to argue with differences rather than ratios.
Let us consider now the ratio in (7.15) for an arbitrary algorithm A. If we select again
v(k) = −ea (k), and this time ŵ0 = w, then we will have
N
X N
X N
X
2 2
µ(k)|v(k)| = µ(k)|ea (k)| ≤ kw − ŵN k22 + µ(k)|ea (k)|2 , (7.17)
k=1 k=1 k=1
or differently written:
PN
kw − ŵN k22 + µ(k)|ea (k)| 2
max PNk=1 ≥ 1. (7.18)
v̄(·) kw − ŵ0 k22 + k=1 µ(k)|v(k)|
2
If for an arbitrary algorithm A the relation (7.18) is true, while we know that for the
gradient method (7.16) is true, then we can summarize this property in the following
theorem.
Theorem 7.2 (Minimax Property of the gradient type algorithm) The gradient
type algorithm solves for µ(k)kxk k22 ≤ 1 the following minimax problem:
PN
kw − ŵN k22 + k=1 µ(k)|ea (k)|
2
min √max P . (7.19)
class of algorithms ŵ, µ(k)v(·) kw − ŵ0 k22 + Nk=1 µ(k)|v(k)|
2
Moreover, its optimal value is one.

7.2.2 Sufficient Convergence Conditions

Theorem 7.3 (Convergence Conditions) For µ(k)kxk k22 ≤ 1, kw̃0 k < ∞ and
P ∞ 2
k=1 µ(k)|v(k)| < ∞ the following is true:
p
µ(k)ea (k) → 0. (7.20)
p
If furthermore µ(k)xk is a persistent excitation, (see Lemma 4.3), then we also have
ŵk → w. (7.21)
Proof: With (7.9) we have:

N
X N
X
2
µ(k)|ea (k)| ≤ kw̃0 k22 + µ(k)|v(k)|2 (7.22)
k=1 k=1
For a bounded initial error kw̃0 k22 < ∞ and bounded disturbance energy, the energy of the
a-priori
perror must be bounded as well. Infinite series of finite energy are Cauchy series and
thus: µ(k)ea (k) → 0.
Consider further the update equation at time instant k:
w̃k = w̃k−1 − µ(k)ẽa (k)x∗k , (7.23)
we can write equivalently
p−1
X
w̃k+p−1 = w̃k − µ(k + l)ẽa (k + l)x∗k+l . (7.24)
l=1
Consider a series of P > M vectors xk+p ; p = 1..P , all of them can be tested on a value w̃k
and we find
p−1
X
xTk+p w̃k = xTk+p w̃k+p−1 + µ(k + l)ẽa (k + l)xTk+p x∗k+l (7.25)
l=1
p−1
X
= ẽa (k + p) + µ(k + l)ẽa (k + l)xTk+p x∗k+l . (7.26)
l=1
Because all ea (k) → 0 and also due to the bounded energy we have v(k) → 0, and thus we
conclude that ẽa (k) → 0. We thus have shown that the right-hand side of (7.26) converges
towards zero. Concatenating all vectors xk+p we find further:
 
xTk+1
 xT 
 k+2 
 ..  w̃k → 0. (7.27)
 . 
xTk+P
Unfortunately, we cannot conclude form here that w̃k → 0, as w̃k could have components
in the null space of the matrix. A consequence of the persistent excitation condition is that
the matrix is of full rank M and thus its null space must be zero. Formally this can be
shown by multiplying the hermitian transpose matrix from the left. We then have
 
xTk+1
∗ ∗ ∗
 T
 xk+2 

xk+1 , xk+2 , ..., xk+P  ..  w̃k → 0. (7.28)
 . 
xTk+P
Due to the persistent excitation condition (Lemma 4.3) the left hand side lies between
αI and βI, with 0 > α > β. Thus the matrix is regular and the null space becomes empty.
Some interesting consequences follow. In case the noise sequence v(k) is not of bounded
energy a bound must be guaranteed by the step-size, for example by µ(k) = a/[b + k]2 .
7.2.3 The Feedback Nature of the Gradient Method

Until here we assume that µ(k)kxk k22 ≤ 1. On the other hand we already know that for
larger values of the step-size the gradient method may work. We thus expect that it is
possible to extend the local bounds.
Lemma 7.1 For the gradient type method we find at each time instant k with µ̄(k) =
1/kxk k22 : 
2 2  ≤ 1 for 0< µ(k)< µ̄(k)
kw̃k k2 + µ(k) |ea (k)|
=1 if µ(k) = µ̄(k) (7.29)
kw̃k−1 k22 + µ(k) |v(k)|2  ≥ 1 for µ(k) > µ̄(k)
Proof: The first relation has been shown already. The second is obtained by substituting
µ(k) = µ̄(k). For the third relation we consider again (7.1):
kw̃k k22 − kw̃k−1 k22 + µ(k)|ea (k)|2 − µ(k)|v(k)|2 = µ(k)|ea (k) + v(k)|2 [µ(k)kxk k22 − 1].
For µ(k) > µ̄(k) we find that [µ(k)kxk k22 − 1] > 0!
An interesting outcome of this considerations is that the local relation describes an

allpass (thus lossless) always when the step-size µ(k) = µ̄(k) is selected. Obviously the
induced l2 −norm becomes one, thus kTk k2,ind = 1. We can also reformulate our original
relation including an arbitrary step-size into a relation with particular step-size µ̄(k)
ŵk = ŵk−1 + µ(k)x∗k [ea (k) + v(k)]

∆
= ŵk−1 + µ̄(k)x∗k [ea (k) + v̄(k)] , (7.30)
and the newly introduced abbreviation
µ(k)
−v̄(k) = ea (k) − [ea (k) + v(k)] = ep (k). (7.31)
µ̄(k)
Figure 7.2 illustrated this structure. The gradient type method can thus be explained
as feedback structure. In teh forward path there is a lossless system, an allpass with
kTk k2,ind = 1, while the feedback path contains a lossy system. Compare this with Equation
(3.68) and Figure 3.4. Such local relation can also be reformulated into a global one, allowing
q −1
w̃k−1 w̃k
p -
µ̄(k) v(k) p
kT k k = 1 µ̄(k) ea (k)
- n - n - -
−
6
µ(k)
µ̄(k)
n

µ(k)
1−
µ̄(k)
Figure 7.2: Gradient type method as (lossless) in the forward path and lossy feedback path.
new robustness statements. Let us consider again
N
X N
X
2
µ̄(k)|ea (k)| ≤ kw˜0 k22 + µ̄(k)|v̄(k)|2 (7.32)
k=1 k=1
with v̄(k) = [µ(k)/µ̄(k)]v(k) + [µ(k)/µ̄(k) − 1]ea (k). Let us consider

∆ µ(k)
δ(N ) = max 1 − , (7.33)
1≤k≤N µ̄(k)
∆ µ(k)
γ(N ) = max , (7.34)
1≤k≤N µ̄(k)
then we find with the triangular inequality:

v v
u N q u N
uX uX
t µ̄(k)|ea (k)|2 ≤ kw˜0 k22 + t µ̄(k)|v̄(k)|2 (7.35)
k=1 k=1
v v
q u N u N 2
uX uX µ(k)
≤ kw˜0 k22 + t µ(k)|v(k)|2 + t µ̄(k) 1 − |ea (k)|2
k=1 k=1
µ̄(k)
v
q u N
uX
≤ kw˜0 k22 + γ(N )t µ̄(k)|v(k)|2
k=1
v
u N
uX
+δ(N )t µ̄(k)|ea (k)|2 (7.36)
k=1
 v 
q u N
1 uX
=  kw˜0 k22 + γ(N )t µ̄(k)|v(k)|2  . (7.37)
1 − δ(N ) k=1
As long as δ(N ) < 1, we can conclude in a global sense what we expected. The
statements are summarized in the following theorem.
Theorem 7.4 (Extended Convergence of the Gradient TypePMethod) For the

2 ∞ 2
gradient type method we find for 0 < µ(k)kxk k2 < 2, kw̃o k < ∞ and k=1 µ̄(k)|v̄(k)| < ∞:
ēa (k) → 0. (7.38)

p
Furthermore, if µ(k)xk is of persistent excitation (see Lemma 4.3), then we also have
ŵk → w. (7.39)
Interesting to remark is the energy flow of such system. As the forward path is lossless,
all energy that enters must also come out. The energy component in the parameter error
vector is fed back into the input of the system. Energy can thus only be lost in the
feedback path. The more energy is lost, the faster learns the algorithm. Thus, the fastest
learning is obtained for µ(k) = µ̄(k) as the feedback becomes zero.
Exercise 7.5 Show the stringent relation
kw̃k k22 + µ(k) |ea (k)|2 µ(k)

2 ≤ (7.40)
2
kw̃k−1 k2 + µ(k) |v(k)| 2µ̄(k) − µ(k)
for the third condition µ̄(k) < µ(k) < 2µ̄(k) in Lemma 7.1.
Exercise 7.6 Derive the following relation:

v  v 
u N q u N
uX 1
γ (N ) 
2 uX
t µ(k)|ea (k)|2 ≤ kw˜0 k22 + γ 2 (N )t
1
µ(k)|v(k)|2  . (7.41)
k=1
1 − δ(N ) k=1
7.2.4 The Gauß-Newton Algorithm

The considerations for the gradient type algorithms can also be applied for a larger class of
non gradient algorithms, the so-called Gauß-Newton methods, of which the RLS algorithm
is a special case. As the procedure is the same as in the previous sections, we only present
the most important results and interpret them. First the definition:
Gauß-Newton Algorithm: Given the observations {d(k)}N k=1 , an initial estimation ŵ 0 ,

and a positive-definite matrix Π0 . Then the recursive Gauß-Newton estimator with positive
parameters {λ(i) ≤ 1, µ(i), β(i)} is given by

ŵk = ŵk−1 + µ(k) Pk x∗k d(k) − xTk ŵk−1 , (7.42)
in which Pk satisfies the following Matrix-Riccati equation:

!
1 Pk−1 x∗k xTk Pk−1
Pk = Pk−1 − λ(k) , P0 = Π0 . (7.43)
λ(k) + xTk Pk−1 x∗k
β(k)
We recognize the RLS algorithm with exponential weighting that is obtained for
µ(k) = β(k) = 1 and λ(k) = λ.
From the local passivity relation we derive


≤ 1 for 0 < µ(k) < µ̄(k),
w̃k Pk w̃k + (µ(k) − β(k)) |ea (k)| 
H −1 2
2 = 1 if µ(k) = µ̄(k), (7.44)

λ(k)w̃H −1
k−1 Pk−1 w̃ k−1 + µ(k) |v(k)|

≥ 1 for µ(k) > µ̄(k).
With the second property we can derive a feedback structure according to Figure 7.3.
λ 2 (k)q −1
1
1 −1 −1
λ 2 (k)Pk−1
2
w̃k−1 Pk 2 w̃k
p -
µ̄(k) v(k) p p
kT k k = 1 µ̄(k) − β(k) ea (k)
- m - m
µ̄(k)v̄(k)
- -
µ(k) −
6
µ̄(k)
m

− 21
β(k) µ(k)
1− 1−
µ̄(k) µ̄(k)
Figure 7.3: Feedback structure of Gauß-Newton algorithm.
Again a global relation can be derived

v v
u N u N
uX p uX µ2 (k)
t λ[k+1,N ] [µ̄(k)−β(k)] |ea (k)|2 ≤ λ[0,N ] w̃H −1
+t λ[k+1,N ] |v(k)|2
0 P0 w̃ 0
k=1 k=1
µ̄(k)
v
u N 2
uX µ(k)
+ t λ [k+1,N ] 1 − µ̄(k)|ea (k)|2
µ̄(k)
k=1
v 
q u N
1 uX
≤ λ[0,N ] w̃H −1
0 P0 w̃ 0 +γ(N )t λ[k+1,N ] µ̄(k)|v(k)|2  .
1 − δ(N ) k=1
(7.45)
We have employed the following short terms:

µ(k)
1 − µ̄(k)
δ(N ) = max q und γ(N ) = max µ(k) , (7.46)
1≤k≤N 1≤k≤N µ̄(k)
1 − β(k) µ̄(k)
as well as
j
Y 1
λ[i,j] = λ(k); µ̄(k) = . (7.47)
k=i
xTk Pk x∗k
Remarkably is that we always have

µ̄(k) ≥ β(k). (7.48)
Again, by substituting µ̄(k) with µ(k) we find a new relationship and we obtain
v
u N
uX
t λ[k+1,N ] [µ(k)−β(k)] |ea (k)|2 ≤
k=1
v 
q u N
1/2
γ̃ (N ) uX
≤ λ[0,N ] w̃H −1
0 P0 w̃ 0 +γ 1/2 (N )t λ[k+1,N ] µ(k)|v(k)|2  .
1 − δ(N ) k=1
(7.49)
now with a modified short term:
µ(k) − β(k)
γ̃(N ) = max , (7.50)
1≤k≤N µ̄(k) − β(k)
Exercise 7.7 Find the stability bound for the step-size µ(k) of the Gauß-Newton Algo-
rithm. Consider both variants (7.45) and (7.49).
Exercise 7.8 For the particular case P0 = ǫI, λ(k) = λ, µ(k) = µo µ̄(k) and β(k) = βo µ̄(k)
find the stability bound as well as the robustness measures. Compare the results to the
gradient algorithm.
7.3 Algorithms with Nonlinear Filter without Mem-

ory in the Estimation Path
The method described so far can also be modified in order to include nonlinearities in
the estimation path. We can distinguish two variants that are being treated similarly in
mathematical terms but have entirely different applications: neuronal networks for pattern
classification and adaptive equalizers.
7.3.1 The Perceptron-Learning Algorithm

Let us first consider the neural network structure in its simplest form. Let us assume there
are two sets of M −dimensional real-valued vectors x, S0 and S1 , characterized by having
either the (exclusive) property A or B, thus
S0 = {x ∈ IRM | xis of property A} ,
S1 = {x ∈ IRM | xis of property B} .

If both sets are separable then a classification scheme (synapse, linear perceptron) can be
employed to decide to which class a given vector x belongs to.
A linear synapse comprises of a linear combiner (as before) but has additionally a nonlin-
ear device f [z] at the output of the linear combiner as shown in Figure 7.4. Such nonlinear
function is called activation function. Its value can also be interpreted as likelihood to
which class a given vector x belongs to. A common choice of such activation functions f [z]
z f [z]
x
Figure 7.4: The linear Synapse.
are so-called Sigmoid-functions:

1
fβ [z] = , β > 0. (7.51)
1 + e−βz
This is a monotone increasing function from 0 to 1 for inputs z ∈ (−∞, ∞). In its switching
area that is around z = 0 it is more or less steep, depending on its parameter β. For β → ∞
the sigmoid-function becomes a step-function (hard limiter), loosing its continuity property:

0 if z < 0
1 + sgn[z]  1
f∞ [z] = = if z = 0 .
2  2
1 if z > 0
Let us consider now a set of possible input vectors {xk } with their corresponding correct
output decisions {y(k)}. The values {y(k)} belong to the range of the activating function
f [·] that is there are unknown vectors w, such that
y(k) = f [xTk w] for a particular w. (7.52)
In a supervised learning data pairs {xk , y(k)} are presented to the synapse so that the
adaptation algorithm can estimate the unknown w. Most well known is the Perceptron
Learning Algorithm (PLA). The algorithm starts with an initial guess w1 and applies the
following rule:
ŵk = ŵk−1 + µxk y(k) − f [xTk ŵk−1 ] . (7.53)
In order to keep the result more general we added also noise v(k) in the reference. This can
be interpreted as modeling error. The additively distorted reference values we denominate
again by {d(k)}. We thus observe
d(k) = f [xTk w] + v(k) = y(k) + v(k). (7.54)
We will analyze the following form of the PLA:

ŵk = ŵk−1 + µ(k)xk d(k) − f [xTk ŵk−1 ] , (7.55)
for which we also included a variable step-size. The only difference compared to our previous
method is the non linear mapping f [·], as it occurs in the estimation path. By the mean
value theorem we have:
f [xTk w] − f [xTk ŵk−1 ] = f ′ [η(k)]ea (k).
Figure 7.5 exhibits the feedback structure in this case. The nonlinear mapping occurs in
the passive feedback path, resulting in a modified convergence condition. Writing down the
equations for global convergence we obtain:

∆ µ(k)
δ(N ) = max 1 − f ′ [η(k)] , (7.56)
1≤k≤N µ̄(k)
which defines the condition for convergence and thus robustness: δ(N ) < 1.
q −1
w̃k−1 w̃k
-
p
kT k k = 1 µ̄(k) ea (k)
- m - m - -
µ(k) −
6
µ̄(k)
m

µ(k)
1 − f ′ [η(k)]
µ̄(k)
Figure 7.5: The Perceptron-Learning Algorithm as feedback structure.

7.3.2 Adaptive Equalizer Structures

Adaptive equalizer work very similarly to the PLA method, when training them by so called
training sequences. Such sequence is known at the receiver, typically sent at the beginning
of a TDMA transmission. The receiver utilizes them in order to find the optimal set of co-
efficients. As the transmitted symbols are from a finite alphabet, a nonlinear mapping of so
called soft symbols to such symbols from the alphabet follows a linear filter. Analogue to the
PLA we also have a concatenation of a linear adaptive filter with a fixed nonlinear mapping.
Very typical for the PLA is that the analysis is limited to real-valued signals and so
is its typical application. The problem of the limitation is the definition of the nonlinear
mapping for complex valued signals. As this is a requirement for adaptive equalizers, we
will have to have a closer look at it now. Let us consider Figure 1.7 of the first chapter.
A complex valued symbol s(k) is transmitted through a linear channel c. Additive noise
alters the received symbol further before it is sent through a linear adaptive filter. We
do not expect that a pure linear filtering will recover the symbol s(k − D) entirely (up
to an unavoidable delay D). On the other hand only very particular symbols are being
sent. We expect that after linear filtering the symbol value lies close to a valid symbol
from the transmission alphabet and can thus be mapped onto the correct symbol by a
suitable nonlinear mapping. We thus assume that this structure with a given filter length
and optimal parameter set wo to ”equalize” the channel so that the output is close to the
correct symbol. Our reference model thus delivers a value y(k), which we can compare
with the estimate ŷ(k) conditioned by a parameter set wk−1 . We call this a training modes
as we are lacking a concrete reference signal.
Let us consider this situation more closely. The reference signal (assume s(k − D), or
a function of it) is given by f [y(k)] = f [xT wo ], by which we can compute the error signal
eo (k) = f [y(k)] − f [ŷ(k)]. In order to relate it to the a-priori error, we write:
eo (k) = f [y(k)] − f [ŷ(k)] (7.57)

f [y(k)] − f [ŷ(k)]
= ea (k) (7.58)
y(k) − ŷ(k)
∆
= h[y(k), ŷ(k)]ea (k). (7.59)
With help of this new function the update equations can be formulated in typical form:
ŵk = ŵk−1 + µ(k)x∗k eo (k) = ŵk−1 + µ(k)x∗k h[y(k), ŷ(k)]ea (k). (7.60)
The difficulty is given with the function h[·, ·]. In order to guarantee l2 −stability, we must
have
∆ µ(k)
δ(N ) = max 1 − h[y(k), ŷ(k)] < 1. (7.61)
1≤k≤N µ̄(k)
Example 7.1: Consider for example BPSK transmission. The excepted symbols are thus
[−1, 1]. We select as non linear mapping f [z] = sgn[z]. Then we obtain
sgn[y(k)] − sgn[ŷ(k)]
h[y(k), ŷ(k)] = . (7.62)
y(k) − ŷ(k)
As negative values of sgn[y(k)] − sgn[ŷ(k)] can only occur when the difference of the
arguments is negative, the function itself is positive and thus a step-size exists for which
stability is guaranteed.
In case there is no reference signals the algorithm can be driven in the so-called blind
mode. As we have some a-priori information about the transmitted symbols, we can select
the nonlinear mapping, so that a constant appears at the output if the filter has been
selected correctly. We can for example employ so called constant modulus (CM) signals,
thus signals whose amplitude remains constant. The information is transmitted in its
phase. Computing |z(k)| = |xTk wo |, the optimal system will give out a constant amplitude,
thus |z(k)| = γ. An adaptive algorithm can thus be:
CMA-q-2 Algorithm
ŵk = ŵk−1 + µ(k)x∗k ŷ(k)[γ − |ŷ(k)|q ]. (7.63)
Corresponding stability conditions are obtained from the function h[·, ·].
Exercise 7.9 Compute the stability conditions of the step-size µ(k) in case of BPSK train-
ing and utilization of a sign function.
Exercise 7.10 CM-signals are being used for a transmission and the CMA-2-2 algorithm
for training the equalizer. What is its stability condition?
Exercise 7.11 BPSK is being transmitted and the equalizer run in blind mode. Its updates
are given by:
ŵk = ŵk−1 + µ(k)x∗k [sgn[ŷ(k)] − ŷ(k)]. (7.64)
Draw the algorithm in its feedback structure and define its stability conditions.
7.3.3 Tracking of Equalizer Structures

Adaptive Equalizer structures do not fit into the system identification scheme and are thus
often wrongly or at least overly simplified interpreted. Figure 7.6 intends to support a
better understanding of the problem. Let us assume that a reference structure exists, that
can guarantee the equalization and thus recovery of the symbols. During the training phase
known signals are offered as desired signal. This is equivalent to the mode when the switch
is selected on MMSE. In this mode the filter coefficients ŵ are selected so that the MSE is
minimized, a requirement that can only be satisfied with some error term as the optimal
solution requires a double infinite length equalizer. Let us denote the received signal by
r(k) = cT sk + v(k). The MMSE can be computed as
eM M SE (k) = z(k) − rTk ŵk−1 = s(k − D) − rTk ŵk−1 + z(k) − s(k − D), (7.65)
| {z } | {z }
eN L (k) −g(z)
where we have written the receiver values r(k) in the vector rTk = [r(k), r(k − 1), .., r(k −
M + 1)]. In the blind mode of operation (also decision directed mode) the reference signal
v(k)

?
s(k) - c - + r(k)
- wo z(k)- s(k − D)
-
NL

?
?
KAA
−
−
A 6 6
- A
ŵA
A f
?MMSE
A
A v f

NL
Figure 7.6: Reference model for adaptive equalizers.
si extracted out of the linear equalized signal by a nonlinear mapping. We thus have a
nonlinear device in the reference path (Switch in position NL). The relation between the
so obtained error signal eN L (k) and the MMSE is shown in Equation (7.65). However, it
was assumed in (7.65) that only correct symbol s(k − D) have been recovered. In teh blind
mode we do not have the correct values but only estimates of them. Equation (7.65) thus
requires some correction:
eM M SE (k) = ŝ(k − D) − rTk ŵk−1 + z(k) − ŝ(k − D) . (7.66)

| {z } | {z }
êN L (k) −ĝ(z)
The updates thus occur with an error signal êN L (k) = eM M SE (k) + ĝ[z(k)]. Out of this we
can conclude that:
• The adaptive filter of a nonlinear equalizer works in a system identification mode.
• An additive disturbance is present in form of the function ĝ[z(k)].
• The excitation for the system identification is a composite signal, comprising of lin-
early filtered transmit symbols and additive noise.
Important for the equalizer is the tracking behavior of the algorithm. This can be
described well by the feedback structure. Consider the update equation in the form:
ŵk = ŵk−1 + µ(k)x∗k f [ŷ(k)]. (7.67)
For den CMA-2-2 algorithm we obtain for example f [ŷ(k)] = ŷ(k)[γ − |ŷ(k)|2 ]. Reformu-
lating in the well known form we obtain:
2
µ(k) 2
2 2
µ̄(k)|ea (k)| + kw̃k k = µ̄(k) ea (k) − f [ŷ(k)] + w̃k−1 (7.68)
µ̄(k)
If we consider the signals as random processes, we can compute the expectation on both
2
2
ends. IN steady-state we find that E[kw̃k k ] = E[ w̃k−1 ] and thus we have
" 2 #
µ(k)
E µ̄(k)|ea (k)| = E µ̄(k) ea (k) −
2
f [ŷ(k)] . (7.69)
µ̄(k)
This expression can be simplified based on the two following assumptions:

• We assume that in steady-state the transmitted signals s(k − D) and the undistorted
a-priori error ea (k) are statistically independent.
• We assume that in steady-state the reciprocal instantaneous energy µ̄(k) and the
estimated value ŷ(k) are statistically independent.
With these assumptions we can further process Equation (7.69). For small and constant
step-sizes we obtain
E [|s(k)|2 γ 2 − 2γ|s(k)|4 + |s(k)|6 ]
E[|ea (k)|2 ] ≈ µ E kx k k2
2 (7.70)
2E [δ|s(k)|2 − γ]
with δ = 2 for the complex-valued case and δ = 3 for the real valued case. A few remarks
to the procedure and the result (7.70).
• It is interesting that the steady-state error energy can now also be minimized with
respect to γ and thus we can find the smallest steady-state error energy.
• Utilizing a gradient algorithm, the learning speed is strongly dependent on the eigen-
value spread of the input process. This spread is defined only by the channel, if we
assume a white data sequence to be transmitted. On the other hand, the additive
white noise works positively to decrease the eigenvalue spread.
• Generalizations towards Decision-Feedback Equalizers (DFE) and oversampled equal-

izers (Fractionally-Spaced Equalizer =FSE) are immediately possible. Note how-
ever that the correlation of the input signal becomes stronger and the learning thus
slower [63].
• Solutions to this problem are certainly all mentioned methods to decorrelate and thus
increasing learning speed, in particular polyphase filter structures [47, 48].
• The method does not require the existence of a feedback structure. The pure decom-
position into in and output are sufficient.
• The method can also be applied in situations with nonlinear mapping in the error
signal, an adaptive structure that will be analyzed in the next section.
Exercise 7.12 Optimize with respect to γ for the CMA-1-2 algorithm.
Exercise 7.13 Compute the Excess-Mean-Square Error of the LMS algorithm based on the
statistical method presented here and compare the results with those in Chapter 3. Compute
also the steady-state error energy for the Least-Mean Fourth algorithm.
Exercise 7.14 Under the assumption of a reference model wk = wk−1 + qk in which the
vectors qk are statistically independent, compute the Excess-Mean-Square error in depen-
dence to the step-size for the LMS algorithm.
7.4 Algorithms with Nonlinear Filter in the Error

Path
The treatment of algorithms in which the nonlinear mapping appears in the error path as
an odd function (even functions do not seem to make sense) are a lot more complicated.
Their update equation is in general given by:
ŵk = ŵk−1 + µ(k)f [d(k) − xTk ŵk−1 ]x∗k . (7.71)
Consider the odd function f [·] as derivative of a convex cost function ψ(·). The adaptive
algorithm (7.71) is a gradient type method that minimizes in approximation E{ψ(d(k) −
xTk ŵk )}. To this situation several algorithms that has been treated in literature, as the
sign-error, least-mean-K, and power-of-two quantized algorithms (see [12, 79]). The sign-
error algorithm we had briefly discussed in Chapter 3 to safe complexity. For its analysis
we take our standard reference model:
d(k) = xTk w + v(k). (7.72)

Theorem 7.5 The update for the gradient type method with nonlinear mapping in the error
path is given by:
x∗k
ŵk = ŵk−1 + (ẽa (k) − q[ẽa (k)]) , (7.73)
kxk k22
in which q[·] is an odd function with e q[e] > 0. Under this condition the gradient method
is l2 −stable and robust, as long as q[e] is contracting, that is
|q[e]| < |e|. (7.74)
Proof: We formulate the update by the parameter error vector:

x∗k
w e k−1 − (ẽa (k) − q[ẽa (k)])
ek = w . (7.75)
kxk k22
By multiplying with xkT and addition of v(k) we obtain
ẽp (k) = q[ẽa (k)]. (7.76)
We thus find for the l2 −norms on both sides of (7.75),
|ea (k)|2 [q[ea (k) + v(k)] − v(k)]2

wk k22 +
ke = ke
w k−1 k2
+ . (7.77)
kxk k22 kxk k2
As q[e] is contracting, we have:

|q[e(k)]| ≤ β|e(k)|, (7.78)
with 0 < β < 1. With this result we can eventually show that

2 2 β 2 2
|q[ea (k) + v(k)] − v(k)| ≤ (β + 1) |ea (k)| + |v(k)| , (7.79)
2
and
2 − β(β 2 + 1) |ea (k)|2 |v(k)|2
wk k22 +
ke ≤ (β 2
+ 1) wk−1 k22 .
+ ke (7.80)
2 kx(k)k2 kxk k22
Now we only have to iterate (7.80) from k = 1 to k = N and we obtain:
N N
2 − β(β 2 + 1) X |ea (k)|2 X |v(k)|2
wN k22 +
ke 2
≤ (β 2
+ 1) 2
w0 k22 ,
+ ke (7.81)
2 k=1
kxk k k=1
kxk k2
and thus
N
X N
|ea (k)|2 2(β 2 + 1) X |v(k)|2
w0 k22 +
< ke . (7.82)
k=1
kxk k22 2 − β(β 2 + 1) k=1 kxk k22
For N → ∞ the squared error terms |ea (k)|2 /kxk k22 remain bounded as long as |v(k)|2 /kxk k22
remain bounded.
We had already mentioned in Chapter 3 that the gradient method with a-posteriori
error is equivalent to a particular normalization of the step-size in the LMS algorithm.
Such form of the gradient method can be extended as shown next.
Theorem 7.6 Given a gradient method with odd nonlinearity ef [e] > 0 over the a-
posteriori error:
ŵk = ŵk−1 + µ(k)f [ẽp (k)]x∗k . (7.83)
The gradient method is l2 −stable and robust for every bounded step-size µ(k) > 0.
Proof: My multiplication with xTk and subtraction from d(k) we obtain the relation
ẽp (k) = ẽa (k) − µ(k)kxk k22 f [ẽp (k)], (7.84)
or also
ẽp (k) + µ(k)kxk k22 f [ẽp (k)] = ẽa (k). (7.85)
As f [e] is an odd function, we have sgn[ẽp (k)] = sgn[ẽa (k)], and thus
|ẽp (k)| + µ(k)kxk k2 f [|ẽp (k)|] = |ẽa (k)|. (7.86)
The relation between ẽp (k) and ẽa (k) is thus contracting therefore with Theorem 7.5 we
conclude that the method is l2 −stable.
Exercise 7.15 Compute the robustness condition for the sign-error algorithm:
ŵk = ŵk−1 + µ(k)sgn[ẽa (k)]x∗k . (7.87)
7.5 Adaptive Algorithms with Linear Filter in the Er-

ror Path
Adaptive Algorithms with linear filter in the error path (or inverse control) constitute their
own class of adaptive algorithms. As it has many different applications this class is not
even small. The simplest case is the LMS algorithm with delayed error (also called Delayed
Update LMS=DLMS Algorithm). If we implement the LMS algorithm in form of an ASIC,
we quickly reach a learning rate limit, due to the recursive form of the algorithm. Improved
speeds in digital realizations are often achieved by pipelining of the algorithmic partitions.
This however is not possible for the LMS algorithm as we first have to compute the error
before we can apply the update equation. Pipelining results in an error signal that is
available only a few clock cycles later. If this is the case the update equation takes on the
form:
ŵk = ŵk−1 + µ(k)x∗k ẽa (k − D), (7.88)
in which the error occurs D cycles delayed. This is the simplest case of filtering.
In active noise control the error is constructed in the acoustic path and captured by a
microphone (see also Figure 1.5). The path from the loudspeaker through the control unit
is part of the linear filter path of the update error.
A further application of such filtered error path are adaptive IIR filter. Until now we
only considered transversal filter structures (and linear combiners). If the reference model
consists of an IIR filter, a large amount of coefficients in a transversal filter would need
to be estimated (depending on the pole locations), while a corresponding IIR filter would
require only few taps. This motivated already in the 70s many researchers to investigate
adaptive IIR algorithms [77, 73, 16]. Their success however, remained very small, the
major problem being stability. The reason for this problem are again the occurrence of a
linear filter in the error path as we will show next.
In adaptive IIR structures we have to distinguish between the so called output error
and the equation error. In the first form we apply estimated output values as input of the
filter, thus the output of the estimator
ŷ(k) = uTk ŵk−1 (7.89)
itself is a part of the filter input
uTk = [x(k), x(k − 1), ..., x(k − M + 1), ŷ(k − 1), ŷ(k − 2), ..., ŷ(k − N )]. (7.90)
As opposed to the equation error method where noise outputs of the reference model are
being employed:
uTk = [x(k), x(k − 1), ..., x(k − M + 1), d(k − 1), d(k − 2), ..., d(k − N )]. (7.91)
Applying the equation error we can straightforwardly write down the Wiener solution for
random signals:
E[u∗k uTk ]ŵ = E[d∗k uTk ]. (7.92)
As parts of d(k) are entries of the regression vector, there is more correlation than desired.
Splitting the regression vector uk into two components xk and dk , we obtain

Rxx Rdx rdx
E ŵ = E . (7.93)
Rxd Rdd rdd
Assuming white additive noise, we find rdd = ryy . On the left hand side of the equation
we find a term Rdd = Ryy + σv2 I that behaves different than usual. Due to the noise it has
an additional component. The so obtained estimator is not bias free. For the output error
method on the other hand it is not expected that such a bias occurs. We will understand
this better after a detailed analysis.
Until now it remains unclear why the output error method would belong to the algo-
rithms with linear filter in the error path. To understand this we use the IIR reference
model:
wT = [b(0), b(1), ..., b(M − 1), a(1), a(2), ..., a(N )] = [bT , aT ]. (7.94)
with the IIR filter coefficients b(0)..b(M − 1) and a(1)..a(N ). For the undistorted a-priori
output error we find
eo (k) = xTk b + y Tk a − xTk b̂k−1 − ŷ Tk âk−1 (7.95)

= xTk b + y Tk a − xTk b̂k−1 − ŷ Tk âk−1 + ŷ Tk a − ŷ Tk a (7.96)
= [xTk b − xTk b̂k−1 ] + [ŷ Tk a − ŷ Tk âk−1 ] + [y Tk a − ŷ Tk a] (7.97)
= uTk [w − ŵk−1 ] + [y k − ŷ k ]T a (7.98)
= uTk w̃k−1 + eTo,k a. (7.99)
The linear combination eTo,k a can be interpreted as a linear filter operation:
eTo,k a = A[eo (k)]. (7.100)
With this the undistorted a-priori error becomes

1
eo (k) = [uT w̃ ] (7.101)
1 − A(q −1 ) k k−1
and the update equation can be reformulated to
ŵk = ŵk−1 + µ(k)u∗k [v(k) + eo (k)] (7.102)

∗ 1
= ŵk−1 + µ(k)uk v(k) + [ea (k)] . (7.103)
1 − A(q −1 )
We recognize a particular linear filtering in the error path. In case of a constant step-size
µ(k) = µ the algorithm is called Feintuch-algorithm [16]. These considerations are of
course not restricted to simple gradient methods. Methods in the class of Gauß-Newton
can be treated equally. Utilizing an output error with the Gauß Newton type algorithm,
the corresponding algorithm is called pseudo linear regression algorithm (PLR).
v(k)
uk ?ẽa (k) F [ẽa (k)]
r - w - m - F -
6−

- w k−1

Figure 7.7: Adaptive algorithm structure with linear filter in the error path.
We thus deal in general with a filter structure like:
ŵk = ŵk−1 + µ(k)x∗k F [v(k) + ea (k)] . (7.104)
Figure 7.7 illustrates the location of the filter function F (q −1 ).

If v(k) and ea (k) experiences a different filter, a joint filtering process can for example
be obtained by a proper substitution (for example v(k) = F −1 [v ′ (k)]). The update equation
(7.104) can be reformulated into
 
 
 µ(k) µ(k) 
ŵk = ŵk−1 + µ̄(k)x∗k ea (k) + F [v(k)] + F [ea (k)] − ea (k)  . (7.105)
 µ̄(k) µ̄(k) 
| {z }
v̄(k)
In Figure 7.8 the feedback structure is illustrated.
If the error path F [·] is given in matrix form, for example here with three coefficients
f0 , f1 , f2 :  
f0
 f1 f0 
 
 
F N =  f2 f1 f0  (7.106)
 f2 f1 f0 
 
... ... ...
and furthermore step-sizes in diagonal matrix form M̄k and Mk , then the abbreviations are
∆ −1 −1
δ(N ) = kI − M̄N 2 MN FN M̄N 2 k2,ind (7.107)
∆ − 12 − 21
γ(N ) = kM̄N MN FN M̄N k2,ind . (7.108)
q −1
w̃k−1 w̃k
-
p p
kT k k = 1 µ̄(k) ea (k)
- m - m
µ̄(k)v(k)
- -
−
6
√1 F [·] √µ(k)
µ̄(k) µ̄(k)
m

1 − √µ(k) F [·] √ 1
µ̄(k) µ̄(k)
Figure 7.8: Gradient type algorithm with linear filter in the error path in feedback structure.
With such definitions it is possible to derive robustness conditions also for the case of
linearly filtered errors.
Theorem 7.7 For the gradient method with linearly filtered error we find l2 stability with
the following definitions (7.107) and (7.108):
v  v 
u N u N
uX 1 uX
t µ̄(k)|ea (k)|2 ≤ kw̃0 k2 + γ(N )t µ̄(k)|v(k)|2  , (7.109)
k=1
1 − δ(N ) k=1
v  v 
u N u N
uX 1
γ (N ) 
2 u X
t µ(k)|ea (k)|2 ≤ kw̃0 k2 + γ 2 (N )t
1
µ(k)|v(k)|2  . (7.110)
k=1
1 − δ(N ) k=1
The proof follows the previous procedure (see also Exercise 7.11). However, the stability
condition is much harder to check with the abbreviations in (7.107)
−1 −1
kI − M̄N 2 MN FN M̄N 2 k2,ind < 1, (7.111)
as we have to deal with time variant components in matrices. For relatively large filters
M we can claim approximately that µ̄(k) = 1/kxk k22 ≈ M σx2 . With constant step-size
µ(k) = µo the following relation is true:

µ o
max 1 − F (e jΩ
) < 1.
(7.112)
Ω M σx2
The step-size µo , providing the fastest convergence, can be found by the following minimax
optimization:

µ o
µopt = arg min max 1 − F (e jΩ
) . (7.113)
µo Ω M σx2
From the stability condition (7.112) one can recognize that it is not necessarily satisfied
for all positive step-sizes. The linear function F [·] can exhibit a negative real value for
various frequencies. In this case the algorithm behaves unstable even for small step-sizes
(assuming excitation at these frequencies). The necessary condition for F [·] is known in
the literature as strict positive real (SPR):

Real F (ejΩ ) > 0; for all Ω. (7.114)
Exercise 7.16 Prove Theorem 7.7.
Exercise 7.17 Extend the proof to include the Gauß-Newton algorithm with linearly
filtered error path.
Exercise 7.18 An adaptive IIR filter with two feedback coefficients a(1) and a(2) is to be
adapted by the Feintuch algorithm. Which condition must the two coefficients satisfy, so
that a stable algorithm s obtained? What is its optimal step-size?
Exercise 7.19 Let an undisturbed, nonlinear system be: y(k) = aTk yk + bT1 xk + bT2 xxk with
xxTk = [x(k)x(k), x(k)x(k − 1), .., x(k)x(k − M2 + 1)]. Derive stability conditions for the
corresponding gradient method with output error.
Exercise 7.20 A compromise could be to derive an adaptive algorithm whose update error
is partially output error and partially equation error. Derive such n algorithm and find its
stability conditions.
Exercise 7.21 Steiglitz and McBride had the idea [74], that the adaptive IIR filter can be
improved by a pre-filtering in the update error of the Feintuch algorithm by 1 − Â(q −1 ).
Derive the stability condition in this case.
Exercise 7.22 The neural network by Narendra and Parthasarathy (see Figure 1.11) can
be interpreted as adaptive IIR filter with nonlinearity in the estimation path. Derive a
gradient algorithm for the training of such network and find the required stability conditions.
Matlab Exercise 7.1 Write a Matlab programme to identify the following system:
y(k) = a(1)y(k − 1) + a(2)y(k − 2) + x(k) + x(k − 1) + x(k − 2).
Let the system be disturbed additively with white Gaussian noise of σv2 = 0.01. Utilize an
equation error as well as an output error and compare the result. Use the three different
sets
a(1) = −1.6, a(2) = −0.8;

a(1) = −0.8, a(2) = −0.9;
a(1) = 0.4, a(2) = −0.2.
For excitation use a) white noise and b)

5
X
x(k) = sin(2πfi k)
i=1
with the frequencies f1 = 0.1, f2 = 0.17, f3 = 0.25, f4 = 0.35, f5 = 0.4. Find the step-sizes
for fastest convergence and the stability bound.
7.6 Adaptive Algorithms with Linearly Filtered Re-

gression Vector
In Chapter 3.4 we already mentioned the option to filter the input signal linearly for example
to decorrelate it and thus speed up convergence. For very small step-sizes it can be shown
that a simpler adaptive filtering class is obtained if the regression vector is filtered by the
same filter as the error path:
F [q −1 ] = fo + f1 q −1 + ... + fMF −1 q −(MF −1) .
In this case we obtain the so-called Filtered-X LMS algorithm:
ŵk = ŵk−1 + µ(k)F [x∗k ]F [ẽa (k)]. (7.115)
For small step-sizes the estimate changes only slowly and assuming it remains constant
during MF updates, MF being the filter length of F [·], we can write
F [ẽa (k)] = F [d(k) − xTk ŵk−1 ] ≈ F [d(k)] − F [xTk ]ŵk−1 ,
which can be argued similarly as we did for the method with decorrelation filter. However,
usually slow adaptations are not of much interest, raising the question how the algorithm
reacts for larger step-sizes. To treat the most general case we consider the generalized
FXLMS algorithm:
ŵk = ŵk−1 + µ(k)F [x∗k ]Gk (q −1 ) {F [ẽa (k)]} . (7.116)
Additional to the fixed filter Filter F we assume a time-variant filter Gi . By multiple
reformulations it can be shown [61], that
M
X F −1

F [ẽa (k)] = F [v(k)] + F [xTk ]ŵk−1 + c(k, l) Gk−l (q −1 )F [ẽa (k − l)] , (7.117)
l=1
with the coefficients

∆
c(k, l) = µ(k − l)Fl [xTk ]F [x∗k−l ] (7.118)
M
X F −1
∆
Fl [xk ] = fj xk−j . (7.119)
j=l
Wit this we find the relation

1
F [ẽa (k)] = F [v(k)] + F [xTk ]w̃k−1 , (7.120)
1 − Ck (q −1 )Gk (q −1 )
which allows to convert the FXLMS algorithm into a type with filtered error path (and not
filtered regression vector). The generalized FXLMS algorithm can thus be reformulated as:
Gk (q −1 )
ŵk = ŵk−1 + µ(k)F [x∗k ] F [v(k)] + F [x T
k ] w̃ k−1 . (7.121)
1 − Ck (q −1 )Gk (q −1 )
Various ideas (see also [61]) try to approximate the term
Gk (q −1 )
1 − Ck (q −1 )Gk (q −1 )
towards one without too much complexity.
Exercise 7.23 Derive (7.117).
Exercise 7.24 Show that the optimal filter for the generalized FXLMS algorithm is given
by:
1
Gopt,k (q −1 ) = . (7.122)
1 + Ck (q −1 )
Exercise 7.25 Under the assumption of constant coefficients c(k) and a normalized step-
size version of the algorithm, provide the stability condition as function of the normalized
step-size µo and the error path filter with optimal Gopt,k .
Exercise 7.26 Consider the DLMS algorithm in the form
ŵk = ŵk−1 + µ(k)x∗k−4 ẽa (k − 4). (7.123)
What is its stability condition?
Exercise 7.27 The Zero-Forcing algorithmus for channel equalization is given by the
updates
ŵk = ŵk−1 + µ(k)[s(k − D) − ŝ(k − D)]s∗k , (7.124)
for which the meaning of the values can be found in Figure 7.6 and the vector sTk =
[s(k), s(k − 1), ...s(k − M + 1)]. Analyze the algorithm and find step-sizes µ(k) to guar-
antee convergence.
7.7 Literature
Good tutorial sand introductions to robust control are found in [11, 37]. An introduction
into adaptive, robust filtering can be found in [42]. The small-gain-theorem is well explained
in [38, 76]. First publications for robustness of LMS algorithms are in [25, 26]. In [69] the
minimax explanation of the LMS algorithm can be found. In [57, 58] the feedback structure
of gradient type algorithms as well as the Gauß-Newton method is explained. In [27, 33, 39]
are good tutorials for neuronal networks. In [56] the explanation to the Feintuch algorithm
is found. The DLMS algorithm has been treated classically in [40, 41] and [54]. Further
details to tracking of equalizers are in [43]. Details to the FXLMS algorithm can be found
in [64].
Appendix A
Differentiation with Respect to

Vectors
Differentiation with Respect to Real-valued Vectors

The rules of differentiation with respect to vectors as presented in Chapter 1 may
seem arbitrary. Let us consider a column vector v which contains the real-valued ele-
ments vi , i = 1, ..., M :
 
v1
 v2 
 
v =  ..  .
 . 
vM
Differentiating v with respect to an entry vi results in the vector:

 
0
 .. 
 . 
 
∂v  0 
 
=  1 ,
∂vi  
 0 
 . 
 .. 
0
with a single unit entry at position i, all other entries are equal to zero. If we consider a
linear combination of the elements vi with the constant coefficients ai , the differentiation
with respect to one element vj leads to:
M
∂ T ∂ X
a v= ak v k = aj .
∂vj ∂vj k=1
117
If we now build a vector of length M that contains at position j the derivative of the
vector v with respect to vj , we can write:
 ∂

∂v1
∂ T  ∂ 
 ∂v2  T
a v= ..  a v = aT .
∂v  . 
∂
∂vM
Note that here, choosing the gradient as a row vector has been done arbitrarily. Since in
these lecture notes, we continuously use gradients in the form of row vectors, this choice
simplifies matters.
Differentiation with Respect to Complex-valued Vectors

Let us now consider the differentiation with respect to complex-valued vectors, a case which
is of special interest for signal processing of complex-valued signals. Assume that g(z) is
a complex valued scalar function which is not necessarily holomorphic. Then, with the
complex-valued scalar argument z = x + jy, its derivative is given by:

∂g(z) ∆ 1 ∂g ∂g
= −j ,
∂z 2 ∂x ∂y

∂g(z) ∆ 1 ∂g ∂g
= +j .
∂z ∗ 2 ∂x ∂y
These operations are also called the Wirtinger differential operators, in recognition of the
mathematician Wilhelm Wirtinger [81].
Consider for example the term g(z) = |z|2 = zz ∗ = x2 +y 2 . By differentiating separately
with respect to x and y, we obtain
∂g(z)
= 2x
∂x
and
∂g(z)
= 2y.
∂y
Consequently, according to the above stated definitions, the derivative of g(z) = |z|2 with
respect to z reads
∂g(z)
= x − jy = z ∗ .
∂z
We observe that obviously, the differentiation with respect to z is not affected by the
complex conjugate z ∗ . Analogously, we can differentiate with respect to z ∗ . For the above
example g(z) = |z|2 , the derivative with respect to z ∗ simply becomes z.
Finally, the rules for differentiation with respect to complex-valued vectors can be derived
based on the rules presented in the previous part of this Appendix A, in combination with
the above introduced Wirtinger differential operators. Below, we present a collection of a
few useful rules for the differentiation of scalar functions which depend on a complex-valued
vector z:

∂kzk22 ∂ xT x + y T y
= = xT − jy T = z H ,
∂z ∂z

∂kzk22 ∂ xT x + y T y
= = x + jy = z,
∂z H ∂z H

∂z H Az ∂ xT Ax − jy T Ax + jxT Ay + y T Ay
= = xT A − jy T A = z H A.
∂z ∂z
Appendix B
Convergence of Random Sequences
Definition [Convergence]: If there exists for every δ > 0 an integer no such that the
distance dist(xn , y) < δ for each n > no with some constant value y, then we call the
sequence xn convergent and y is the limit of xn :
xn → y
y = lim xn .
n→∞
We also say, the sequence xn converges towards y. Here, the distance dist(·) is a metric. If
there are more than one limit values (for example at the output of a discrete time oscillator),
we call these points limit points.
Such a definition of convergence is not very practical since it requires the limit point y
to be known a-priori. Only then, it is possible to check whether the sequence converges or
not. Such form of convergence is also called convergence everywhere. It is applicable
for deterministic sequences as well as for random sequences and processes. For a random
process xn (ζ), the limit may depend on the result of a random experiment ζ. Definition
[Convergence with probability one]: If a random process xn (ζ) converges to a limit
x(ζ), such that:
P {xn → x} = 1,
then we call it convergence with probability one or convergent almost everywhere.
Similar is the following term:

Definition [Convergence in probability]: If for each δ > 0 the following is true
P (|xn − x| > δ) → 0; für n → ∞
then we call this convergence in probability or stochastic convergence.
120
There are further definitions in this context:

Definition [Convergence in the mean]: If the following is true for a random process
lim E[xn − x] = 0,
then we call it convergence in the mean which is not to be confused with the next expression.
Definition [Convergence in the mean square]: If the following is true for a random
process
lim E[|xn − x|2 ] = 0,
then we call it convergence in the mean square sense or limit in the mean.
Definition [Convergence in distribution]: Let the distribution Fxn (x) be of a random

process xn . If this distribution moves to a fixed distribution Fx (x),
lim Fxn (x) → Fx (x); für n → ∞
then we say that the random process converges in distribution.
There is obviously a multitude of convergence forms as it is often difficult to prove

a particular convergence. On the other hand some convergence properties are related to
others. For example the convergence in the mean is the weakest property. Is this satisfied
we cannot conclude any other property. On the other extreme is convergence everywhere as
it includes all other convergence properties. Figure B.1 displays the relations graphically.
Here the notation is e: convergence everywhere, a.e.: convergence almost everywhere, p:
convergence in probability, d: convergence in distribution, MS: convergence in the Mean
Square Sense.
d
p
a.e.
MS
Figure B.1: Convergence properties of random processes, e: convergence everywhere, a.e.:

convergence almost everywhere, p: convergence in probability, d: convergence in distribution, MS:
convergence in the Mean Square Sense.
Appendix C
Spherically Invariant Random

Processes
Definition Two random variables: Let us first consider two random variables x and y.
They are called spherically invariant if their joint density function fxy (x, y) can be written
as p
fxy (x, y) = g( x2 + y 2 ) = g(r).
The following theorem holds.
Theorem C.1 If two random variables x and y are spherically invariant and statistically
independent then they are Gaussian distributed, zero-mean and of identical variance.
Proof: Statistical independence means:
p
g(r) = g( x2 + y 2 ) = fx (x)fy (y).
Differentiating with respect to x results in:

∂g(r) ∂g(r) ∂r ∂g(r) r
= =
∂x ∂r ∂x ∂r x
x (r) (x)
g (r) = fx (x)fy (y).
r
We divide the last equation by xg(r) and obtain:
(x)
1 g (r) (r) 1 fx (x)
= = const.
r r x fx (x)
This must be constant as only then the function can only be dependent of x and only of r.
The left hand side is a well-known differential equation whose solution is:

g(r) = A exp αr2 /2 .
122
We thus find that

p
fxy (x, y) = g( x2 + y 2 ) = A exp α(x2 + y 2 )2 /2 ,
which allows for interpreting σx2 = σy2 = −1/α.
The theorem is thus proven. Note however, that this does not mean the all spherically
invariant random variables are Gaussian distributed. It means very much that those that
are not Gaussian distributed are not statistically independent. In those cases we do not
have fxy (x, y) = fx (x)fy (y).
Graphical Interpretation: A graphical interpretation of two random variables

being spherically invariant is obtained when considering g(r) = const. For s specific value
of the
p radius r this means that all pairs (x, y)− with this radius need to be on a circle
r = x2 + y 2 . Cutting the density function horizontally leads to circles on which the same
probability for all pairs (x, y)− exists.
We can now relax the condition of circles and extend the space towards elliptically
invariant random processes:
Ax2 + Bxy + Cy 2 + Dx + Ey + F = 0. (C.1)
We distinguish the three cases

1. B 2 − 4AC > 0:Hyperbola equation
2. B 2 − 4AC = 0:Parabola equation
3. B 2 − 4AC < 0:Ellipsoid equation

For two random variables, a more practical description is given by:
T −1
x − x0 σx2 ρσx σy x − x0
= const (C.2)
y − y0 ρσx σy σy2 y − y0
T
x − x0 a b x − x0
= const. (C.3)
y − y0 b c y − y0
A comparison with (C.1) delivers:
A = a, B = 2b, C = c, D = −2ax0 − 2by0 , E = −2cy0 − 2bx0
and thus the condition for an ellipsoid: B 2 − 4AC = 4b2 − 4ac < 0, thus b2 < ac, which is
given as the correlation coefficient ρ is bounded by −1 ≤ ρ ≤ 1 (proof by Cauchy-Schwarz-
Inequality). The formC.3 is much easier to handle as we can immediately recognize the
translational parameters (x0 , y0 ). Also the variances and the correlation factor are given
by:
c a b
σx2 = ; σ 2
y = ; ρ = − √ .
ac − b2 ac − b2 ac
The form in (C.2) comprises of a matrix inversion but on the other hand allows for relatively
quick reading of desired parameters. With this form we can easily extend the description
towards more dimensions than two, for example three:
 T  −1  
x − x0 σx2 ρxy σx σy ρxz σx σz x − x0
 y − y0   ρxy σx σy σy2 ρyz σy σz   y − y0  = const. (C.4)
z − z0 ρxz σx σz ρyz σy σz σz2 z − z0
More generally we can describe an elliptical equation by
xT Rxx
−1
x = const (C.5)
with Rxx being the autocorrelation matrix of the random vector x.
Theorem C.2 Elliptical random processes maintain such property under linear transfor-
mation. There exists at least one linear transformation that transforms elliptical random
variables into spherical ones.
Proof: Simply by substitution. Assume we have an elliptical random process x with

autocorrelation matrix Rxx . By a linear invertible transformation A we generate y = Ax
or x = A−1 y. For the autocorrelation matrix of y we find:
Ryy = E[yyT ] = E[AxxT AT ] = ARxx AT
and
Rxx = A−1 Ryy A−T .
Thus yT Ryy
−1
y = xT AT Ryy
−1
Ax = xT Rxx −1
x. For density functions under linear transforma-
tion we have
fx (x) g(xT Rxx−1
x) g(yT Ryy−1
y)
fy (y) = = = .
| det A| | det A| | det A|
−1/2
Thus, the first part of the theorem is shown. Let us now select a particular A = Rxx ,
then we have
Ryy = E[yyT ] = ARxx AT = I.
The in this way linearly transformed random vector y is thus uncorrelated. The particular
choice of A is always possible as the autocorrelation matrix is always Hermitian and
positive. Note that with this choice of A we do not only guarantee decorrelation but also
identical variance of all entries in y. If there would be required only decorrelation we can
−1/2
satisfy this for all diagonal matrices D with A = DRxx .
Polar coordinates: No matter if we have an elliptical distribution with or without

mean the basic distribution is spherically invariant and by adding mean and introducing
linear transformations corresponding autocorrelation matrices can be introduced. It thus
makes sense to investigate this process more and to introduce polar coordinates. Let us
first consider two spherically invariant random variables x and y in polar form:
p fr (r) 1
fxy (x, y) = g( x2 + y 2 ) = g(r) = = f (r2 , 2). (C.6)
2π π
Such formulation is independent of the radial density- Note that the angle component
φ is always constant between [−π, π]. Furthermore we immediately conclude statistical
independence of the two random variables r and φ in polar form
frφ (r, φ) = fr (r)fφ (φ),
while the Cartesian counterparts x and y are not necessarily statistical independent. We
have already shown that this is only the case for Gaussian distributions. Such formulations
in polar coordinates can be extended to more dimensions. If done so we recognize that all
angle densities are statistically independent to each other and to the radial density. For all
higher angles we always find a constant distribution in [0, π].
In (C.6) we already introduced a particular description for the radial density in this
1
case for the two dimensional density function π1 f (r2 , 2) or more general πM/2 f (r2 , M ) for
M dimensions. To emphasis on the dimension M we took it on as a parameter of the
1
density function. The pre term πM/2 turns out to be quite practical for compact writing.
As marginal densities can be computed by integration out of joint densities we find:
Z ∞
1 2 1
M f (r , M ) = M +1 f (r2 + s2M +1 , M + 1)dsM +1 .
π 2 −∞ π 2
1
Here the joint density M f (r2 , M ) consists of M Cartesian components s1 , s2 , ...sM with
π 2
condition r2 = s21 + s22 + ... + s2M .
For spherically invariant process even the converse is true or in other words, if the radial
density is known in one dimension, it can be computed for all dimensions. This can be easily
shown when moving from dimension M + 2 to M :

Z ∞Z ∞
1 2 1 2 2 2
M f (r , M ) = M +2 f (r + sM +1 + sM +2 , M + 2)dsM +1 dsM +2
π2 −∞ −∞ π 2
Z ∞ Z 2π
1 2 2
= M ρf (r + ρ , M + 2)dρdφ
Z0 ∞ 0 ππ 2
1 2 2
= M 2ρf (r + ρ , M + 2)dρ.
0 π 2
With an additional substitution r2 = s, s + ρ2 = x, 2rdr = ds we now obtain

Z ∞
1 1
M f (s, M ) = M f (x, M + 2)dx (C.7)
π2 π2 s
or in other words:
∂
f (s, M + 2) = − f (s, M ). (C.8)
∂s
There is of course also a description from f (s, M + 1) to f (s, M ), however somewhat more
involved.Details can be found in [5].
Example C.1 Consider a Gaussian distribution:

1 1 −r2 /2
f (r2 , 2) = e .
π 2π
Thus all even numbered higher dimensional density functions are given by
1 −s/2
f (s, 2M ) = e .
2M
We also know the one-dimensional density function
r
1 2 −r2 /2
1/2
f (r2 , 1) = e .
π π
Thus all odd numbered higher dimensional density functions are given by
1 −1/2 −s/2
f (s, 2M + 1) = 2 e .
2M
Random process: So far we described a finite amount of M components in a random
vector x by its joint density π −M/2 f (r2 , M ). If we move M towards infinity the vector
describes a random process. Finite values of M can thus be interpreted as part of a
random process, explaining thus that there is always a vector with higher dimension
whose components may or may not be statistically dependent. We can thus interpret a
vector as the presence of a random process of which we always see a fraction of length
M . Independent of the time we always obtain the same joint density function for M
components. We thus conclude that we have stationary processes.
Radial density function: Transforming polar coordinates into Cartesian is given by

M
Y −1
s1 = r sin(φj )
j=1
M
Y −k
sk = r cos(φM +1−k ) sin(φj ); k = 2, ..., M − 1
j=1
sM = r cos(φ1 ).
Note there that the joint density f (r2 , M ) is not the radial density fr (r) .
1
f (r2 , M ) = fr (r)fφ1 (φ1 )...fφM −1 (φM −1 )∆(φ1 , .., φM −1 )
π M/2
with the Jacobi determinant for polar coordinates
M
Y −2
∆(φ1 , .., φM −1 ) = r M −1
(sin(φj ))M −1−j .
j=1
In order to obtain the radial density we have to integrate over all angles, thus to compute
the marginal density in r
Z Z Z Z
1 2
... f (r , M )dφ1 ..dφM −1 = fr (r) ... fφ1 (φ1 )...fφM −1 (φM −1 )∆(φ1 , .., φM −1 )dφ1 ..dφM −1 .
π M/2
Since all angular density functions are known this term can be precomputed for a given
dimension M
2
frM (r) = rM −1 f (r2 , M ). (C.9)
Γ(M/2)
1
Note that the radial component in M f (r2 , M ) has M components while it has M + 1 in
π 2
1
f (r2 , M + 1). The radial density frM (r) behaves similar. To emphasize this we now
M +1
π 2
introduced a dimension M in form of an index rM . Special cases are known for M = 1, 2, 3:
r
2 2 −r2 /2
Positive Gaussian : fr1 (r) = f (r2 , 1) = e
Γ(1/2) π
2 2
Rayleigh : fr2 (r) = rf (r2 , 2) = re−r /2
Γ(1)
r
2 2 2 2 2 −r2 /2
Maxwell : fr3 (r) = r f (r , 3) = r e
Γ(3/2) π
Note that r ≥ 0 which is a bit unusual for the Gaussian√distribution as they have usually
negative arguments as well. In that case the factor is 1/ 2π instead.
Ergodicity: Consider the temporal averaging over the components si of a spherically

invariant random process:
M
2 1 X 2 1 2
σM = si = r .
M i=1 M M
Thus its variance is σM = M −1/2 rM and follows the radial density. Let us now grow M to
infinity. What happens?
lim σM = lim M −1/2 rM
M →∞ M →∞
and thus we obtain for the density of σM
fσ (r) = lim fσM (r) = lim M 1/2 frM (M 1/2 r).

M →∞ M →∞
We find thus that in general in the limit the variance is also a random variable described by
the density function fσ (r). We cannot expect that the ensemble average and the temporal
average are identical as the temporal average would need to be a constant. We can thus
conclude that in general spherically invariant processes are not ergodic although they are
stationary.
A particularity is again the Gaussian process. For the Gaussian process we have (just
considering even orders)
2 2 1 2M −1 −r2 /2
fr2M (r) = r2M −1 f (r2 , 2M ) = r e . (C.10)
Γ(M ) Γ(M ) 2M
The density function follows a χ2 − distribution. The limit of M 1/2 frM (M 1/2 r) tends to 1,
thus
lim frM (r) = δ(r − 1).
M →∞
The Gaussian process is thus ergodic. The property of the Gaussian process that the radial
component tends to be a constant with increasing dimension is often called the hardening
phenomenon.
Example C.2 This opens the question how we can generate other spherically invariant
processes next to the Gaussian process, possibly with particular radial density. Consider
first a multiplication of a Gaussian process xk by a random variable y to obtain zk = xk y.
Let us first consider one element for time instant k, thus z = xy. The density function of
z can be obtained by the following two methods:
Path1: Consider a fixed value of y. Then we have

1 2
fz (z|y) = p e−(z/y) /2 .
2πy 2
The joint density fzy (z, y) = fz (z|y)fy (y). We obtain the desired marginal density by
integration Z Z
1 2
fz (z) = fz (z|y)fy (y)dy = p e−(z/y) /2 fy (y)dy.
2πy 2
Path 2: Construct the auxiliary variable w = y and compute the transformation of x, y
to z, w:
fxy (z/w, w) fx (z/w)fy (w)
fzw (z, w) = = .
|w| |w|
We obtain the marginal density of z again by integration
Z Z 2
fx (z/w)fy (w) 1 e−(z/w) /2 fy (w)
fz (z) = dw = √ dw
|w| 2π |w|
In both cases we thus obtain the same result. We recall that r ≥ 0 and identify
Z 2
2 1/2 e−(r/w) /2 fy (w)
f (r , 1) = 2 dw. (C.11)
|w|
Example C.3 We now want to extend this example. Consider three IID Gaussian dis-
tributed random variables x1 , x2 , x3 being multiplied by a fourth random variable y of density
fy (y) that is statistically independent of the first three. For the joint densities we have:
fx1 ,x2 ,x3 ,y (x1 , x2 , x3 , y) = fx1 (x1 )fx2 (x2 )fx3 (x3 )fy (y)
We thus construct three news random values z1 = x1 y, z2 = x2 y, z3 = x3 y. The joint

density of the three new values can be computed by the auxiliary variable w = y:
Z
fz1 ,z2 ,z3 (z1 , z2 , z3 ) = fz1 ,z2 ,z3 ,w (z1 , z2 , z3 , w)dw.
We obtain
fx1 (z1 /w)fx2 (z2 /w)fx3 (z3 /w)fy (w)
fz1 ,z2 ,z3 ,w (z1 , z2 , z3 , w) = .
|w|3
As all three variables x1 , x2 , x3 are Gaussian we find:
2 +z 2 +z 2
z1 2 3
fy (w)e− 2w2
fz1 ,z2 ,z3 ,w (z1 , z2 , z3 , w) = (2π)−3/2 .
|w|3
The desired joint density function is thus:

2 +z 2 +z 2
Z z1 2 3
fy (w)e− 2w2
fz1 ,z2 ,z3 (z1 , z2 , z3 ) = (2π)−3/2 dw.
|w|3
We again identify forr ≥ 0:

Z r2
2 −1/2 fy (w)e− 2w2
f (r , 3) = 2 dw. (C.12)
|w|3
For spherically invariant random processes (C.8) must hold, thus:

Z 2 /2
d d 1/2 e−s/w fy (w)
f (s, 1) = 2 dw
ds ds |w|
Z 2
−1/2 e−s/w /2 fy (w)
= −2 dw
|w|3
= −f (s, 3).
In this way we can compute the joint densities of more than three variables. We must have
d(n)
f (s, 2n + 1) = (−1)n f (s, 1).
dsn
In our case thus
Z 2
n −(2n−1)/2 e−s/w /2 fy (w)
f (s, 2n + 1) = (−1) 2 dw.
|w|2n+1
With this trick we can now compute the radial densities for odd dimensions M = 2n + 1
(see (C.9))
Z −r 2 /w2 /2
1 M −1 e fy (w)
frM (r) = (−1)(M −1)/2 2−M/2 r dw.
Γ(M/2) |w|M
Remark: The examples here showed a so called product process. It can be shown
that all spherically invariant processes are a product process, in particular a product of a
Gaussian process with a random variable. (for more details see [5]).
Appendix D
Remarks on Gaussian Processes
Gaussian random variables play an important role in many applications, in particular if the
sum of several random variables shows up. Most often the central value theorem is then
true which says that loosely speaking a sum of many arbitrary random variables results
asymptotically in a Gaussian random variable. For a real-valued vector of p dimensions x
with Gaussian entries mean x̄ and covariance matrix Cxx =E[(x − x̄)(x − x̄)T ] we obtain
the following probability density function:
1 T −1
1 1 − [x − x̄] Cxx [x − x̄]
fx (x) = p p e 2 . (D.1)
(2π)p det Cxx
For zero-mean random variables the covariance matrix and the autocorrelation matrix
Rxx = Cxx are identical. Correspondingly two joint variables x ∈ IR1×p and y ∈ IR1×q
can be described by their joint pdf

1 T T −1 x − x̄
− [(x − x̄) , (y − ȳ) ]Cxx,yy
1 1 1 2 y − ȳ
fx,y (x, y) = p p q e , (D.2)
(2π) p (2π) q
det Cxx,yy
and covariance matrix

Cxx Cxy
Cxx,yy = . (D.3)
Cyx Cyy
Obviously the real-valued Gaussian process is defined by its mean and its covariance
matrix (first and second order moments) In the following we show that this is also true for
complex-valued Gaussian random processes.
We consider a complex-valued random vector z, comprising of two real-valued random

vectors zi = xi + jyi . It is called Gaussian if its real and imaginary parts are called joint
131
Gaussian. Assume the first two moments of the complex-valued Gaussian process are given,
thus z̄ and Czz :
Czz = E[(z − z̄)(z − z̄)∗ ] = Cxx + Cyy + j(Cyx − Cxy ). (D.4)
This information is not sufficient to find uniquely Cxx , Cyy , Cxy as the real part of Czz only
defines the sum of Cxx + Cyy and the imaginary part only the difference of Cyx − Cxy . In
order to separate the three parts we need further knowledge, for example

Czz∗ = E[(z − z̄)(z − z̄)] = Cxx − Cyy + j Cyx + Cxy . (D.5)
With this additional knowledge the three parameter could be defined uniquely.Gaussian
processes that are circular (spherically invariant) have the following property1 :

Czz∗ = Cxx − Cyy + j Cyx + Cxy = 0. (D.6)
This is equivalent to
Cxx = Cyy and Cxy = −Cyx . (D.7)
Note that the requirement Cxy = −Cyx is unusual and indeed this requirement assures the
circularity of the process. Consider for example a two dimensional random vector. The
correlation matrix between x and y is given by

α11 α12
Cxy = . (D.8)
α21 α22
In order to have Cxy = −Cyx , we must have: α11 = α22 = 0 und α12 = −α21 . Thus we
obtain the particular form
0 α12
Cxy = . (D.9)
−α12 0
Is this a contradiction to the statement that quadratic forms need to be positive, thus their
imaginary part being negative? The answer is no as we will show on the next example. We
consider the quadratic form of a circular complex-valued random vectors z H Czz z:
1 H
T
z Czz z = (x − jy) Cxx + Cyy (x + jy). (D.10)
2
We obtain for its real part:
1 H
ℜz Czz z = xT [Cxx + Cyy ]x + y T [Cxx + Cyy ]y ≥ 0. (D.11)
2
1
Note that spherically invariance requires that the joint density function is a function of the radius r,
for a Gaussian process it is proportional to exp(−r2 ).
We obtain for the imaginary part:
1 H
ℑz Czz z = xT [Cxy + Cyy ]y − y T [Cxx + Cyy .]x = 0. (D.12)
2
Equivalently the covariance matrix in (D.3) for a circular complex-valued vector z can
be written as

Cxx Cxy 1 Real {Czz } −Imag {Czz }
Cxx,yy = = . (D.13)
−Cxy Cxx 2 Imag {Czz } Real {Czz }
Wit this we can write the p-dimensional joint density of a complex-valued random vector
z can be compactly described by its covariance matrix Czz (compare to (D.2) with p = q):
1 1 H −1
−[z − z̄] Czz [z − z̄]
fx,y (x, y) = fz (z) = e . (D.14)
π p |det Czz |
To proof this we have to show the following two identities

1 T T −1 x − x̄
[(x − x̄) , (y − ȳ) ]Cxx,yy = [z − z̄]H Czz
−1
[z − z̄] (D.15)
2 y − ȳ
1 1 1
= r . (D.16)
|det Czz | 2p
det Cxx,yy
The first identity is shown by computing the inverse of the matrix and comparing the
individual terms of the x and y pairs.
−1
−1 Cxx Cxy ∆−1 Cxx
−1
Cxy ∆−1
Cxx,yy = = −1 (D.17)
−Cxy Cxx −Cxx Cxy ∆−1 ∆−1
−1
∆ = Cxx + Cxy Cxx Cxy (D.18)
1 −1
C = [Cxx + jCxy ]−1 = [I − jCxx
−1
Cxy ]∆−1 . (D.19)
2 zz
Writing both terms in real and imaginary part x and y, we obtain for the second part (D.19)
1 H −1
z Czz z = xT ∆−1 x + y T ∆−1 y (D.20)
2
+ xT Cxx
−1
Cxy ∆−1 y − y T Cxx
−1
Cxy ∆−1 x
h i
− j xT Cxx−1
Cxy ∆−1 x + y T Cxx
−1
Cxy ∆−1 y

− j y T ∆−1 x − xT ∆−1 y ,
and realize that the real part is indeed identical. Note that the imaginary part becomes
zero again. However, this is not obvious and will be considered more closely. The last term
in the imaginary part T −1
y ∆ x − xT ∆−1 y = 0.
The first term is of more difficult nature:
h i
T −1 −1 T −1 −1
x Cxx Cxy ∆ x + y Cxx Cxy ∆ y .
−1
Here we have that (Cxx Cxy ∆−1 )T = −∆−1 Cxy Cxx−1 T
, and due tot symmetry Cxy = −Cxy
also: ∆−1 Cxy Cxx
−1 −1
= Cxx Cxy ∆−1 . And further we find
xT Cxx
−1
Cxy ∆−1 x = 0.
The later relation we prove by
−1
Cxx Cxy ∆−1 = ∆−1 Cxy Cxx
−1
.
Multiplying ∆ from left and right we obtain:
−1
Cxy [I + (Cxx Cxy )2 ] = [I + (Cxy Cxx
−1 2
) ]Cxy .
Multiplying with Cxy finally shows the identity.
To compute the determinant we use the Schur complement. We have:

Cxx Cxy −1
det = det(Cxx ) det(Cxx + Cxy Cxx Cxy ). (D.21)
−Cxy Cxx
Note that we also have:

Cxx − jCxy Cxy −1
det = det(Cxx ) det(Cxx − jCxy + Cxy Cxx (Cxy + jCxx ))
−Cxy − jCxx Cxx
−1
= det(Cxx ) det(Cxx + Cxy Cxx Cxy ). (D.22)
Thus, the determinant is not negative due to the quadratic term. We have:
∗

2Cxx 2Cxy Czz 2Cxy
det = det ∗ . (D.23)
−2Cxy 2Cxx −jCzz 2Cxx
Applying the Schur complement a second time and we find the solution:
∗

Czz 2Cxy ∗ ∗ ∗ −1
det ∗ = det(C zz ) det 2C xx + jC zz (C xy ) 2C xy (D.24)
−jCzz 2Cxx

∗
= det(Czz ) det 2Cxx + j2Cxy (D.25)
∗
= det(Czz ) det(Czz ) = | det(Czz )|2 . (D.26)
Fourth order moments: For real-valued, zero-mean Gaussian processes, we find:

E[x1 x2 x3 x4 ] = E[x1 x2 ]E[x3 x4 ] + E[x1 x3 ]E[x2 x4 ] + E[x1 x4 ]E[x2 x3 ]
For complex-valued, zero-mean Gaussian processes, we find:
E[x1 x∗2 x3 x∗4 ] = E[x1 x∗2 ]E[x3 x∗4 ] + E[x1 x∗4 ]E[x∗2 x3 ]
In the following we derive the relation for real-valued processes. Consider the joint density
function for the four random variables x1 ...x4 :
1 1 1 T −1
fx (x) = 2
p e− 2 x Rxx x . (D.27)
(2π) det(Rxx )
Due to the normalization of densities, we can modify the density function by
  
0 α 0 0
1  −1  α 0 0 0  
− xT 
 Rxx + 

 x
Z ∞ Z ∞ 2 0 0 0 β  
1 1 0 0 β 0
g(α, β) = 2
p ... e dx. (D.28)
(2π) det(Rxx ) −∞ −∞
Obviously g(0, 0) = 1 brings back the original density. We further introduce the matrix
  −1
0 α 0 0
 −1  α 0 0 0  
L= 
Rxx +  0
 (D.29)
0 0 β  
0 0 β 0
for which we have s
det(Rxx )
g(α, β) =1 (D.30)
det(L)
independent of α and β. Differentiating g(·, ·) with respect to α and β delivers
Z ∞ Z ∞
∂2
g(α, β) = ... x1 x2 x3 x4 fx (x)dx, (D.31)
∂α∂β −∞ −∞
thus exactly the desired expression. We thus have

s
∂2 det(L)
E[x1 x2 x3 x4 ] = (D.32)
∂α∂β det(Rxx )
∂2 1
= v    . (D.33)
∂α∂β u
u 0 α 0 0
u   α
u 0 0 0  
udet  
I + Rxx  0

t 0 0 
β 
0 0 β 0
We now have to find the determinant of the matrix:

    
0 α 0 0 1 + αr21 αr22 βr23 βr24
  α 0 0 0    αr11 1 + αr βr βr14 
I + Rxx   =  12 13 .
  0 0 0 β    αr41 αr42 1 + βr43 βr44 
0 0 β 0 αr31 αr32 βr33 1 + βr34
Development of the determinant in the first row we find the following terms:

(1 + αr12 ) βr13 βr14

det(L) = (1 + αr21 ) αr42 (1 + βr43 ) βr44

αr32 βr33 (1 + βr34 )

αr11 βr13 βr14

− αr22 αr41 (1 + βr34 ) βr44

αr31 βr33 (1 + βr34 )

αr11 1 + αr12 βr
14
+ βr23 αr41 αr42 βr44

αr31 αr32 (1 + βr34 )

αr11 1 + αr12 βr
13
− βr24 αr41 αr42 (1 + βr43 )
αr31 αr32 βr33
= h(α, β).
We now take advantage of symmetries rij = rji and differentiate h(α, β) with respect to α
and β:
∂2 1 3 5 1 3
p = h(α, β)− 2 hα (α, β)hβ (α, β) − h(α, β)− 2 hα,β (α, β). (D.34)
∂α∂β h(α, β) 4 2
Here the first derivatives with respect to α and β are index accordingly. As we are interested
in the result for α = 0 and β = 0, we finally obtain

∂2 1

E[x1 x2 x3 x4 ] = p (D.35)
∂α∂β h(α, β)
α=0,β=0
= r12 r34 + r13 r24 + r14 r23 . (D.36)
It is worth to read the corresponding section in Papoulis’ textbook where a much simpler
derivation is shown. Note however that the path here can readily be modified to compute
other terms, like six order moments.
Appendix E
More General Derivation for the

Mean Squared Parameter Error
Vector
E.1 Decomposition of Matrices

A problem in the derivation of the LMS behaviour is that the covariance matrices as they
appear for the parameter error vector
Kk = E[(w − wk )(w − wk )T ] (E.1)
are in general not in the modal space of the driving process uTk = [uk , uk−1 , ..., uk−M +1 ] with
autocorrelation matrix Ruu = E[uk uTk ] = QΛu QT thus
Kk 6= b0 I + b1 Ruu + b2 R2uu + ... (E.2)
Since the derivation of the LMS algorithm only requires the knowledge of the trace of such
matrices, it is sufficient to analyze only the algorithm’s impact on the parameter error
vector with respect to Ruu . It is therefore proposed to decompose a given matrix K into
a first part that is in the modal space of the autocorrelation matrix Ruu of the driving
process uk and a second part in its orthogonal complement space, that is:
K = b0 I + b1 Ruu + ... + bM −1 RM
uu
−1
+ K⊥
= P (Ruu ) + K⊥ . (E.3)
Here, P (Ruu ) denotes a polynomial in Ruu . Note that due to the Cayley-Hamilton theo-
rem, an exponent larger than M − 1, with M denoting the system order is not required [66].
137
Lemma E.1 Any Hermitian matrix K can be decomposed into a part from the subspace
of a given modal space Ru = span{I, Ruu , R2uu , ..., RM −1
uu } and its orthogonal complement
subspace R⊥u for which
tr[K⊥ Rluu ] = 0
for any value of l = 0, 1, 2, ....
Proof: The optimal set of coefficients to approximate the Hermitian matrix K is found by
min tr[(K − P (Ruu ))(K − P (Ruu ))T ], (E.4)
{b0 ,b1 ,..,bM −1 }
which is a simple quadratic problem with linear solution:

  
M tr[Ruu ] ... tr[RM −1
uu ] b0
 tr[Ruu ] tr[R2 ] ... tr[RM  
 uu uu ]  b1 
 tr[R2 ] tr[R3 ] ... tr[RM +1  
 uu uu uu ]  b2 
 ..  .. 
 .  . 
tr[RM −1 M 2M −2
uu ] tr[Ruu ] ... tr[Ruu ] bM −1
 
tr[K]
 tr[KRuu ] 
 
 2 
=  tr[KRuu ]  (E.5)
 .. 
 . 
M −1
tr[KRuu ]
The trace of R0uu = I is simply M , the system order. Due to the orthogonality property of
the least squares solution, it is found that
tr[K⊥ Rluu ] = 0, (E.6)
for arbitrary values of l = 0, 1, 2, ....
As a consequence also terms of the form tr[Rm ⊥ l m ⊥ l ⊥

uu K Ruu ] = 0, and thus Ruu K Ruu ∈ Ru ,
and any polynomial P (Ruu ) ∈ Ru .
E.2 Modal Space of LMS

Let us consider the classical LMS analysis [75, 31, 17] utilizing the IA as stated in the
introduction. For simplicity, noise is ignored at this point as interest is only in the evolution
of Kk . In a first step the following homogeneous equation
K1 = E[(I − µu0 uT0 )K0 (I − µu0 uT0 )T ] (E.7)
= K0 − µRuu K0 − µK0 Ruu
+µ2 E[u0 uT0 K0 u0 uT0 ]
is obtained to illustrate the behavior. Let us start with two examples.
Example 1: Assume first a specific solution for a real-valued Gaussian random process and
for K0 = R0uu = I, that is K0 is member of the modal space Ru of Ruu . If it is for example
assumed that the initial parameter estimate w0 = 0 and an average over many possible
systems w is performed, K0 = E[wwT ] = I can be a realistic assumption. If on the other
hand a-priori knowledge on the set of systems is present, other values may be more realistic
(see for example [44] for impulse responses of rooms in which K0 is of diagonal shape). In
the first step
K1 = I − 2µRuu + µ2 (2R2uu + Ruu tr[Ruu ]), (E.8)
is obtained that is K1 is a second order polynomial in Ruu and thus in the modal space
of Ruu . Assume now that Kk develops into an arbitrary polynomial in Ruu . How does it
change from time instant k to k + 1?
Kk+1 = Kk − 2µRuu Kk + µ2 (2Ruu Kk Ruu

+Ruu tr[Ruu Kk ]). (E.9)
In other words, it remains to be a polynomial in Ruu . The same is in fact true for K0
to be any polynomial in Ruu . It can thus be concluded that the LMS update equation
under a real-valued Gaussian process forces the parameter error vector covariance matrix
K0 = P (Ruu ) to evolve into a polynomial in the modal space of Ruu . Terms of the
orthogonal complement are never generated.
Example 2: Let us now assume that the initial covariance matrix is entirely from the
orthogonal complement R⊥ ⊥
u , that is K0 = K . In the first step
K1 = K⊥ − µRuu K⊥ − µK⊥ Ruu

+µ2 E[u0 uT0 K⊥ u0 uT0 ]
= K⊥
1 (E.10)
is obtained. As tr[K1 ] is of interest, tr[K⊥

1 ] = 0 is found. Thus all terms originating from
⊥
K have no impact on tr[K1 ] (or tr[K1 Ruu ]). Now to the next step:
K2 = K1 − µRuu K1 − µK1 Ruu + µ2 E[u1 uT1 K1 u1 uT1 ]

= K⊥ ⊥ ⊥ 2 ⊥
1 − µRuu K1 − µK1 Ruu + 2µ Ruu K1 Ruu
= K⊥
2. (E.11)
The part K⊥ 1 from the orthogonal complement space thus only contributes to this space
⊥
as K2 but has no influence in the modal space of Ruu . Thus, any component from the
orthogonal complement will remain there and will not generate a component in the modal
space of Ruu .
A general K0 will be a linear combination as shown in (E.3). Take for example a fixed
system w to be identified. In this case K0 = wwT . This value can be decomposed into
P (Ruu ) in the modal space of Ruu and a component K⊥ from its orthogonal complement.
The polynomial will evolve into a polynomial and thus stay in the modal space and con-
tribute to the learning performance terms, while the perpendicular terms will not contribute
to the algorithm’s performance curves under the trace operation. This also allows the de-
k
scription of the evolution of the individual components, starting with Kk = Kk + K⊥ k , with
k ⊥ ⊥
Kk ∈ Ru and Kk ∈ Ru a set of homogeneous equations is obtained
k k k k
Kk+1 = Kk − µRuu Kk − µKk Ruu (E.12)
k k
+µ2 (2Ruu Kk Ruu + Ruu tr[Kk Ruu ]).
K⊥
k+1 = K⊥ ⊥ ⊥
k − µRuu Kk − µKk Ruu (E.13)
+2µ2 Ruu K⊥k Ruu ,
which in turn allows to formulate a first statement for Gaussian driving processes.
Lemma E.2 Assume the driving process uk = Axk a linearly filtered white Gaussian
process xk with E[xk xTk ] = IM +P and A an upperright Toeplitz matrix for linearly filtering.
Under the IA the initial parameter error vector covariance matrix K0 of the LMS algorithm
evolves 1) into a polynomial in AAT = Ruu of the modal space of Ruu , solely responsible
for the mismatch and the misadjustment of the algorithm and 2) a part in its orthogonal
complement which has no impact on the performance measures.
As such examples are rather intuitive for the particular case of a Gaussian driving
process, (spherically invariant process as a generalization of Gaussian processes can be
included straightforwardly), it is of interest what can be said about larger classes of driving
processes. To achieve this goal a few considerations with respect to the driving process are
required.
Driving Process: The properties of Lemma E.2 are not only maintained by Gaussian
random processes but by a much larger class of driving processes. It will be shown that
these properties hold for random processesPthat are constructed by a linearly filtered white,
zero mean random process uk = A[xk ] = Pm=0 am xk−m , whose only conditions are that:
Driving Process Assumptions (A1):

m(2)
x = E[x2k ] = 1 (E.14)
m(2,2)
x = E[x2k x2l ] ≤ c2 < ∞; k 6= l (E.15)
m(4)
x = E[x4k ] ≤ c3 < ∞ (E.16)
mx(1,1,1,1) = E[xk xl xm xn ] = 0; k 6= l 6= m 6= n (E.17)
mx(2,1,1) = E[x2k xm xn ] = 0; k 6= m 6= n (E.18)
m(1,3)
x = E[xk x3l ] = 0; k 6= l (E.19)
mx = E[xk ] = 0. (E.20)
Note that the conditions are shown for real-valued processes, for complex-valued processes
they need to be adjusted. The last four conditions (E.17)-(E.20) are listed here for com-
pleteness. They exclude processes that do not have a zero mean in some sense and have been
assumed in most of the literature even though not often explicitly mentioned. Linearly fil-
tering such processes will preserve the zero mean properties (E.17)-(E.20). These processes
(2,2) (4)
include certainly real-valued Gaussian and SIRP (3mx = mx = 3)and complex-valued
2
(2,2) (4) (2,2) (2)
Gaussian and SIRP (2mx = mx = 2), as well as IID processes mx = mx .
Once constructing vectors xTk = [xk , xk−1 , ..., xk−N +1 ] the following second and forth order
expressions are found:
E[xk xTk ] = IN , (E.21)
E[xk xTk xk xTk ] = (m(4) (2,2)
x + (M − 1)mx )IN . (E.22)
Correspondingly the linearly filtered vectors read uk = Axk with an upperright Toeplitz
matrix A of dimension M × N . The impulse response of the coloring filter is given by
a0 , a1 , ...aP and appears on every row of A starting with a0 on its main diagonal. In
general the driving process vector xk is longer than uk , depending on the order P of the
impulse response1 .
Lemma E.3 Assume the driving process uk = A[xk ] to originate from a linearly filtered
white random process xk so that uk = Axk with xTk = [xk , xk−1 , ..., xk−N +1 ], A denoting an
upperright Toeplitz matrix with the correlation filter impulse response and xk satisfying the
k
conditions (E.14)-(E.22). The parameter error vector covariance matrix K0 = K0 + K⊥ 0
of the LMS algorithm essentially (with error of order O(µ2 /M )) evolves into a polynomial
in AAT in the modal space of Ruu while terms in its orthogonal complement K⊥ remain
there or die out.
1
Alternatively, the IA can be removed by employing particular processes in which each element of
the regression vector uTk = [uk,1 , uk,2 , ..., uk,M ] is generated by a filtered version of individual processes
xk,1 , ..., xk,M . As such processes seem artificial, this approach is not followed here.
Note that this formulation may associate that this is only true for linearly filtered
processes of moving average (MA) type. As no condition on the order N of such process
is imposed, N can become arbitrarily large and thus autoregressive processes (AR) or
combinations (ARMA) can be resembled as well (e.g., see [28](Chapter 2.7)).
Proof: The proof proceeds in two steps: first, rewriting (E.7) for K0 = I to get to know
the most important terms and mathematical steps based on a simpler formulation, and
then refining the arguments for arbitrary values of Kk to Kk+1 .
For K0 = I and recalling that Ruu = E[uk uTk ] = AE[xk xTk ]AT = AAT the following is
obtained:
K1 = I − 2µAAT + µ2 AE[xk xTk AT Axk xTk ]AT . (E.23)
P P
On the main diagonal of the M × M matrix AAT identical elements are found: i=0 |ai |2 ,
P
thus tr[AT A] = tr[AAT ] = M Pi=0 |ai |2 , with P denoting the filter order of the MA
process.
Due to the properties (E.14)-(E.22) of the driving process xk ∈ IR
E[xk xTk Lxk xTk ]ij (E.24)

(
(2,2)
mx (Lij + Lji ) ; i 6= j
= (4) (2,2) P
mx Lii + mx k6=i Lkk ; i = j
E[xk xTk Lxk xTk ] = m(2,2)

x (L + LT ) + m(2,2)
x tr[L]IM +P
(4) (2,2)

+ mx − 3mx diag[L] (E.25)
is found, where diag[L] denotes a diagonal matrix with the diagonal terms of a matrix L
as entries. In case xk ∈ C
l the slightly different result is obtained
E[xk xH
k Lxk xk ]
H
= m(2,2)
x L + m(2,2)
x tr[L]IM +P
(4) (2,2)

+ mx − 2mx diag[L]. (E.26)
(4) (2,2)
For spherically invariant random processes (including Gaussian) the term (mx −3mx )
(4) (2,2)
for real-valued signals or (mx − 2mx ) for complex-valued signals, vanishes and thus the
problem can be solved classically.
P
In our particular case L = AT A ∈ R(M +P )×(M +P ) with tr[AT A] = M Pi=0 |ai |2 ,
PP
diag[AAT ] = 2 T
i=0 |ai | IM +P = tr[A A]/M IM +P . One problematic term remains how-
ever: diag[AT A]. At this point the following is proposed:
Approximation A2: diag[AT A] ≈ tr[A

T A]
M
IM +P
with an identity matrix IM +P of the corresponding dimension. Note the approximation

would be exact (up to the dimension) if there would be the term diag[AAT ] instead of
diag[AT A]. The approximation can be interpreted as replacing each of the diagonal ele-
ments of AT A by their average value tr[AT A]/M . Consider the relative difference matrix
−1
tr[AT A] T tr[AT A]
∆ǫ = diag[A A] − IM +P
M M
M
= diag[AT A] − IM +P (E.27)
tr[AT A]
of dimension (M + P ) × (M + P ). Its P diagonal terms at the beginning and end of

the diagonal remain different to zero while those terms in the middle (whose range can
be substantially large if M ≫P P ) are zero.
P The first P elements on the diagonal are for
example given by ∆ǫ,ii = − Pm=i |am |2 / Pm=0 |am |2 ; i = 1..P . It is worth comparing the
long filter derivations by Butterweck [?] that exclude boarder effects, thus the beginning
and ending of the matrices. Our approximation can thus be interpreted among the same
lines of approximations, just originating from a different approach. Note that later this
approximation will be dropped entirely for large values of M (Section E.5) for different
reasons. With this error term ∆ǫ for real-valued driving processes
E[xk xTk AT Axk xTk ] (E.28)

= 2m(2,2)x AT A
+(m(2,2)
x + (m(4)x − 3mx
(2,2)
)/M )tr[AT A]IM +P
+(m(4)x − 3mx
(2,2)
)∆ǫ
= 2mx A A + γx m(2,2)
(2,2) T
x tr[AT A]IM +P
+(γx − 1)m(2,2)
x tr[AT A]∆ǫ ,
is obtained with a newly introduced pdf-shape correction value

!
(4)
mx 1
γx = 1 + (2,2)
−3 , (E.29)
mx M
(4)
mx
a value that depends on the statistic of the process xk . The term (2,2) − 3 is similar to
mx
4 (4)
E[|x−mx | ] mx
the excess kurtosis E[|x−m 2 2 − 3 = (2) − 3. Processes with negative excess kurtosis are
x| ] (mx )2
often referred to as sub-Gaussian processes, while a positive excess kurtosis leads to so-called
super-Gaussian processes. This (slightly abused) terminology will be used correspondingly
(4)
to discriminate the term m(2,2) x
− 3. Thus sub-Gaussian processes in this sense take on
mx
γx values smaller than one, while super-Gaussian processes have values larger than one.
However, it is also noted that our approximation error ∆ǫ has an impact only in case
γx 6= 1 which vanishes not only for Gaussian pdfs but also with growing filter order M !
Note further that the term in the LMS algorithm where Approximation A2 applies, is
proportional to µ2 . It thus has no impact for small step-sizes but certainly on the stability
bound. A first conclusion thus is that the error on the parameter error vector covariance
matrix due to this approximation is of O(µ2 ). Furthermore, the approximation error term is
proportional to γx −1 that is proportional to 1/M . Approximation A2 can thus be concluded
to cause an error of the parameter error vector covariance matrix of order O(µ2 /M ). The
consequence that the applied approximation is negligible for large filter order M as well as
for Gaussian-type processes is reflected in Lemma E.3 by the wording ”essentially”. This
means that in extreme cases (small M and far away from Gaussian) indeed a very small
proportion can leak into the complementary space. At the first update with K0 = I
K1 = I − 2µAAT + 2µ2 m(2,2)x (AAT )2 (E.30)

+µ2 m(2,2)
x γx tr[AT A]AAT + O(µ2 /M )
is obtained, a polynomial in AAT .
Now the proof starts for general updates from Kk to Kk+1 . While the first terms that
are linear in µ are straightforward, the quadratic part in µ needs more attention.
E[uk uTk Kk uk uTk ] =

= AE[xk xTk AT Kk Axk xTk ]A (E.31)

= m(2,2)x A 2AT AKk AT A + AT Atr[AAT Kk ] AT
+ (m(4) (2,2)
x − 3mx )Adiag[AT Kk A]AT .
Here the same approximation method as before in A2 is imposed, that is diag[AT Kk A] ≈

tr[AAT Kk ]IM +P /M resulting in
E[uk uTk Kk uk uTk ] = 2mx(2,2) AAT Kk AT A (E.32)

+ γx mx(2,2) AAT tr[AAT Kk ] + O(µ2 /M )
and obtaining eventually
Kk+1 = Kk − µAAT Kk − µKk AAT (E.33)

+ µ2 m(2,2)
x 2AAT Kk AAT + γx AAT tr[AAT Kk ]
+O(µ2 /M ).
Now this can be split into two parts in the modal space Kk and in its orthogonal
complement K⊥ as in (E.12) and (E.13) before and the following is obtained:

k k k k
Kk+1 = Kk − µAAT Kk − µKk AAT (E.34)

k k
+ µ2 m(2,2)
x 2AA T
K k AA T
+ γ x AA T
tr[AA T
K k ]
+O(µ2 /M )
k k
= Kk − 2µAAT Kk (E.35)

2 (2,2) T 2 k T T k
+ µ mx 2[AA ] Kk + γx AA tr[AA Kk ]
+O(µ2 /M ),
K⊥ ⊥ T ⊥ ⊥
k+1 = Kk − µAA Kk − µKk AA
T
(E.36)
+2µ2 m(2,2)
x AAT K⊥ T 2
k AA + O(µ /M ).
The consequence of this statement is that the parameter error vector covariance matrix
is forced by the driving process to remain only in the modal space of the driving process.
This is not only true for its initial values but at every time instant k. The components of
the orthogonal complement remain in there or die out. This statement will be addressed
later in more detail in the context of step-size bounds for stability. Note also that for
complex-valued processes the only difference in (E.33) is the occurrence of AAH Kk AAH
rather than 2AAT Kk AAT .
E.3 Influence of Noise

So far additive noise vk has been neglected in the evolution of Kk . Adding noise in our
reference model (dk = wT k + vk ) causes an additional term µ2 Ruu σv2 = µ2 AAT σv2 in the
evolution of Kk and thus defines the inhomogeneous equations. Independent of the noise
statistics, this additional term lies also in the modal space of the driving process and thus
will not change our previous statements, as long as the IA holds. Therefore components
(2,2)
of the orthogonal complement die out as long as 0 < µ < 1/[mx λmax ], λmax denoting
the largest eigenvalue of Ruu 2 . As we will see later the step-size bound for the component
Kk of the modal space is smaller and thus all terms in the orthogonal complement will
die out for the step-size range of interest. As terms in the orthogonal complement die
out, K∞ = limk→∞ Kk is expected to exist only in the modal space of Ruu . Compute the
steady-state solution for k → ∞ and omitting the approximation error terms O(µ2 /M )
now in the following for simplicity
K∞ = K∞ − 2µAAT K∞ + µ2 σv2 AAT (E.37)

+ µ2 m(2,2)
x 2(AA T 2
) K ∞ + γ x AA T
tr[AA T
K ∞ ]
The procedure to obtain such result is the same as explained in the following paragraph for Kk , just
2
much simpler as the trace terms do not appear.

is obtained, or equivalently
2AAT K∞ − 2µm(2,2)x (AAT )2 K∞ (E.38)

−µm(2,2)
x γx tr[AAT K∞ ]AAT = µσv2 AAT .
Since K∞ exists only in the modal space of AAT diagonalizing both by the same unitary
matrix3 leads to QK∞ QT = ΛK and QAAT QT = Λu .
2Λu ΛK − µm(2,2)
x (2Λ2u ΛK − γx Λu tr[Λu ΛK ]) = µσv2 Λu . (E.39)
Stacking the diagonal values of the matrices into vectors: Λu 1 = λu , ΛK 1 = λK , λTu =

[λ1 , λ2 , ..., λM ], the following is obtained

2Λu − 2µm(2,2)
x Λ2u − µm(2,2)
x γx λu λTu λK = µσv2 λu . (E.40)
resulting in the well-known form [53][Eqn. (3.15)]:

T −1
λK = µσv2 2Λu − 2µm(2,2)
x Λ 2
u − µm (2,2)
x γ x λ u λ u λu
−1
= β∞ 2Λu − 2µm(2,2)
x Λ2u λu , (E.41)
with the abbreviation

2µσv2
β∞ = (2,2) P λi
(E.42)
2 − µmx γx i (2,2)
1−µmx λi
obtained by employing the matrix inversion lemma [28]: [P (Λu ) + λu λTu ]−1 λu = 1/[1 +
λTu P (Λu )−1 λu ]P (Λu )−1 λu . The final steady-state system mismatch is thus given by
tr[K∞ ] = kλK k1 = 1T λK (E.43)

P 1
µσv2 i (2,2)
1−µmx λi
= (2,2) P λi
(E.44)
2 − µmx γx i (2,2)
1−µmx λi
and the misadjustment

P λi
λTu λK µ i (2,2)
1−µmx λi
M= = (2,2) P (E.45)
σv2 2 − µmx γx i λi
(2,2)
1−µmx λi
the only difference to classic solutions for SIRPS [53] being the term γx that contains
(4) (2,2) (2,2)
influences of the fourth order moments mx and mx as well as mx explicitly.
3
Even if A2 is not satisfied and K∞ has a component in the orthogonal complement space, the method
applying Q can be used. Although K∞ is not diagonalized then and ΛK is not of diagonal form then, for
the performance measures only its diagonal terms are of importance and will be considered later on.
For complex-valued driving processes the final steady-state system mismatch is given
correspondingly by
P 1
µσv2 i (2,2)
T 2−µmx λi
tr[K∞ ] = 1 λK = (2,2) P λi
(E.46)
1 − µmx γx i (2,2)
2−µmx λi
and the misadjustment

P λi
λTu λK µ i (2,2)
2−µmx λi
M= 2
= (2,2) P . (E.47)
σv 1 − µmx γx i λi
(2,2)
2−µmx λi
E.4 Complete LMS Solution

After showing several aspects of the LMS solution, a more general statement can be
formulated for the transient and steady-state behavior of the algorithm.
Theorem E.1 Assuming the driving process uk = A[xk ] to originate from a linearly
filtered white random process xk with properties (E.14)-(E.22), any parameter error vector
covariance matrix K0 evolves essentially into two parts: a polynomial in AAT , stemming
from its decomposition onto the modal space of Ruu and a second part K⊥ k of its orthogonal
complement. The LMS update affects these two parts independently of each other.
The proof is straightforward by applying all previous results. In other words the comple-
mentary subspace part K⊥ k has no impact on the LMS performance measures and can thus
be neglected not only for Gaussian but for a large class of linearly filtered random processes.
A consequence for this theorem is the step-size bound that can be derived either from
(E.44) or by Gershgorin’s circle theorem from the matrix in (E.40). The result is identical
in both ways and conservative for real-valued xk :
2
0<µ≤ (2,2)
. (E.48)
mx (2λmax + γx tr[Ruu ])
Depending on the statistic of the driving process a more or less conservative bound is
obtained. It is worth to distinguish sub- and super Gaussian cases. For sub-Gaussian
distributions, γx < 1 while for super-Gaussian γx > 1. The step-size bound thus varies with
the distribution type more or less by tr[Ruu ] in the bound (E.48). For SIRPs (and thus
Gaussian) distributions as well as for very long filters γx = 1 and thus
2 2
0<µ≤ (2,2)
≤ (2,2)
. (E.49)
3mx tr[Ruu ] mx (2λmax + tr[Ruu ])
(2,2)
This result is identical to the classic term 2/3tr[Ruu ] for Gaussian processes as mx = 1.
Note that the results are conservative. For a statistically white driving process, for
(2,2)
example, an exact bound leads to µ ≤ 2/[mx tr[Ruu ](γx + 2/M )], thus a significantly
larger bound and still depending on the distribution by the value of γx . For complex-valued
processes xk the bounds are very similar to obtain, simply substituting the term 2λmax by
λmax .
Further note that the components in the orthogonal complement space indeed vanish
as argued at the beginning of Section III. Take for example Eqn. (E.13) or (E.36) as the
evolution of the orthogonal complement. It is straightforward to show that for the given
step-size range in (E.48) or (E.49) the components K⊥ k vanish as long as there is no new
components induced by violating Assumption A2.
E.5 Very Long Adaptive Filters

An analysis for very long filter orders M has been considered by Butterweck [6, 7]. The
(4) (2,2)
previously introduced term γx = (1 + (mx /mx − 3)/M ) hints already towards γx = 1 for
very large filter orders M . But note that the derivation in the previous sections required a
further approximation that we can entirely drop now for the long filter case.
The main reason for getting rid of the approximation lies in the fact that for very long
linear filters, it is well known that the Toeplitz matrices A behave equivalent to cyclic
matrices [22], say C.
Theorem E.2 Assuming the driving process uk = C[xk ] to originate from a linearly filtered
random process xk , the very long adaptive LMS filter simplifies to:
−1
λK = µσv2 2Λu − µm(2,2)
x Λ2u − µm(2,2)
x λu λTu λu . (E.50)
Proof: Following the fact that cyclic matrices can be diagonalized by DFT matrices, say
F, we reconsider the term E[uk uTk Kk uk uTk ], remembering that a process linearly filtered
by a unitary filter F for very long filters preserves its properties at the output of the filter.
We thus find
E[uk uTk Kk uk uTk ]

= CE[xk xTk CT Kk Cxk xTk ]CT (E.51)
= FH Λu1/2 FE[xk xTk FH ΛH/2 H 1/2
u FKk F Λu Fxk xk ]
T
FH ΛH/2
u F (E.52)
= F Λu E[fk fkH (ΛH/2
H 1/2 H 1/2 H H/2
u FKk F Λu )fk fk ]Λu F
= FH Λu1/2 E[fk fkH (ΛH/2 1/2 H H/2
u ΛK Λu )fk fk ]Λu F. (E.53)
Note that the filter is now being formulated in terms of a complex-valued driving process
fk = Fxk . We now notice that the center term of the last equation is of diagonal form,
simplifying the terms to
(2,2) (2,2)
E[fk fkH Lfk fkH ] = mf L + mf tr(L)I
(4) (2,2)
+(mf − 2mf )L, (E.54)
(2,2) (2,2)
= mf tr(L)I + mf L, (E.55)
H/2 1/2
with the particular solution L = Λu ΛK Λu = Λu ΛK . Note that we have used the fact
(4) (2,2)
that mf = 2mf as shown in the appendix. The parameter error vector covariance matrix
evolves now as
ΛKk+1 = ΛKk − 2µΛu ΛKk + µ2 σv2 Λu (E.56)

(2,2)
+ µ2 m f tr(Λu ΛKk )Λu + Λ2u ΛKk .
This shows that for very larger filter orders our previous approximations hold exactly and
the parameter error vector covariance matrix Kk indeed remains in the modal space of the
driving process uk defined by the DFT matrices of order M .
The steady-state values are obtained for k → ∞, i.e.,
ΛK∞ = ΛK∞ − 2µΛu ΛK∞ + µ2 σv2 Λu (E.57)

(2,2)
+µ2 mf tr(Λu ΛK∞ )Λu + Λ2u ΛK∞
(2,2)
2Λu ΛK∞ = µmf tr(Λu ΛK∞ )Λu + Λ2u ΛK∞
+ µσv2 Λu . (E.58)
Reshaping the diagonal terms into vectors leads to

(2,2)
2Λu λK = µmf λu λTu λK + Λ2u λK + µσv2 λu . (E.59)
and finally
(2,2)
λK = β∞ [2Λu − µmf Λ2u ]−1 λu . (E.60)
with
µσv2
β∞ = (2,2) P λi
. (E.61)
1 − µmf i 2−µm(2,2) λ
f i
The final steady-state system mismatch is thus given by

P 1
µσv2 i (2,2)
2−µm λi
tr(K∞ ) = 1T λK = (2,2) P
f
λi
, (E.62)
1 − µmf i (2,2)
2−µmf λi
and the system mismatch reads:

P λi
T µ
i 2−µm(2,2) λ
λu λK f i
M= 2
= (2,2) P . (E.63)
σv 1 − µmf λ i
i (2,2)
2−µmf λi
(2,2)
Note that this result for the long filter depends only on the joint moment mf of the
DFT of the driving process. As shown in the appendix for most distributions this moment
takes on the same value. This explains that the long LMS filter behaves more or less
identical, independently of the driving process, as long as the correlation is the same. The
interesting reader may also compare to an older publication by Gardner [19] where the
(4)
forth order moment mx was emphasized for purely white driving processes.
The last equation in turn results in the conservative step-size bound

1 2
0<µ< (2,2)
< (2,2)
. (E.64)
mf tr(Ruu ) mf [λmax + tr(Ruu )]
The step-size bound for the very long filter is thus considerably larger (by one third) than
for short lengths. An explicit dependency on the distribution or the length M of the filter
(2,2)
is now no longer present. Note however, that mf is dependent on M and the correction
term γx can be applied to the bound (E.64) as well to address small filters.
There is also an alternative bound possible now. As the eigenvalues λi are simply
originating from the cyclic matrices C that linearly filter the driving process, they are
obtained by a DFT on the filter matrices CCT , or equivalently correspond to the powers
of the spectrum of uk at equidistant frequencies 2π/M , allowing an alternative bound for
λmax ≥ maxΩ |A(ejΩ )|2 , thus
1 1
0<µ≤ (2,2)
≤ (2,2)
. (E.65)
mf λmax mf maxΩ |C(ejΩ )|2
This bound corresponds more to the spectral variations in the driving process while the
former bound including the trace term focuses more on the gain of the correlation filter. A
similar bound was already proposed by Butterweck [7] in the context of the LMS analysis
following a wave-theoretical argument. In terms of our notation it reads:
1
0 < µButterweck ≤ (2) |C(ejΩ )|2
. (E.66)
mx M maxtrΩ(C T C)
It is thus confirming to learn that classical matrix approaches are leading to similar
(2,2)
results. In Butterweck’s analysis the fourth order moment mf is not accounted for. In
particular for spherically invariant processes that are not Gaussian this plays a crucial role.
Appendix F
Basics of linear Algebra
Consider two sets of vectors

       
1 0 0 0
 0   1   0   0 
e1 =   
 0  , e2 =  0
 , e3 =   , e 4 = 
  1   0


0 0 0 1
and     
  
1 1 1 1
 0   1   1   1 
g1 =   
 0  , g2 =  0
,g =  ,g = 
 3  1  4  1
.

0 0 0 1
As the vectors in both sets are linearly independent, each set builds a basis. Constructing
a matrix out of the basis vectors

F = e1 e2 e3 e4
and
G= g1 g2 g3 g4 ,
then the rows and columns of such matrices are linearly independent. The number of linear
independent rows provides the row rank, the number of linearly independent columns the
column rank of the matrix. The matrices F and G have both the rank four.
Both sets of vectors thus build a basis of a linear vector space of dimension four. While
the first set {e1 , e2 , e3 , e4 } is an orthonormal (=normalized to unity and orthogonal) basis
, the second set {g 1 , g 2 , g 3 , g 4 } is simply a basis of such space.
N ×M
Let us now consider a matrix H ∈ Cl . Obviously, H describes a mapping of vectors
x∈Cl M ×1 onto vectors y ∈ C
l N ×1 :
y = Hx. (F.1)
151
The range space of the matrix H is that linear vector space that is spanned by his columns,
thus
R(H) = {Hx|x ∈ C l M ×1 } (F.2)
That R(H) indeed is a linear space follows from the property that if we have for {x1 , x2 }
that {Hx1 , Hx2 } ∈ R(H), then it is also true that c1 Hx1 +c2 Hx2 ∈ R(H). Such properties
do not allow to conclude whether the mapping (F.1) is unique that is that there are two
different x1 and x2 that generate the same y. If this is the case , we have H[x1 − x2 ] = 0.
If such vectors exist, then their differences define the nullspace of H,

N = x∈C l M ×1 |Hx = 0 . (F.3)
Repeating the above argument, we can show that the nullspace is a linear space too. In
summary, we find that the uniqueness of a solution depends on the null space. If it is
empty, then the solution is unique.
An orthogonal complement (space) L⊥ to a linear space L is given by

L⊥ = x ∈ C l M ×1 |xH z = 0, for all z ∈ L . (F.4)
Lemma F.1 (column space and nullspace) We have the following properties:
N (H) = R(H H )⊥ , N (H H ) = R(H)⊥ , R(H) = N (H H )⊥ , R(H H ) = N (H)⊥ .
Proof: Let us consider the first property. Let be x ∈ N (H). Then we have also that
Hx = 0. Then the following is also true y H Hx = xH H H y = 0 for all y. Thus, x is
orthogonal onto H H , thus x ∈ R(H H )⊥ and as x ∈ N (H), we must have N (H) ⊆ R(H)⊥ .
Starting the argumentation with x ∈ R(H H )⊥ , we thus conclude that x ∈ N (H) and
thus R(H H )⊥ ⊆ N (H). Therefore, it must be that N (H) = R(H H )⊥ . The remaining
properties follow accordingly.
Lemma F.2 (Column spaces of H H and H H H) We have:
R(H H ) = (H H H).
Proof: With Lemma F.1 it is sufficient to show that N (H) = N (H H H). Let x ∈
N (H H H), then H H Hx = 0. Thus, we also have xH H H Hx = kHxk22 = 0. Hx = 0 is
equivalent to x ∈ N (H), and therefore N (H H H) ⊆ N (H). Let’s start from the other
end: let x ∈ N (H), then Hx = 0 and also H H Hx = 0. Thus we have x ∈ N (H H H) and
therefore N (H) ⊆ N (H H H). Then, there is only possible that N (H) = N (H H H) and
thus also R(H H ) = R(H H H).
Appendix G
Method of Lagrange Multipliers
The method of Lagrange multipliers allows to formulating an optimization problem with

side constraints into a simper form without side constraints. Let us consider a real-valued
function J(x) of an unknown vector x ∈ IR1×n and f (x) a real-valued function of the same
vector. Then the following optimization problem with constraint is given
min J(x) (G.1)

x
with constraint f (x) = b. (G.2)
A solution xo to this problem, if existent, will in general not be a stationary (or extremal)
point of J(x). It thus does not need to be point (in general it is not a point) for which the
gradient of J(·) disappears.
Example G.1 Let us assume that x is a scalar and that
J(x) = x2 + x − 2, f (x) = x2 , b = 1.
The solutions of f (x) = 1 are x = ±1, while the extrema point of J(x) is at x = −1/2.
Under the constraint f (x) = 1 we obtain two possible solutions x = ±1 of which J(1) = 0
and J(−1) = −2 only x = −1 minimized the function J(x).
In general such elimination is not necessarily given and thus difficult to obtain. The
method of the Lagrangian multipliers that we explain next, offers such a procedure which
may lead to the desired result. We fist have to consider the differentials J(x) and f (x):
∂J ∂J ∂J
dJ(x) = dx1 + dx2 + ... + dxn (G.3)
∂x1 ∂x2 ∂xn
∂f ∂f ∂f
df (x) = dx1 + dx2 + ... + dxn . (G.4)
∂x1 ∂x2 ∂xn
153
If a solution xo exists to this problem, we must have:
df (xo ) = 0. (G.5)
Also, we must have that

dJ(xo ) = 0. (G.6)
If we select the following linear combination of the two differentials we obtain the required
(but not sufficient) condition for the existence of a minimum

∂J ∂f ∂J ∂f ∂J ∂f
(x ) − λ1 (x ) dx1 + (x ) − λ2 (x ) dx2 +...+ (x ) − λn (x ) dxn = 0.
∂x1 o ∂x1 o ∂x2 o ∂x2 o ∂xn o ∂xn o
(G.7)
∂f
We thus no longer require that each term ∂x (xo ) becomes zero, but it suffices to select
n
∂J ∂f
the variables λn so that the terms ∂xn (xo ) − λn ∂xn (xo ) dxn become zero.
Example G.2 Let us assume x to be a scalar and that
J(x) = x2 + x − 2, f (x) = x2 , b = 1.
A necessary condition for the solution is
[(2xo + 1) − 2λxo ]dx = 0
for arbitrary λ. We select λ, so that
(2xo + 1) − 2λxo = 0,
for which we obtain by substituting x0 in f (x)

1
= 1.
4(λ − 1)2
This provides two solutions for λ = {1/2, 3/2} and thus x = ±1. Trying the possible
solutions leads to the single true solution xo = −1.
Summarizing we can say that the method of Lagrangian multipliers transforms an op-
timization problem J(x) with constraint f (x) = 0 into a new optimization problem
n
X
V (x, λ) = J(x) − λi [fi (x) − bi ] (G.8)
i=1
the solutions of which are given by
dV (xo ) = 0, fi (xo ) = bi .
Obviously this is a necessary but not a sufficient condition.
In case of complex-valued constraints (f (x) ∈ C

l ), it can be substituted by two condi-
tions:
V (x, λR , λI ) = J(x) − λR [fR (x) − bR ] − λI [fI (x) − bI ] (G.9)

= J(x) − Re{λ∗ [f (x) − b]} (G.10)
= J(x) − λ∗ [f (x) − b] − λ[f (x) − b]∗ , (G.11)
using a complex-valued λ = λR + jλI .

Appendix H
Parametric Model Processes
If random processes with particular spectral properties are desired, they can be designed
by filtering a white process. The properties of the process are then defined by the filter
parameters. In particular Gaussian processes are of interest as they are not loosing their
Gaussian distribution during linear filtering. The filtering just changes variance and
autocorrelation of the process. Essentially there are three important classes of filtered
model processes: AR,MA and ARMA. The last one is a superposition of the first two.
Autoregressive Processes (short: AR Process) are designed by recursive filtering

from Gaussian processes:
P
X
xk = al xk−l + vk (H.1)
p=1
Here vk ∈ N (0, 1) is the Gaussian driving source with unit variance. AR processes
typically appear when resonances are to be modeled. Speech signals for example can be
modeled well by AR processes, as the vowel tract in mouth and throat has many such
resonances. Even sinusoidal or small band signals with heavy noise can be modeled well.
A further advantage of AR processes is that the autocorrelation matrix (typically auto
co-variance matrix as the signals are of zero mean) can directly be computed out of the
AR parameters a1 , a2 , .., aP and vice versa. For this we only need to multiply the process
in (H.1) by xk (in complex-valued processes by x∗k ) and compute the expectation. We thus
obtain:
P
X
2
E[xk ] = rxx (0) = al rxx (l) + 1. (H.2)
p=1
The last term E[xk vk ] =E[vk vk ] = 1 is obtained since xk consists only of past values of
the white process vk . By multiplying with previous values xk−l to (H.1) we obtain further
156
values, leading to a system of equations for the AR coefficients al :

   
  1 1
rxx (0) rxx (1) ... rxx (P )  −a1   
 rxx (1) rxx (0)     0 
 ... r xx (P − 1)   −a2   
 .. ..  = 0 . (H.3)
 . .  ..   .. 
 .   . 
rxx (P ) rxx (P − 1) ... rxx (0)
−aP 0
As only P parameter need to be found, we can delete the first row and column. The
reduced system is called the Yule-Walker equations.
If we want to compute the spectrum of such a process, we can use the generator equation
(H.1) and find: ! !
1 1
sxx (Ω) = P P . (H.4)
1 − Pl=1 al ejΩl 1 − Pl=1 al e−jΩl
By recursive filtering we can preserve the property to exhibit sharp resonances. It is poles
close to the unit circle that ensure such property.
A problem in AR processes in simulations are their initial values. Note that starting at
time instant zero the past values xk−1 , xk−2 ... are undefined. Thus, they are often set to
zero, which is incorrect as the joint density of such variables need also to be Gaussian. As
recursive processes forget their initial values over time, it is recommended to let them run
for a while first and then use their outputs.
Moving Average Process (short MA Process) is obtained by a non recursive filtering

of a white Gauss process by:
Q
X
xk = bl vk−l + vk . (H.5)
l=1
Multiplying by xk and computing the expectation value delivers the first equation for the
filter coefficients:
Q
X
rxx (0) = b2l + 1. (H.6)
l=1
Further equations are obtained similar to the AR process by shifting the process in time:
Q
X
E[xk−n xk ] = rxx (n) = bl b|l−m| . (H.7)
l=1
However, different to the AR process is now the computation of the filter coefficients not
so straightforward as we do not obtain a set of linear equations. There exists iterative
solutions for this problem.
The spectrum of the MA process can be computed again out of the generator (H.5) by:
Q
! Q
!
X X
sxx (Ω) = 1 − bl ejΩl 1− bl e−jΩl . (H.8)
l=1 l=1
Different to the AR process, the MA process can describe zeros in the spectrum rather
well. The initial values cab be generated easily by selecting independent Gaussian variables
vk−1 , vk−2 , ..., vk−Q of unit variance.
A combination of AR and MA processes, the so called ARMA processes allow for both:
resonances as well as zeros in the spectrum. They have a high flexibility as we need only
few coefficients to model specific spectra. However, it can be difficult to compute the filter
coefficients out of a given acf.
Appendix I
State Space Description of Systems
A linear, time-variant system can be described by

xk+1 = Fk xk + Gk uk , xk0 = initial value (I.1)
y k = Hk xk + Kk uk , k ≥ k0 , (I.2)
where the matrices {Fk , Gk , Hk , Kk } are of dimensions n × n, n × q, p × n and p × q,
respectively. Correspondingly, uk is of dimension q × 1 and y k of dimension p × 1. The n−
dimensional vector xk is called the state of the system.
State xk can be computed directly by the transition matrixΦ(k, j):

k−1
X
xk = Φ(k, j)xj + Φ(k, l + 1)Gl ul . (I.3)
l=j
The transition matrix is given by

Φ(k, j) = Fk−1 Fk−2 ...Fj , Φ(k, k) = I. (I.4)
For the special case of a time-invariant system {F, G, H, K}, we find for the transition
matrix
Φ(k, j) = F k−j , k ≥ j. (I.5)
Under many circumstances the time-invariant system with transition matrix can be diago-
nalized.
F = diag{λ1 , λ2 , ..λn }.
Assuming we write G as column vector with elements βi , i = 1..n, then the pair {F, G} is
called controllable if all λi are different from each other and all βi are unequal to zero. By
this condition an input signal uk can impact all states xk . Similarly, we like to know if each
state xk can impact each output value y k . This property is defined by the pair {F, H}.
If {F, H} is observable then {F ∗ , H ∗ } is controllable. Furthermore, the stability of such
system is of interest. The stability is guaranteed if all eigenvalues|λi | < 1.
159
Appendix J
Small-Gain Theorem
The small gain theorem can be interpreted as the generalization of stability in the linear
case. It is well known for linear time-invariant systems that a closed loop system is stable
if and only if the open loop system has a gain smaller than one. The small gain theorem
extents such statement towards arbitrary nonlinear systems.Consider an input signal xk , k =
1..N described by the vector xN of dimension 1 × N . The response of a system HN on such
a signal is given by
y N = H N xN . (J.1)
Definition 1: A mapping H is called l-stable, if two positive constants γ, β exist, such
that for all input signals xN the output is upper bounded by:
ky N k = kHN xN k ≤ γkxN k + β. (J.2)
Definition 2: The smallest positive constant γ, which satisfies l-stability is called the
gain of the system HN .
Remark: So called bounded input-bounded output (BIBO) stability is nothing else but
l∞ -stability.
Let us consider now a feedback system with two components HN and GN of individual
gains γh and γg . We find:
y N = HN hN = HN [xN − z N ] (J.3)
z N = GN g N = GN [uN + y N ]. (J.4)
Theorem J.1 (Small Gain Theorem) If the gains γh and γg are such that
γh γg < 1, (J.5)
160
then the signals hN and g N are upper bounded by
1
khN k ≤ [kxN k + γg kuN k + βh + γg βh ] (J.6)
1 − γh γg
1
kg N k ≤ [kuN k + γh kxN k + βg + γh βg ] . (J.7)
1 − γh γg
Proof: We have
hN = x N − z N (J.8)
g N = uN + y N . (J.9)
Thus for the norm
khN k ≤ kxN k + kGN g N k (J.10)

≤ kxN k + γg kg N k + βg (J.11)
≤ kxN k + γg [kuN k + γh khN k + βh ] + βg (J.12)
= γg γh khN k + kxN k + γg kuN k + γg βh + βg (J.13)
1
= [γg kuN k + γg βh + βg ]. (J.14)
1 − γg γh
The corresponding relation for g N can be shown is the same way.

Bibliography
[1] A. Bahai, M. Rupp, “Training and tracking of adaptive DFE algorithms under IS-136,”
SPAWC97 in Paris, April 1997.
[2] Behrooz and Parhami, “Computer Arithmetic,” Oxford University Press, 2000.
[3] N.J. Bershad, “Analysis of the normalized LMS algorithm with Gaussian inputs,”
IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP–34, no. 4, pp. 793–806,
Aug. 1986.
[4] N.J. Bershad, “Behavior of the ǫ–normalized LMS algorithm with Gaussian inputs,”
IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP–35, no. 5, pp. 636–644,
May 1987.
[5] Helmut Brehm, “Description of spherically invariant random processes by means of

G–functions,” Springer lecture Notes, vol. 969, pp. 39–73, 1982.
[6] H.-J. Butterweck, ”Iterative analysis of the steady-state weight fluctuations in LMS-
type adaptive filters,” IEEE Transactions on Signal Processing, vol. 47, pp. 2558-2561,
Sept. 1999.
[7] H.-J. Butterweck, ”A wave theory of long adaptive filters,” IEEE Transactions on
Circuits and Systems I: Fundamental Theory and Applications, vol. 48, pp. 739-747,
June 2001.
[8] Theo A.C.M. Claasen, Wolfgang F.G. Mecklenbräuker, “Comparison of the conver-
gence of two algorithms for adaptive FIR digital filters,” IEEE Trans. Acoust., Speech,
Signal Processing, vol. ASSP–29, no. 3, pp. 670–678, Juni 1981.
[9] Peter M. Clarkson, Paul R. White, “Simplified analysis of the LMS adaptive filter using
a transfer function approximation,” IEEE Trans. Acoust., Speech, Signal Processing,
vol. ASSP–35, no. 7, pp. 987–993, Juli 1987.
[10] S.C. Douglas, T.H.–Y. Meng, “Exact expectation analysis of the LMS adaptive filter
without the independence assumption,” Proc. ICASSP, San Francisco, pp. IV61–IV64,
Apr. 1992.
162
[11] J.C. Doyle, K. Glover, P. Kargonekar, B. Francis, “State-space solutions to standard

H2 and H∞ control problems, IEEE Transactions on Automatic Control, vol. 34, no.
8, pp. 831-847,1989.
[12] D.L. Duttweiler,“Adaptive filter performance with nonlinearities in the correlation mul-
tiplier,” IEEE Trans. Acoust., Speech, Signal Processing, vol.ASSP-30, pp. 578-586,
Aug. 1982.
[13] D.L. Duttweiler, “Proportionate Normalized Least-Mean-Squares Adaptation in Echo
Cancelers,” IEEE Trans. on Speech and Audio Processing, vol. 8,no.5, pp. 508-518,
Sep. 2000.
[14] E. Eleftheriou, D.D. Falconer, “Tracking properties and steady-state performance of
RLS adaptive filter algorithms,” IEEE Trans. Acoustics, Speech and Signal Proc., vol.
ASSP-34, no. 5, pp. 1097-1110, Oct. 1986.
[15] D.C. Farden, “Tracking Properties of adaptive processing algorithms,” IEEE Trans.
Acoustics, Speech and Signal Proc., vol. ASSP-29, no. 3, pp. 439-446, June 1981.
[16] P.L. Feintuch, “An adaptive recursive LMS filter,” Proc. IEEE, vol. 64, no. 11, pp.
1622–1624, Nov. 1976.
[17] A. Feuer and E. Weinstein, “Convergence analysis of LMS filters with uncorrelated
Gaussian data,” IEEE Trans. Acoust. Speech and Signal Processing, vol. ASSP–33,
no. 1, pp. 222–230, Feb. 1985.
[18] R. Frenzel, M.E. Hennecke, Using prewhitening and stepsize control to improve the
performance of the LMS algorithm for acoustic echo compensation, 1992 IEEE Inter-
national Symposium on Circuits and Systems, ISCAS ’92. Proceedings., vol. 4 , pp.
1930 -1932, 1992.
[19] W.A. Gardner, ”Learning characteristics of stochastic gradient descent algorithms: a
general study, analysis and critique,” Signal Processing, pp. 113-133, Apr. 1984.
[20] S.L. Gay, “An efficient fast converging adaptive filter for network echo cancelling, Proc.
of Asilomar Conference,, Monterey, Nov. 1998.
[21] John R. Glover, Jr., “Adaptive noise canceling applied to sinusoidal interferences,”
IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP–25, no. 6., pp. 484–491,
Dec. 1977.
[22] R.M. Gray, “Toeplitz and Circulant Matrices: A Review,” now publisher, 2007.
[23] B. Haetty, “Recursive Least Squares Algorithms using Multirate Systems for Cancel-
lation of Acoustical Echoes,” Proc. ICASSP, Albuquerque, New Mexico, USA, pp.
1145-1148, 1990.
[24] M. Hajivandi, W.A. Gardner, “Measures of tracking performance for the LMS algo-
rithm,” IEEE Trans. Acoustics, Speech and Signal Proc., vol. ASSP-38, no. 11, pp.
1953-1958, Nov. 1990.
[25] B. Hassibi, A.H. Sayed, and T. Kailath, “LMS and Backpropagation are minimax
filters,” in Neural Computation and Learning, ed. V. Roychowdhury, K. Y. Siu, and
A. Orlitsky, Ch. 12, pp. 425–447, Kluwer Academic Publishers, 1994.
[26] B. Hassibi, A.H. Sayed, and T. Kailath,“H∞ optimality of the LMS algorithm,” IEEE
Trans. Signal Processing, vol. 44, no. 2, pp. 267–280, Feb. 1996.
[27] S. Haykin, Neural Networks: A Comprehensive Foundation, MacMillan Publishing

Company, 1994.
[28] Simon Haykin, Adaptive Filter Theory, 1. edition, Prentice Hall, 1986.
[31] L.L. Horowitz and K.D. Senne, ”Performance advantage of complex LMS for control-
ling narrow-band adaptive arrays,” IEEE Transactions on Signal Processing, vol. 29 ,
no. 3, pp.722-736, June 1981.
[32] S. Hui, S.H. Zak, “The Widrow-Hoff algorithm for McCulloch-Pitts type neurons,”
IEEE Transactions on Neural Networks, vol. 5, no. 6, pp. 924-929, Nov. 1994.
[33] D.R. Hush, B.G. Horne, “Progress in supervised neural networks,” IEEE Signal Pro-
cessing Magazine, vol. 10, no. 1, pp. 8-39, Jan. 1993.
[34] T.Kailath, A.H.Sayed, B.Hassibi, Linear Estimation, Prentice Hall, 1999.
[35] R. E. Kalman, “Design of a self–optimizing control system,” Trans. ASME, vol. 80,
pp. 468–478, 1958.
[36] W. Kellermann, “Analysis and Design of Multirate Systems for Cancellation of Acous-
tical Echoes,” Proc. IEEE International Conference on Acoustics, Speech, and Signal
Processing, NY, 1988, vol. 5, pp. 2570-2573.
[37] P.P.Khargonekar, K.M. Nagpal, “Filtering and smooting in an H ∞ − setting, IEEE

Transactions on Automatic Control, vol. 36, pp. 151-166, 1991.
[38] H.K. Khalil, Nonlinear Systems, Mac Millan, 1992.
[39] R.P. Lippmann, “An introduction to computing with neural nets,” IEEE Acoustics,
Speech and Signal Processing Magazine, vol. 4, no. 2, pp.4-22, April 1987.
[40] Guozho Long, Fuyun Ling, John A. Proakis, “The LMS algorithm with delayed coeffi-
cient adaptation,” IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP–37, no.
9, pp. 1397–1405, Sep. 1989.
[41] Guozho Long, Fuyun Ling, John A. Proakis, “Corrections to ‘The LMS algorithm with
delayed coefficient adaptation’,” IEEE Trans. Signal Processing, vol. SP–40, no. 1, pp.
230–232, Jan. 1992.
[42] Vijay K. Madisetti: Editor, The DSP Handbook, CRC Press, 1997.
[43] J. Mai, Ali H. Sayed, “A feedback approach to the steady-state performance of

fractionally-spaced blind adaptive equalizers, IEEE Trans. on Signal Processing, vol.
48, no.1, pp. 80-91, Jan. 2000.
[44] S. Makino, Y. Kaneda and N. Koizumi, “Exponentially weighted step-size NLMS adap-
tive filter based on the statistics of a room impulse response, IEEE Trans. on Speech
and Audio Processing, vol. 1, no. 1, pp. 101–108, Jan. 1993.
[45] S. Marcos, O. Macchi, “Tracking capability of the least mean square algorithm: appli-
cation to an asynchronous echo canceller, IEEE Trans. Acoustics, Speech and Signal
Proc., vol. ASSP-35, no. 11, pp. 1570-1578, Nov. 1987.
[46] J. E. Mazo, “On the independence theory of equalizer convergence, Bell Syst. Tech.
J., vol. 58, pp. 963–993, 1979.
[47] H. Mohamad, S. Weiss, M. Rupp, L. Hanzo, “A fast converging fractionally spaced

equalizer, Asilomar conference, Nov. 2001.
[48] H. Mohamad, S. Weiss, M. Rupp, L. Hanzo, ”Fast adaptation of fractionally spaced

equalizers,” Electronic Letters, vol. 38, no.2, p. 96-98, 17. Jan. 2002.
[49] K.S. Narendra, K. Parthasarathy, “Identification and control of dynamical systems

using neural networks, IEEE Transactions on Neural Networks, vol. 1, p. 4-27, March
1990.
[50] K.S. Narendra, K. Parthasarathy, “Gradient methods for the optimization of dynamical
systems containing neural networks, IEEE Trans. on Neural Networks, vol. 2, p. 252-
262, March 1991.
[51] R. Nitzberg, “Normalized LMS algorithm degradation due to estimation noise,” IEEE
Trans. Aerosp. Electron. Syst., vol. AES–22, no. 6., p. 740–749, Nov. 1986.
[52] K. Ozeki, T. Umeda, “An adaptive filtering algorithm using orthogonal projection to
an affine subspace and its properties,” Electronics and Communications in Japan, vol.
67–A, no. 5, pp. 19–27, 1984.
[53] M. Rupp, “The behavior of LMS and NLMS algorithms in the presence of spherically
invariant processes,” IEEE Trans. Signal Processing, vol. SP–41, no. 3, pp. 1149-1160,
March 1993.
[54] M. Rupp, R. Frenzel “The behavior of LMS and NLMS algorithms with delayed coef-
ficient update in the presence of spherically invariant processes,” IEEE Trans. Signal
Processing, vol. SP–42, no. 3, pp. 668-672, March 1994.
[55] M. Rupp, “Bursting in the LMS algorithm,” IEEE Transactions on Signal Processing,
vol. 43, no. 10, pp. 2414-2417, Oct. 1995.
[56] M. Rupp, A. H. Sayed, “On the stability and convergence of Feintuch’s algorithm for
adaptive IIR filtering,” Proc. IEEE Conf. ASSP, Detroit, MI, May 1995.
[57] M. Rupp, A.H. Sayed, “A robustness analysis of Gauss-Newton recursive methods,”

Signal Processing, vol. 50, no. 3, pp. 165-188, June 1996.
[58] M. Rupp, A.H. Sayed, “A time-domain feedback analysis of filtered-error adaptive

gradient algorithms,” IEEE Transactions on Signal Processing, vol. 44, no. 6, pp.
1428-1440, June 1996.
[59] M. Rupp, “Saving complexity of modified filtered-X-LMS and delayed update LMS
algorithm,” IEEE Transactions on Circuits & Systems, pp. 57-59, Jan. 1997.
[60] M.Rupp, A.H. Sayed, “Improved convergence performance for supervised learning of
perceptron and recurrent neural networks: a feedback analysis via the small gain
theorem,” IEEE Transaction on Neural Networks, vol. 8, no. 3, pp. 612-623, May
1997.
[61] M. Rupp, A.H. Sayed, “Robust FxLMS algorithm with improved convergence perfor-
mance,” IEEE Transactions on Speech and Audio Processing, vol. 6, no. 1, pp. 78-85,
Jan. 1998.
[62] M. Rupp, “A family of adaptive filter algorithm with decorrelating properties,” IEEE
Transactions on Signal Processing, vol. 46, no. 3, pp. 771-775, March 1998.
[63] M. Rupp, “On the learning behavior of decision feedback equalizers,” 33rd. Asilomar
Conference, Monterey, California, Oct. 1999
[64] M. Rupp, A.H. Sayed, “On the convergence of blind adaptive equalizers for constant
modulus signals” IEEE Transactions on Communications, vol. 48, no. 5, pp. 795-803,
May 2000.
[65] M. Rupp, J. Cezanne, “Robustness conditions of the LMS algorithm with time-variant
matrix step-size,” Signal Processing, vol. 80, no. 9, Sept. 2000.
[66] M. Rupp and H.-J. Butterweck, ”Overcoming the independence assumption in LMS
filtering,” in Proc. of Asilomar Conference, pp. 607-611, Nov. 2003.
[67] A.H. Sayed and T. Kailath, “A state-space approach to adaptive RLS filtering,” IEEE
Signal Processing Magazine, vol. 11, no. 3, pp. 18–60, July 1994.
[68] A.H. Sayed, M. Rupp, “Error energy bounds for adaptive gradient algorithms,” IEEE
Transactions on Signal Processing, vol. 44, no. 8, pp. 1982-1989, Aug. 1996.
[69] A.H.Sayed, M. Rupp, “An l2 −stable feedback structure for nonlinear H ∞ -adaptive
filtering,” Automatica, vol. 33, no.1, pp. 13-30, Jan. 1997.
[70] V.H.Nascimento, A.H.Sayed, “Are ensemble averaging learning curves reliable in eval-
uating the performance of adaptive filters,” Proceedings 32nd Asilomar Conference on
Circuits, Systems, and Computers, pp. 1171-1174, Nov. 1998.
[71] A.H. Sayed, ”Fundamentals of Adaptive Filtering,” Wiley 2003.
[72] Solo, V. and X. Kong, Adaptive Signal Processing Algorithms: Stability and Perfor-
mance, Prentice Hall, New Jersey, 1995.
[73] S.D. Stearns, G. R. Elliot, “On adaptive recursive filtering,” Proceedings 10th Asilomar
Conference on Circuits, Systems, and Computers, pp. 5–11, Nov. 1976.
[74] K. Steiglitz, L. E. McBride, “A technique for the identification of linear systems,”
IEEE Trans. Autom. Control, vol. AC–10, pp. 461–464, 1965.
[75] G. Ungerboeck, “Theory on the speed of convergence in adaptive equalizers for digital
communication,” IBM J. Res. Develop., vol. 16, no. 6, pp. 546–555, 1972.
[76] M. Vidyasagar, Nonlinear Systems Analysis, Prentice Hall, New Jersey, second edition,
1993.
[77] S. A. White, “An adaptive recursive digital filter,” Proc. 9th Asilomar Conf. Circuits
Syst. Comput., pp. 21–25, Nov. 1975.
[78] Bernard Widrow, M. E. Hoff Jr., “Adaptive switching circuits,” IRE WESCON conv.
Rec., Part 4, pp. 96–104, 1960.
[79] E. Walach and B. Widrow, “The least-mean-fourth (LMF) adaptive algorithm and its
family,” IEEE Trans. Inform. Theory, vol. IT-30, pp.275-283, March 1984.
[80] B. Widrow and S. D. Stearns, Adaptive Signal Processing, NY: Prentice-Hall, Inc.,
1985.
[81] W. Wirtinger, ”Zur formalen Theorie der Funktionen von mehr komplexen
Veränderlichen,” Mathematische Annalen, vol.. 97, pp. 357-376, 1927.

Adaptive Filters

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Adaptive Filters

Uploaded by

Copyright:

Available Formats

VU 389.

069 – Signal Processing, Advanced Course:

Univ.Prof. DI Dr.-Ing. Markus Rupp

Summer Semester 2011

1 Adaptive Filters: An Overview 5

3 The LMS Algorithm 38

4 The RLS Algorithm 59

5 Tracking behavior of Adaptive Systems 75

7 Robust Adaptive Filters 88

A Differentiation with Respect to Vectors 117

B Convergence of Random Sequences 120

C Spherically Invariant Random Processes 122

D Remarks on Gaussian Processes 131

F Basics of linear Algebra 151

G Method of Lagrange Multipliers 153

H Parametric Model Processes 156

I State Space Description of Systems 159

J Small-Gain Theorem 160

Adaptive Filters: An Overview

1.1 Applications of Adaptive Filters

Figure 1.1: Near echo connection.

Figure 1.2: Far echo connection.

Figure 1.3: Loudspeaker-Room-Microphone system.

Figure 1.4: System identification.

Primary Electrical Secundary Error

x(t) y(t) ef (t)

Figure 1.6: Linear Predictor.

Figure 1.7: Optimal equalizer and decoder.

Figure 1.8: Reference model for inverse modeling.

where the operator hp,n [·] represents a multi-dimensional convolution:

reference yref (n)

u(n) nonlinear y(n)

Figure 1.9: Nonlinear adaptive predistortion.

u(n) x(n) y(n)

Figure 1.10: Nonlinear Wiener Model.

A problem similar to adaptive equalization is faced in pattern recognition. Here, (binary)

Figure 1.11: Feedback Neuronal Network.

1.2 Classification Schemes for Adaptive Filters

A, B, C, ... denote matrices of which all entries are deterministic.

Attaching an argument (k) to a scalar variable, or an index k (small letter) to a vector or

represents an IIR filter given by the difference equation

Figure 1.12: Adaptive Scheme.

2.1 Least-Mean-Squares Estimators

σx2 = E[x − x̄]2 = E[x2 ] − x̄2 . (2.1)

min E[(x − x̂)2 ]. (2.4)

Proof: Per definition we find:

fx,z (x, z) = fx,y (x, z − x) (2.10)

Which finally enables us to evaluate the integral in (2.9):

E(x) = Ey [Ex (x|y)] , (2.14)

the following is true as well:

E[xg(y)] = Ey [Ex [xg(y)|y]] = Ey [Ex [x|y] g(y)] = E[x̂g(y)]. (2.15)

2.) Determine the conditional probability density function fx|y (x|y).

where Σ still has to be determined.

2.2 Linear Least-Mean-Squares Estimators

The corresponding MMSE is given by:

Proof: Consider the MSE:

E[(x − x̂)(x − x̂)H ] = E[(x − Ky)(x − Ky)H ] (2.19)

Example 2.2 Consider a linear model:

where W is a matrix of suitable dimensions, v is an additive, zero mean disturbance with

Lemma 2.4 (Matrix-Inversion-Lemma) For nonsingular matrices A and C the follow-

In summary, the linear least-mean-squares estimators are presented in Table 2.1.

Table 2.1: Linear LMS estimators for zero mean variables.

Exercise 2.3 Consider the following observation

y(k) = a cos (2πf0 k) + v(k); k = 1, 2, . . . , N.