You are on page 1of 4

FPGA Implementation of an Optimized NLMS

Algorithm
Cristian Stanciu, Cristian Anghel, Constantin Paleologu, Silviu Ciochină Jacob Benesty
Telecommunications Department INRS-EMT
University Politehnica of Bucharest University of Quebec
Bucharest, Romania Montreal, Canada
Email: {cristian, canghel, pale, silviu}@comm.pub.ro Email: benesty@emt.inrs.ca

This paper is dedicated to the memory of Steven L. Grant for his exceptional contributions to the echo cancellation problem.

Abstract—The recently proposed joint-optimized normalized filter, achieving a good compromise between the performance
least-mean-square (JO-NLMS) adaptive algorithm employs a criteria.
variable step-size parameter to achieve both high convergence In this paper, we provide an FPGA implementation of
and low misadjustment. In this paper, an efficient hardware
implementation design is proposed, which exploits the key numer- the JO-NLMS algorithm, in the context of acoustic echo
ical properties of the JO-NLMS algorithm to reduce the hardware cancellation [12], [13] (which is one of the most challenging
workload (corresponding to the adaptive filter’s operations), in system identification problems). Simulation results indicate
terms of binary numerical representations and multiplication that the JO-NLMS algorithm could be a reliable choice in
blocks. Simulation results are provided in an acoustic echo real-world applications.
cancellation scenario, using a ModelSIM-Matlab testing platform.
II. T HE JO-NLMS A LGORITHM
Index Terms—Adaptive filters; echo cancellation; FPGA im-
plementation; normalized least-mean-square (NLMS) algorithm; In this section, we briefly detail the development and the
variable step-size NLMS. performance features of the JO-NLMS algorithm [10], [11].
In the context of a system identification problem, the desired
I. I NTRODUCTION signal at the discrete-time index 𝑛 is obtained as
In many application of adaptive filtering, the family of 𝑑(𝑛) = x𝑇 (𝑛)h(𝑛) + 𝑣(𝑛), (1)
least-mean-square (LMS) algorithms [including the normal- [ ]𝑇
ized LMS (NLMS)] [1], [2] represents a reliable choice. The where x(𝑛) = 𝑥(𝑛) 𝑥(𝑛 − 1) ⋅ ⋅ ⋅ 𝑥(𝑛 − 𝐿 + 1)
main reasons behind this popularity are the good numerical is a vector containing the 𝐿 most recent samples of the
properties and low computational complexity of the LMS- input signal, superscript 𝑇 is the transpose operator, h(𝑛) =
[ ]𝑇
based algorithms. ℎ0 (𝑛) ℎ1 (𝑛) ⋅ ⋅ ⋅ ℎ𝐿−1 (𝑛) is the impulse response
The main parameter that controls the overall performance (of length 𝐿) of the system that we need to identify, and 𝑣(𝑛)
of LMS-based algorithms is the step-size. The choice of this is the system noise (usually considered as a zero-mean white
parameter reflects a compromise between the convergence Gaussian noise signal).
rate and tracking of the algorithm versus misadjustment and Let us consider that the unknown system h(𝑛) follows a
robustness. In order to control this issue, a variable step-size simplified first-order Markov model (which is reasonable to
could be used (instead of the constant one). Consequently, assume in the context of many applications), i.e.,
many interesting variable step-size (VSS) algorithms can be
h(𝑛) = h(𝑛 − 1) + w(𝑛), (2)
found in the literature, e.g., [3]–[11].
Most of these algorithms were developed in a system iden- where w(𝑛) is a zero-mean white Gaussian noise signal
tification context, following an optimization criterion based vector [uncorrelated with h(𝑛 − 1)]. The correlation matrix
2
on the minimization of the system misalignment. However, in of w(𝑛) is assumed to be Rw = 𝜎𝑤 I𝐿 , where I𝐿 is the
2
most of the developments, the unknown system is considered 𝐿 × 𝐿 identity matrix and the variance, 𝜎𝑤 , captures the
time-invariant, which is not the case in many applications. For uncertainties in h(𝑛). In this framework, the main objective is
example, in echo cancellation [12], [13], the impulse response to estimate h(𝑛) with an adaptive filter, denoted by ĥ(𝑛) =
[ ]𝑇
of the echo path is variable in time, requiring good tracking . Equations (1) and (2)
ℎ̂0 (𝑛) ℎ̂1 (𝑛) ⋅ ⋅ ⋅ ℎ̂𝐿−1 (𝑛)
capabilities for the adaptive filters.
define a state variable model, like in the Kalman filtering [15].
The recently proposed joint-optimized NLMS (JO-NLMS)
Let us consider the update of the LMS-based algorithms
algorithm [10], [11] was developed in the context of a state
[1], [2], which is given by
variable model, inheriting some features from Kalman filter-
ing [14], [15]. This algorithm behaves like a VSS adaptive ĥ(𝑛) = ĥ(𝑛 − 1) + 𝜇x(𝑛)𝑒(𝑛), (3)

978-1-5090-3748-3/16/$31.00 ©2016 IEEE


where 𝜇 denotes the step-size parameter and TABLE I
JO-NLMS ALGORITHM .
𝑒(𝑛) = 𝑑(𝑛) − x𝑇 (𝑛)ĥ(𝑛 − 1) (4)
represents the error signal of the adaptive filter. In this context,
Initialization:
we can define the a posteriori misalignment as c(𝑛) = h(𝑛) −
ĥ(0) = 0𝐿×1
ĥ(𝑛). Consequently, developing (3) in terms of the a posteriori
𝑚(0) = 𝜀 > 0
misalignment, we obtain
𝑞(0) = 0
c(𝑛) = c(𝑛 − 1) + w(𝑛) − 𝜇x(𝑛)𝑒(𝑛). (5) ˆ𝑑2 (0) = 𝜎
𝜎 ˆ𝑦2ˆ(0) = 0
1
At this point, we focus on an optimization criterion based 𝜆=1− , 𝐾≥1
𝐾𝐿
on the minimization of the system misalignment, which is a
reliable choice in system identification problems. Therefore, For time index 𝑛 = 1, 2, ...:
taking the ℓ2 norm in (5), followed by mathematical expecta- 𝑦ˆ(𝑛) = x𝑇 (𝑛)ĥ(𝑛 − 1)
tion on both sides, it results in 𝑒(𝑛) = 𝑑(𝑛) − 𝑦ˆ(𝑛)
[ ] [ ] 𝑝(𝑛) = 𝑚(𝑛 − 1) + 𝑞(𝑛 − 1)
2 2 2
𝐸 ∥c(𝑛)∥2 = 𝐸 ∥c(𝑛 − 1)∥2 + 𝐿𝜎𝑤 1
[ ] [ ] ˆ𝑥2 (𝑛) = x𝑇 (𝑛)x(𝑛)
𝜎
𝐿
− 2𝜇𝐸 x𝑇 (𝑛)c(𝑛 − 1)𝑒(𝑛) − 2𝜇𝐸 x𝑇 (𝑛)w(𝑛)𝑒(𝑛)
[ ] ˆ𝑑2 (𝑛) = 𝜆ˆ
𝜎 𝜎𝑑2 (𝑛 − 1) + (1 − 𝜆)𝑑2 (𝑛)
+ 𝜇2 𝐸 𝑒2 (𝑛)x𝑇 (𝑛)x(𝑛) . (6) ˆ𝑦2ˆ(𝑛) = 𝜆ˆ
𝜎 𝜎𝑦2ˆ(𝑛 − 1) + (1 − 𝜆)ˆ
𝑦 2 (𝑛)
Following the development ˆ𝑣2 (𝑛) = 𝜎
𝜎 ˆ𝑑2 (𝑛) − 𝜎
ˆ𝑦2ˆ(𝑛)
[ ]from [16] and introducing the
2
notation 𝑚(𝑛) = 𝐸 ∥c(𝑛)∥2 , (6) results in 𝜎𝑣2 (𝑛) + (𝐿 + 2)ˆ
𝑟(𝑛) = 𝐿ˆ 𝜎𝑥2 (𝑛)𝑝(𝑛)
[ ] 𝑝(𝑛)
𝜇(𝑛) =
𝑚(𝑛) = 1 − 2𝜇𝜎𝑥2 + (𝐿 + 2)𝜇2 𝜎𝑥4 𝑚(𝑛 − 1) 𝑟(𝑛)
[ ] u(𝑛) = 𝜇(𝑛)x(𝑛)𝑒(𝑛)
+ 𝐿𝜇2 𝜎𝑥2 𝜎𝑣2 + (𝐿 + 2)𝜎𝑥2 𝜎𝑤 2
− 2𝐿𝜇𝜎𝑥2 𝜎𝑤
2
+ 𝐿𝜎𝑤2
, (7)
[ ] [ ] ĥ(𝑛) = ĥ(𝑛 − 1) + u(𝑛)
[ ]
where 𝜎𝑥2 = 𝐸 𝑥2 (𝑛) and 𝜎𝑣2 = 𝐸 𝑣 2 (𝑛) are the variances 𝑚(𝑛) = 1 − 𝜇(𝑛)ˆ 𝜎𝑥2 (𝑛) 𝑝(𝑛)
of 𝑥(𝑛) and 𝑣(𝑛), respectively. Consider that the step-size 𝜇 is 𝑞(𝑛) = u𝑇 (𝑛)u(𝑛)
time dependent and evaluating ∂𝑚(𝑛)/∂𝜇(𝑛) = 0, we obtain
an optimal step-size:
1
𝜇(𝑛) = . (8)
𝐿𝜎𝑣2
2
(𝐿 + 2)𝜎𝑥 + 2
where 𝜎ˆ𝑑2 (𝑛) and 𝜎
ˆ𝑦ˆ2 (𝑛) represent the power estimates of 𝑑(𝑛)
𝑚(𝑛 − 1) + 𝐿𝜎𝑤 and 𝑦ˆ(𝑛), respectively. The terms in (11) can be evaluated as
Finally, based on (8), the filter update is
ˆ𝑑2 (𝑛) = 𝜆ˆ
𝜎 𝜎𝑑2 (𝑛 − 1) + (1 − 𝜆)𝑑2 (𝑛), (12)
ĥ(𝑛) = ĥ(𝑛 − 1) + 𝜇(𝑛)x(𝑛)𝑒(𝑛) (9)
ˆ𝑦ˆ2 (𝑛) = 𝜆ˆ
𝜎 𝜎𝑦ˆ2 (𝑛 − 1) + (1 − 𝜆)ˆ
𝑦 2 (𝑛), (13)
and the update of the parameter 𝑚(𝑛) becomes
[ ][ ] where 𝜆 = 1 − 1/(𝐾𝐿), with 𝐾 ≥ 1 [6].
𝑚(𝑛) = 1 − 𝜇(𝑛)𝜎𝑥2 𝑚(𝑛 − 1) + 𝐿𝜎𝑤 2
. (10) The third parameter to be found is 𝜎𝑤 2
. Following the
Consequently, the resulting JO-NLMS algorithm is defined proposal from [15], we can take the ℓ2 norm on both sides of
by the relations (4), (9), and (10). As an alternative, the (2), then replacing h(𝑛) by its estimate ĥ(𝑛). Consequently,
JO-NLMS algorithm can be also obtained starting from the 2 1
2

𝜎
ˆ𝑤 (𝑛) = ĥ(𝑛) − ĥ(𝑛 − 1) . (14)
NLMS algorithm and following a joint-optimization problem 𝐿 2
on its important parameters, i.e., the normalized step-size and Finally, the JO-NLMS algorithm is presented in Table I, in
regularization term [10], [11]. order to facilitate its implementation [16].
It should also mentioned that, in practice, we need to
estimate three main parameters of the JO-NLMS algorithm. III. FPGA I MPLEMENTATION OF THE JO-NLMS
The first one is the variance of the input signal, which A LGORITHM
could be evaluated as 𝜎 ˆ𝑥2 (𝑛) = x𝑇 (𝑛)x(𝑛)/𝐿 (similar to
The JO-NLMS algorithm is implemented in fixed-point
the NLMS algorithm). Second, there is the noise power, 𝜎𝑣2 ,
arithmetic using the VHDL language with two’s complement
which can be estimated in different ways, e.g., like in [17],
numerical representations, for a Xilinx Virtex XC5VFX70T
[18]. For example, the method proposed in [17] assumes that
target device available on the ML507 evaluation platform [19].
the adaptive filter has converged to a certain degree, i.e.,
The implementation block scheme is represented in Fig. 1.
𝑦(𝑛) ≈ 𝑦ˆ(𝑛), where 𝑦ˆ(𝑛) denotes the output of the adaptive
We consider 16 bits input/output signals [i.e., 𝑥(𝑛), 𝑑(𝑛), and
filter. Consequently,
𝑒(𝑛)] with a sampling rate of 8 kHz. The same precision is
ˆ𝑣2 (𝑛) = 𝜎
𝜎 ˆ𝑑2 (𝑛) − 𝜎
ˆ𝑦ˆ2 (𝑛), (11) used for the representation of the adaptive filter coefficients.
d(n)

d(n)

4
3
x(n) 5
mem_x yˆ(n)
1, 2
н e(n)
Ͳ
н
6
y 1, 2, 6
e(n)
ACCUMULATOR 1
2 3
M1 Lσˆ x2 (n) Lσˆ x2 (n) p( n)

н
mem_h
y
1 z -1 1
4 μ(n)e(n)
2 r(n) ͬ
3 6 5 4
q(n-1) M2 3 2

н
1
p(n) 2
3
3, 5
σˆ d2 (n) − σˆ y2ˆ (n)
m(n-1) н
z
-1
нͲ
μ (n)

u(n)

Fig. 1. Block scheme of the JO-NLMS implementation.

A system clock frequency of 100 MHz is targeted, associated we need the product 𝐿ˆ𝜎𝑣2 (𝑛), the estimates from (12) and (13)
with a number of 12500 clock cycles between consecutive can be expressed (using a general notation) as
input samples 𝑥(𝑛). Two 18 kb BRAMs (block random access ˆ𝛼2 (𝑛 − 1) 𝛼2 (𝑛)
𝜎
memories) are used to store the values corresponding to the 𝜎𝛼2 (𝑛) = 𝐿ˆ
𝐿ˆ 𝜎𝛼2 (𝑛 − 1) − + . (15)
𝐾 𝐾
vectors x(𝑛) and ĥ(𝑛).
Choosing the value of 𝐾 as a power of 2, we replace
The arithmetic workload corresponding to the multiplica-
the divisions required for the computation of 𝐿ˆ 𝜎𝑣2 (𝑛) with
tions is distributed between two distinct blocks (denoted as
bit-shifts to the right. The remaining multiplications [i.e.,
M1 and M2 in Fig. 1), which are implemented with digital
corresponding to 𝛼2 (𝑛)] are performed using M1 . The final
signal processing (DSP) modules available on the FPGA. The
estimate has a 32 bits format with 6 bits for the integer part,
multiplications units are working with 16 bits × 16 bits,
in order to be aligned with the binary structure of 𝐿ˆ𝜎𝑥2 (𝑛)𝑝(𝑛)
respectively 32 bits × 32 bits operands. Considering that
for the addition performed to obtain the parameter 𝑟(𝑛).
the DSP physical blocks are limited to 18 bits × 25 bits Next, the step-size computation generates a 16 bits result,
multiplications, the modules require a total number of 5 DSPs where 7 bits are associated with the integer part. The result
(one unit for M1 and the rest for M2 ). is used as input for M1 , which computes the product with
The module M1 is employed for most of the multiplications the error signal. The M2 multiplier is further employed for
performed in the JO-NLMS algorithm. At its output, it is 𝐿 multiplications of the M1 outputs with the input samples
connectable to a resetable accumulator based on configurable 𝑥(𝑛). The results [i.e., the values of u(𝑛)] are truncated to 16
logic, for the computations of the echo estimate 𝑦ˆ(𝑛) and the bits and used for two operations. First, the filter coefficients
term 𝐿ˆ 𝜎𝑥2 (𝑛). The error signal 𝑒(𝑛) is obtained by truncating are updated through 𝐿 reading/writing operations associated
𝑦ˆ(𝑛) to a 16 bits representation (from the full precision of with mem h memory block (see Fig. 1). Second, the values
32 bits generated by M1 ) and subtracting it from the desired of u(𝑛) are used as operands for M1 , in order to compute
signal 𝑑(𝑛). However, the value of 𝐿ˆ 𝜎𝑥2 (𝑛) is computed by the associated squared values, which are accumulated to
accumulating the output of M1 with 25 bits for the fractional generate the parameter 𝑞(𝑛). The procedures are synchronized
part and 6 bits for the integer part (an overall representation in such manner that the vector u(𝑛) does not require BRAM
is comprised of 32 bits). resources.
As the analysis performed in [16] revealed that the variables The module M2 is finally used for the multiplication of 𝜇(𝑛)
𝑞(𝑛), 𝑝(𝑛), 𝑚(𝑛), and 𝑟(𝑛) (see Table I) require high precision with 𝐿ˆ 𝜎𝑥2 (𝑛)𝑝(𝑛). The result is shifted and aligned with the
in order to obtain a good performance of the JO-NLMS format of 𝑝(𝑛), and the corresponding subtraction is performed
algorithm, we implemented 32 bits corresponding registers. in order to generate 𝑚(𝑛).
Consequently, we use the approximation 𝐿 + 2 ≈ 𝐿 (which
is valid for 𝐿 ≫ 1) and perform the multiplication between IV. S IMULATION R ESULTS
𝐿ˆ𝜎𝑥2 (𝑛) and 𝑝(𝑛) using M2 . The result is truncated to 32 bits The fixed-point design of the JO-NLMS algorithm is sim-
and added to the value 𝐿ˆ 𝜎𝑣2 (𝑛) in order to compute 𝑟(𝑛). Since ulated using a ModelSIM-Matlab test platform in an acoustic
so that it requires 4835 Slice Registers and 2208 Look-Up-
(a)
10 JO-NLMS (Matlab) Tables (i.e., 10.79%, respectively 4.9% of the corresponding
Misalignment (dB)

0
JO-NLMS (VHDL) available resources on the targeted FPGA model). In addition,
NLMS (Matlab)
the two multiplication units require only 5 of the 128 DSP48E
-10 available slices and two 18 kb BRAMs.
-20 V. C ONCLUSIONS
-30 This paper proposed a hardware implementation of the JO-
0 10 20 30 40 50 60
NLMS algorithm. The design uses a moderate computational
Time (seconds)
(b)
cost to achieve good performance, in terms of both fast
40 JO-NLMS (Matlab) converge rate and tracking, but also low misadjustment and
Misalignment (dB)

JO-NLMS (VHDL)
20 NLMS (Matlab) robustness. Simulation results performed in an AEC scenario
recommend the JO-NLMS algorithm as a reliable option for
0
practical implementations and real-world applications.
-20
ACKNOWLEDGEMENTS
-40
0 10 20 30 40 50 60
This work was supported by the UEFISCDI under Grants
Time (seconds) PN-II-RU-TE-2014-4-1880 and PN-II-ID-PCE-2011-3-0097.
R EFERENCES
Fig. 2. Normalized misalignment of the NLMS algorithm using the step-size [1] S. Haykin, Adaptive Filter Theory. Fourth Edition, Upper Saddle River,
1/[𝛿 + x𝑇 (𝑛)x(𝑛)] and JO-NLMS (VHDL precision and Matlab precision) NJ: Prentice-Hall, 2002.
in (a) a tracking scenario and (b) a double-talk scenario. The input signal is [2] A. H. Sayed, Adaptive Filters. New York, NY: Wiley, 2008.
speech, 𝐿 = 512, and SNR = 25 dB. [3] E. Hänsler and G. Schmidt, “Control of LMS-type adaptive filters,” in S.
Haykin and B. Widrow (eds.), Least-Mean-Square Adaptive Filters, pp.
175–240. New York, NY: Wiley, 2003.
[4] A. I. Sulyman and A. Zerguine, “Convergence and steady-state analysis
echo cancellation (AEC) scenario. The measured acoustic of a variable step-size NLMS algorithm,” Signal Processing, vol. 83, pp.
impulse response has 512 coefficients and the same length 1255–1273, June 2003.
is used for the adaptive filter (i.e., 𝐿 = 512). The input (i.e., [5] H.-C. Shin, A. H. Sayed, and W.-J. Song, “Variable step-size NLMS and
affine projection algorithms,” IEEE Signal Processing Lett., vol. 11, pp.
far-end) signal is a speech sequence and the output of the echo 132–135, Feb. 2004.
path is corrupted by an independent white Gaussian noise; the [6] J. Benesty, H. Rey, L. Rey Vega, and S. Tressens, “A nonparametric VSS-
signal-to-noise ratio (SNR) is 25 dB. The performance of the NLMS algorithm,” IEEE Signal Processing Lett., vol. 13, pp. 581–584,
Oct. 2006.
fixed-point implementation of the JO-NLMS is compared to [7] P. Park, M. Chang, and N. Kong, “Scheduled-stepsize NLMS algorithm,”
the full Matlab precision. Also, the NLMS algorithm using the IEEE Signal Processing Lett., vol. 16, pp. 1055–1058, Dec. 2009.
largest step-size (i.e., the fastest convergence mode) 1/[𝛿 + [8] H.-C. Huang and J. Lee, “A new variable step-size NLMS algorithm and
x𝑇 (𝑛)x(𝑛)] (with 𝛿 = 20𝜎𝑥2 ) is used for comparison. The its performance analysis,” IEEE Trans. Signal Processing, vol. 60, pp.
2055–2060, Apr. 2012.

performance measure is the normalized 
misalignment (in dB), [9] I. Song and P. Park, “A normalized least-mean-square algorithm based
 
which is evaluated as 20log10 h(𝑛) − ĥ(𝑛) / ∥h(𝑛)∥2 . on variable-step-size recursion with innovative input data,” IEEE Signal
2 Processing Lett., vol. 19, pp. 817–820, Dec. 2012.
In the first experiment, a single-talk case is considered. Also, [10] C. Paleologu, S. Ciochină, J. Benesty, and S. L. Grant, “An overview on
in order to evaluate the tracking capability of the algorithms, optimized NLMS algorithms for acoustic echo cancellation,” EURASIP
J. Advances in Signal Processing, vol. 2015:97, pp. 1–19, Dec. 2015.
an echo path change scenario is simulated in the middle of the [11] S. Ciochină, C. Paleologu, and J. Benesty, “An optimized NLMS
experiment, by shifting the impulse response to the right by algorithm for system identification,” Signal Processing, vol. 118, pp. 115–
25 samples. The results are presented in Fig. 2(a). First, it can 121, Jan. 2016.
[12] J. Benesty, T. Gaensler, D. R. Morgan, M. M. Sondhi, and S. L. Gay,
be noticed that the JO-NLMS algorithm clearly outperforms Advances in Network and Acoustic Echo Cancellation. Berlin, Germany:
the NLMS algorithm in terms of misalignment. Second, in the Springer-Verlag, 2001.
case of the JO-NLMS algorithm, the difference between the [13] C. Paleologu, J. Benesty, and S. Ciochină, Sparse Adaptive Filters for
Echo Cancellation. Morgan & Claypool Publishers, Synthesis Lectures
VHDL implementation and the full Matlab precision is minor. on Speech and Audio Processing, 2010.
In the second experiment, a double-talk scenario is consid- [14] P. A. C. Lopes and J. B. Gerald, “New normalized LMS algorithms
ered. The near-end speech appears between time 25 and 30 based on the Kalman filter,” in Proc. IEEE ISCAS, 2007, pp. 117–120.
[15] C. Paleologu, J. Benesty, and S. Ciochină, “Study of the general Kalman
seconds. The algorithms do not use any double-talk detector filter for echo cancellation,” IEEE Trans. Audio, Speech, Language
(DTD), which are usually required in these scenarios [12], Process., vol. 21, pp. 1539–1549, Aug. 2013.
[13]. According to the results presented in Fig. 2(b), the JO- [16] C. Stanciu, C. Anghel, C. Paleologu, S. Ciochină, and J. Benesty, “On
the numerical properties of an optimized NLMS algorithm,” in Proc. IEEE
NLMS algorithm is much more robust in this case, due to the COMM, 2016, 4 p.
practical evaluation of the near-end signal power from (11). [17] C. Paleologu, S. Ciochină, and J. Benesty, “Variable step-size NLMS
Finally, it should be mentioned that the synthesis process algorithm for under-modeling acoustic echo cancellation,” IEEE Signal
Processing Lett., vol. 15, pp. 5–8, 2008.
was performed using the Xilinx ISE Design Suite for the [18] M. A. Iqbal and S. L. Grant, “Novel variable step size NLMS algorithms
target Virtex 5 device. The results showed that the system for echo cancellation,” in Proc. IEEE ICASSP, 2008, pp. 241–244.
can work with a maximum clock frequency of 102.039 MHz, [19] Xilinx ML507 evaluation platform user guide, www.xilinx.com.

You might also like