You are on page 1of 5

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 65, NO.

1, JANUARY 2018 61

A Delay Relaxed RLS-DCD Algorithm for


Real-Time Implementation
Geonu Kim, Student Member, IEEE, Hyuk Lee, Student Member, IEEE, Jinjoo Chung,
and Jungwoo Lee, Senior Member, IEEE

Abstract—The recursive least squares algorithm (RLS) using RLS algorithm is well known to exhibit extremely faster con-
dichotomous coordinate descent (DCD) iterations, namely, RLS- vergence than the least mean square (LMS) algorithm [4], its
DCD, is regarded to be well suited for hardware implementation computational complexity is far much larger, making its use
because of its small computational complexity compared to the
classical RLS algorithm. While this is true, yet another important quite limited in practice. While the reductions in computational
aspect that ultimately determines its applicability for real-time complexity of numerous variants of the classical RLS algo-
applications with high sample rates, is its iteration bound. In this rithm are only marginal [3], the RLS-DCD algorithm shows
brief, we discuss this issue and propose a modified RLS-DCD a remarkable improvement. In particular, it has been shown
algorithm based on delay relaxation whose iteration bound can that the computational complexity can be reduced to 3N mul-
be reduced arbitrarily. The degradation in convergence speed
is shown to be tolerable, which results in still much faster tiplications per sample where N is the filter length, which
convergence compared to the normalized least mean square is only slightly larger than that of the LMS algorithm—2N
algorithm. multiplications.
Index Terms—Adaptive filter, RLS, DCD, delay relaxation, Although hardware implementations of the RLS-DCD algo-
iteration bound, FPGA. rithm have been investigated, and are shown to be of low
complexity [2], [3], little has been said on its fundamental
throughput limit, which can be measured in terms of iteration
I. I NTRODUCTION bound [5]. In real-time applications with high sample rates, it
is often the case that the iteration bound, not the computational
HE DICHOTOMOUS coordinate descent (DCD) algo-
T rithm has been introduced as an efficient tool for solving
least squares (LS) problems [1]. The LS problem is usually
complexity, ultimately determines the feasibility of an algo-
rithm. Accordingly, various methods [6]–[8] have been used
on improving the iteration bound of the QR-decomposition
formulated in the matrix form of normal equation
RLS (QRD-RLS) algorithm—an algorithm that is equivalent
Rh = β, to the classical RLS algorithm, but with improved numerical
stability [4].
where R is an N × N symmetric positive definite matrix, and In this brief, we first identify the iteration bound of the RLS-
both h and β are N × 1 vectors. The objective of the LS DCD algorithm and argue its excessiveness, which makes it
problem is to find a solution vector h for given R and β. inappropriate for real-time implementation with a high sam-
Different from the canonical method using matrix inversion, ple rate. We then propose a modified RLS-DCD algorithm
the DCD algorithm belongs to the family of iterative meth- based on delay relaxation, whose iteration bound can be made
ods which are preferable due to reduced algorithm complexity arbitrarily small by simply increasing its delay parameter.
and numerical stability [2]. The DCD algorithm requires nei- Simulation results reveal that the convergence speed degrades
ther multiplications nor divisions, and uses only addition and only moderately, and is still much faster than the popular
bit-shift operations, which makes it well suited for hardware normalized LMS (NLMS) algorithm—a variant of the LMS
implementation. algorithm which mitigates the gradient noise amplification
It also has been shown that the DCD algorithm is applicable problem and is also generally faster [4].
to the recursive LS (RLS) algorithm [3]. While the classical The rest of this brief is organized as follows. Section II
Manuscript received October 27, 2016; revised February 24, 2017; briefly reviews the original RLS-DCD algorithm. In
accepted April 20, 2017. Date of publication May 19, 2017; date of cur- Section III, our delay relaxed RLS-DCD algorithm is presented
rent version December 22, 2017. This work was supported by the Basic with emphasis on the iteration bound, followed by experimen-
Science Research Program through NRF funded by MSIP under Grant
NRF-2015R1A2A1A15052493, the Technology Innovation Program funded tal results in Section IV. A discussion comparing our algorithm
by MOTIE under Grant 10051928, the Bio-Mimetic Robot Research Center with other high throughput algorithms is given in Section V,
funded by DAPA under Grant UD130070ID, INMC, and BK21-plus. This and concluding remarks in Section VI.
brief was recommended by Associate Editor M. Cao. (Corresponding author:
Jungwoo Lee.)
The authors are with the Institute of New Media and Communications, II. RLS-DCD A LGORITHM
Department of Electrical and Computer Engineering, Seoul National
University, Seoul 08826, South Korea (e-mail: bdkim@wspl.snu.ac.kr; The RLS algorithm solves the sequence of normal equations
hyuklee@wspl.snu.ac.kr; jjpearll@wspl.snu.ac.kr; junglee@snu.ac.kr).
Digital Object Identifier 10.1109/TCSII.2017.2706367 R(n)h(n) = β(n), (1)
1549-7747 c 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Authorized licensed use limited to: Kongu Engineering College. Downloaded on January 24,2022 at 08:54:23 UTC from IEEE Xplore. Restrictions apply.
62 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 65, NO. 1, JANUARY 2018

TABLE I
n = 0, 1, . . . , in a recursive manner. R is the time-average R EFORMULATED RLS A LGORITHM
correlation matrix of the tap-input reference signal x, and β is
the time-average cross-correlation vector between the tap-input
reference signal x and the desired response d. By imposing a
forgetting factor λ, they are generally computed recursively as
R(n) = λR(n − 1) + x(n)xT (n), (2)
β(n) = λβ(n − 1) + d(n)x(n). (3)
While the classical RLS algorithm makes use of the matrix
inversion lemma in order to construct a recursion formula on
h(n) and avoid computing the inverse of the matrix R(n), the
RLS-DCD algorithm first obtains a recursion by reformulating
the RLS problem in terms of a sequence of auxiliary nor-
mal equations with respect to increments of the filter weights, TABLE II
and then avoids the costly matrix inversion by approximately DCD A LGORITHM
solving the auxiliary normal equations via the DCD algorithm.

A. RLS Reformulation
Assume that, at the previous step n−1, (1) is approximately
solved such that ĥ(n − 1) is the solution with residual vector
r(n − 1) = β(n − 1) − R(n − 1)ĥ(n − 1). (4)
As a recursion,
h(n) = h(n) − ĥ(n − 1) (5)
is the filter weight increment that will be (approximately) com-
puted at the current step n. From (1) and (5), the auxiliary
normal equation to be solved is
B. DCD Algorithm
R(n)h(n) = β 0 (n), (6)
The DCD algorithm can be employed in Step 5 of Table I.
where It is an approximation of the coordinate descent algorithm—
β 0 (n) = β(n) − R(n)ĥ(n − 1). (7) a line search algorithm solving normal equations iteratively
with gradients lying on a single coordinate direction. The
By defining coordinate descent property is favorable from an implemen-
R(n) = R(n) − R(n − 1), tation point of view, since the computational resources can be
shared among all the coordinates in each iteration. The DCD
β(n) = β(n) − β(n − 1),
algorithm further greatly reduces computational complexity by
and using them together with (4) on (7), β 0 (n) can be getting rid of all multiplication and division operations. This
expressed as is achieved by cleverly choosing the filter weight step size to
be a power-of-two.
β 0 (n) = r(n − 1) + β(n) − R(n)ĥ(n − 1).
The DCD algorithm with leading element [3] is shown in
Then, an approximate auxiliary solution ĥ(n) for (6) will be Table II. It is a slight variation of the original DCD algo-
computed, and an approximate solution for (1) is given by rithm with modification in the coordinate selection process,
i.e., Step 1. In particular, instead of cycling through all the
ĥ(n) = ĥ(n − 1) + ĥ(n), (8)
coordinates, the algorithm selects the coordinates with largest
together with an auxiliary residual vector absolute value in the current residual vector, generally result-
r0 (n) = β 0 (n) − R(n)ĥ(n). ing in faster convergence. Step 2, 3, and 4 determines the step
size, which is constrained to be a power-of-two as aforemen-
Observe that tioned. In Step 5 and 6, the solution and residual vector are
r(n) = β(n) − R(n)ĥ(n) updated respectively, where the former is updated only in the
selected coordinate. More details on the DCD algorithm can
= β 0 (n) − R(n)ĥ(n) be found in [1] and [3].
= r0 (n),
where the second equality is due to (7) and (8). Therefore, r(n) III. D ELAY R ELAXED RLS-DCD A LGORITHM
for the next recursion does not require additional computation. Even though the RLS-DCD algorithm shows remarkably
Further incorporating (2) and (3) with some derivation, the reduced computational complexity, it still may not be suit-
algorithm can be summarized as in Table I. able for real-time implementation because of its rather large

Authorized licensed use limited to: Kongu Engineering College. Downloaded on January 24,2022 at 08:54:23 UTC from IEEE Xplore. Restrictions apply.
KIM et al.: DELAY RELAXED RLS-DCD ALGORITHM FOR REAL-TIME IMPLEMENTATION 63

Fig. 3. Signal flow graph of the delay relaxed RLS-DCD algorithm.


Fig. 1. Signal flow graph of the RLS-DCD algorithm.

step size) [4]. Therefore, the iteration bound of the RLS-DCD


algorithm is expected to be larger than the LMS algorithm by
an amount approximately equal to the computational delay of
node RLS-5, the DCD step.
For a rough analysis of the computational delay of the DCD
step, first observe that its signal flow graph consists of Nu
cascaded copies of the blocks in Fig. 2. This is inevitable due
to the algorithm’s iterative nature. For a single iteration, the
critical path turns out to be that through the nodes DCD-1, 2* ,
and 6, which includes a large number of comparators, adders,
multiplexers, and nontrivial logic blocks.
Although our analysis above is only qualitative, we argue
that the iteration bound of the RLS-DCD algorithm will cause
Fig. 2. Signal flow graph for a single iteration of the DCD algorithm. real-time implementation issues in various situations based on
the following observations.
1) Even the iteration bound of the LMS algorithm is not
iteration bound. In this section we make an analysis on the small enough for many applications, so that the delayed
issue and propose a modified RLS-DCD algorithm based LMS (DLMS) algorithm [5], [9] is widely used in prac-
on a delay relaxation technique, which allows almost arbi- tice. The RLS-DCD algorithm is even more inadequate
trary reduction in the iteration bound with only moderate for such applications because of the additional delay by
performance degradation. the DCD step.
2) The computational delay of the DCD step can be very
A. Iteration Bound of RLS-DCD large compared to other parts in the critical loop, Even
though it includes no multiplication, which is a domi-
An abstract signal flow graph representation of the RLS- nant factor for the computation delay of the other parts,
DCD algorithm is shown in Fig. 1. The node numbering addition or similar operations are not so much faster
matches to the step numbering of Table I and the computation than multiplication, unless implemented in complex fast
of each algorithm step is encapsulated into its corresponding architectures [10]. Note that the DCD step consists of a
node. The input signals x and d are omitted since they are very long chain of such operations.
irrelevant in analyzing the iteration bound. The node RLS- 3) For field-programmable gate-array (FPGA) implemen-
5 represents the DCD algorithm, whose detailed signal flow tation, multiplications are usually mapped to fast embed-
graph for a single iteration is shown in Fig. 2. Here the node ded multipliers, hence the relative gap between the
numbering is again matched to Table II. Note that Step 2, 3, iteration bounds of the RLS-DCD and the LMS algo-
and 4 of Table II are merged into a single node, namely node rithm gets far much larger.
DCD-2* . The superscripts on the signal variables denote the
iteration index.
From Fig. 1, the critical loop is easily identifiable to be B. Proposed Delay Relaxation
the one going through the nodes RLS-2, 3, 4, 5, and 6. It is Fig. 3 shows the signal flow graph of the proposed RLS-
worthwhile to note that the signal flow graph of the RLS-DCD DCD algorithm based on delay relaxation.
algorithm is structurally similar to the LMS algorithm, whose The corresponding algorithm is given in Table III.
signal flow graph can be obtained by replacing the dotted box The outputs of the node RLS-5 are delayed by an amount
of Fig. 1 with a node computing ĥ(n) = μ·e(n)x(n) (μ is the of LD samples, where LD is an arbitrary parameter. Due to the

Authorized licensed use limited to: Kongu Engineering College. Downloaded on January 24,2022 at 08:54:23 UTC from IEEE Xplore. Restrictions apply.
64 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 65, NO. 1, JANUARY 2018

TABLE III
D ELAY R ELAXED RLS-DCD A LGORITHM∗

newly added delay elements, the iteration bound is reduced


LD -fold compared to the original RLS-DCD algorithm, and a
sample rate close to that can be achieved by properly retiming
these delay elements. Note that retiming the LD delay elements
is not limited to be done inside the node RLS-5, but is allowed
to any point in the feedback upstreams. Fig. 4. MSE learning curve for uncorrelated reference inputs; N = 16,
The motivation and intuition of our proposed algorithm orig- σw2 = 0.2, λ = 1, δ = 0.004, H = 2−3 , Mb = 8, μ = 0.03.
inate from those of the DLMS algorithm [5], [9]. While we
do not have an analytical proof for the convergence of our
proposed algorithm, we have strong evidence by numerous
simulations under various parameter settings.

IV. E XPERIMENTAL R ESULTS


A. Simulation Results
To demonstrate the convergence performance of our
proposed algorithm, a system identification simulation over
1000 ensembles has been performed. The desired response
is generated as
d(n) = hT0 x(n) + w(n),
where h0 represents the unknown system and w(n) is a white
Gaussian noise sequence with variance σw2 . The sequence x(n),
which forms the reference signal vector x(n) = [x(n) x(n −
Fig. 5. MSE learning curve for correlated reference inputs; The algorithm
1) . . . x(n − N + 1)]T , is generated as either an uncorrelated parameters are the same as in Fig. 4.
Gaussian sequence with unit variance or a correlated first order
autoregressive sequence defined as x(n) = 0.9 · x(n − 1) + v(n),
where v(n) is white Gaussian with unit variance. The elements bound. In such a case, the NLMS algorithm should also be
of the vector h0 are generated randomly with uniform distri- delay relaxed [11] to balance the iteration bounds, resulting
bution on the interval [−1, 1]. The mean squared error (MSE) in an even larger MSE performance gap. We also note that,
performance, defined as h0 − ĥ(n)2 /h0 2 , is compared for our delay relaxed RLS-DCD algorithm, the contribution
among the classical RLS, NLMS and our proposed algorithm. of a higher number of DCD iteration to the MSE convergence
Note that we omit comparison with the original RLS-DCD speed is not very large, while its impact on the iteration bound
algorithm since its computational results are similar to those is critical. Furthermore, it can be seen that it even worsens
of the RLS algorithm. Also the comparison with LMS is not the MSE performance tail. Finally, the MSE curves for vari-
included, since both its stability and convergence are inferior ous LD values are shown in Fig. 6. The impact of increasing
to NLMS. The step size μ of the NLMS algorithm is chosen LD on the MSE curve is observed to be gradual, providing
such that the residual MSE is at least as large as all other tested a good tradeoff between the iteration bound and the MSE
algorithms. The reduction of the residual MSE, by having a performance.
smaller step size, results in an even slower convergence.
Simulation results are denoted in Fig. 4 and 5. Though the
convergence of our algorithm is a bit slower than the RLS B. Implementation Results
algorithm, it is definitely far much faster than the NLMS As a proof of concept, we compared two prototype designs;
algorithm, especially for correlated reference inputs—due to they are the conventional and the proposed algorithms for a
the substantial performance degradation in the NLMS algo- system of N = 32 with λ = 1 − 2−11 , δ = 2−8 , H = 2−1 ,
rithm. Note that our simulation includes the case with only a Mb = 8, and Nu = 2. The conventional design performs the
single DCD iteration (Nu = 1), and large enough delay relax- original RLS-DCD algorithm without delay relaxation, while
ation (LD = 10), which possibly gives a very small iteration delay relaxation of LD = 20 is applied for the proposed design.

Authorized licensed use limited to: Kongu Engineering College. Downloaded on January 24,2022 at 08:54:23 UTC from IEEE Xplore. Restrictions apply.
KIM et al.: DELAY RELAXED RLS-DCD ALGORITHM FOR REAL-TIME IMPLEMENTATION 65

the QRD-RLS algorithm by using lookup tables (LUTs) and


multiplication units instead of coordinate rotation digital com-
puter (CORDIC) units, it does not tackle the recursive nature
of the QRD-RLS algorithm, and the throughput increment by
pipelining its unfolded architecture is therefore strictly limited.
The computational complexity is also high, including LUTs in
the order of N and multiplications in the order of N 2 .
The algorithm in [7] also performs an exact QRD-RLS com-
putation as in [8], but is capable of reducing the iteration
bound in proportion to its look-ahead [5] factor—a common
property to our algorithm. However, its computational com-
plexity in terms of CORDIC units increases linearly with
the look-ahead factor, hence becomes prohibitive when the
throughput requirement is high; the number of CORDIC units
is in the order of LN 2 , where L denotes the look-ahead
Fig. 6. MSE learning curve LD tradeoff; Nu = 1 and other parameters are factor. In contrast, the small complexity of the proposed algo-
the same as in Fig. 4. rithm, inherited from the original RLS-DCD algorithm, is kept
constant by trading off performance with throughput. While
TABLE IV
I MPLEMENTATION R ESULTS
the algorithm in [6] shows a similar property with marginal
increment in complexity, the overall complexity is still much
larger than ours because its modified CORDIC units include
multiplication and division.

VI. C ONCLUSION
In this brief, we have proposed a modified RLS-DCD algo-
rithm based on delay relaxation, whose iteration bound can
In the proposed design, 14 delay elements are retimed into be reduced arbitrarily compared to the original RLS-DCD
the two DCD units, and 6 delay elements into the feedback algorithm. Experimental results show a tradeoff between the
upstreams. Furthermore, the 6 delay elements retimed towards iteration bound and convergence speed.
the node RLS-1 in Fig. 1, are shifted and merged efficiently
into the time-average correlation matrix calculation and tap- R EFERENCES
input reference units, where the transversal filter structure [3] [1] Y. V. Zakharov and T. C. Tozer, “Multiplication-free iterative algorithm
allows a great reduction in the number of registers. for LS problem,” Electron. Lett., vol. 40, no. 9, pp. 567–569, Apr. 2004.
Both of the designs are implemented in VHDL and com- [2] J. Liu, Y. V. Zakharov, and B. Weaver, “Architecture and FPGA design
of dichotomous coordinate descent algorithms,” IEEE Trans. Circuits
piled with Altera Quartus II on an EP4CE115-F23C8 FPGA. Syst. I, Reg. Papers, vol. 56, no. 11, pp. 2425–2438, Nov. 2009.
Fixed-point arithmetic is used with a 14-bit Q7 number format [3] Y. V. Zakharov, G. P. White, and J. Liu, “Low-complexity RLS algo-
for input and output signals, and correspondingly optimized rithms using dichotomous coordinate descent iterations,” IEEE Trans.
Signal Process., vol. 56, no. 7, pp. 3150–3161, Jul. 2008.
number formats for internal signals. [4] S. Haykin, Adaptive Filter Theory, 4th ed. Upper Saddle River, NJ,
As a result, the proposed design achieves 16-fold increased USA: Prentice-Hall, 2002.
throughput over the conventional design, as shown in Table IV, [5] K. Parhi, VLSI Digital Signal Processing Systems: Design and
Implementation, 1st ed. Hoboken, NJ, USA: Wiley, 1999.
with tolerable MSE performance degradation. [6] K. J. Raghunath and K. K. Parhi, “Pipelined RLS adaptive filtering using
scaled tangent rotations (STAR),” IEEE Trans. Signal Process., vol. 44,
no. 10, pp. 2591–2604, Oct. 1996.
V. C OMPUTATIONAL C OMPLEXITY [7] L. Gao and K. K. Parhi, “Hierarchical pipelining and folding of QRD-
A favorable aspect of the proposed algorithm is that its com- RLS adaptive filters and its application to digital beamforming,” IEEE
Trans. Circuits Syst. II, Analog Digit. Signal Process., vol. 47, no. 12,
putational complexity is equal to that of the original RLS-DCD pp. 1503–1519, Dec. 2000.
algorithm, which requires only 3N multiplications and no divi- [8] M. S. Alizadeh, J. Bagherzadeh, and M. Sharifkhani, “A low-latency
sions [3]. This is only slightly larger than that of the (delay QRD-RLS architecture for high-throughput adaptive applications,” IEEE
Trans. Circuits Syst. II, Exp. Briefs, vol. 63, no. 7, pp. 708–712,
relaxed) LMS algorithm, which is 2N. The NLMS algorithm Jul. 2016.
further requires a division operation along with some addi- [9] G. Long, F. Ling, and J. G. Proakis, “The LMS algorithm with delayed
tional multiplications. The additional complexity of adders and coefficient adaptation,” IEEE Trans. Acoust. Speech Signal Process.,
vol. 37, no. 9, pp. 1397–1405, Sep. 1989.
registers in the proposed algorithm can be simply justified by [10] I. Koren, Computer Arithmetic Algorithms, 2nd ed. Natick, MA, USA:
its much superior performance. A K Peters, 2002.
High throughput QRD-RLS algorithms also have some [11] P. J. Voltz, “Sample convergence of the normed LMS algo-
rithm with feedback delay,” in Proc. Int. Conf. Acoust. Speech
drawbacks compared to the proposed algorithm. In partic- Signal Process. (ICASSP), vol. 3. Toronto, ON, Canada, Apr. 1991,
ular, while the work in [8] reduces the overall latency of pp. 2129–2132.

Authorized licensed use limited to: Kongu Engineering College. Downloaded on January 24,2022 at 08:54:23 UTC from IEEE Xplore. Restrictions apply.

You might also like