Professional Documents
Culture Documents
1, JANUARY 2018 61
Abstract—The recursive least squares algorithm (RLS) using RLS algorithm is well known to exhibit extremely faster con-
dichotomous coordinate descent (DCD) iterations, namely, RLS- vergence than the least mean square (LMS) algorithm [4], its
DCD, is regarded to be well suited for hardware implementation computational complexity is far much larger, making its use
because of its small computational complexity compared to the
classical RLS algorithm. While this is true, yet another important quite limited in practice. While the reductions in computational
aspect that ultimately determines its applicability for real-time complexity of numerous variants of the classical RLS algo-
applications with high sample rates, is its iteration bound. In this rithm are only marginal [3], the RLS-DCD algorithm shows
brief, we discuss this issue and propose a modified RLS-DCD a remarkable improvement. In particular, it has been shown
algorithm based on delay relaxation whose iteration bound can that the computational complexity can be reduced to 3N mul-
be reduced arbitrarily. The degradation in convergence speed
is shown to be tolerable, which results in still much faster tiplications per sample where N is the filter length, which
convergence compared to the normalized least mean square is only slightly larger than that of the LMS algorithm—2N
algorithm. multiplications.
Index Terms—Adaptive filter, RLS, DCD, delay relaxation, Although hardware implementations of the RLS-DCD algo-
iteration bound, FPGA. rithm have been investigated, and are shown to be of low
complexity [2], [3], little has been said on its fundamental
throughput limit, which can be measured in terms of iteration
I. I NTRODUCTION bound [5]. In real-time applications with high sample rates, it
is often the case that the iteration bound, not the computational
HE DICHOTOMOUS coordinate descent (DCD) algo-
T rithm has been introduced as an efficient tool for solving
least squares (LS) problems [1]. The LS problem is usually
complexity, ultimately determines the feasibility of an algo-
rithm. Accordingly, various methods [6]–[8] have been used
on improving the iteration bound of the QR-decomposition
formulated in the matrix form of normal equation
RLS (QRD-RLS) algorithm—an algorithm that is equivalent
Rh = β, to the classical RLS algorithm, but with improved numerical
stability [4].
where R is an N × N symmetric positive definite matrix, and In this brief, we first identify the iteration bound of the RLS-
both h and β are N × 1 vectors. The objective of the LS DCD algorithm and argue its excessiveness, which makes it
problem is to find a solution vector h for given R and β. inappropriate for real-time implementation with a high sam-
Different from the canonical method using matrix inversion, ple rate. We then propose a modified RLS-DCD algorithm
the DCD algorithm belongs to the family of iterative meth- based on delay relaxation, whose iteration bound can be made
ods which are preferable due to reduced algorithm complexity arbitrarily small by simply increasing its delay parameter.
and numerical stability [2]. The DCD algorithm requires nei- Simulation results reveal that the convergence speed degrades
ther multiplications nor divisions, and uses only addition and only moderately, and is still much faster than the popular
bit-shift operations, which makes it well suited for hardware normalized LMS (NLMS) algorithm—a variant of the LMS
implementation. algorithm which mitigates the gradient noise amplification
It also has been shown that the DCD algorithm is applicable problem and is also generally faster [4].
to the recursive LS (RLS) algorithm [3]. While the classical The rest of this brief is organized as follows. Section II
Manuscript received October 27, 2016; revised February 24, 2017; briefly reviews the original RLS-DCD algorithm. In
accepted April 20, 2017. Date of publication May 19, 2017; date of cur- Section III, our delay relaxed RLS-DCD algorithm is presented
rent version December 22, 2017. This work was supported by the Basic with emphasis on the iteration bound, followed by experimen-
Science Research Program through NRF funded by MSIP under Grant
NRF-2015R1A2A1A15052493, the Technology Innovation Program funded tal results in Section IV. A discussion comparing our algorithm
by MOTIE under Grant 10051928, the Bio-Mimetic Robot Research Center with other high throughput algorithms is given in Section V,
funded by DAPA under Grant UD130070ID, INMC, and BK21-plus. This and concluding remarks in Section VI.
brief was recommended by Associate Editor M. Cao. (Corresponding author:
Jungwoo Lee.)
The authors are with the Institute of New Media and Communications, II. RLS-DCD A LGORITHM
Department of Electrical and Computer Engineering, Seoul National
University, Seoul 08826, South Korea (e-mail: bdkim@wspl.snu.ac.kr; The RLS algorithm solves the sequence of normal equations
hyuklee@wspl.snu.ac.kr; jjpearll@wspl.snu.ac.kr; junglee@snu.ac.kr).
Digital Object Identifier 10.1109/TCSII.2017.2706367 R(n)h(n) = β(n), (1)
1549-7747 c 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Kongu Engineering College. Downloaded on January 24,2022 at 08:54:23 UTC from IEEE Xplore. Restrictions apply.
62 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 65, NO. 1, JANUARY 2018
TABLE I
n = 0, 1, . . . , in a recursive manner. R is the time-average R EFORMULATED RLS A LGORITHM
correlation matrix of the tap-input reference signal x, and β is
the time-average cross-correlation vector between the tap-input
reference signal x and the desired response d. By imposing a
forgetting factor λ, they are generally computed recursively as
R(n) = λR(n − 1) + x(n)xT (n), (2)
β(n) = λβ(n − 1) + d(n)x(n). (3)
While the classical RLS algorithm makes use of the matrix
inversion lemma in order to construct a recursion formula on
h(n) and avoid computing the inverse of the matrix R(n), the
RLS-DCD algorithm first obtains a recursion by reformulating
the RLS problem in terms of a sequence of auxiliary nor-
mal equations with respect to increments of the filter weights, TABLE II
and then avoids the costly matrix inversion by approximately DCD A LGORITHM
solving the auxiliary normal equations via the DCD algorithm.
A. RLS Reformulation
Assume that, at the previous step n−1, (1) is approximately
solved such that ĥ(n − 1) is the solution with residual vector
r(n − 1) = β(n − 1) − R(n − 1)ĥ(n − 1). (4)
As a recursion,
h(n) = h(n) − ĥ(n − 1) (5)
is the filter weight increment that will be (approximately) com-
puted at the current step n. From (1) and (5), the auxiliary
normal equation to be solved is
B. DCD Algorithm
R(n)h(n) = β 0 (n), (6)
The DCD algorithm can be employed in Step 5 of Table I.
where It is an approximation of the coordinate descent algorithm—
β 0 (n) = β(n) − R(n)ĥ(n − 1). (7) a line search algorithm solving normal equations iteratively
with gradients lying on a single coordinate direction. The
By defining coordinate descent property is favorable from an implemen-
R(n) = R(n) − R(n − 1), tation point of view, since the computational resources can be
shared among all the coordinates in each iteration. The DCD
β(n) = β(n) − β(n − 1),
algorithm further greatly reduces computational complexity by
and using them together with (4) on (7), β 0 (n) can be getting rid of all multiplication and division operations. This
expressed as is achieved by cleverly choosing the filter weight step size to
be a power-of-two.
β 0 (n) = r(n − 1) + β(n) − R(n)ĥ(n − 1).
The DCD algorithm with leading element [3] is shown in
Then, an approximate auxiliary solution ĥ(n) for (6) will be Table II. It is a slight variation of the original DCD algo-
computed, and an approximate solution for (1) is given by rithm with modification in the coordinate selection process,
i.e., Step 1. In particular, instead of cycling through all the
ĥ(n) = ĥ(n − 1) + ĥ(n), (8)
coordinates, the algorithm selects the coordinates with largest
together with an auxiliary residual vector absolute value in the current residual vector, generally result-
r0 (n) = β 0 (n) − R(n)ĥ(n). ing in faster convergence. Step 2, 3, and 4 determines the step
size, which is constrained to be a power-of-two as aforemen-
Observe that tioned. In Step 5 and 6, the solution and residual vector are
r(n) = β(n) − R(n)ĥ(n) updated respectively, where the former is updated only in the
selected coordinate. More details on the DCD algorithm can
= β 0 (n) − R(n)ĥ(n) be found in [1] and [3].
= r0 (n),
where the second equality is due to (7) and (8). Therefore, r(n) III. D ELAY R ELAXED RLS-DCD A LGORITHM
for the next recursion does not require additional computation. Even though the RLS-DCD algorithm shows remarkably
Further incorporating (2) and (3) with some derivation, the reduced computational complexity, it still may not be suit-
algorithm can be summarized as in Table I. able for real-time implementation because of its rather large
Authorized licensed use limited to: Kongu Engineering College. Downloaded on January 24,2022 at 08:54:23 UTC from IEEE Xplore. Restrictions apply.
KIM et al.: DELAY RELAXED RLS-DCD ALGORITHM FOR REAL-TIME IMPLEMENTATION 63
Authorized licensed use limited to: Kongu Engineering College. Downloaded on January 24,2022 at 08:54:23 UTC from IEEE Xplore. Restrictions apply.
64 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 65, NO. 1, JANUARY 2018
TABLE III
D ELAY R ELAXED RLS-DCD A LGORITHM∗
Authorized licensed use limited to: Kongu Engineering College. Downloaded on January 24,2022 at 08:54:23 UTC from IEEE Xplore. Restrictions apply.
KIM et al.: DELAY RELAXED RLS-DCD ALGORITHM FOR REAL-TIME IMPLEMENTATION 65
VI. C ONCLUSION
In this brief, we have proposed a modified RLS-DCD algo-
rithm based on delay relaxation, whose iteration bound can
In the proposed design, 14 delay elements are retimed into be reduced arbitrarily compared to the original RLS-DCD
the two DCD units, and 6 delay elements into the feedback algorithm. Experimental results show a tradeoff between the
upstreams. Furthermore, the 6 delay elements retimed towards iteration bound and convergence speed.
the node RLS-1 in Fig. 1, are shifted and merged efficiently
into the time-average correlation matrix calculation and tap- R EFERENCES
input reference units, where the transversal filter structure [3] [1] Y. V. Zakharov and T. C. Tozer, “Multiplication-free iterative algorithm
allows a great reduction in the number of registers. for LS problem,” Electron. Lett., vol. 40, no. 9, pp. 567–569, Apr. 2004.
Both of the designs are implemented in VHDL and com- [2] J. Liu, Y. V. Zakharov, and B. Weaver, “Architecture and FPGA design
of dichotomous coordinate descent algorithms,” IEEE Trans. Circuits
piled with Altera Quartus II on an EP4CE115-F23C8 FPGA. Syst. I, Reg. Papers, vol. 56, no. 11, pp. 2425–2438, Nov. 2009.
Fixed-point arithmetic is used with a 14-bit Q7 number format [3] Y. V. Zakharov, G. P. White, and J. Liu, “Low-complexity RLS algo-
for input and output signals, and correspondingly optimized rithms using dichotomous coordinate descent iterations,” IEEE Trans.
Signal Process., vol. 56, no. 7, pp. 3150–3161, Jul. 2008.
number formats for internal signals. [4] S. Haykin, Adaptive Filter Theory, 4th ed. Upper Saddle River, NJ,
As a result, the proposed design achieves 16-fold increased USA: Prentice-Hall, 2002.
throughput over the conventional design, as shown in Table IV, [5] K. Parhi, VLSI Digital Signal Processing Systems: Design and
Implementation, 1st ed. Hoboken, NJ, USA: Wiley, 1999.
with tolerable MSE performance degradation. [6] K. J. Raghunath and K. K. Parhi, “Pipelined RLS adaptive filtering using
scaled tangent rotations (STAR),” IEEE Trans. Signal Process., vol. 44,
no. 10, pp. 2591–2604, Oct. 1996.
V. C OMPUTATIONAL C OMPLEXITY [7] L. Gao and K. K. Parhi, “Hierarchical pipelining and folding of QRD-
A favorable aspect of the proposed algorithm is that its com- RLS adaptive filters and its application to digital beamforming,” IEEE
Trans. Circuits Syst. II, Analog Digit. Signal Process., vol. 47, no. 12,
putational complexity is equal to that of the original RLS-DCD pp. 1503–1519, Dec. 2000.
algorithm, which requires only 3N multiplications and no divi- [8] M. S. Alizadeh, J. Bagherzadeh, and M. Sharifkhani, “A low-latency
sions [3]. This is only slightly larger than that of the (delay QRD-RLS architecture for high-throughput adaptive applications,” IEEE
Trans. Circuits Syst. II, Exp. Briefs, vol. 63, no. 7, pp. 708–712,
relaxed) LMS algorithm, which is 2N. The NLMS algorithm Jul. 2016.
further requires a division operation along with some addi- [9] G. Long, F. Ling, and J. G. Proakis, “The LMS algorithm with delayed
tional multiplications. The additional complexity of adders and coefficient adaptation,” IEEE Trans. Acoust. Speech Signal Process.,
vol. 37, no. 9, pp. 1397–1405, Sep. 1989.
registers in the proposed algorithm can be simply justified by [10] I. Koren, Computer Arithmetic Algorithms, 2nd ed. Natick, MA, USA:
its much superior performance. A K Peters, 2002.
High throughput QRD-RLS algorithms also have some [11] P. J. Voltz, “Sample convergence of the normed LMS algo-
rithm with feedback delay,” in Proc. Int. Conf. Acoust. Speech
drawbacks compared to the proposed algorithm. In partic- Signal Process. (ICASSP), vol. 3. Toronto, ON, Canada, Apr. 1991,
ular, while the work in [8] reduces the overall latency of pp. 2129–2132.
Authorized licensed use limited to: Kongu Engineering College. Downloaded on January 24,2022 at 08:54:23 UTC from IEEE Xplore. Restrictions apply.