Professional Documents
Culture Documents
1053-587X © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Fondren Library Rice University. Downloaded on May 18,2020 at 07:55:40 UTC from IEEE Xplore. Restrictions apply.
574 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 68, 2020
implementations with an increasing number of users. This issue at 1.2 V. We show reference FPGA and ASIC implemen-
limits the application of hardware architectures (detectors) in tation results and compare our designs to those of recently
massive MIMO systems [16]–[19]. Consequently, many meth- reported massive MIMO detection implementations.
ods have been proposed to further reduce computational com- Our results demonstrate that RCG provides near-optimal
plexity, improve the parallelism of MMSE and optimize hard- detection accuracy with high parallelism and low complexity,
ware architectures. In [16] and [17], Cholesky decomposition and compared with other state-of-the-art architectures, the cor-
(CHD)-based detectors and LU decomposition (LUD)-based responding architecture achieves substantial improvements in
detectors were proposed and implemented in hardware with high terms of energy and area efficiencies.
detection accuracy. However, the throughput is limited, and the Notation: Bold uppercase, bold lowercase, and lowercase
architecture requires a significant amount of hardware resources. letters denote matrices, vectors, and scalars, respectively. (·)i,j
Neumann series approximation (NSA)-based architectures have denotes the element in the ith row and jth column of a matrix.
been proposed to achieve high throughput for massive MIMO (·)i denotes the ith element of a vector. IM denotes the M × M
detection [18]–[20]. However, only a marginal reduction in identity matrix. (·)H and (·)−1 denote conjugate transpose and
complexity can be achieved. To achieve a reasonable balance inversion, respectively.
of detection accuracy, throughput, and hardware resource con- Outline: The remainder of this paper is organized as follows.
sumption, architectures based on approximation methods, such Section II briefly introduces the system model and motivation.
as the Gauss-Seidel (GS) [21], [22], successive over-relaxation Section III describes the proposed recursive conjugate gradient
(SOR) [23], [24], weighted Jacobi iteration (WeJi) [25], [26] and method for a massive MIMO system. Section IV shows the
Monte Carlo tree search (MCTS) [27] methods, have also been symbol-error-rate simulation results and comparisons. Section
proposed. However, the computations in the GS, SOR, WeJi, V presents the proposed VLSI architecture. Section VI shows
and MCTS methods are difficult to parallelize due to high cor- the hardware implementation results and their comparisons with
relations when estimating each symbol from users. To explore state-of-the-art designs. Conclusions are drawn in section VII.
the parallelism between each step, implicit methods have been
proposed, including optimized coordinate descent (OCD) [28], II. SYSTEM MODEL AND MOTIVATION
parallelizable Chebyshev iteration (PCI) [29] and intraiterative
In an Nr × Nt MIMO system with Nt transmitters on the user
interference cancellation (IIC) [30]. However, these implicit
side and Nr antennas on the base station (BS) side (predomi-
methods ignore the unique properties of massive MIMO systems
nantly Nr Nt [2]–[4]), the uplink system can be modeled
(e.g., channel hardening). Therefore, the same Gram matrix
as [10]–[12]
needs to be calculated multiple times, which means that implicit
method architectures suffer from higher energy consumption y = Hs + n, (1)
and latency than explicit methods.
Contributions: In this paper, we propose a recursive conjugate where H ∈ C Nr ×Nt represents a Rayleigh flat-fading channel
gradient (RCG) method to achieve massive MIMO detection matrix; s ∈ C Nt ×1 denotes the transmitted signal vector, which
with corresponding very-large-scale integration (VLSI) designs. is based on the 64-QAM modulation constellation set Ω in this
Our contributions are summarized as follows: paper; n ∈ C Nr ×1 is an additive white Gaussian noise vector
1) We propose a modified conjugate gradient method to with zero mean and variance σ 2 ; and y ∈ C Nr ×1 is the signal
solve the detection problem with high parallelism and low vector received at the BS. In classical MMSE detection, the
complexity in the massive MIMO system. estimation of the transmitted signal can be expressed as [16], [17]
2) According to the property of the massive MIMO system, −1 H
ŝ = HH H + N0 Es −1 INt H y = W−1 yMF , (2)
we propose a quadrant-certain-based initial method and
an approximated log likelihood ratio (LLR) method, all of where W = HH H + N0 Es −1 IM is the MMSE filtering
which reduce the computational complexity while main- matrix with a noise power spectral density of N0 and a signal
taining high detection accuracy. power spectral density of Es and yMF = HH y denotes the
3) We mathematically analyze the approximated error, com- matched-filter vector. According to (1) and (2), the vector ŝ can
plexity, and parallelism of the proposed RCG method. be described as
RCG maintains its advantages in terms of these factors. In
addition, we provide symbol-error-rate (SER) simulation ŝ = W−1 HH Hs + W−1 HH n, (3)
results to show that compared with related methods, the where U = W−1 HH H and V = UW−1 are the equivalent
RCG method achieves high detection accuracy. channel matrices, which can be used to compute the equivalent
4) We develop a VLSI architecture for RCG that uses a paral- channel gain and postequalization noise plus interference (NPI,
lel processing element (PE) array and a deeply pipelined σ 2 ), respectively. The soft-output LLR for the b-th bit index
user-level method to achieve high throughput with low and i-th user satisfies
hardware consumption. 2 2
2 ŝi ŝi
5) The architecture is verified on an FPGA and fabricated on Ui,i
Li,b = 2 min0 − s − min1 − s = ζi2 ϕb (ŝi ),
silicon. The chip achieves a 1.5 Gbps throughput under a σi s∈Sb Ui,i s∈Sb Ui,i
500 MHz working frequency while dissipating 557 mW (4)
Authorized licensed use limited to: Fondren Library Rice University. Downloaded on May 18,2020 at 07:55:40 UTC from IEEE Xplore. Restrictions apply.
LIU et al.: ENERGY- AND AREA-EFFICIENT RECURSIVE-CONJUGATE-GRADIENT-BASED MMSE DETECTOR 575
where ζi2 is the signal-to-interference-plus-noise ratio (SINR) In addition, according to the Lanczos orthogonalization algo-
for the i-th user; ϕb (ŝi ) is a piecewise linear function for rithm [31], the residual vector z(k+1) can be expressed as
Gray mappings; and Sb0 and Sb1 denote the sets of modulation
constellation symbols, where the i-th z(k+1) = ρ(k) z(k) − γ (k) η (k) + (1 − ρ(k) )z(k−1) , (8)
t bits are 2
0 and 1,
respectively. In addition, there is σi2 = N
j=i |U i,j | E s +Vi,i N0 .
In massive MIMO systems, MMSE detection can achieve where ρ(k) , γ (k) , and η (k) = Wz(k) are iterative parameters.
near-optimal performance [3]. However, the result of large- When computing the iterative residual, the RCG algorithm im-
scale multiplication during the computation of the Gram matrix proves the original CG method by Lanczos orthogonalization
is required in the following computations. Considering hard- algorithm so as to improve the parallelism of inter-iteration
ware implementation, this limitation significantly influences the and intro-iteration computation. Combining (7) and (8), the
throughput of the entire detector and affects resource consump- estimation of the transmitted vector can be expressed as
tion [16], [19], [25]. Furthermore, the computational complex-
ities of matrix inversion W−1 in (2) and LLR in (4) are high, ŝ(k+1) = ρ(k) ŝ(k) + γ (k) z(k) + 1 − ρ(k) ŝ(k−1) . (9)
particularly in systems with a large number of antennas [16],
[19], [25]. Considering hardware implementation, the computa- Next, it is important to compute the iterative parameters. Be-
tion of the matrix W−1 restricts system parallelism due to the cause the vectors z(k+1) , z(k) , and z(k−1) are mutually or-
high correlation of each computation. thotropic [31], there are
z(k+1) , z(k) = z(k−1) , z(k) = z(k−1) , z(k+1) = 0.
III. RECURSIVE CONJUGATE GRADIENT METHOD FOR
MASSIVE MIMO DETECTION (10)
Hence, combining (8) and (10), the parameters can be
First, this section describes a recursive conjugate gradient obtained as
detection algorithm that approximates an MMSE detector in a
massive MIMO system. For the formulation of this algorithm, ξ (k)
γ (k) = ;
an optimized conjugate gradient-based method is presented and φ(k)
a quadrant-certain-based initial method is proposed. In addi-
ξ (k−1)
tion, an approximate method of computing LLRs is presented. ρ(k) = , (11)
Second, it is demonstrated that the proposed method has a low ξ (k−1) + γ (k) η (k) , z(k−1)
approximated error. Third, analyses of the proposed recursive
where ξ (k) and φ(k) can be computed as
conjugate gradient detection algorithm are presented to show
its advantages in terms of computational complexity and paral- ξ (k) = z(k) , z(k) ;
lelism in comparison with other algorithms.
φ(k) = η (k) , z(k) . (12)
A. Proposed Recursive Conjugate Gradient Detection
1) Optimized Iteration Method: Conjugate Gradient (CG) In addition, according to (8) and (10), there are
iteration is a promising method for solving linear equations
such as (2) [31]. Therefore, CG iteration can be used in the η (k) , z(k−1) = z(k) , η (k−1) ,
MMSE detection algorithm to reduce computational complexity
z(k) z(k−1)
by avoiding large-scale matrix inversions. However, the tradi- η (k−1) = − (k−1) (k−1)
+ (k−1)
tional CG method still has some shortcomings. In the original ρ γ γ
algorithm, there is strong data dependency between each itera- 1 − ρ(k−1) z(k−2)
tion and between each element of the estimated vector in one + . (13)
ρ(k−1) γ (k−1)
iteration. To solve these problems, based on the CG method,
Recursive Conjugate Gradient (RCG) iteration can be expressed Therefore, the parameter ρ(k) can be computed as
as
−1
(k+1) (k) (k+1) (k+1) (k) γ (k) ξ (k) 1
ŝ = ŝ +α p , (5) ρ = 1 − (k−1) (k−1) (k−1) . (14)
γ ξ ρ
where p(k) is an orthogonal basis, k is the iteration number, and In this proposed algorithm, to compute the iteration and pa-
the parameter α(k) can be calculated as rameters, there are required initial settings, such as ρ(0) = 1,
(0) (k) z(−1) = z(0) , and ŝ(−1) = ŝ(0) .
(k) z ,p
α = . (6) After the iteration, points need to be found in the constellation
Wp(k) , p(k)
graph according to ŝ(k) . In the constellation graph, the round-off
In (6), z(k) represents the residual vector, which can be expressed method can be used to find the nearest points rather than calcu-
as lating all distances with all constellation points of an element
in ŝ(k) . In this way, the computational complexity is decreased
z(k) = yMF − Wŝ(k) . (7) from quadratic to linear.
Authorized licensed use limited to: Fondren Library Rice University. Downloaded on May 18,2020 at 07:55:40 UTC from IEEE Xplore. Restrictions apply.
576 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 68, 2020
Authorized licensed use limited to: Fondren Library Rice University. Downloaded on May 18,2020 at 07:55:40 UTC from IEEE Xplore. Restrictions apply.
LIU et al.: ENERGY- AND AREA-EFFICIENT RECURSIVE-CONJUGATE-GRADIENT-BASED MMSE DETECTOR 577
According to [31], the orthogonal basis p(k) and residual vector 39: Li,b (ŝi ) = ζi2 ϕb (ŝi );
z(k) in (5) satisfy: p(1) = z(0) and z(1) = z(0) − α(1) Wp(1) = 40: end for
z(0) − α(1) Wz(0) . Using mathematical induction, for the tth 41: end for
iteration, there are
m(t) − ŝ(0) be one vector of the space St ; then, there is
t−1
(t−1)
t−1
(t)
z(t−1) = ai Wi z(0) ; p(t) = bi Wi z(0) , (24)
t
Authorized licensed use limited to: Fondren Library Rice University. Downloaded on May 18,2020 at 07:55:40 UTC from IEEE Xplore. Restrictions apply.
578 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 68, 2020
Hence, combining (24), (25), and (29), (m(t) − ŝ) in (27) can 1 1
= = , (36)
be expressed as Tk (d) Tk λ max +λmin
λmax −λmin
k
(ς)
m(t) − s = ŝ(0) + ci p(i) − s where Rk (ς) = TTkk (d) . According to the formula of the k times
i=1
Chebyshev polynomial, there is
k
k
i−1
k−1 1
2
k
2
(i) Tk (x) = x+ x −1 + x− x −1 . (37)
= Q(0) + ci bj Wj z(0) = Q(0) + c i Wi r(0) 2
i=1 j=0 i=1
Therefore, Tk ( λmax +λmin ) in (36) can be computed as (38)
k λmax −λmin
= I− c i W i Q(0) = Pk (W) Q(0) , (30) shown at the bottom of the next page. Hence, combining (31),
i=1
(33), (34), (36), and (38) there is
√ √ k
where c i is a constant, Pk (W) is the k-degree polynomial of (k) λmax − λmin (0)
ŝ − s 2 √ √ ŝ − s
matrix W, and Pk (0) = 1. All polynomials satisfying Pk (0) = W λmax + λmin W
1 are defined as a set Xk ; therefore, according to (30), (28) can √ k
be expressed as κ−1 (0)
=2 √ ŝ − s , (39)
κ+1 W
(k)
ŝ − s = min Pk (W) Q(0) . (31)
W Pk ∈Xk W where κ = λmax /λmin . Given the difficulty of determining the
Because W is a symmetric positive definite matrix, its eigen- eigenvalues of matrix W, the largest and smallest eigenvalues
values (λ1 , λ2 , . . . , λNt ) are all positive. Therefore, there are are approximated because of the properties of massive MIMO
eigenvectors ψ (1) , ψ (2) , . . . , ψ (Nt ) , which satisfy: systems. In a massive MIMO system, as Nr and Nt increase,
the λmax and λmin of matrix W can be approximated as
ψ (i) , Wψ (j) = 0; ψ (i) = 1, (32)
W λmax ≈ Nr + Nt + 2 Nr Nt ;
where i, j = 1, 2, . . . , Nt , and i = j. Suppose λmin ≈ Nr + Nt − 2 Nr Nt . (40)
Nt
Therefore, combining (39) and (40), (39) can be approximated
Q(0) = ci ψ (i) . (33)
as
i=1
k/2
Let λmax and λmin be the largest and smallest eigenvalues of (k) Nt (0)
ŝ − s 2 ŝ − s . (41)
matrix W, respectively. Hence, there are W Nr W
2 According to (41), when the number of users (Nt ) is fixed, as the
2
n
Pk (W) Q(0) = ci Pk (λi ) ψ (i) number of BS antennas Nr increases, the approximation error
W will decrease. In other words, when the ratio between Nt /Nr
i=1 W
Authorized licensed use limited to: Fondren Library Rice University. Downloaded on May 18,2020 at 07:55:40 UTC from IEEE Xplore. Restrictions apply.
LIU et al.: ENERGY- AND AREA-EFFICIENT RECURSIVE-CONJUGATE-GRADIENT-BASED MMSE DETECTOR 579
TABLE I
NUMBER OF REAL-VALUED MULTIPLICATIONS
TABLE II
NUMBER OF REAL-VALUED MULTIPLICATIONS IN A 128 × 8 MIMO SYSTEM
decreases, the approximation error of the proposed method will computational complexities of the various methods. Table II
decrease. For massive MIMO systems, the number of antennas at shows the total number of real-valued multiplications for all
the BS is much larger than the number of users, which indicates the algorithms with different k in a 128 × 8 MIMO system. The
that the proposed method achieves low approximation error. proposed RCG method has a lower computational complexity
In addition, according to (40) and (41), when the number of than that of the GS [21], CG [32], and conjugate gradient least
iterations k increases, the approximation error further decreases. squares (CGLS) [33] methods. For example, when Nr = 128,
Therefore, for large k, ŝ(k) − sW approaches zero. Hence, in Nt = 8, and k = 2, the proposed RCG method requires 21632
a massive MIMO system, the proposed method can achieve a RMULs, which is less than those of the GS (23152 RMULs),
low approximation error that approaches zero. CG (23304 RMULs), and CGLS (27232 RMULs) methods. The
OCD [28] and PCI [29] methods achieve large-scale matrix
multiplication and inversion in an implicit way. Compared with
C. Algorithm Analyses
other explicit methods, these implicit methods achieve low com-
The number of real-valued multiplications (RMUL) is gen- putational complexity (16416 and 12344 RMULs, respectively),
erally used to evaluate the computational complexity of meth- as shown in Table I and Table II. However, these implicit meth-
ods. In the proposed RCG iterative algorithm, the first set of ods ignore the channel hardening of a massive MIMO system.
calculations consists of a series of computations of preiterative Therefore, in an actual system, the same Gram matrix needs
parameters yMF , W, and z(0) . These computations are required to be calculated multiple times. Hence, the implicit methods
to achieve the computations of a conjugate transpose matrix of an suffer from high computational complexity in an actual system,
Nr × Nt channel matrix H with an Nr × 1 vector y, an Nt × which indicates high energy and area consumptions of implicit
Nr matrix HH with an Nr × Nt matrix H, and an Nt × Nt architectures. As k increases, the computational complexity of
symmetric matrix W with an Nt × 1 vector ŝ(0) . The second set the NSA method increases. When k < 3, the NSA method [19]
of calculations consists of a series of multiplications of a matrix had a lower complexity of O(Nt2 ). However, when k = 3, this
with a vector in an iterative process, that is, the multiplication method had a computational complexity of O(Nt3 ) that was even
of an Nt × Nt matrix W with an Nt × 1 vector z. The third higher than that of exact MMSE detection for k > 3. In general,
set of calculations is two inner products of ξ (k) = (z(k) , z(k) ) to ensure detection accuracy, k should be greater than 3 in the
and φ(k) = (η (k) , z(k) ) in an iterative process. The final set of NSA method.
calculations consists of updating ŝ and z in an iterative process. Additional important aspects of the hardware implementation
Note that when the iterative number k = 1, ρ(0) = 1. Therefore, of the detection algorithm must still be considered. The paral-
the multiplications in (9) can be ignored. Table I compares the lelism of a method is an important issue in both algorithm design
√ k √ k
λmax + λmin 1 λmax + λmin 2 λmax λmin λmax + λmin 2 λmax λmin
Tk = + + −
λmax − λmin 2 λmax − λmin λmax − λmin λmax − λmin λmax − λmin
⎡ √ 2 k √ 2 k ⎤ √
√ √ √ k √ √ k
1⎣ λmax + λmin λmax − λmin 1 λmax + λmin λ max − λ min
= + ⎦= √ √ + √ √
2 λmax − λmin λmax − λmin 2 λmax − λmin λmax + λmin
√ √ k
1 λmax + λmin
√ √ (38)
2 λmax − λmin
Authorized licensed use limited to: Fondren Library Rice University. Downloaded on May 18,2020 at 07:55:40 UTC from IEEE Xplore. Restrictions apply.
580 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 68, 2020
Authorized licensed use limited to: Fondren Library Rice University. Downloaded on May 18,2020 at 07:55:40 UTC from IEEE Xplore. Restrictions apply.
LIU et al.: ENERGY- AND AREA-EFFICIENT RECURSIVE-CONJUGATE-GRADIENT-BASED MMSE DETECTOR 581
Authorized licensed use limited to: Fondren Library Rice University. Downloaded on May 18,2020 at 07:55:40 UTC from IEEE Xplore. Restrictions apply.
582 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 68, 2020
Fig. 7. Architecture of the processing element array. Fig. 9. Architecture of the user-level pipeline.
Authorized licensed use limited to: Fondren Library Rice University. Downloaded on May 18,2020 at 07:55:40 UTC from IEEE Xplore. Restrictions apply.
LIU et al.: ENERGY- AND AREA-EFFICIENT RECURSIVE-CONJUGATE-GRADIENT-BASED MMSE DETECTOR 583
TABLE III
BIT-WIDTH OF ALL RCG VARIABLES
Authorized licensed use limited to: Fondren Library Rice University. Downloaded on May 18,2020 at 07:55:40 UTC from IEEE Xplore. Restrictions apply.
584 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 68, 2020
TABLE IV
COMPARISON OF RESOURCE USAGE ON A XILINX VIRTEX-7 FPGA
a
SNR values are targeted to an SER of 10−2 with floating-point designs, and the iteration number is listed for each algorithm in the simulation. In this work, the SNR
is 10.38 dB with a fixed-point design. The SNR in this work is close to the performance of other algorithms. In the case of similar detection accuracy, the throughput,
resource consumption and hardware efficiency are compared with related designs.
b
Hardware efficiency is defined as throughput/normalized source consumption (NSC), which is computed as NSC = LUTs+FFs+DSP × 280 [30].
Authorized licensed use limited to: Fondren Library Rice University. Downloaded on May 18,2020 at 07:55:40 UTC from IEEE Xplore. Restrictions apply.
LIU et al.: ENERGY- AND AREA-EFFICIENT RECURSIVE-CONJUGATE-GRADIENT-BASED MMSE DETECTOR 585
TABLE V
COMPARISON OF ASIC IMPLEMENTATION RESULTS
a
SNR values are targeted to an SER of 10−2 with floating-point designs, and the iteration number is listed for each algorithm in the simulation. In this work, the SNR
is 10.38 dB with a fixed-point design. The SNR in this work is close to the performance of other algorithms. In the case of similar detection accuracy, the throughput,
area, power consumption, energy and area efficiencies are compared with related designs.
b
Energy and area efficiencies are defined as throughput/power and throughput/area (gate count), respectively.
c
There are two kinds of normalized aspects. (1) Technology normalized to 65 nm technology. According to [9], [13], [25], [28], frequency and power are increased
V log2 Nt
by s and 1s × ( dd 2 8
) , respectively. (2) MIMO size normalized to 128 × 8. According to [12], throughput is increased by N × log 8 , and power and area are
V t 2
dd
128 8 log2 Nt
increased by Nr × Nt × log2 8 .
Authorized licensed use limited to: Fondren Library Rice University. Downloaded on May 18,2020 at 07:55:40 UTC from IEEE Xplore. Restrictions apply.
586 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 68, 2020
core area, 14.173-102.7 Mbps/mW energy efficiency, and 0.499- algorithm is a hardware-friendly design. Meanwhile, the ar-
6.855 Mbps/kGE area efficiency, respectively. Like [17], the chitecture design fully utilizes the advantage of the proposed
architecture in [12] achieves high efficiency because of the algorithm, which improves the hardware utilization and reduces
small MIMO size. Hence, the energy and area efficiencies are resource consumption. (4) Optimized fixed-point design: The
normalized to the same MIMO size and technology to achieve fixed-point design of the proposed architecture is determined by
fair comparisons. The RCG architecture achieves 6.85×, 7.18×, a large number of simulations. The fixed-point design achieves
and 2.29× the normalized energy efficiency in the EPD [10], a short bit-width with an acceptable SER performance loss. (5)
CHD [16], and MPD [12] architectures, respectively. In addi- Deep pipeline design: In the whole architecture, a deep pipeline
tion, the normalized area efficiency of the RCG architecture is is designed to improve the throughput according to the prop-
11.7×, 2.88×, and 2.39× that of the designs in the EPD [10], erties of the proposed algorithm. For example, the processing
CHD [16], and MPD [12] architectures, respectively. Note that element array and user-level pipeline module in Section V-A
the architectures in [10] and [16] support higher modulation and Section V-B implement deep pipelines according to the
schemes. characteristics of the algorithm. Therefore, high throughput and
The proposed architecture has a throughput of 1.5 Gbps, and hardware utilization are achieved, which results in high energy
the throughput of the related works are 0.3-8 Gbps [12], [16]. and area efficiencies. Based on the optimized design of both the
Notably, the frequency of the architecture proposed in [19], [20] algorithm and architecture, the proposed architecture achieves
reaches 1 GHz. Because the throughput related to the amount of advantages over related works in terms of normalized energy
resource provision, area (gate count) and power are both critical and area efficiencies.
data for evaluating efficiency. In the proposed architecture, the
gate count is 1372 kGE, and the power consumption is 557 mW.
VII. CONCLUSION
In related works, the gate counts are 347-6650 kGE [17], [19],
[20], and the power consumptions are 26.5-1720 mW. Without We have proposed a modified massive MIMO detection al-
considering preprocessing, the gate count of the proposed archi- gorithm based on the RCG method. In addition, according to
tecture is 596 kGE, and the power consumption is 120 mW. The the properties of massive MIMO systems, a quadrant-certain-
gate counts of architectures in [17], [19], [20] are 148-3607 kGE, based initial method is proposed. Moreover, an approximated
and the power consumptions are 18-127 mW. [10] and [16] adopt LLR computation method is proposed in RCG to simplify
28 nm technology, [12] adopts 40 nm technology, [19] and [20] the calculations. Then, we theoretically demonstrate that the
adopt 45 nm technology. This paper adopts 65 nm technology. proposed RCG method achieves low approximated error. Sim-
The normalized methodology used in this paper has two aspects: ulation results shows that the SNR required to achieve an SER
(1) technology normalized to 65 nm technology. (2) MIMO of 10−2 is 10.16 dB, which is close to that of the state-of-the-
size normalized to 128 × 8. These two normalized methods are art designs (9.66-10.42 dB). To demonstrate the effectiveness
commonly used in related research and could provide reasonable of our algorithm, we have proposed a VLSI architecture that
information. The normalized energy efficiency of the proposed consists of a parallel processing element array and a deeply
architecture is higher than that of [19], [20], and the normalized pipelined user-level method. We have verified the architecture
area efficiency is higher than that of [10], [16], [25]. The pro- on FPGA and fabricated a chip on silicon with TSMC 65 nm
posed architecture has advantages over related works in terms CMOS technology. The chip achieves 1.5 Gbps throughput
of throughput, power consumption, area, normalized energy and with 3.5 mm2 silicon area and 557 mW power consumption
area efficiencies, owing mainly to the following reasons: (1) at 1.2 V. This architecture achieves 2.69 Mbps/mW energy
Low complexity of the algorithm: According to Table I and efficiency and 1.09 Mbps/kG area efficiency, respectively, which
Table II, the proposed RCG algorithm has lower complexity are 2.39 to 10.60× and 1.15 to 8.81× those of state-of-the-art
than the other algorithms, such as NSA [19], [20] and CHD [16]. designs. In future work, we plan to consider algorithms and VLSI
When compared with nonlinear algorithms, such as EPD [10] architectures for situations with imperfect channel information.
and MPD [12], the RCG algorithm also maintains its complexity In addition, algorithms and architectures for higher modulation
advantage. Therefore, the RCG algorithm will consume fewer schemes will be considered.
hardware resources than other algorithms when realizing similar
throughput. Additionally, higher normalized energy and area
REFERENCES
efficiencies can be achieved. Complexity is one consideration
for the algorithm design in this paper. (2) High parallelism of [1] G. Peng, L. Liu, S. Zhou, Q. Wei, S. Yin, and S. Wei, “A 2.69 Mbps/mW
1.09 Mbps/kge conjugate gradient-based MMSE detector for 64-QAM
the algorithm: According to the analysis of Section III-C, the 128 × 8 massive MIMO systems,” in Proc. IEEE Asian Solid State Circuits
proposed RCG algorithm has high parallelism in the iterative Conf., Tainan, Taiwan, Nov. 2018, pp. 191–194.
process and therefore has higher throughput and normalized [2] T. L. Marzetta, “Noncooperative cellular wireless with unlimited numbers
of base station antennas,” IEEE Trans. Wireless Commun., vol. 9, no. 11,
energy and area efficiencies. The other related designs, such pp. 3590–3600, Nov. 2010.
as CHD [16], LUD [17], and WeJi [25], face problems with [3] F. Rusek et al., “Scaling up MIMO: Opportunities and challenges with
parallelism. High parallelism is also an important consideration very large arrays,” IEEE Signal Process. Mag., vol. 30, no. 1, pp. 40–60,
Jan. 2013.
for the algorithm design of this paper. (3) Optimized co-design [4] H. Q. Ngo, E. G. Larsson, and T. L. Marzetta, “Energy and spectral
of the algorithm and architecture: The design of the algorithm efficiency of very large multiuser MIMO systems,” IEEE Trans. Commun.,
fully considers the hardware implementation. In other words, vol. 61, no. 4, pp. 1436–1449, Apr. 2013.
Authorized licensed use limited to: Fondren Library Rice University. Downloaded on May 18,2020 at 07:55:40 UTC from IEEE Xplore. Restrictions apply.
LIU et al.: ENERGY- AND AREA-EFFICIENT RECURSIVE-CONJUGATE-GRADIENT-BASED MMSE DETECTOR 587
[5] E. Agrell, T. Eriksson, A. Vardy, and K. Zeger, “Closest point search [26] J. Minango, C. de Almeida, and C. D. Altamirano, “Low-complexity
in lattices,” IEEE Trans. Inf. Theory, vol. 48, no. 8, pp. 2201–2214, MMSE detector for massive MIMO systems based on Damped Jacobi
Aug. 2002. method,” in Proc. IEEE Int. Symp. Pers., Indoor Mobile Radio Commun.,
[6] M. O. Damen, H. El Gamal, and G. Caire, “On maximum-likelihood Montreal, QC, Canada, Oct. 2017, pp. 1–5.
detection and the search for the closest lattice point,” IEEE Trans. Inf. [27] J. Chen, C. Fei, H. Lu, G. E. Sobelman, and J. Hu, “Hardware efficient
Theory, vol. 49, no. 10, pp. 2389–2402, Oct. 2003. massive MIMO detector based on the Monte Carlo tree search method,”
[7] Z. Guo and P. Nilsson, “Algorithm and implementation of the K-best sphere IEEE J. Emerg. Sel. Topics Circuits Syst., vol. 7, no. 4, pp. 523–533,
decoding for MIMO detection,” IEEE J. Sel. Areas Commun., vol. 24, no. 3, Dec. 2017.
pp. 491–503, Mar. 2006. [28] M. Wu, C. Dick, J. R. Cavallaro, and C. Studer, “High-Throughput data
[8] G. Peng, L. Liu, S. Zhou, Y. Xue, S. Yin, and S. Wei, “Algorithm detection for massive MU-MIMO-OFDM using coordinate descent,” IEEE
and architecture of a low-complexity and high-parallelism preprocessing- Trans. Circuits Syst. I, Reg. Papers, vol. 63, no. 12, pp. 2357–2367,
based K-best detector for large-scale MIMO systems,” IEEE Trans. Signal Dec. 2016.
Process., vol. 66, no. 7, pp. 1860–1875, Apr. 2018. [29] G. Peng, L. Liu, P. Zhang, S. Yin, and S. Wei, “Low-Computing-Load,
[9] C.-H. Liao, T.-P. Wang, and T.-D. Chiueh, “A 74.8 mW soft-output detector high-parallelism detection method based on chebyshev iteration for mas-
IC for 8 × 8 spatial-multiplexing MIMO communications,” IEEE J. Solid- sive MIMO systems with VLSI architecture,” IEEE Trans. Signal Process.,
State Circuits, vol. 45, no. 2, pp. 411–421, Feb. 2010. vol. 65, no. 14, pp. 3775–3788, Jul. 2017.
[10] W. Tang, H. Prabhu, L. Liu, V. Öwall, and Z. Zhang, “A 1.8Gb/s 70.6pJ/b [30] J. Chen, Z. Zhang, H. Lu, J. Hu, and G. E. Sobelman, “An intra-iterative
128 × 16 link-adaptive near-optimal massive MIMO detector in 28 nm interference cancellation detector for large-scale MIMO communications
UTBB-FDSOI,” in Proc. IEEE Int. Solid-State Circuits Conf. Dig., San based on convex optimization,” IEEE Trans. Circuits Syst. I, Reg. Papers,
Francisco, CA, USA, Feb. 2018, pp. 224–226. vol. 63, no. 11, pp. 2062–2072, Nov. 2016.
[11] W. Tang, C.-H. Chen, and Z. Zhang, “A 0.58 mm2 2.76 Gb/s 79.8 pJ/b 256- [31] D. R. Kincaid and E. W. Cheney, Numerical Analysis: Mathematics of
QAM massive MIMO message-passing detector,” in Proc. IEEE Symp. Scientific Computing. 3 rd ed. Belmont, CA, USA: Wadsworth, 2002.
VLSI Circuits, Honolulu, HI, USA, Jun. 2016, pp. 1–2. [32] B. Yin, M. Wu, J. R. Cavallaro, and C. Studer, “Conjugate gradient-
[12] Y. T. Chen, C. C. Cheng, T. L. Tsai, W. C. Sun, Y. L. Ueng, and C. H. Yang, based soft-output detection and precoding in massive MIMO systems,”
“A 501 mW 7.6l Gb/s integrated message-passing detector and decoder for in Proc. IEEE Global Telecommun. Conf., Austin, TX, USA, Dec. 2014,
polar-coded massive MIMO systems,” in Proc. IEEE Symp. VLSI Circuits, pp. 3696–3701.
Kyoto, Japan, Jun. 2017, pp. 330–331. [33] B. Yin, M. Wu, J. R. Cavallaro, and C. Studer, “VLSI design of large-scale
[13] O. Castaneda, T. Goldstein, and C. Studer, “Data detection in large soft-output MIMO detection using conjugate gradients,” in Proc. IEEE Int.
multi-antenna wireless systems via approximate semidefinite relaxation,” Symp. Circuits Syst., Lisbon, Portugal, May 2015, pp. 1498–1501.
IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 63, no. 12, pp. 2334–2346, [34] B. E. Godana and T. Ekman, “Parametrization based limited feedback
Dec. 2016. design for correlated MIMO channels using new statistical models,” IEEE
[14] Y. Jiang, M. K. Varanasi, and J. Li, “Performance analysis of ZF and Trans. Wireless Commun., vol. 12, no. 10, pp. 5172–5184, Oct. 2013.
MMSE equalizers for MIMO systems: An in-depth study of the high SNR
regime,” IEEE Trans. Inf. Theory, vol. 57, no. 4, pp. 2008–2026, Apr. 2011.
[15] S. Ozyurt and M. Torlak, “Exact joint distribution analysis of zero-forcing
V-BLAST gains with greedy ordering,” IEEE Trans. Wireless Commun.,
vol. 12, no. 11, pp. 5377–5385, Dec. 2012.
Leibo Liu (Senior Member, IEEE) received the B.S.
[16] H. Prabhu, J. N. Rodrigues, L. Liu, and O. Edfors, “A 60 pJ/b 300
degree in electronic engineering and the Ph.D. de-
Mb/s 128 × 8 massive MIMO precoder-detector in 28 nm FD-SOI,” in
gree from the Institute of Microelectronics, Tsinghua
Proc. IEEE Int. Solid-State Circuits Conf. Dig., San Francisco, CA, USA,
University, Beijing, China, in 1999 and 2004, respec-
Feb. 2017, pp. 60–61. tively. He is currently a Professor with the Institute
[17] C. H. Chen, W. Tang, and Z. Zhang, “A 2.4 mm2 130 mW MMSE-
of Microelectronics, Tsinghua University. His cur-
nonbinary-LDPC iterative detector-decoder for 4 × 4 256-QAM MIMO
rent research interests include reconfigurable com-
in 65 nm CMOS,” in Proc. IEEE Int. Solid-State Circuits Conf. Dig., San
puting, mobile computing, and VLSI digital signal
Francisco, CA, USA, Feb. 2015, pp. 338–340. processing.
[18] B. Yin, M. Wu, G. Wang, C. Dick, J. R. Cavallaro, and C. Studer,
“A 3.8 Gb/s large-scale MIMO detector for 3GPP LTE-advanced,” in
Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Florence, Italy,
May 2014, pp. 3879–3883.
[19] M. Wu, B. Yin, G. Wang, C. Dick, J. R. Cavallaro, and C. Studer,
“Large-scale MIMO detection for 3GPP LTE: Algorithms and FPGA
implementations,” IEEE J. Sel. Topics Signal Process., vol. 8, no. 5, Guiqiang Peng received the B.S. degree from the
pp. 916–929, Oct. 2014. School of Micro-Electronics and Solid-State Elec-
[20] B. Yin, “Low complexity detection and precoding for massive mimo tronics, University of Electronic Science and Tech-
systems: Algorithm, architecture, and application,” Ph.D. dissertation, nology of China, Chengdu, China, in 2013. He is
Dept. Elect. Comput. Eng., Rice Univ., Houston, TX, USA, 2014. currently working toward the Ph.D. degree with
[21] L. Dai, X. Gao, X. Su, S. Han, I. Chih-Lin, and Z. Wang, “Low-Complexity the Institute of Microelectronics, Tsinghua Uni-
soft-output signal detection based on gauss-seidel method for uplink versity, Beijing, China. His current research inter-
multi-user large-scale MIMO systems,” IEEE Trans. Veh. Technol., vol. 64, ests include reconfigurable computing, mobile com-
no. 10, pp. 4839–4845, Oct. 2015. puting, and VLSI signal processing and wireless
[22] Z. Wu, C. Zhang, Y. Xue, S. Xu, and X. You, “Efficient architecture communications.
for soft-output massive MIMO detection with Gauss-Seidel method,” in
Proc. IEEE Int. Symp. Circuits Syst., Montreal, QC, Canada, May 2016,
pp. 1886–1889.
[23] X. Gao, L. Dai, Y. Hu, and Z. Wang, “Matrix inversion-less signal
detection using SOR method for uplink large-scale MIMO systems,” in Pan Wang received the B.S. degree in microelec-
Proc. IEEE Global Telecommun. Conf., San Diego, CA, USA, Dec. 2015, tronics science and engineering from Central South
pp. 3291–3295. University, Changsha, China, in 2017. He is currently
[24] P. Zhang, L. Liu, G. Peng, and S. Wei, “Large-scale MIMO detection working toward the master’s degree with the Institute
design and FPGA implementations using SOR method,” in Proc. IEEE of Microelectronics, Tsinghua University, Beijing,
Int. Conf. Commun. Softw. Netw., Beijing, China, Jun. 2016, pp. 206–210. China. His current research interests include recon-
[25] G. Peng, L. Liu, S. Zhou, S. Yin, and S. Wei, “A 1.58 Gbps/W 0.40 figurable computing, mobile computing, and VLSI
Gbps/mm2 ASIC implementation of MMSE detection for 128 × 8 signal processing and wireless communications.
64-QAM massive MIMO in 65 nm CMOS,” IEEE Trans. Circuits Syst. I,
Reg. Papers, vol. 65, no. 5, pp. 1717–1730, May 2018.
Authorized licensed use limited to: Fondren Library Rice University. Downloaded on May 18,2020 at 07:55:40 UTC from IEEE Xplore. Restrictions apply.
588 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 68, 2020
Sheng Zhou (Member, IEEE) received the B.E. and Shouyi Yin (Member, IEEE) received the B.S., M.S.,
Ph.D. degrees in electronic engineering from Ts- and Ph.D. degrees in electronic engineering from
inghua University, Beijing, China, in 2005 and 2011, Tsinghua University, Beijing, China, in 2000, 2002,
respectively. From January to June 2010, he was a and 2005, respectively. He was with Imperial College
Visiting Student with the Wireless Systems Labora- London as a Research Associate. He is currently with
tory, Department of Electrical Engineering, Stanford the Institute of Microelectronics, Tsinghua Univer-
University, Stanford, CA, USA. He is currently an As- sity, as an Associate Professor. His research interests
sociate Professor with the Department of Electronic include mobile computing, wireless communications,
Engineering, Tsinghua University. His research inter- and SoC design.
ests include cross-layer design for multiple-antenna
systems, edge computing and caching, and green
wireless communications.
Qiushi Wei received the B.S. degree from the School Shaojun Wei (Fellow, IEEE) was born in Beijing,
of Physical Electronics, University of Electronic Sci- China, in 1958. He received the Ph.D. degree from
ence and Technology of China, Chengdu, China, in La Faculté Polytechnique de Mons, Mons, Belgium,
2015. He is currently working toward the master’s in 1991.
degree with the Institute of Microelectronics, Ts- He became a Professor with the Institute of Mi-
inghua University, Beijing, China. His current re- croelectronics, Tsinghua University, in 1995. He is
search interests include reconfigurable computing, currently a Senior Member of the Chinese Institute of
mobile computing, and VLSI signal processing and Electronics. His main research interests include VLSI
wireless communications. SoC design, EDA methodology, and ASIC design for
communications.
Authorized licensed use limited to: Fondren Library Rice University. Downloaded on May 18,2020 at 07:55:40 UTC from IEEE Xplore. Restrictions apply.