You are on page 1of 16

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL.

68, 2020 573

Energy- and Area-Efficient


Recursive-Conjugate-Gradient-Based MMSE
Detector for Massive MIMO Systems
Leibo Liu , Senior Member, IEEE, Guiqiang Peng , Pan Wang , Sheng Zhou , Member, IEEE,
Qiushi Wei, Shouyi Yin , Member, IEEE, and Shaojun Wei , Fellow, IEEE

Abstract—Minimum-mean-square-error (MMSE) detection is I. INTRODUCTION


increasingly relevant for massive multiple-input multiple-output
ASSIVE multiple-input multiple-output (MIMO), a key
(MIMO) systems. MMSE suffers from high computational com-
plexity and low parallelism because of the increasing number of
users and antennas in massive MIMO systems. This paper proposes
M technology, can be used in next-generation wireless
communication systems, such as 5G [2]–[4]. However, massive
a recursive conjugate gradient (RCG) method to iteratively esti- MIMO still faces various challenges, and one such challenge
mate signals. First, a recursive conjugate gradient detection algo-
rithm is proposed that achieves high parallelism and low complexity
exists in uplink massive system signal detection. Data detection
through iteration. Second, a quadrant-certain-based initial method in MIMO generally involves low parallelism and high compu-
that improves detection accuracy without added complexity is tational complexity, particularly when the number of antennas
proposed. Third, an approximated log likelihood ratio (LLR) com- increases; thus, data detection is a challenging task and compu-
putation method is proposed to achieve simplified calculation. The tational complexity is extremely high. Consequently, an efficient
analyses show that compared with related methods, the proposed
RCG algorithm reduces computational complexity and exploits
signal detection algorithm that satisfies the requirements of low
the potential parallelism. RCG is mathematically demonstrated to complexity, high parallelism, and high accuracy is considerably
achieve low approximated error. Based on the RCG method, an ar- important. Maximum likelihood (ML) detection [5], [6] is an
chitecture is proposed in a 128 × 8 64-QAM massive MIMO system. optimal detection algorithm. Nonetheless, the computational
First, a parallel processing element array with single-sided input is complexity notably increases with the number of users and mod-
adopted; this array eliminates the throughput limitation. Second,
a deeply pipelined user-level method based on the recursive conju-
ulation orders, thus preventing the practical application of the
gate gradient method is proposed. Third, an approximated archi- ML algorithm. Other nonlinear algorithms, such as K-best [7],
tecture is proposed to compute the soft output. The architecture is [8], sphere decoding (SD) [9], expectation-propagation detec-
verified on an FPGA and fabricated on 1.87 × 1.87 mm2 silicon with tion (EPD) [10], message-passing detector (MPD) [11], [12], and
TSMC 65 nm CMOS technology. The chip achieves 2.69 Mbps/mW triangular approximate semidefinite relaxation (TASER) [13],
and 1.09 Mbps/kG energy efficiency (throughput/power) and
area efficiency (throughput/area), respectively, which are 2.39 to can achieve near-optimal ML detection performance. However,
10.60× and 1.15 to 8.81× those of the normalized state-of-the-art K-best and SD involve QR decomposition when the chan-
designs. nel matrix is large, and these two algorithms are difficult to
implement because of their high computational complexities.
Index Terms—Massive multiple-input multiple-output (MIMO),
detection, minimum mean square error (MMSE), recursive In addition, there are complex low-parallelism iterations in
conjugate gradient, very-large-scale integration (VLSI), wireless the EPD, MPD, and TASER algorithms. These nonlinear al-
communications. gorithms also require abundant area and power for hardware
implementation. Various linear detection algorithms have been
Manuscript received January 2, 2019; revised July 5, 2019 and October 19, proposed to reduce the computational complexity [14]–[17] and
2019; accepted January 1, 2020. Date of publication January 6, 2020; date of can be employed in a massive MIMO system with a large but
current version January 24, 2020. The associate editor coordinating the review finite number of antennas and a comparatively small number
of this manuscript and approving it for publication was Prof. Xinming Huang.
This work was supported in part by the National Natural Science Foundation of users. These linear methods can achieve near-optimal per-
of China under Grant 61834002, in part by the National Key R&D Program of formance with relatively low computational complexity com-
China under Grant 2018YFB2202101. This article was presented in part at the pared with nonlinear methods. Among these linear algorithms,
2018 IEEE Asian Solid-State Circuits Conference (A-SSCC), Tainan, Taiwan,
November 2018 [1]. (Leibo Liu and Guiqiang Peng contributed equally to this the minimum mean square error (MMSE) is one of the most
work.) (Corresponding author: Shaojun Wei.) effective algorithms in reducing computational complexity with
L. Liu, G. Peng, P. Wang, Q. Wei, S. Yin, and S. Wei are with the minimal detection accuracy loss; consequently, it has signifi-
Institute of Microelectronics, Tsinghua University, Beijing 100084, China
(e-mail: liulb@tsinghua.edu.cn; pgq13@mails.tsinghua.edu.cn; wp17@mails. cant potential for use in practical massive MIMO systems [3],
tsinghua.edu.cn; weiqs15@mails.tsinghua.edu.cn; yinsy@tsinghua.edu.cn; [16], [17].
wsj@tsinghua.edu.cn). However, considering the hardware architecture, MMSE de-
S. Zhou is with the Department of Electronic Engineering, Tsinghua Univer-
sity, Beijing 100084, China (e-mail: sheng.zhou@tsinghua.edu.cn). tection involves complicated matrix inversions and multiplica-
Digital Object Identifier 10.1109/TSP.2020.2964234 tions, as well as low parallelism, causing difficulties for hardware

1053-587X © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Authorized licensed use limited to: Fondren Library Rice University. Downloaded on May 18,2020 at 07:55:40 UTC from IEEE Xplore. Restrictions apply.
574 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 68, 2020

implementations with an increasing number of users. This issue at 1.2 V. We show reference FPGA and ASIC implemen-
limits the application of hardware architectures (detectors) in tation results and compare our designs to those of recently
massive MIMO systems [16]–[19]. Consequently, many meth- reported massive MIMO detection implementations.
ods have been proposed to further reduce computational com- Our results demonstrate that RCG provides near-optimal
plexity, improve the parallelism of MMSE and optimize hard- detection accuracy with high parallelism and low complexity,
ware architectures. In [16] and [17], Cholesky decomposition and compared with other state-of-the-art architectures, the cor-
(CHD)-based detectors and LU decomposition (LUD)-based responding architecture achieves substantial improvements in
detectors were proposed and implemented in hardware with high terms of energy and area efficiencies.
detection accuracy. However, the throughput is limited, and the Notation: Bold uppercase, bold lowercase, and lowercase
architecture requires a significant amount of hardware resources. letters denote matrices, vectors, and scalars, respectively. (·)i,j
Neumann series approximation (NSA)-based architectures have denotes the element in the ith row and jth column of a matrix.
been proposed to achieve high throughput for massive MIMO (·)i denotes the ith element of a vector. IM denotes the M × M
detection [18]–[20]. However, only a marginal reduction in identity matrix. (·)H and (·)−1 denote conjugate transpose and
complexity can be achieved. To achieve a reasonable balance inversion, respectively.
of detection accuracy, throughput, and hardware resource con- Outline: The remainder of this paper is organized as follows.
sumption, architectures based on approximation methods, such Section II briefly introduces the system model and motivation.
as the Gauss-Seidel (GS) [21], [22], successive over-relaxation Section III describes the proposed recursive conjugate gradient
(SOR) [23], [24], weighted Jacobi iteration (WeJi) [25], [26] and method for a massive MIMO system. Section IV shows the
Monte Carlo tree search (MCTS) [27] methods, have also been symbol-error-rate simulation results and comparisons. Section
proposed. However, the computations in the GS, SOR, WeJi, V presents the proposed VLSI architecture. Section VI shows
and MCTS methods are difficult to parallelize due to high cor- the hardware implementation results and their comparisons with
relations when estimating each symbol from users. To explore state-of-the-art designs. Conclusions are drawn in section VII.
the parallelism between each step, implicit methods have been
proposed, including optimized coordinate descent (OCD) [28], II. SYSTEM MODEL AND MOTIVATION
parallelizable Chebyshev iteration (PCI) [29] and intraiterative
In an Nr × Nt MIMO system with Nt transmitters on the user
interference cancellation (IIC) [30]. However, these implicit
side and Nr antennas on the base station (BS) side (predomi-
methods ignore the unique properties of massive MIMO systems
nantly Nr  Nt [2]–[4]), the uplink system can be modeled
(e.g., channel hardening). Therefore, the same Gram matrix
as [10]–[12]
needs to be calculated multiple times, which means that implicit
method architectures suffer from higher energy consumption y = Hs + n, (1)
and latency than explicit methods.
Contributions: In this paper, we propose a recursive conjugate where H ∈ C Nr ×Nt represents a Rayleigh flat-fading channel
gradient (RCG) method to achieve massive MIMO detection matrix; s ∈ C Nt ×1 denotes the transmitted signal vector, which
with corresponding very-large-scale integration (VLSI) designs. is based on the 64-QAM modulation constellation set Ω in this
Our contributions are summarized as follows: paper; n ∈ C Nr ×1 is an additive white Gaussian noise vector
1) We propose a modified conjugate gradient method to with zero mean and variance σ 2 ; and y ∈ C Nr ×1 is the signal
solve the detection problem with high parallelism and low vector received at the BS. In classical MMSE detection, the
complexity in the massive MIMO system. estimation of the transmitted signal can be expressed as [16], [17]
2) According to the property of the massive MIMO system,  −1 H
ŝ = HH H + N0 Es −1 INt H y = W−1 yMF , (2)
we propose a quadrant-certain-based initial method and
an approximated log likelihood ratio (LLR) method, all of where W = HH H + N0 Es −1 IM is the MMSE filtering
which reduce the computational complexity while main- matrix with a noise power spectral density of N0 and a signal
taining high detection accuracy. power spectral density of Es and yMF = HH y denotes the
3) We mathematically analyze the approximated error, com- matched-filter vector. According to (1) and (2), the vector ŝ can
plexity, and parallelism of the proposed RCG method. be described as
RCG maintains its advantages in terms of these factors. In
addition, we provide symbol-error-rate (SER) simulation ŝ = W−1 HH Hs + W−1 HH n, (3)
results to show that compared with related methods, the where U = W−1 HH H and V = UW−1 are the equivalent
RCG method achieves high detection accuracy. channel matrices, which can be used to compute the equivalent
4) We develop a VLSI architecture for RCG that uses a paral- channel gain and postequalization noise plus interference (NPI,
lel processing element (PE) array and a deeply pipelined σ 2 ), respectively. The soft-output LLR for the b-th bit index
user-level method to achieve high throughput with low and i-th user satisfies
hardware consumption.   2  2
2  ŝi   ŝi 
5) The architecture is verified on an FPGA and fabricated on Ui,i
Li,b = 2 min0  − s − min1  − s = ζi2 ϕb (ŝi ),
silicon. The chip achieves a 1.5 Gbps throughput under a σi s∈Sb Ui,i s∈Sb Ui,i
500 MHz working frequency while dissipating 557 mW (4)

Authorized licensed use limited to: Fondren Library Rice University. Downloaded on May 18,2020 at 07:55:40 UTC from IEEE Xplore. Restrictions apply.
LIU et al.: ENERGY- AND AREA-EFFICIENT RECURSIVE-CONJUGATE-GRADIENT-BASED MMSE DETECTOR 575

where ζi2 is the signal-to-interference-plus-noise ratio (SINR) In addition, according to the Lanczos orthogonalization algo-
for the i-th user; ϕb (ŝi ) is a piecewise linear function for rithm [31], the residual vector z(k+1) can be expressed as
Gray mappings; and Sb0 and Sb1 denote the sets of modulation 
constellation symbols, where the i-th z(k+1) = ρ(k) z(k) − γ (k) η (k) + (1 − ρ(k) )z(k−1) , (8)
 t bits are 2
0 and 1,
respectively. In addition, there is σi2 = N
j=i |U i,j | E s +Vi,i N0 .
In massive MIMO systems, MMSE detection can achieve where ρ(k) , γ (k) , and η (k) = Wz(k) are iterative parameters.
near-optimal performance [3]. However, the result of large- When computing the iterative residual, the RCG algorithm im-
scale multiplication during the computation of the Gram matrix proves the original CG method by Lanczos orthogonalization
is required in the following computations. Considering hard- algorithm so as to improve the parallelism of inter-iteration
ware implementation, this limitation significantly influences the and intro-iteration computation. Combining (7) and (8), the
throughput of the entire detector and affects resource consump- estimation of the transmitted vector can be expressed as
tion [16], [19], [25]. Furthermore, the computational complex-  
ities of matrix inversion W−1 in (2) and LLR in (4) are high, ŝ(k+1) = ρ(k) ŝ(k) + γ (k) z(k) + 1 − ρ(k) ŝ(k−1) . (9)
particularly in systems with a large number of antennas [16],
[19], [25]. Considering hardware implementation, the computa- Next, it is important to compute the iterative parameters. Be-
tion of the matrix W−1 restricts system parallelism due to the cause the vectors z(k+1) , z(k) , and z(k−1) are mutually or-
high correlation of each computation. thotropic [31], there are
  
z(k+1) , z(k) = z(k−1) , z(k) = z(k−1) , z(k+1) = 0.
III. RECURSIVE CONJUGATE GRADIENT METHOD FOR
MASSIVE MIMO DETECTION (10)
Hence, combining (8) and (10), the parameters can be
First, this section describes a recursive conjugate gradient obtained as
detection algorithm that approximates an MMSE detector in a
massive MIMO system. For the formulation of this algorithm, ξ (k)
γ (k) = ;
an optimized conjugate gradient-based method is presented and φ(k)
a quadrant-certain-based initial method is proposed. In addi-
ξ (k−1)
tion, an approximate method of computing LLRs is presented. ρ(k) =  , (11)
Second, it is demonstrated that the proposed method has a low ξ (k−1) + γ (k) η (k) , z(k−1)
approximated error. Third, analyses of the proposed recursive
where ξ (k) and φ(k) can be computed as
conjugate gradient detection algorithm are presented to show

its advantages in terms of computational complexity and paral- ξ (k) = z(k) , z(k) ;
lelism in comparison with other algorithms.

φ(k) = η (k) , z(k) . (12)
A. Proposed Recursive Conjugate Gradient Detection
1) Optimized Iteration Method: Conjugate Gradient (CG) In addition, according to (8) and (10), there are
iteration is a promising method for solving linear equations  
such as (2) [31]. Therefore, CG iteration can be used in the η (k) , z(k−1) = z(k) , η (k−1) ,
MMSE detection algorithm to reduce computational complexity
z(k) z(k−1)
by avoiding large-scale matrix inversions. However, the tradi- η (k−1) = − (k−1) (k−1)
+ (k−1)
tional CG method still has some shortcomings. In the original ρ γ γ
 
algorithm, there is strong data dependency between each itera- 1 − ρ(k−1) z(k−2)
tion and between each element of the estimated vector in one + . (13)
ρ(k−1) γ (k−1)
iteration. To solve these problems, based on the CG method,
Recursive Conjugate Gradient (RCG) iteration can be expressed Therefore, the parameter ρ(k) can be computed as
as

−1
(k+1) (k) (k+1) (k+1) (k) γ (k) ξ (k) 1
ŝ = ŝ +α p , (5) ρ = 1 − (k−1) (k−1) (k−1) . (14)
γ ξ ρ
where p(k) is an orthogonal basis, k is the iteration number, and In this proposed algorithm, to compute the iteration and pa-
the parameter α(k) can be calculated as rameters, there are required initial settings, such as ρ(0) = 1,
 (0) (k)  z(−1) = z(0) , and ŝ(−1) = ŝ(0) .
(k) z ,p
α = . (6) After the iteration, points need to be found in the constellation
Wp(k) , p(k)
graph according to ŝ(k) . In the constellation graph, the round-off
In (6), z(k) represents the residual vector, which can be expressed method can be used to find the nearest points rather than calcu-
as lating all distances with all constellation points of an element
in ŝ(k) . In this way, the computational complexity is decreased
z(k) = yMF − Wŝ(k) . (7) from quadratic to linear.

Authorized licensed use limited to: Fondren Library Rice University. Downloaded on May 18,2020 at 07:55:40 UTC from IEEE Xplore. Restrictions apply.
576 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 68, 2020

initial solution method is better than the traditional method.


Additive noise has no effect on the distribution of the trans-
mitting signal points, so the transmitting signal is considered to
occur randomly among the 64 constellation points. The above
analysis shows that the proposed quadrant-certain initial points
method achieves better results in selecting the initial solutions of
52 constellation points than does the traditional initial solution
(zero vector). The simulation results (Fig. 3 in Section IV)
prove that the proposed method achieves better performance
than that of the traditional initial solution method. In addition,
the computational complexity does not increase. Hence, the
Fig. 1. Quadrant-certain-based initial solution.
proposed method can reduce hardware resource consumption
and increase throughput.
2) Quadrant-Certain-Based Initial Solution: Any initial so- 3) Approximate LLR: According to (3) and (4), the compu-
lution can obtain the final solution, and the initial solution of tation of soft-output LLR requires the equivalent channel gain
the iteration affects the detection accuracy and computational (Ui,i ), NPI (σ 2 ), and estimated vector ŝ. The estimated vector ŝ
complexity when the number of iterations is limited. The next can be attained according the iteration; therefore, the following
task in the signal detection method determines the initial solu- work is focused on the computations of Ui,i and σ 2 . According
tion, which is traditionally set as a zero vector because no a priori to (9), the estimated vector can be expressed as
information of the final solution is available. Note that for uplink  
massive MIMO systems, the channel matrix H is asymptotically ŝ(k+1) = ρ(k) ŝ(k) + γ (k) z(k) + 1 − ρ(k) ŝ(k−1)
orthogonal when Nr  Nt [2]–[4]; hence, there is 
= ρ(k) 1 − γ (k) W ŝ(k)
Nr , i = j,
Wi,j ≈ (15) 
0, i = j.
+ 1 − ρ(k) ŝ(k−1) + ρ(k) γ (k) yMF . (18)
Therefore, the Gram matrix is diagonally dominant for an uplink
massive MIMO system, and all elements in the diagonal are Combining (2) and (18), let yMF = e(Nr ,1) ; the estimated of the
positive [21], [22]. Combining (2) and (15), there are inversion of matrix W can be expressed as
 −1 MF  1   
(k+1) (k)
Re (ŝi ) ≈ Re Wi,i yi ≈ Re yiMF ; Ŵinv = ρ(k) 1 − γ (k) W Ŵinv
Nr

 −1 MF  1   (k−1)
Im (ŝi ) ≈ Im Wi,i yi ≈ Im yiMF , (16) + 1 − ρ(k) Ŵinv + ρ(k) γ (k) e(Nr ,1) . (19)
Nr
(0) (0)
which means that the ith element of the vector ŝ(k) is located in According to (15), the Ŵinv in (19) can be computed as Ŵinv =
1
the same quadrant as the ith element of the vector yMF . Based Nr INt , which is diagonal matrix. Combining (3) and (19), the
on this property, the ith element of the initial solution vector ŝ(0) matrices U and V can be approximated as
can be chosen in a certain point of one quadrant according to 
Û(k+1) = ρ(k) 1 − γ (k) W Û(k)
the located quadrant of yMF rather than the conventional zero
vector, which is shown in Fig. 1. Hence, the ith row of the initial 
solution ŝ(0) for 64-QAM modulation can be expressed as + 1 − ρ(k) Û(k−1) + ρ(k) γ (k) HH H;
(0)

ŝi = ±4 ± 4i. (17) V̂(k+1) = ρ(k) 1 − γ (k) W Û(k) Ŵinv
For the problem of constellation points searching in the itera- 
tion process, the relative position between a constellation point + 1 − ρ(k) Û(k−1) Ŵinv + ρ(k) γ (k) HH HŴinv .
and the corresponding constellation point of the initial solution (20)
of the iteration will affect the performance of the iteration.
The equivalent channel gain Ui,i and NPI can be approximated
Below, the quadrant-certain initial points method proposed in
as
this paper is compared with traditional initial solution using the
(k+1)
zero vector method. Here, the advantages and disadvantages of Ui,i ≈ Ûi,i ;
each method are compared in terms of the distance between the
two initial solution methods and each constellation point. Taking
Nt
σi2 ≈ |Ui,j |2 Es +Vi,i N0 . (21)
64-QAM as an example, among the 64 constellation points, there
j=i
are 4 constellation points (±1 ± 1i) for which the traditional
zero initial solution method is better than the proposed initial Therefore, combining (19)–(21), the SINR ζi2 can be computed.
solution method, there are 8 constellation points (±1 ± 3i, However, these computations are complicated and computation-
±3 ± 1i) for which the two methods have same performance, ally complex. According to the properties of massive MIMO
and there are 52 constellation points for which the proposed systems, the Gram matrix and matrix U are diagonal dominant.

Authorized licensed use limited to: Fondren Library Rice University. Downloaded on May 18,2020 at 07:55:40 UTC from IEEE Xplore. Restrictions apply.
LIU et al.: ENERGY- AND AREA-EFFICIENT RECURSIVE-CONJUGATE-GRADIENT-BASED MMSE DETECTOR 577

Hence, NPI can be approximated as


Algorithm 1: Proposed Recursive Conjugate Gradient
 (k+1) (RCG) Algorithm for Soft-Output MMSE Detection.
(k+1)
σi2 ≈ N0 Ûi,i Ŵinv . (22)
i,i 1: Inputs:
Combining (21) and (22), the SINR ζi2 in (4) can be approxi- 2: Nr × Nt flat Rayleigh fading channel matrix H;
mated as 3: Nr × 1 received signal vector at the BS y;
 2 4: parameters: iterative number k, noise power spectral
(k+1)
Ûi,i Nr density N0 , signal power spectral density Es ;
ζi2 ≈  (k+1) ≈ . (23) 5: Outputs:
(k+1) N0
N0 Ûi,i Ŵinv 6: LLR values;
i,i
7: Initialization:
These approximations avoid large-scale matrix multiplications 8: yMF = HH y, W = HH H + N0 Es −1 IM ;
and inversions, thereby reducing computational complexity and 9: for i = 1, 2, . . . , Nt do
increasing parallelism, which is suitable for hardware imple- 10: if Re(yiMF ) > 0, Im(yiMF ) > 0 then
mentations. Finally, the proposed algorithm is summarized in (0)
11: ŝi = 4 + 4i;
Algorithm 1. 12: else if Re(yiMF ) > 0, Im(yiMF ) < 0 then
[25] and [29] proposed two approximate methods for LLR (0)
13: ŝi = 4 − 4i;
calculation. [25] used the diagonal element of MMSE filtering
14: else if Re(yiMF ) < 0, Im(yiMF ) > 0 then
matrix W to approximate the value of SINR ζi2 to approximately (0)
calculate the value of the LLR. Since the diagonal elements of 15: ŝi = −4 + 4i;
matrix W are not exactly the same, the corresponding SINR ζi2 16: else if Re(yiMF ) < 0, Im(yiMF ) < 0 then
(0)
is different for each user (each diagonal element of matrix W). 17: ŝi = −4 − 4i;
Therefore, the SINR ζi2 for each user needs to be calculated. In 18: end if
addition, the SINR ζi2 needs to wait for the diagonal elements of 19: end for
matrix W to complete the calculation before starting. The data 20: ŝ(−1) = ŝ(0) , z(−1) = z(0) = yMF − Wŝ(0) ;
calculated before and after have certain correlations. Therefore, 21: Iteration:
considering the hardware implementation, the correlations will 22: for t = 0, 1, . . . , k − 1 do
affect the detector processing rate and throughput rate. [29] 23: η (t) = Wz(t) ;
approximates the values of SINR ζi2 and LLR by calculating 24: ξ (t) = (z(t) , z(t) ), φ(t) = (η (t) , z(t) );
(t)
the eigenvalues of matrix W. This method must calculate the 25: γ (t) = φξ (t) ;
eigenvalues of matrix W, which will introduce extra compu- 26: if t = 0 then
tational complexity to the RCG algorithm in this paper. In 27: ρ(0) = 1;
addition, the values of SINR ζi2 and LLR must be calculated 28: else  −1
(t) (t)
after calculating the value of matrix W and its eigenvalues. 29: ρ(t) = 1 − γγ(t−1) ξξ(t−1) ρ(t−1)1
;
Therefore, this method also has some data correlations, which 30: end if
will affect the processing rate and throughput rate of the detector. 31: ŝ(t+1) = ρ(t) (ŝ(t) + γ (t) z(t) ) + (1 − ρ(t) )ŝ(t−1) ;
In this paper, the value of SINR ζi2 is approximately calculated 32: z(t+1) = ρ(t) (z(t) − γ (t) η (t) ) + (1 − ρ(t) )z(t−1) ;
by Nr . The corresponding value of SINR ζi2 only needs one 33: end for
operation. Furthermore, there is no data dependency before and 34: LLR:
after the calculation. Therefore, compared with [25] and [29], 35: for i = 1, 2, . . . , Nt do
the approximate LLR method proposed in this paper is simple 36: for b = 1, 2, . . . , log2 Q do
and conducive to hardware implementation. 2
37: ϕb (ŝi ) = min0 | Uŝi,i
i
− s|
s∈Sb
B. Approximated Error 2
− min1 | Uŝi,i
i
− s| ;
s∈Sb
This section mathematically demonstrates that the proposed
RCG detection algorithm achieves a low approximate error. 38: ζi2 = N N0 ;
r

According to [31], the orthogonal basis p(k) and residual vector 39: Li,b (ŝi ) = ζi2 ϕb (ŝi );
z(k) in (5) satisfy: p(1) = z(0) and z(1) = z(0) − α(1) Wp(1) = 40: end for
z(0) − α(1) Wz(0) . Using mathematical induction, for the tth 41: end for
iteration, there are
m(t) − ŝ(0) be one vector of the space St ; then, there is

t−1
(t−1)

t−1
(t)
z(t−1) = ai Wi z(0) ; p(t) = bi Wi z(0) , (24)
t

i=0 i=0 m(t) − ŝ(0) = ci p(i) , (25)


i=1
(t−1) (t−1)
where ai and bi are constants. If z(t−1) is not zero, then
where ci is a constant. According to (5), there is
z(0) , Wz(0) , W2 z(0) , . . ., Wt−1 z(0) are linearly independent,
and p(1) , p(2) , . . ., p(t) can form t linear independent vectors.
t
ŝ(t) − ŝ(0) = α(i) p(i) . (26)
Therefore, these vectors have the same space, called St . Let
i=1

Authorized licensed use limited to: Fondren Library Rice University. Downloaded on May 18,2020 at 07:55:40 UTC from IEEE Xplore. Restrictions apply.
578 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 68, 2020

Combining (25) and (26), there are


Nt
2 2 2

Nt
 2  2 = |ci | |Pk (λi )|  max |Pk (λi )| |ci |2
 (t)   
m − s = m(t) − ŝ(t) + ŝ(t) − s
i
i=1 1
W W   2
 2  
 t
k  = max |Pk (λi )|2 Q(0) 
 (i) (i)  i W
= (ci − αi ) p − αi p   2
   
i=1 i=t+1 W  (0) 2
 t 2  k 2  max |Pk (x)| Q  . (34)
    λmin xλmax W
 (i)   (i) 
= (ci − αi ) p  +  αi p 
    Let ς = λmax +λmin −2x , and there is x=
i=1 W i=t+1 W λmax −λmin
 2  2  2 (λmax +λmin )−(λmax −λmin )ς λ + λ
      . Let d = ς|x=0 = max min ,
= m(t) − ŝ(t)  + ŝ(t) − s  ŝ(t) − s . (27) 2 λmax −λmin
W W W and there is d > 1. Therefore, Pk (x) in (34) can be rewritten as
In (27), there is a definition as xW =(Wx, x) 2 . When
1  
(λmax + λmin ) − (λmax − λmin ) ς
m(t) = ŝ(t) , the m(t) − ŝ2W in (27) reaches its minimum, Pk (x) = Pk = Rk (ς) ,
2
which can be expressed as (35)
   
 (t)   (t)  where Rk (ς) is a polynomial of ς. When x arises from λmin to
ŝ − s = (t)min m − s . (28)
W (0) m −ŝ ∈St W λmax , ς falls from 1 to −1. In addition, there are

Let ŝ(t) − s = Q(t) , there are Rk (d) = Pk (0) = 1;



min max |Pk (x)| = max |Rk (ς)|
z(0) = yMF − Wŝ(0) = W s − ŝ(0) = −WQ(0) . (29) Pk (x)∈Xk λmin xλmax −1ς1

Hence, combining (24), (25), and (29), (m(t) − ŝ) in (27) can 1 1
= =  , (36)
be expressed as Tk (d) Tk λ max +λmin
λmax −λmin

k
(ς)
m(t) − s = ŝ(0) + ci p(i) − s where Rk (ς) = TTkk (d) . According to the formula of the k times
i=1
Chebyshev polynomial, there is

k

k
i−1
k−1 1  
2
k  
2
(i) Tk (x) = x+ x −1 + x− x −1 . (37)
= Q(0) + ci bj Wj z(0) = Q(0) + c i Wi r(0) 2
i=1 j=0 i=1
  Therefore, Tk ( λmax +λmin ) in (36) can be computed as (38)

k λmax −λmin
= I− c i W i Q(0) = Pk (W) Q(0) , (30) shown at the bottom of the next page. Hence, combining (31),
i=1
(33), (34), (36), and (38) there is
  √ √ k  
where c i is a constant, Pk (W) is the k-degree polynomial of  (k)  λmax − λmin  (0) 
ŝ − s  2 √ √ ŝ − s
matrix W, and Pk (0) = 1. All polynomials satisfying Pk (0) = W λmax + λmin W
1 are defined as a set Xk ; therefore, according to (30), (28) can √ k  
be expressed as κ−1  (0) 
=2 √ ŝ − s , (39)
    κ+1 W
 (k)   
ŝ − s = min Pk (W) Q(0)  . (31)
W Pk ∈Xk W where κ = λmax /λmin . Given the difficulty of determining the
Because W is a symmetric positive definite matrix, its eigen- eigenvalues of matrix W, the largest and smallest eigenvalues
values (λ1 , λ2 , . . . , λNt ) are all positive. Therefore, there are are approximated because of the properties of massive MIMO
eigenvectors ψ (1) , ψ (2) , . . . , ψ (Nt ) , which satisfy: systems. In a massive MIMO system, as Nr and Nt increase,
   the λmax and λmin of matrix W can be approximated as
  
ψ (i) , Wψ (j) = 0; ψ (i)  = 1, (32)
W λmax ≈ Nr + Nt + 2 Nr Nt ;

where i, j = 1, 2, . . . , Nt , and i = j. Suppose λmin ≈ Nr + Nt − 2 Nr Nt . (40)

Nt
Therefore, combining (39) and (40), (39) can be approximated
Q(0) = ci ψ (i) . (33)
as
i=1
   k/2  
Let λmax and λmin be the largest and smallest eigenvalues of  (k)  Nt  (0) 
ŝ − s  2 ŝ − s . (41)
matrix W, respectively. Hence, there are W Nr W
 2 According to (41), when the number of users (Nt ) is fixed, as the
 2  n 
   
Pk (W) Q(0)  =  ci Pk (λi ) ψ (i)  number of BS antennas Nr increases, the approximation error
W   will decrease. In other words, when the ratio between Nt /Nr
i=1 W

Authorized licensed use limited to: Fondren Library Rice University. Downloaded on May 18,2020 at 07:55:40 UTC from IEEE Xplore. Restrictions apply.
LIU et al.: ENERGY- AND AREA-EFFICIENT RECURSIVE-CONJUGATE-GRADIENT-BASED MMSE DETECTOR 579

TABLE I
NUMBER OF REAL-VALUED MULTIPLICATIONS

TABLE II
NUMBER OF REAL-VALUED MULTIPLICATIONS IN A 128 × 8 MIMO SYSTEM

decreases, the approximation error of the proposed method will computational complexities of the various methods. Table II
decrease. For massive MIMO systems, the number of antennas at shows the total number of real-valued multiplications for all
the BS is much larger than the number of users, which indicates the algorithms with different k in a 128 × 8 MIMO system. The
that the proposed method achieves low approximation error. proposed RCG method has a lower computational complexity
In addition, according to (40) and (41), when the number of than that of the GS [21], CG [32], and conjugate gradient least
iterations k increases, the approximation error further decreases. squares (CGLS) [33] methods. For example, when Nr = 128,
Therefore, for large k, ŝ(k) − sW approaches zero. Hence, in Nt = 8, and k = 2, the proposed RCG method requires 21632
a massive MIMO system, the proposed method can achieve a RMULs, which is less than those of the GS (23152 RMULs),
low approximation error that approaches zero. CG (23304 RMULs), and CGLS (27232 RMULs) methods. The
OCD [28] and PCI [29] methods achieve large-scale matrix
multiplication and inversion in an implicit way. Compared with
C. Algorithm Analyses
other explicit methods, these implicit methods achieve low com-
The number of real-valued multiplications (RMUL) is gen- putational complexity (16416 and 12344 RMULs, respectively),
erally used to evaluate the computational complexity of meth- as shown in Table I and Table II. However, these implicit meth-
ods. In the proposed RCG iterative algorithm, the first set of ods ignore the channel hardening of a massive MIMO system.
calculations consists of a series of computations of preiterative Therefore, in an actual system, the same Gram matrix needs
parameters yMF , W, and z(0) . These computations are required to be calculated multiple times. Hence, the implicit methods
to achieve the computations of a conjugate transpose matrix of an suffer from high computational complexity in an actual system,
Nr × Nt channel matrix H with an Nr × 1 vector y, an Nt × which indicates high energy and area consumptions of implicit
Nr matrix HH with an Nr × Nt matrix H, and an Nt × Nt architectures. As k increases, the computational complexity of
symmetric matrix W with an Nt × 1 vector ŝ(0) . The second set the NSA method increases. When k < 3, the NSA method [19]
of calculations consists of a series of multiplications of a matrix had a lower complexity of O(Nt2 ). However, when k = 3, this
with a vector in an iterative process, that is, the multiplication method had a computational complexity of O(Nt3 ) that was even
of an Nt × Nt matrix W with an Nt × 1 vector z. The third higher than that of exact MMSE detection for k > 3. In general,
set of calculations is two inner products of ξ (k) = (z(k) , z(k) ) to ensure detection accuracy, k should be greater than 3 in the
and φ(k) = (η (k) , z(k) ) in an iterative process. The final set of NSA method.
calculations consists of updating ŝ and z in an iterative process. Additional important aspects of the hardware implementation
Note that when the iterative number k = 1, ρ(0) = 1. Therefore, of the detection algorithm must still be considered. The paral-
the multiplications in (9) can be ignored. Table I compares the lelism of a method is an important issue in both algorithm design

   √ k  √ k 
λmax + λmin 1 λmax + λmin 2 λmax λmin λmax + λmin 2 λmax λmin
Tk = + + −
λmax − λmin 2 λmax − λmin λmax − λmin λmax − λmin λmax − λmin
⎡ √ 2 k  √  2 k ⎤  √
√ √ √ k  √ √ k 
1⎣ λmax + λmin λmax − λmin 1 λmax + λmin λ max − λ min
= + ⎦= √ √ + √ √
2 λmax − λmin λmax − λmin 2 λmax − λmin λmax + λmin
√ √ k
1 λmax + λmin
 √ √ (38)
2 λmax − λmin

Authorized licensed use limited to: Fondren Library Rice University. Downloaded on May 18,2020 at 07:55:40 UTC from IEEE Xplore. Restrictions apply.
580 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 68, 2020

and hardware implementation. With regard to hardware imple-


mentation, there are strong correlations among all elements of
the estimated vectors in the GS, CHD, LUD, and WeJi methods,
which means that these methods have low parallelism [16],
[17], [21], [25], [29]. In the RCG method, all elements in the
matrices and vectors, such as matrix W in (2) and vector η (k)
in (8), can be computed in parallel. In addition to parallelism
for elements, steps can also be calculated in parallel, such as the
computations of φ(k) and ξ (k) in (11) and z(k) and ŝ(k) in (8)
and (9). In traditional CG and CGLS methods [32], [33], these
computations cannot be parallelized. Hence, the proposed RCG
method can be performed with high parallelism. Fig. 2. SER performance comparison between the proposed initial solution
and the traditional zero-vector initial solution (128 × 8 MIMO system).
The acronym RCG was introduced to distinguish it from
traditional CG algorithms. Conventional CG and CGLS methods
have been proposed to achieve massive MIMO detection [32],
[33]. Nonetheless, compared with the proposed RCG method,
the algorithm proposed in this paper has advantages in four
aspects: parallelism, complexity, iteration number and initial
solution design. First, the conventional CG [32] and CGLS [33]
methods compute linear equations with relatively low paral-
lelism. For example, in an iterative process, each step must be
calculated in sequence; no two steps can be calculated simultane-
ously [32], [33]. However, the proposed RCG method achieves
high parallelism between/in steps, as discussed above. This
advantage is important in hardware implementation. Second,
as shown in Table I and Table II, the proposed RCG method has
a lower computational complexity than that of the CG [32] and
CGLS [33] methods, as discussed above. Third, although the Fig. 3. SER performance comparisons between the proposed algorithm and
CG [32] and CGLS [33] methods can solve the linear equation other algorithms (128 × 8 MIMO system).
in massive MIMO detection, such as (2), they require a large
number of iterations (generally larger than three [32], [33]); thus,
these methods do not converge very rapidly. A large number fading across the coded symbols. At the receiver, LLRs were
of iterations corresponds to higher computational complexity, used as the soft input for Viterbi decoding. Additionally, as
which results in high energy and area consumptions. Finally, a in [18], [19], [21], the signal-to-noise ratio (SNR) was defined
zero vector is used as the initial solution in these methods [32], at the receiver.
[33]. A good initial solution in iterative steps can make the com- Fig. 2 compares the SER performances of the RCG method,
putation reach an accurate final solution with a relatively small the proposed initial solution and the conventional zero-vector
number of iterations; thus, choosing a proper initial solution can initial solution. According to Fig. 2, to have the same SER and
efficiently accelerate signal detection. The initial solution of the number of iterations (10−2 , k = 2), the proposed initial solution
proposed RCG method is obtained by using the property of the requires 1.74 dB less SNR than the traditional zero-vector initial
massive MIMO system, which converges faster than the zero solution. Hence, the proposed initial solution achieves a lower
vector in [32], [33]. In addition, there is no added computational loss in terms of detection accuracy for a given number of
complexity. Hence, the proposed method can reduce hardware iterations. In addition, the simulated SER performance obtained
resource consumption and increase throughput. for k = 2 using the proposed initial solution is very close to that
achieved for k = 3 using the conventional initial solution, which
indicates that the proposed initial method reduces computational
IV. SYMBOL-ERROR-RATE PERFORMANCE
complexity while maintaining a similar detection accuracy.
To evaluate the performance of the proposed RCG method, Fig. 3 shows the simulated SER results of the proposed
simulated SER results for RCG are compared with those of RCG method and different detection algorithms in a 64-QAM
state-of-the-art methods, such as NSA and WeJi. The SER 128 × 8 MIMO system. The proposed RCG method achieves
performance of exact MMSE detection based on the CHD high detection accuracy compared with recent methods. For
method is also provided for comparison. In these simulations, example, as shown in Fig. 3, when k = 2, the SNR required
the settings are adopted as a 64-QAM modulation scheme and a to achieve an SER of 10−2 was 10.16 dB, which was close
rate-1/2 industry standard convolutional code with a [133o 171o ] to the performance of the CHD-based MMSE (9.66 dB) [16],
polynomial along with a random interleaver. In addition, coding GS (10.10 dB) [21], [22] and OCD (10.11 dB) [28] detection
was performed over 120 symbols, and the number of frames was methods. In contrast, to achieve the same SER, the required
10,000. The channels were assumed to exhibit i.i.d. Rayleigh SNRs of the WeJi [25] and NSA [18]–[20] methods are 10.42 dB

Authorized licensed use limited to: Fondren Library Rice University. Downloaded on May 18,2020 at 07:55:40 UTC from IEEE Xplore. Restrictions apply.
LIU et al.: ENERGY- AND AREA-EFFICIENT RECURSIVE-CONJUGATE-GRADIENT-BASED MMSE DETECTOR 581

Fig. 5. SER performance comparison for the Kronecker channel model


Fig. 4. SER performance comparisons at different MIMO dimensions (128 × 8 MIMO system).
(128 × 8 and 128 × 16 MIMO systems).

and >20 dB, respectively. All simulations presented above are


based on 128 × 8 MIMO systems and indicate that the proposed
RCG algorithm is advantageous in terms of SER performance.
To confirm that the proposed method maintains its advan-
tages for higher-order MIMO systems, Fig. 4 shows the SER
Fig. 6. Top-level block diagram of VLSI architecture.
performances under different simulation settings (128 × 16 and
128 × 8 MIMO systems). According to Fig. 4, compared with
the MMSE detection algorithm, the proposed algorithm with V. VLSI ARCHITECTURE
k = 3 suffers an SNR loss of 0.34 dB (achieving an SER of In this section, a VLSI architecture is designed to achieve
10−2 ) in a 128 × 16 MIMO system. The proposed method massive MIMO detection based on a modified MMSE signal
achieves better detection performance than those of the methods detection algorithm. The architecture was designed for a case
of WeJi (>7 dB) and NSA (>7 dB) in [25] and [18]–[20]. study of a 64-QAM, 128 × 8 massive MIMO system. Fig. 6
Fig. 4 shows that, to achieve the same SER, the SNR required shows the top-level block diagram of the proposed VLSI ar-
by the proposed algorithm is also smaller than that required by chitecture. To achieve high throughput with limited hardware
the NSA [18]–[20] and WeJi [25] methods, thus verifying that resources, the top-level architecture is fully pipelined. The VLSI
the proposed algorithm can maintain its advantages at different architecture is divided into three main modules. In the first
MIMO dimensions. module, a PE array is used to compute the matched-filter yMF
Different simulation results are shown in Fig. 2–4. Fig. 2 is and matrix W (Algorithm 1, line 8). These input data are stored
used to compare the SER performances of the proposed initial in different memories of the architecture. The outputs are used to
solution in this paper with that of the traditional initial solution. compute the initial solution s(0) of the estimated vector and other
To have the same SER, the proposed initial solution requires parameters in the preiterative block (Algorithm 1, lines 9-20).
less SNR than the traditional zero-vector initial solution. Fig. 3 In the second module, a user-level pipeline module iteratively
is used to compare the SER performances of RCG algorithm estimates the transmitted vector s(k) by using the proposed
and other related methods. The proposed RCG method achieves RCG method, which reduces computational complexity and
high detection accuracy compared with related methods. Fig. 4 is enhances the parallelism of each iteration. In this module, vector
used to compare the SER performance of different algorithms at η (k) (Algorithm 1, line 23) and parameters ξ (k) , φ(k) , and γ (k)
different MIMO dimensions. The proposed method can maintain (Algorithm 1, lines 24–29) are computed. Then, the estimated
its advantages at different MIMO dimensions. To show the SER signal ŝ(k) (Algorithm 1, line 31) and residual vector z(k) are
performance in a more realistic channel model, the Kronecker computed (Algorithm 1, line 32). In the third unit, the final vector
channel model was used [21], [25], [29], [34]. In this channel ŝ(k) and the parameters Nr and N0 are computed to obtain the
model, channel matrix H is influenced by a correlation ma- soft outputs (LLRs) (Algorithm 1, lines 35-41). These blocks
trix. The correlation matrix is defined with a correlation factor are further described in the following subsections.
between the neighboring branches (ξ). In this simulation, the
correlated factor ξ is 0.2, 0.5 and 0.7, sequentially. Fig. 5 shows
A. Processing Element Array
that, to achieve the same SER, the SNR required by the proposed
RCG method is smaller than that required by the CG, RI and In the first module, a parallel PE array computes matrix W
NSA methods, which proves that the proposed algorithm can and vector yMF , as shown in Fig. 7. In the array, there are two
2
maintain its advantages in a realistic model. types of PEs, including Nt PE-As and Nt +N 2
t
PE-Bs, and all

Authorized licensed use limited to: Fondren Library Rice University. Downloaded on May 18,2020 at 07:55:40 UTC from IEEE Xplore. Restrictions apply.
582 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 68, 2020

Fig. 7. Architecture of the processing element array. Fig. 9. Architecture of the user-level pipeline.

same time to achieve the following detection. The subsequent


user-level pipeline module unit must wait longer to receive its
inputs. Therefore, the throughput of the overall architecture
is reduced, and the latency is increased. In [18], [19], [22],
double-sided inputs for PE-A are used; however, in the proposed
architecture, single-sided inputs are used, which reduces the
number of registers on the input side by half. In architectures
based on implicit methods (such as OCD, PCI, and IIC), the
computations of the Gram matrices are transmitted to vector
Fig. 8. Architectures of the PE-A and PE-B.
multiplications [28]–[30]. Therefore, these implicit architec-
tures have advantages in terms of throughput, hardware resource
PEs are deeply pipelined. For a 128 × 8 massive MIMO system, usage, and power consumption. However, in an actual system,
the numbers of PE-As and PE-Bs are 8 and 36, respectively. The the same Gram matrix results as those in the implicit archi-
PE-As are used to compute the diagonal elements of matrix W; tectures are computed many times when considering channel
these elements are real-valued. The PE-Bs are used to compute hardening. Hence, in an actual system, the energy consumption
the off-diagonal elements of the matrix W (28 PE-Bs) and vector and latency of the proposed architecture are lower than those
yMF (8 PE-Bs). In one clock cycle, eight elements of HH and of these implicit architectures because of the reusability of the
y are input for computation, and the PEs exhibit echelonment Gram matrix result.
with a high processing speed because G is an asymmetric matrix.
The different types of calculations included in the PEs (all PEs
B. User-Level Pipeline Module
in this massive MIMO detector) are achieved through multiple
pipelines. The outputs, matrix W and vector yMF are transmitted The user-level pipeline (Fig. 9) is designed to achieve RCG
to the next module to iteratively computate the final vector, ŝ(k) . iteration, which includes two blocks: a preiterative block and
As shown in Fig. 8(a), each PE-A includes eight groups of the an iterative block. The initial solution and some parameters are
same arithmetic logical units (ALUs), one accumulator, and an computed in the preiterative block, as shown in Fig. 9(a). In the
adder for the diagonal element. In each PE-A, there are total 16 preiterative block, there are Nt parallel PE-Cs (8 PE-Cs for a
multipliers, 16 adders and 8 subtracters in total. ALU is used to 128 × 8 MIMO system) to compute the initial solution by using
compute part of the diagonal element in the Gram matrix. Then, the proposed initial solution method. According to the elements
the results of all ALUs are accumulated to obtain the value Gi,i . of the vector yMF , the initial solution vector ŝ(0) can be set. Then,
Fig. 8(b) shows the details of the PE-B, which computes the according to (7), the first residual vector z(0) can be computed by
complex-valued multiplications and accumulations. There are using Nt parallel PE-Ds (8 PE-Ds for a 128 × 8 MIMO system).
eight groups of ALUs and two accumulators in a PE-B. In each In each PE-D, there are 8 multipliers, 8 adders and 8 subtracters
PE-B, there are 32 multipliers, 38 adders and 8 subtracters in to multiply the matrix and vector. The details of PE-C and PE-D
total. The real and imaginary parts of the output are computed are also shown in Fig. 9(a). The iterative block estimates and
at the same time. The outputs of the PE-A are transmitted to updates the signal ŝ(k) and residual z(k) according to matrix
the next unit per clock cycle after the initial latency. Hence, this W. As shown in Fig. 9(b), the whole iterative block includes
parallel PE array can achieve high throughput, high hardware two stages, which complete two iterations. The details of all
utilization, and high energy and area efficiencies. stages are shown in Fig. 10. In the first stage, there are Nt + 2
In [8], [18], [19], [22], [25], systolic arrays were used to parallel PE-Bs (10 PE-Bs for a 128 × 8 MIMO system), which
compute the Gram matrix G and matched-filter vector yMF . are used for matrix-vector multiplications to obtain parameters
In contrast to the proposed systolic array, the computations in η (0) in (8) and ξ (0) , φ(0) , and η (0) in (11). In addition, there
these architectures do not compute the diagonal and off-diagonal are one PE-E and one PE-F to compute vectors ŝ(1) and z(1)
elements of matrix G simultaneously. Hence, the computations according to (8) and (9). In each PE-E (PE-F), there are 8 highly
of the off-diagonal elements of matrix G are delayed. However, parallel multipliers and 8 adders (subtracters). In the second
these diagonal and off-diagonal elements are required at the stage, similar PE-Bs are used to compute the parameters in the

Authorized licensed use limited to: Fondren Library Rice University. Downloaded on May 18,2020 at 07:55:40 UTC from IEEE Xplore. Restrictions apply.
LIU et al.: ENERGY- AND AREA-EFFICIENT RECURSIVE-CONJUGATE-GRADIENT-BASED MMSE DETECTOR 583

Fig. 11. Architecture of the approximated LLR processing unit.

TABLE III
BIT-WIDTH OF ALL RCG VARIABLES

for Q-QAM modulation. According to (4), the computation of


the LLR values requires the channel gain, Ui,i ; NPI, σi2 ; and
SINR, ζi2 . Based on (4), the function ϕb (ŝi ) is reformulated
as a piecewise linear function for Gray mapping, which can
Fig. 10. Architecture of the iterative block. be efficiently achieved in the hardware architecture. In this
architecture, each of the linear equation coefficients, ϕb (ŝi ), are
second iteration. In addition, there are additional operators used
stored in a correction LUT. To achieve the computation of the
to compute the parameter ρ(1) in (14). Then, two PE-Es can be
approximate LLRs for each transmitted bit, the next problem is
used to update the signal ŝ(2) according to the outputs of the first
the values of SINR ζi2 . In this architecture, the approximated
stage. The calculations of these two stages were deeply pipelined
methods in (23) are used to calculate the SINR ζi2 using Nr and
based on the user level, resulting in high parallelism. Moreover,
N0 . In addition, the reciprocal of Nr is achieved by the LUT.
the hardware utilizations of PE-B, PE-E and PE-F are high.
Next, the bit LLR values, Li,b , are calculated using the SINR ζi2 .
In [28], the GS iteration is achieved in this architecture. Com-
The block allows the computations of the LLRs to be simplified,
pared with [28], the proposed user-level parallel architecture can
which increases the processing speed and reduces both the area
be utilized in this iterative module. Therefore, this module can
and power consumptions in the architecture.
be used to obtain a fully pipelined architecture that achieves high
throughput of the complete system. In [18], [19], similar iterative
computations were performed in eight groups of lower triangular VI. IMPLEMENTATION AND COMPARISON
systolic arrays. In those architectures, the pipeline delay was The proposed hardware architecture is verified on an FPGA
determined by the time consumption of the lower triangular platform (Xilinx Virtex-7) and implemented using TSMC 65 nm
systolic arrays. Hence, the throughput of the entire system was CMOS technology. Details of the hardware features, derived
limited because of resource limitations. The proposed architec- from the FPGA and silicon implementations, are compared to
ture can perform the iterations using highly parallel vector mul- those of state-of-the-art designs. In addition, the fixed-point
tiplications in place of the matrix multiplications, which results design and the SER performance with fixed-point are also
in high throughput and low area and power consumptions. In [8], presented.
[25], the iterations are achieved by one-row-based systolic ar-
rays. These architectures reduced the hardware consumption by A. Fixed-Point Design
using these one-row-based systolic arrays. However, compared
with the proposed architecture, these architectures increased the In order to facilitate hardware implementation, fixed-point
pipeline delay because of the transmission of data between the arithmetic is used in the whole detector. Based on the large
PEs in the systolic arrays. Hence, throughput will be limited to amount of simulations, the detector data are fixed. Additionally,
some extent in these architectures. the bit-widths of the real and imaginary parts are the same for
the complex computation. The bit-widths are 12-bit for input
matrix H, received signal y and N0 . The bit-width of the output,
C. LLR Module
estimated signal s, is 13-bit, and the bit-width of the LLR value
Fig. 11 shows the block diagram of the approximated LLR is 12-bit. The bit-width precision of all RCG variables are listed
module, which includes 12 log2 Q PE-Fs in this architecture in Table III. The SER performance with fixed-point is shown in

Authorized licensed use limited to: Fondren Library Rice University. Downloaded on May 18,2020 at 07:55:40 UTC from IEEE Xplore. Restrictions apply.
584 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 68, 2020

TABLE IV
COMPARISON OF RESOURCE USAGE ON A XILINX VIRTEX-7 FPGA

a
SNR values are targeted to an SER of 10−2 with floating-point designs, and the iteration number is listed for each algorithm in the simulation. In this work, the SNR
is 10.38 dB with a fixed-point design. The SNR in this work is close to the performance of other algorithms. In the case of similar detection accuracy, the throughput,
resource consumption and hardware efficiency are compared with related designs.
b
Hardware efficiency is defined as throughput/normalized source consumption (NSC), which is computed as NSC = LUTs+FFs+DSP × 280 [30].

fixed-point design. Based on the similar SER performances,


the throughputs and efficiencies are compared. The throughput
in this architecture is similar to that in the CHD architecture
in [16], but the resource consumption is noticeably reduced.
According to [30], the normalized source consumption (NSC)
can be computed as NSC = LUTs+FFs+DSP×280. Hence, the
hardware efficiency (throughput/NSC) of the RCG architecture
is increased by 2.92× compared with that of the CHD architec-
ture in [16]. Compared with the CHD architecture, the architec-
tures in this work and recent works [19], [22], [25], [28], [30],
[33] achieve massive MIMO detection with reduced complexity.
Compared with the WeJi [25], OCD [28], NSA [19], GS [22], and
CG [33] architectures, the proposed RCG architecture achieves
2.05×, 1.68×, 1.04×, 13.13×, and 31.5× increases in through-
Fig. 12. SER performance of the fixed-point design and comparison with other
detectors (128 × 8 MIMO system). put, respectively. Considering that there are different hardware
consumptions in these architectures, hardware efficiency is used
to achieve a fair comparison. Compared with the WeJi [25],
OCD [28], NSA [19], GS [22], and CG [33] architectures, the
Fig. 12 (labeled ‘fp’) for a 128 × 8 MIMO system. In addition, RCG architecture achieves 1.66×, 1.60×, 2.32×, 4.44×, and
related designs with k = 2 are also included in Fig. 12, such as 1.75× increases in throughput, respectively. Note that com-
NSA, GS, OCD, WeJi, IIC, and CHD. In this architecture, the pared with the RCG architecture, the IIC architecture achieves
SNR required to achieve an SER of 10−2 was 10.38 dB, 0.21 dB a 1.45× increase in throughput. However, the IIC architec-
less than that of the floating-point design. The fixed-point SNR ture requires more resources; therefore, the RCG architecture
is comparable with related designs, such as NSA, GS, OCD, achieves 1.33× greater hardware efficiency than the IIC ar-
WeJi, and IIC. Notably, these results are based on floating-point chitecture. In the TASER architectures, there is less hardware
design. resource consumption than in the RCG architecture. However,
compared with these two TASER architectures, the RCG ar-
B. FPGA Implementation chitecture achieves 16.58× and 12.6× increases in throughput.
Table IV lists the key results of the FPGA implementation Considering hardware efficiency, the RCG architecture achieves
of the proposed architecture and other state-of-the-art designs 1.20× and 2.88× greater efficiency than the two architectures.
in [13], [19], [22], [25], [28], [30], [33]. In this table, the iteration
numbers and SNR values (targeting an SER of 10−2 ) are listed
for all the algorithms, and the SNR values are based on floating- C. Silicon Implementation
point designs with the listed iteration number in each algorithm. The proposed MMSE detector was implemented on
The SNR in this work (10.16 dB) is close to the performances of 1.87 × 1.87 mm2 silicon using TSMC 65 nm CMOS technology.
other algorithms (9.66–10.47 dB). Note that the TASER [13] is a Fig. 13 shows the die micrograph and the measured frequency
nonlinear algorithm. In massive MIMO systems (predominantly and power consumption for different supply voltages of the
Nr  Nt ), linear algorithms can achieve near-optimal perfor- chip. This chip was able to achieve a 1.5 Gbps data rate with
mance compared with nonlinear algorithms [2]–[4], [18]–[22]. a 500 MHz working frequency while dissipating 557 mW at
In addition, the SNR is 10.38 dB in this architecture with the 1.2 V. The measured results show that under supply voltages of

Authorized licensed use limited to: Fondren Library Rice University. Downloaded on May 18,2020 at 07:55:40 UTC from IEEE Xplore. Restrictions apply.
LIU et al.: ENERGY- AND AREA-EFFICIENT RECURSIVE-CONJUGATE-GRADIENT-BASED MMSE DETECTOR 585

TABLE V
COMPARISON OF ASIC IMPLEMENTATION RESULTS

a
SNR values are targeted to an SER of 10−2 with floating-point designs, and the iteration number is listed for each algorithm in the simulation. In this work, the SNR
is 10.38 dB with a fixed-point design. The SNR in this work is close to the performance of other algorithms. In the case of similar detection accuracy, the throughput,
area, power consumption, energy and area efficiencies are compared with related designs.
b
Energy and area efficiencies are defined as throughput/power and throughput/area (gate count), respectively.
c
There are two kinds of normalized aspects. (1) Technology normalized to 65 nm technology. According to [9], [13], [25], [28], frequency and power are increased
V log2 Nt
by s and 1s × ( dd 2 8
 ) , respectively. (2) MIMO size normalized to 128 × 8. According to [12], throughput is increased by N × log 8 , and power and area are
V t 2
dd
128 8 log2 Nt
increased by Nr × Nt × log2 8 .

near-optimal performance compared with nonlinear algorithms,


such as EPD [10], MPD [12], and MCTS [27]. [2]–[4],
[18]–[22]. Based on the similar SER performances, the en-
ergy and area efficiencies are compared. In this chip, the
core area is 3.50 mm2 , and the energy and area efficiencies
are 2.693 Mbps/mW and 1.093 Mbps/kGE, respectively. The
energy and area efficiencies of related designs are 1.096–
52.038 Mbps/mW and 0.3–3.974 Mbps/kGE. The architecture
in [17] achieves high efficiency because of the small MIMO
size, which is used in traditional MIMO system. Hence, consid-
ering that different MIMO size and technology influence these
Fig. 13. (a) Die micrograph; (b) Measured frequency and power consumption
for different supply voltages. efficiencies, the energy and area efficiencies are normalized to
the same MIMO size and technology (using the full-scaling
approach) to achieve fair comparisons. The same normalization
0.9 V, 1.0 V, and 1.1 V, the proposed chip achieves frequencies method is also used in [12]. The RCG architecture achieves
of 310 MHz, 330 MHz, and 390 MHz and power consumptions 2.47×, 2.39×, 10.60×, and 7.09× greater normalized energy
of 194 mW, 283 mW, and 409 mW, respectively. In [10], under efficiency (throughput/power) than the WeJi [25], LUD [17],
supply voltages of 0.7 V and 1.0 V, the frequencies are 238 MHz NSA [19], [20], and MCTS [27] architectures, respectively. In
and 569 MHz and the power consumptions are 23.4 mW and addition, the normalized area efficiency (throughput/gate count)
127 mW, respectively. In [12] and [16], the supply voltages of RCG is 1.15×, 8.81×, 5.25×, and 1.18× that in the WeJi [25],
are both 0.9 V, and in [17] and [25], the supply voltages are LUD [17], NSA [19], [20], and MCTS [27] architectures, respec-
both 1.0 V. The measured results of these circuits are shown tively. Note that the values of the NSA and MCTS architectures
in Table V. Table V lists the detailed hardware features of the are layout results without silicon proof. Note that the proposed
RCG architecture and the state-of-the-art designs in [10], [12], RCG architecture includes preprocessing. To achieve fair com-
[16], [17], [19], [20], [25], [27]. In Table V, the SNR values parisons with recent preprocessing-excluded architectures [10],
targeted to an SER of 10−2 are listed for all these algorithms. [12], [16], the results of RCG without preprocessing are listed
In addition, these SNR values are based on floating-point de- in Table V. Despite the preprocessing in the detector (only con-
signs with the listed iteration number in each algorithm. This sidering postpreprocessing design), the proposed architecture
work (10.16 dB) achieves performance close to that of the achieves 1.62 mm2 core area, 12.5 Mbps/mW energy efficiency,
state-of-the-art designs (9.66–10.42 dB). In massive MIMO sys- and 2.517 Mbps/kGE area efficiency, respectively. The state-of-
tems (predominantly Nr  Nt ), linear algorithms can achieve the-art designs without preprocessing achieve 0.58-2.00 mm2

Authorized licensed use limited to: Fondren Library Rice University. Downloaded on May 18,2020 at 07:55:40 UTC from IEEE Xplore. Restrictions apply.
586 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 68, 2020

core area, 14.173-102.7 Mbps/mW energy efficiency, and 0.499- algorithm is a hardware-friendly design. Meanwhile, the ar-
6.855 Mbps/kGE area efficiency, respectively. Like [17], the chitecture design fully utilizes the advantage of the proposed
architecture in [12] achieves high efficiency because of the algorithm, which improves the hardware utilization and reduces
small MIMO size. Hence, the energy and area efficiencies are resource consumption. (4) Optimized fixed-point design: The
normalized to the same MIMO size and technology to achieve fixed-point design of the proposed architecture is determined by
fair comparisons. The RCG architecture achieves 6.85×, 7.18×, a large number of simulations. The fixed-point design achieves
and 2.29× the normalized energy efficiency in the EPD [10], a short bit-width with an acceptable SER performance loss. (5)
CHD [16], and MPD [12] architectures, respectively. In addi- Deep pipeline design: In the whole architecture, a deep pipeline
tion, the normalized area efficiency of the RCG architecture is is designed to improve the throughput according to the prop-
11.7×, 2.88×, and 2.39× that of the designs in the EPD [10], erties of the proposed algorithm. For example, the processing
CHD [16], and MPD [12] architectures, respectively. Note that element array and user-level pipeline module in Section V-A
the architectures in [10] and [16] support higher modulation and Section V-B implement deep pipelines according to the
schemes. characteristics of the algorithm. Therefore, high throughput and
The proposed architecture has a throughput of 1.5 Gbps, and hardware utilization are achieved, which results in high energy
the throughput of the related works are 0.3-8 Gbps [12], [16]. and area efficiencies. Based on the optimized design of both the
Notably, the frequency of the architecture proposed in [19], [20] algorithm and architecture, the proposed architecture achieves
reaches 1 GHz. Because the throughput related to the amount of advantages over related works in terms of normalized energy
resource provision, area (gate count) and power are both critical and area efficiencies.
data for evaluating efficiency. In the proposed architecture, the
gate count is 1372 kGE, and the power consumption is 557 mW.
VII. CONCLUSION
In related works, the gate counts are 347-6650 kGE [17], [19],
[20], and the power consumptions are 26.5-1720 mW. Without We have proposed a modified massive MIMO detection al-
considering preprocessing, the gate count of the proposed archi- gorithm based on the RCG method. In addition, according to
tecture is 596 kGE, and the power consumption is 120 mW. The the properties of massive MIMO systems, a quadrant-certain-
gate counts of architectures in [17], [19], [20] are 148-3607 kGE, based initial method is proposed. Moreover, an approximated
and the power consumptions are 18-127 mW. [10] and [16] adopt LLR computation method is proposed in RCG to simplify
28 nm technology, [12] adopts 40 nm technology, [19] and [20] the calculations. Then, we theoretically demonstrate that the
adopt 45 nm technology. This paper adopts 65 nm technology. proposed RCG method achieves low approximated error. Sim-
The normalized methodology used in this paper has two aspects: ulation results shows that the SNR required to achieve an SER
(1) technology normalized to 65 nm technology. (2) MIMO of 10−2 is 10.16 dB, which is close to that of the state-of-the-
size normalized to 128 × 8. These two normalized methods are art designs (9.66-10.42 dB). To demonstrate the effectiveness
commonly used in related research and could provide reasonable of our algorithm, we have proposed a VLSI architecture that
information. The normalized energy efficiency of the proposed consists of a parallel processing element array and a deeply
architecture is higher than that of [19], [20], and the normalized pipelined user-level method. We have verified the architecture
area efficiency is higher than that of [10], [16], [25]. The pro- on FPGA and fabricated a chip on silicon with TSMC 65 nm
posed architecture has advantages over related works in terms CMOS technology. The chip achieves 1.5 Gbps throughput
of throughput, power consumption, area, normalized energy and with 3.5 mm2 silicon area and 557 mW power consumption
area efficiencies, owing mainly to the following reasons: (1) at 1.2 V. This architecture achieves 2.69 Mbps/mW energy
Low complexity of the algorithm: According to Table I and efficiency and 1.09 Mbps/kG area efficiency, respectively, which
Table II, the proposed RCG algorithm has lower complexity are 2.39 to 10.60× and 1.15 to 8.81× those of state-of-the-art
than the other algorithms, such as NSA [19], [20] and CHD [16]. designs. In future work, we plan to consider algorithms and VLSI
When compared with nonlinear algorithms, such as EPD [10] architectures for situations with imperfect channel information.
and MPD [12], the RCG algorithm also maintains its complexity In addition, algorithms and architectures for higher modulation
advantage. Therefore, the RCG algorithm will consume fewer schemes will be considered.
hardware resources than other algorithms when realizing similar
throughput. Additionally, higher normalized energy and area
REFERENCES
efficiencies can be achieved. Complexity is one consideration
for the algorithm design in this paper. (2) High parallelism of [1] G. Peng, L. Liu, S. Zhou, Q. Wei, S. Yin, and S. Wei, “A 2.69 Mbps/mW
1.09 Mbps/kge conjugate gradient-based MMSE detector for 64-QAM
the algorithm: According to the analysis of Section III-C, the 128 × 8 massive MIMO systems,” in Proc. IEEE Asian Solid State Circuits
proposed RCG algorithm has high parallelism in the iterative Conf., Tainan, Taiwan, Nov. 2018, pp. 191–194.
process and therefore has higher throughput and normalized [2] T. L. Marzetta, “Noncooperative cellular wireless with unlimited numbers
of base station antennas,” IEEE Trans. Wireless Commun., vol. 9, no. 11,
energy and area efficiencies. The other related designs, such pp. 3590–3600, Nov. 2010.
as CHD [16], LUD [17], and WeJi [25], face problems with [3] F. Rusek et al., “Scaling up MIMO: Opportunities and challenges with
parallelism. High parallelism is also an important consideration very large arrays,” IEEE Signal Process. Mag., vol. 30, no. 1, pp. 40–60,
Jan. 2013.
for the algorithm design of this paper. (3) Optimized co-design [4] H. Q. Ngo, E. G. Larsson, and T. L. Marzetta, “Energy and spectral
of the algorithm and architecture: The design of the algorithm efficiency of very large multiuser MIMO systems,” IEEE Trans. Commun.,
fully considers the hardware implementation. In other words, vol. 61, no. 4, pp. 1436–1449, Apr. 2013.

Authorized licensed use limited to: Fondren Library Rice University. Downloaded on May 18,2020 at 07:55:40 UTC from IEEE Xplore. Restrictions apply.
LIU et al.: ENERGY- AND AREA-EFFICIENT RECURSIVE-CONJUGATE-GRADIENT-BASED MMSE DETECTOR 587

[5] E. Agrell, T. Eriksson, A. Vardy, and K. Zeger, “Closest point search [26] J. Minango, C. de Almeida, and C. D. Altamirano, “Low-complexity
in lattices,” IEEE Trans. Inf. Theory, vol. 48, no. 8, pp. 2201–2214, MMSE detector for massive MIMO systems based on Damped Jacobi
Aug. 2002. method,” in Proc. IEEE Int. Symp. Pers., Indoor Mobile Radio Commun.,
[6] M. O. Damen, H. El Gamal, and G. Caire, “On maximum-likelihood Montreal, QC, Canada, Oct. 2017, pp. 1–5.
detection and the search for the closest lattice point,” IEEE Trans. Inf. [27] J. Chen, C. Fei, H. Lu, G. E. Sobelman, and J. Hu, “Hardware efficient
Theory, vol. 49, no. 10, pp. 2389–2402, Oct. 2003. massive MIMO detector based on the Monte Carlo tree search method,”
[7] Z. Guo and P. Nilsson, “Algorithm and implementation of the K-best sphere IEEE J. Emerg. Sel. Topics Circuits Syst., vol. 7, no. 4, pp. 523–533,
decoding for MIMO detection,” IEEE J. Sel. Areas Commun., vol. 24, no. 3, Dec. 2017.
pp. 491–503, Mar. 2006. [28] M. Wu, C. Dick, J. R. Cavallaro, and C. Studer, “High-Throughput data
[8] G. Peng, L. Liu, S. Zhou, Y. Xue, S. Yin, and S. Wei, “Algorithm detection for massive MU-MIMO-OFDM using coordinate descent,” IEEE
and architecture of a low-complexity and high-parallelism preprocessing- Trans. Circuits Syst. I, Reg. Papers, vol. 63, no. 12, pp. 2357–2367,
based K-best detector for large-scale MIMO systems,” IEEE Trans. Signal Dec. 2016.
Process., vol. 66, no. 7, pp. 1860–1875, Apr. 2018. [29] G. Peng, L. Liu, P. Zhang, S. Yin, and S. Wei, “Low-Computing-Load,
[9] C.-H. Liao, T.-P. Wang, and T.-D. Chiueh, “A 74.8 mW soft-output detector high-parallelism detection method based on chebyshev iteration for mas-
IC for 8 × 8 spatial-multiplexing MIMO communications,” IEEE J. Solid- sive MIMO systems with VLSI architecture,” IEEE Trans. Signal Process.,
State Circuits, vol. 45, no. 2, pp. 411–421, Feb. 2010. vol. 65, no. 14, pp. 3775–3788, Jul. 2017.
[10] W. Tang, H. Prabhu, L. Liu, V. Öwall, and Z. Zhang, “A 1.8Gb/s 70.6pJ/b [30] J. Chen, Z. Zhang, H. Lu, J. Hu, and G. E. Sobelman, “An intra-iterative
128 × 16 link-adaptive near-optimal massive MIMO detector in 28 nm interference cancellation detector for large-scale MIMO communications
UTBB-FDSOI,” in Proc. IEEE Int. Solid-State Circuits Conf. Dig., San based on convex optimization,” IEEE Trans. Circuits Syst. I, Reg. Papers,
Francisco, CA, USA, Feb. 2018, pp. 224–226. vol. 63, no. 11, pp. 2062–2072, Nov. 2016.
[11] W. Tang, C.-H. Chen, and Z. Zhang, “A 0.58 mm2 2.76 Gb/s 79.8 pJ/b 256- [31] D. R. Kincaid and E. W. Cheney, Numerical Analysis: Mathematics of
QAM massive MIMO message-passing detector,” in Proc. IEEE Symp. Scientific Computing. 3 rd ed. Belmont, CA, USA: Wadsworth, 2002.
VLSI Circuits, Honolulu, HI, USA, Jun. 2016, pp. 1–2. [32] B. Yin, M. Wu, J. R. Cavallaro, and C. Studer, “Conjugate gradient-
[12] Y. T. Chen, C. C. Cheng, T. L. Tsai, W. C. Sun, Y. L. Ueng, and C. H. Yang, based soft-output detection and precoding in massive MIMO systems,”
“A 501 mW 7.6l Gb/s integrated message-passing detector and decoder for in Proc. IEEE Global Telecommun. Conf., Austin, TX, USA, Dec. 2014,
polar-coded massive MIMO systems,” in Proc. IEEE Symp. VLSI Circuits, pp. 3696–3701.
Kyoto, Japan, Jun. 2017, pp. 330–331. [33] B. Yin, M. Wu, J. R. Cavallaro, and C. Studer, “VLSI design of large-scale
[13] O. Castaneda, T. Goldstein, and C. Studer, “Data detection in large soft-output MIMO detection using conjugate gradients,” in Proc. IEEE Int.
multi-antenna wireless systems via approximate semidefinite relaxation,” Symp. Circuits Syst., Lisbon, Portugal, May 2015, pp. 1498–1501.
IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 63, no. 12, pp. 2334–2346, [34] B. E. Godana and T. Ekman, “Parametrization based limited feedback
Dec. 2016. design for correlated MIMO channels using new statistical models,” IEEE
[14] Y. Jiang, M. K. Varanasi, and J. Li, “Performance analysis of ZF and Trans. Wireless Commun., vol. 12, no. 10, pp. 5172–5184, Oct. 2013.
MMSE equalizers for MIMO systems: An in-depth study of the high SNR
regime,” IEEE Trans. Inf. Theory, vol. 57, no. 4, pp. 2008–2026, Apr. 2011.
[15] S. Ozyurt and M. Torlak, “Exact joint distribution analysis of zero-forcing
V-BLAST gains with greedy ordering,” IEEE Trans. Wireless Commun.,
vol. 12, no. 11, pp. 5377–5385, Dec. 2012.
Leibo Liu (Senior Member, IEEE) received the B.S.
[16] H. Prabhu, J. N. Rodrigues, L. Liu, and O. Edfors, “A 60 pJ/b 300
degree in electronic engineering and the Ph.D. de-
Mb/s 128 × 8 massive MIMO precoder-detector in 28 nm FD-SOI,” in
gree from the Institute of Microelectronics, Tsinghua
Proc. IEEE Int. Solid-State Circuits Conf. Dig., San Francisco, CA, USA,
University, Beijing, China, in 1999 and 2004, respec-
Feb. 2017, pp. 60–61. tively. He is currently a Professor with the Institute
[17] C. H. Chen, W. Tang, and Z. Zhang, “A 2.4 mm2 130 mW MMSE-
of Microelectronics, Tsinghua University. His cur-
nonbinary-LDPC iterative detector-decoder for 4 × 4 256-QAM MIMO
rent research interests include reconfigurable com-
in 65 nm CMOS,” in Proc. IEEE Int. Solid-State Circuits Conf. Dig., San
puting, mobile computing, and VLSI digital signal
Francisco, CA, USA, Feb. 2015, pp. 338–340. processing.
[18] B. Yin, M. Wu, G. Wang, C. Dick, J. R. Cavallaro, and C. Studer,
“A 3.8 Gb/s large-scale MIMO detector for 3GPP LTE-advanced,” in
Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Florence, Italy,
May 2014, pp. 3879–3883.
[19] M. Wu, B. Yin, G. Wang, C. Dick, J. R. Cavallaro, and C. Studer,
“Large-scale MIMO detection for 3GPP LTE: Algorithms and FPGA
implementations,” IEEE J. Sel. Topics Signal Process., vol. 8, no. 5, Guiqiang Peng received the B.S. degree from the
pp. 916–929, Oct. 2014. School of Micro-Electronics and Solid-State Elec-
[20] B. Yin, “Low complexity detection and precoding for massive mimo tronics, University of Electronic Science and Tech-
systems: Algorithm, architecture, and application,” Ph.D. dissertation, nology of China, Chengdu, China, in 2013. He is
Dept. Elect. Comput. Eng., Rice Univ., Houston, TX, USA, 2014. currently working toward the Ph.D. degree with
[21] L. Dai, X. Gao, X. Su, S. Han, I. Chih-Lin, and Z. Wang, “Low-Complexity the Institute of Microelectronics, Tsinghua Uni-
soft-output signal detection based on gauss-seidel method for uplink versity, Beijing, China. His current research inter-
multi-user large-scale MIMO systems,” IEEE Trans. Veh. Technol., vol. 64, ests include reconfigurable computing, mobile com-
no. 10, pp. 4839–4845, Oct. 2015. puting, and VLSI signal processing and wireless
[22] Z. Wu, C. Zhang, Y. Xue, S. Xu, and X. You, “Efficient architecture communications.
for soft-output massive MIMO detection with Gauss-Seidel method,” in
Proc. IEEE Int. Symp. Circuits Syst., Montreal, QC, Canada, May 2016,
pp. 1886–1889.
[23] X. Gao, L. Dai, Y. Hu, and Z. Wang, “Matrix inversion-less signal
detection using SOR method for uplink large-scale MIMO systems,” in Pan Wang received the B.S. degree in microelec-
Proc. IEEE Global Telecommun. Conf., San Diego, CA, USA, Dec. 2015, tronics science and engineering from Central South
pp. 3291–3295. University, Changsha, China, in 2017. He is currently
[24] P. Zhang, L. Liu, G. Peng, and S. Wei, “Large-scale MIMO detection working toward the master’s degree with the Institute
design and FPGA implementations using SOR method,” in Proc. IEEE of Microelectronics, Tsinghua University, Beijing,
Int. Conf. Commun. Softw. Netw., Beijing, China, Jun. 2016, pp. 206–210. China. His current research interests include recon-
[25] G. Peng, L. Liu, S. Zhou, S. Yin, and S. Wei, “A 1.58 Gbps/W 0.40 figurable computing, mobile computing, and VLSI
Gbps/mm2 ASIC implementation of MMSE detection for 128 × 8 signal processing and wireless communications.
64-QAM massive MIMO in 65 nm CMOS,” IEEE Trans. Circuits Syst. I,
Reg. Papers, vol. 65, no. 5, pp. 1717–1730, May 2018.

Authorized licensed use limited to: Fondren Library Rice University. Downloaded on May 18,2020 at 07:55:40 UTC from IEEE Xplore. Restrictions apply.
588 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 68, 2020

Sheng Zhou (Member, IEEE) received the B.E. and Shouyi Yin (Member, IEEE) received the B.S., M.S.,
Ph.D. degrees in electronic engineering from Ts- and Ph.D. degrees in electronic engineering from
inghua University, Beijing, China, in 2005 and 2011, Tsinghua University, Beijing, China, in 2000, 2002,
respectively. From January to June 2010, he was a and 2005, respectively. He was with Imperial College
Visiting Student with the Wireless Systems Labora- London as a Research Associate. He is currently with
tory, Department of Electrical Engineering, Stanford the Institute of Microelectronics, Tsinghua Univer-
University, Stanford, CA, USA. He is currently an As- sity, as an Associate Professor. His research interests
sociate Professor with the Department of Electronic include mobile computing, wireless communications,
Engineering, Tsinghua University. His research inter- and SoC design.
ests include cross-layer design for multiple-antenna
systems, edge computing and caching, and green
wireless communications.

Qiushi Wei received the B.S. degree from the School Shaojun Wei (Fellow, IEEE) was born in Beijing,
of Physical Electronics, University of Electronic Sci- China, in 1958. He received the Ph.D. degree from
ence and Technology of China, Chengdu, China, in La Faculté Polytechnique de Mons, Mons, Belgium,
2015. He is currently working toward the master’s in 1991.
degree with the Institute of Microelectronics, Ts- He became a Professor with the Institute of Mi-
inghua University, Beijing, China. His current re- croelectronics, Tsinghua University, in 1995. He is
search interests include reconfigurable computing, currently a Senior Member of the Chinese Institute of
mobile computing, and VLSI signal processing and Electronics. His main research interests include VLSI
wireless communications. SoC design, EDA methodology, and ASIC design for
communications.

Authorized licensed use limited to: Fondren Library Rice University. Downloaded on May 18,2020 at 07:55:40 UTC from IEEE Xplore. Restrictions apply.

You might also like