15-3 (8102)

15.3 (8102) IEEE Asian Solid-State Circuits Conference November 5-7, 2018/Taman, Taiwan A 2.69 Mbps/mW 1.09 Mbps/kGE Conjugate Gradient-based MMSE Detector for 64-QAM 1288 Massive MIMO Systems Guigiang Peng, Leibo Liu*, Qiushi Wei, Yao Wang, Shouyi Yin, and Shaojun Wei Institute of Microelectronics, Tsinghua University, Beijing, China Email: liulb@tsinghua.edu.cn Abstract—This paper proposes a. very-large-scaleintegration (VLSI) architecture for an MMSE detector in an uplink 128%8 ©4-QAM massive MIMO system to achieve high energy and area efficiencies. A soft-output recursion conjugate gradient (RCG)-based minimum mean square error (MMSE) detector is fabricated onto a 35 mm’ silicon with TSMC 65 nm CMOS technology for a 64-QAM 128x8 massive MIMO system. A processing element (PE) array and a user-level pipeline are Introduced for this architecture to achieve high energy and area efficiencies. The chip achieves a 1.5 Gbps throughput under 500 MHz working frequency while dissipating 557 mW at 1.2 V. The energy efficiency (throughpuU/power) and arca efficiency (throughputarea) are 2.69 Mbps/mW and 1.09 Mbps/kGE, which are 2.39-lo-247x and 1.15-108.81% those of the normalized state- of-the-art designs, respectively. 1. IntRopuction Massive MIMO scales up antennas by orders of magnitude while serving tens of users. It is generally accepted that this technology will be applied in future wireless communication techniques such as SG and beyond. However, the complicated signal detection for uplink processing in a base station (BS) makes it difficult to be efficiently implemented in cizcuits [1]. Maximum likelihood (ML) detection is an optimal de: tection algorithm. Nonetheless. the computing load increases markedly with the number of users and modulation orders, thus preventing the practical application of the ML algorithm. Other non-linear algorithms (2}-(4], such as K-best, sphere decoding (SD), expectation-propagation detection (EPD), and message-passing detector (MPD), are able to achieve near- optimal ML detection performance. However, these non-linear architectures are of low throughput and require abundant area and power. To reduce the computational complexity, various linear detection algorithms have been proposed [5], [6], which can be employed in a massive MIMO system with a large but finite number of antennas and a comparatively small number of users considering the hardware architecture, detection involves complicated matrix inversions and ‘multiplications as well as low parallelism, which produces dif- ficulties for hardware implementation with increasing numbers of users. Consequently, numerous methods have been proposed to further reduce the computational complexity, improve the parallelism of MMSE, and optimize hardware architectures Howeve 976-1-5386-6419-118/831.00 ©2018 IEEE 191 Fig. 1. Uplink massive MIMO system, [1], [SH{7) However, the throughput is limited, and the architectures require large amounts of hardware resources. ‘This paper proposes a VLSI architecture for signal detection to achieve higher energy and area efficiencies (ic. through: ‘pu/power and throughpuVarea) in an uplink massive MIMO system, Given that the results of a Gram matrix are required jn subsequent detection, a parallel processing element (PE) array with single-sided input is designed. This parallel PE array substantially reduces latency using limited hardware, and the architecture operates in a deeply pipelined manner. In addition, a user-level pipeline architecture is proposed to achieve detection based on the recursion conjugate gradient (RCG) method, which is proposed to scale down the high computational complexity and improve the parallelism of the MMSE detection. The proposed VLSI architecture is verified fon an FPGA and fabricated onto a chip with TSMC 65 im CMOS technology in an uplink massive MIMO system. The measurement results show that this architecture achieves substantial improvements in energy and area efficiencies compared with other state-of-the-art architectures, ‘This paper is organized as follows, Section IK presents the proposed RCG massive MIMO detection algorithm, and Sec- tion II describes the architecture of the RCG detector. Then, the silicon implementation results and the algorithm simulation results are shown in Section IV, Finally, a conclusion is given in section V. II, Massive MIMO DETECTION By REC CONJUGATE GRADIENT IRSION In an N; x Nz MIMO system with N, transmitters on the user side and IV, antennas on the BS side (predominantly V- 2 Ne), the uplink system can be modeled as y = Hs+n,15.3 (8102) where Hc C1 represents a Rayleigh flat-fading channel matix: s ¢ CN denotes the wansmitted signal vector, which is based on the 64-QAM modulation constellation set 2 in this paper: n € C%""1 is an additive white Gaussian noise vector with zero mean and variance o; and y © C"™! is the received signal vector atthe BS. The uplink massive MIMO system is shown in Fig. 1. In classical MMSE detection, the estimation of the transmitted signal can be expressed as “Ix,) a where W = HHL + NoE, "Ing is the MMSE filtering ma trix with a noise power spectral density of Np and a signal power spectral density of , and y“® — Hy denotes the matched-flter vector ‘The RCG iteration can be expressed by = (HHA NE Uy,) “Hy = Woy, sh) <5!" 4 ap @ where p\") isan orthogonal basis (8, Fis the iteration number and the parameter a" can be calculated by (2 Twp ay } 6) In (3), 2") represents the residual vector, which can be expressed as ai) = yl — welt) @ In addition, according to the Lanczos orthogonalization algorithm, the residual vector 2(**!) can be expressed as [8] 2809 = pf (9 Pal) £0 ox, ) where p), 7"), and n\®) = Wal*) are iterative parameters. ‘Combining (4) and (5), the estimation of the transmitted vector can be expressed as 40 = pl (6 4 (hall) + (1 ph) > Next, the iterative parameters are computed, Because the vectors 2(**), 2!¥), and 2(*-Y are mutually orthotropic (8). (289,209) 2 a © ‘ (29,2) = (29,29) = Hence, combining (5) and (7) the parameters can be calculated 29 gw im ® = om = TTT yay | where €® = (2,209), and g® = (oy, ni). In addition, according to (S) and (7), there are (0,29) = (0%,n-0), 3) (1 — plF2) gh) a) = ae oe my ey o IEEE Asion Solid-State Circuits Conference November 5-7, 2018/Tainan, Taiwan Conat|_ [Tor Regster | *Lcantal Processing se evel Pipe Preitentve | {lente Block ["L_Bl aR Bloce Fi. 2. Top level block diagram forthe proposed mative MIMO detectr, 590 2000 1500 0 sao 10000 15000 188 MIMO 28:16 MIMO. Therefore, the parameter p‘*) can be computed as fata OO = | Sea gD pT Ia this proposed grt, operons he terion aad alco late the parameters, there are several required initial settings such as pl) = 1, 2-9 = 20), ands") = 5) ao) III, VLSI Arcurrectore In this section, a VLSI architecture is designed to achieve massive MIMO detection based on the RCG detection algorithm. The architecture was designed for a 64-QAM, 128%8 massive MIMO system case study. Fig. 2 shows the top-level block diagram for the proposed massive MIMO detector. To achieve @ higher throughput with limited hardware resources, the top-level architecture is fully pipelined, There are three computation blocks in the detector. The first PE array is used to compute the matched-flter y™* and matrix W. The outputs are used to compute the initial solution ofthe estimated vector inthe pre-iterative block, Next, a user-level parallelism-based full pipeline iteratively applies the RCG method, The high computational complexity of the MMSE arises from a series cof matrix multiplication and matrix inversion operations. A modified RCG-based MMSE detector is proposed to reduce the number of real-valued multiplications (RMULs) from O(N2) to O(NZ). For example, as shown in Fig. 3, the RMULs are reduced by 36.6% and 73.8% when compared (o the traditional MMSE detector for a 128x8 and 128x16 MIMO system, respectively. In addition, the proposed detector enhances the parallelism of each iteration. Finally, the soft outputs are computed in a log-likelibood-ratio (LLR) block ‘The two Key blocks (processing element array and user-level pipeline) will be detailed in the following subsections A. Processing Element Array ‘The agchitectwe of the proposed PE array multiplies the vector ¥F and matsix W. as shown in Fig 4. In the ary. there are two types of PEs: N; PE-As and “22% PE-Bs The PE-As ave used to compute the diagonal elements of 19215.3 (8102) Hi. PEon, nt He. mat, fart Web ete ae. ee. Hishn IEEE Asian Solid-State Circuits Conference November 5-7, 2018/Taman, Taiwan Fig. 6. Die micrograph. TABLET Fig 4. Architect of the proctisingeleeot aa. Contrasts or ASIC Iyresnnsystios Ristiss a ar) Wn We ® Seo | eee [ere per mer Wig Teel TR a t setae fp vate Fase RUET 2 + isso eT [ep + aeg(oer eT aye [as | atv | vee | aot | : oth n bg os Findstat hl Se eee 25 [aan | ime | vw | om [une aia hepsi] 7 2 Pec Fig. 5. Arcitectute ofthe iterative block inthe user-level pipeline. the matrix W; these elements are real valued. For a 1288 massive MIMO system, there are § PE-As and 36 PE-Bs. ‘The PE-Bs are used to compute the off-diagonal elements of the matrix W (28 PE-Bs) and the vector y“* (8 PE, Bs). In one clock cycle, 8 elements of HU” and y are input for computation, and the PEs exhibit echelonment with high processing speed because the Gram matrix G = HH is an asymmetric matrix. As shown in Fig. 4, each PE-A includes eight groups of the same arithmetic logical units (ALUs), one accumulator, and an adder for diagonal elements. The ALU is used to compute the part of the diagonal element in the Gram matrix, Then, the results of all ALUs are accumulated to oblain the value of G,,. Fig. 4 also shows the details of the PE-B, which performs the computation of the complex-valted ‘multiplications and accumulations. In addition, the hardware utiizations of both PE-A and PE-B approaches are high. ‘Therefore, this PE array achieves high throughput and high energy and area elliciencies. B. User-level Pipeline ‘The user-level pipeline (Fig. 5) is designed to achieve the RGG iteration, which includes two blocks: the pre-iterative block and the iterative block, The initial solution and some parameters are computed in the pre-iterative block. Instead of, 193 Poy = Val) ° BHO se ermal e268, Acerg [2] rough acremed by SEMEL. and te poner and eee erased by HE ex HEME “The Cnty and ae eee ae dled a vege aod capa, zero, each element of the intial solution can be chosen with a certain point of one quadrant according to the located quadrant of they", In addition, there is no extra computational cost. The iterative block estimates and updates the signal s(* and residual z'*) according to the channel matrix H. The iterative block includes two stages in total. The first stage calculates °), which is required to update the signal s() and the residual 2) according to the initial solution of s\°); then, it completes the first update of s( and x”), The second stage calculates p) and"); then, it updates the signals) according 0 oulpuls of the first stage. The calculations of these two stages were deeply pipelined based on the user level, resulting in high parallelism, The hardware utilizations of both PE-B, PE-C and PE-D are high. W. ‘The proposed MMSE detector was implemented onto a 1,871.87 mm silicon using TSMC 65 nm CMOS technol ogy. Fig. 6 shows the die micrograph of the chip. This chip achieved a 1,5 Gbps data rate at a S00 MHz working frequency while dissipating 557 mW, Table I lists the implementation results of the proposed detector and the state-of-the-art designs in [4]-1Tk these designs were those with results closest to the MBASUREMENT RESULTS15.3 (8102) ARB) Fig. 7. SER performance comparison between the proposed inal ft the tational serovecior iia lation, seluion =) qe © SNR(GB) Fig. &. SER performance compatisons between the propoted algorithm and er algorihns ‘SuRess) ig. 9. SER performance compatsons a dillerent MIMO dimensions results of the proposed detector. To achieve a fair comparison, the energy and atea efficiencies are normalized to the same MIMO dimension and process (full-scaling approach); the same normalization method is also used in [2]. The proposed architecture achieves 2.47% (1.15x) and 2.39 (8.81) normalized energy (area) efficiencies compared with that of [7] and [6], respectively. Despite the pre-iterative processing in the detector (only considering post pre-iterative processing design), the proposed architecture achieves 12.5 MbpsimW energy efficiency and 2.52 Mbps/kGE area efficiency, which are 6.85 (11-7) and 7.18% (2.88%) those of the designs in [4] and [5], respectively. ‘To evaluate the performance of the proposed RCG method, simulated symbol-error-rate (SER) results for RCG are com- pated with state-of-the-art methods such as NSA and Well ‘The SER performance of exact MMSE detection based on the CHD method is also provided for comparison. In these simulations, the settings are adopted as a 64-QAM modula. tion scheme and a rate-1/2 industry standard convolutional code with a (183, 171,] polynomial slong with a random IEEE Asian Solid-State Circuits Conference November 5-7, 2018/Tainan, Taiwan interleaver. In addition, the coding is performed over 120 symbols, and the number of frames is 10,000. The channels were assumed to exhibit ii.d. Rayleigh fading across the coded symbols. At the receiver, the LLRs are used as the soft input for Viterbi decoding. In addition, as in [4]-[7], the signal- to-noise ratio (SNR) is defined at the receiver. Fig. 7 shows a comparison between the proposed initial solution and the tuaditional zero-vector initial solution. According to Fig. 7. to achieve the same SER (10), the proposed intial solution has 1.74 dB gain when compared with the traditional zero-vector initial solution. Fig. 8 shows the simulated SER results of the proposed RCG method and different detection algorithms in a 64-QAM 128x8 MIMO system. When K = 2, the SNR required to achieve an SER of 10~? was 10.17 dB, which was close to the performance of the Cholesky-based MMSE detection (9.66 dB) [5]. In contrast, the required SNRs of the Weli and NSA methods mentioned in [7] and [1] are 10.42 4B and >20 B, respectively. Fig. 9 shows that to achieve the same SER, the SNR required by the proposed algorithm is also smaller than that by the NSA and Weli methods, proving that the proposed algorithim can maintain its advantages at different MIMO dimensions V. Coneusion ‘This paper proposes an ASIC implementation of a massive MIMO detector based on a recursion conjugate gradient method, therein achieving near-optimal performance for massive MIMO systems, This architecture achieves high energy and area efficiencies when compared with state-of- the-art designs. Thus, we believe that this detector makes fan important contribution to next-generation massive MIMO communication systems, REFERENCES Uy] M. Wa, B. Yin, G. Wang, C. Dick. J. R.Cavallao, and C_ Stde TLage-icile MIMO detecion for 30PP LTE. agnthns aad FPGA implementations” IEEE J. Sel. Topice Signal Procerr, ol. 820.5. 7. 916-928, Ot 2014, YT Cheo, CC. Cheng. TL. Thal, W. C. Sun, YL Ueng. and C. HL Yang. "A SOLmW 781Gb integeated message-passing detector and decoder for polar-coded massive MIMO systema” tn Poe. IEEE Symp VES Circe, Kyoto, Tapan, Sun. 2017, pp 33033 W. Tang, C-H Chen, and Z. Zhang, "A 058 mm 2.76 Gols 79:8 pl 256.QAM massive MIMO message-passing detect” ia Proc JEEE Symp. VISI Cie, Honolua USA, Tun. 2016, pp. 250-281 ‘W'Ting, H, Prob L Lin, V. Oval and Z. Zhang “A. NGD(s 70 p> 128% 16 linkadapive near-opimal massive MIMO detector in 28am UTBB-FDSOL” in Proc IEEE Int. Solid-State Circuits Conf. Dis, Sa Francisco, CA, Feb 2018 pp, 224-235 HL Prat, J. Rodrigues L. Liv, and 0. Edfors, “A 6p 300M 1288 Massive MIMO preoder-etector in 28nm FD-SO1" in Proc IEEE Int. Sold State Cres Conf Dig, San Fraeseo, CA, Feb, 2017, pp. 50-81 C caer, W Tang, and Z Zhang, “A 2mm? 130m MMSE-nonbinary LLDPC iterative detector decoder for 4x4 256-QAM MIMO in 651m (CMOS;" in Proc IEEE nt. Sli State Circus Con Di, San Francs, CCA, Feb 2015, pp. 358-338. 6. Beng, Liv. Zhou, 8, Yin, and S. We, “A158 GipsW 040 CGhpsinm ASIC Implemsatation of MMSE Detection for 128 x 8 64 QAM Massive MIMO in 65 am CMOS." IEEE Trans. Circa 3) J eg Papers, sl. 55, 0.8, pp. 1717-1730, May 2018 DR, Kinesis and i. W. Cheney, Numerical anaes: mathematics of Sciewic comping, 3rd ed Belmont, CA: Wadsworth, 2002, eB BI ts fe ‘sl nm (81 198

15-3 (8102)

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

15-3 (8102)

Uploaded by

Copyright:

Available Formats

You might also like