You are on page 1of 6

Hight Throughput FPGA Implementation Of

Low-Complexity Detector For High-Rate Spatial


Modulation
Van-Son Trinh, Thuong-Duc Duong, Vu-Duc Ngo
School of Electronics and Telecommunication, Hanoi University of Science and Technology, Vietnam
Email: trinhvanson92@gmail.com, duongthuongduc1292@gmail.com

Abstract—This paper proposes an hardware architecture im- to the MMSE criterion, called MMSE-BLAST was presented
plementing Improved MMSE-SQRD (ISQRD) detection algo- in [6] and a version with lower complexity was introduced
rithm that is one of the low-complexity suboptimal detectors pro- in [7]. This optimal order successive detection requires mul-
posed in [1] for a High-rate Spatial Modulation (HR-SM) scheme
introduced in [2]. This architecture will design to maximize tiple calculations of pseudo-inverses, being a computational
throughput and to gain high frequency. FPGA implementation expensive task. A reduced complexity detection algorithm
is also deployed on Xilinxs Vertex Series, which can give good utilizing a sorted QR decomposition of the channel matrix was
performance. Ultimately, the simulation results are introduced proposed in [8], called MMSE-SQRD. Nonetheless, it offered
and compared to some related works. the lower bit error rate (BER) compare to MMSE-BLAST
Index Terms—MIMO system, High-rate Spatial Modulation,
FPGA implementation, ISQRD, Improved MMSE-SQRD. because it do not perform exactly optimal order detection.
Another tendency research iterative tree-search algorithms like
the sphere decoder or the K-Best detector, which reduce the
I. I NTRODUCTION
computational complexity and increase the BER performance
Multiple-Input Multiple-Output wireless communication near to ML detection.
system that has multiple transmit and receive antennas has
MIMO wireless communication systems can be classified
been researched for many year. It was theoretically shown
into three main categories: Space-time coding to achieve di-
to be higher spectral efficiency than the conventional single
versity gain, MIMO precoding with channel state information
antenna systems in [3], [4]. In order to detect the transmitted
(CSI) available at the transmitter to achieve capacity gain, and
signal, many detection techniques has been developed. ML
Layered space time code to exploit multiplexing gain. The
detector that gives the optimal performance is high of compu-
HR-SM is one of the spatial modulation scheme proposed
tational complexity. Many suboptimal MIMO detection tech-
in [2] provides a substantial increase in spectrum efficiency
niques are studied to decrease the complexity of detector. First,
compared to the conventional spatial modulation (SM) scheme.
the linear detectors such as zero-forcing (ZF) and minimum
In [1], the author proposed some low-complexity suboptimal
mean square error (MMSE) have lower complexity compare
order detectors, which is the modified versions of MMSE-
to ML detector. While ZF detector completely remove inter-
SQRD and MMSE-BLAST detectors applying for the HR-SM
symbol interference but it works poorly unless the channel
scheme, called MSQRD, MBLAST and ISQRD, respectively.
matrix is well conditioned because of correlation between the
noises, MMSE detector represents a trade-off between noise In this paper, being interested in ISQRD detector that
amplification and interference suppression so it give some actually make a full search with one layer signal and deal
improvement of performance. However, both ZF and MMSE with remain layer signal by MMSE-SQRD algorithm, ISQRD
detectors have to find pseudo-inverse of a matrix so they is shown to be remarkably higher performance than the
are still high of computational complexity even when some MBLAST (compared in figure 5) and has lower complexity
methods proposed to avoid finding directly the pseudo-inverse at 4-QAM modulation and accepted complexities at 16-QAM,
matrix such as QR Decomposition method. The problem is we propose a hardware architecture of the ISQRD detector
that they still make a full search for each signal received at for QAM4 modulation and 4x4 antenna of HR-SM system.
each antenna. In order to solve this problem, various succes- In this architecture, the detector is fully pipelined so it can
sive interference cancellation (SIC) detectors were devised to detect one vector y per clock cycle. The SQRD block use
decrease the complexity by detecting signal successively after Sorted Modified-Gram-Schmidt QR decomposition and has
the previous one pretended to known. However, The error are throughput of one output every 8 clock cycles, specially, it
increase because of the propagation error. In [5], ZF-BLAST can handle the channel matrices coming in any distance of
detector was developed for signal detection in the layered the clock cycle as long as it greater than 8 clock cycles. This
space-time system. This detector limit the propagation error novel feature make it be flexible to many wireless interfaces.
by detecting signal in optimal order from strongest signals FPGA implementation was deployed on Xilinx virtex 5, the
to the weaker one. An adaption of the original ZF-BLAST synthesis result costs 26128 slice registers, 17197 slice LUTs
TABLE I
ISQRD D ETECTION A LGORITHM

Input: Y, H̄
Output: x, s̄  
Ht
1. Decompose Dt = √1 In −1
using MMSE-SQRD
Es T
algorithm to get Q, R, and the permutation vector P.
2. Detection and Cancellation:
for m = 1 : M
Fig. 1. System model
 
tm
compute tm = Y − H1 × xm and v = QH
0
for k = nT − 1 : −1 : 1
and 260 DSPs. The maximum frequency is 364.7 MHz, thus if (k == nT − 1)
the throughput of the design can reach to 2.9 Gbps. This cm,k = minc∈Ωx ||vk − c × rk,k ||2
synthesis results will be analyzed and compared to some else
for n = k + 1 : nT − 1
related works.
vk = vk − rl,k × cm,l
The remainder of the paper is organized as follows. In end
end
Section II, the system model and notation is introduced. In
Compute dm = ||tm − Ht × c̄m,p ||
Section III, the ISQRD algorithm was recall. The hardware end
implementation was investigated in section IV. The report 3. Find m̂ : m̂ = arg minm dm
results was shown in section V and the final conclusion is 4. Obtain the recovered modulated symbol x = xm̂ and the
the section VI. recovered SC codeword s̄ = x1 [c̄]

II. S YSTEM M ODEL


We consider a specific HR-SM system with nT = 4 transmit receive antennas. The HR-SM codeword c will be detected at
and nR = 4 receiver antennas in the presence of a quasi- the receiver by using a ISQRD detector proposed in [1].
static Rayleigh fading MIMO channel, as illustrated in Fig. 1. III. I MPROVED MMSE-SQRD (ISQRD) A LGORITHM
During a symbol period, l + m data bits are fed into the HR-
A. The ISQRD Algorithm
SM transmitter. There are l bits mapping into a SC codeword
s, out of K SC codewords within the spatial constellation Ωs . In this system, the detector use ISQRD algorithm to detect
The remaining m bits are modulated by using a M-QAM/PSK vector c from vector signal y received by antennas and the
modulator to get modulate symbol x in constellation Ωx . In channel matrix H assumed to known (for example, it can be
this paper, we used 4-QAM modulator to modulate signal bits. estimated from the training sequence bits). The ideas of this
As propose in [2], the SC codeword s is designed by fixing algorithm is as follows:
the first elements of s with 1 and assigning the remaining From (2) we get:
elements with values randomly selected from the set {±1;±j},
for example, s = {1, -1, j, -j}. Therefore, in this system, there t = Y − H1 × x = Ht × c̄ + n (3)
are total of K = 43 = 64 SC codeword s ∈ Ωs . Codeword c
Where H1 is the first column of H̄, Ht = [H2 , H3 , H4 ],
is created by multiplying s and x, such as c = s × x, so that
c̄ = [c1 ; c2 ; c3 ]. From (3), we can see that it can apply the
c = [x, x × s1 , x × s2 , x × s3 ] = [c0 ; c1 ; c2 ; c3 ]. And then,
MMSE-SQRD algorithm to detect vector c̄. However, x in
this codeword c is transmitted via 4 antennas within a symbol
equation (3) is unknown, the solution is make a full search
period. At the receiver, the 4 received signal matrix Y is given
for x. With each x, we get a vector c̄, then we compute the
by: r distance of detection vector c = [ x, c̄ ] by the function d =
γ
Y = H.c + N (1) ||Y − H̄ × c||2 = ||t − Ht × c̄||2 . Finally, we compare the
4Es
distances to decide the final vector c. The ISQRD algorithm
or is summarized in table I.
Y = H̄.s.x + N (2)
q B. Sorted QR Decomposition
γ
Where H̄ = 4Es H = [H1 , H2 , H3 , H4 ]. H1 , H2 , H3 , H4 In the step one of ISQRD algorithm, it use the MMSE-
is the columns of H̄ matrix. H and N respectively denote 4×4 SQRD algorithm which was proposed in [8], actually is the ex-
channel matrix and 4 × 1 noise matrix. The entries of H and tension of the ZF-SQRD algorithm introduced by the same au-
N are assumed to be independent and identically distributed thor. Instead of detect signals in optimal order like VBLAST-
(i.i.d) complex Gaussan random variables with zero mean and based detection, it just do the same idea in suboptimal order to
unit variance. c = s × x and s denote 4 × 1 HR-SM codeword gain the decrease in compute complexity. The MMSE-SQRD
matrix and 4 × 1 SC codeword matrix, respectively. ES is the in [8] is based on Modified Gram-Schmidt (MGS) procedure.
average symbol energy of x and is the average SNR at each This algorithm was implemented on VLSI technology for a
4x4 matrix using fixed-point number presentation in [9], it was
compared fairly to an other VLSI implementation of MMSE-
SQRD algorithm that is based on Givens rotations procedure
proposed earlier in [10]. The results of comparision clearly
showed superiority of the GR-based VLSI solution in terms
of area, processing cycles, and throughput. However, in this
paper, we still use the MGS-based MMSE-SQRD algorithm
for two reasons: 1) it is easy to implement on FPGA. 2) We are
the first group implementing a ISQRD detector for the HR-
SM system, so we choosed the one that easy to understand
and used popularly. The detail of this algorithm is given in
table II.

TABLE II
MMSE-SQRD BASED ON MGS ALGORITHM
Fig. 2. ISQRD detector
Input: D
Output: Q, R, P
A. SQRD block
h 0
i 0
1. R = 0, Q = D, P = 1 : nT (nT = nT − 1 for this system)
0
2. For n=1 to nT Each term in D matrix is a complex number including real
norm(k) = ||Qk ||2 part and image part. Each part was represented by a 12 bits
3. end fixed point number with 4 bits integer and 8 bits fraction.
0
4. for i=1 to nT Specially, some blocks which have accumulate operations
a. k = arg mink norm(k) such as the block computing norm value have more accurate
b. Swap p column k and i in Q, P, norm, R. representation of number to reduce truncation error causing
c. rii = norm(i) by fixed point type. For FPGA implementation, the divider
d. qi = rqiii
0 and square root blocks use available library of Xilinx, which
e. for k = i + 1 : nT
H give good speed and hardware usage.The nd signal (new
rik = qi × qk
qk = qk − rik × qi data) informs a new data was fetched. The latency is 143
norm(k) = norm(k) − rik 2 cycle clock and throughput can reach to 8 cycle clock for
f. end each H matrix. Specially, it can handle the channel matrices
5. end coming in any distance of the clock cycle as long as it greater
than 8. For example, H1 matrix comes at the beginning, H2
matrix comes after H1 8 cycle clock, then H3 matrix come
after H2 9 cycle clock... This feature make it be flexible to
many wireless interfaces. In order to perform this feature, the
IV. FPGA I MPLEMENTATION hardware architecture is organized like a waterfall, the signal
will flow inside it. This also costs the hardware into many
sub-state machines to synchronize the signal, but it absolutely
In this section we consider the architecture and implemen- worthwhile. The block diagram of stage one that perform the
tation of the ISQRD detector for 4-QAM modulation and 4×4 first loop (i = 1) of the algorithm in table II is shown in figure
antenna of HR-SM system. Figure 2 shows the block diagram 3. In this diagram, the main computing blocks are painted by
for the ISQRD detector. For this detector, we assume that the purple. The result of one computing block is stored by stored
input signals were normalized and the Es value in D matrix is blocks. One stored block will be activated or reactivated by nd
a constant and was known. The SQRD block performs three (new data) signal and it will sample the stored signal circularly
functions. Firstly, it decomposes the D matrix to Q and R after 8 cycle clock.
matrix and swapped vector P by MMSE-SQRD algorithm. This block receives its input in 4 cycle clock with the high
Secondly, It buffer and apply the swapped vector P to the level of nd (new data) signaling a new data was fetched. The
channel matrix H, outputting the first column, H1, and the delay is 143 cycle clock and throughput can reach to 8 cycle
Ht matrix. Thirdly, it pushes its output at the time when the clock for each H matrix. Specially, it can give any throughput
detector needs, this means its output are not outputted at the of H that is greater than 8 cycle clock, this feature make it be
same time. The detector block performs the main part of the flexibility to many wireless interfaces.
ISQRD algorithm. The Y buffer block’s function is to delay
the y signal in a FIFO buffer to synchronize it with the other B. Detector block
signals which are the input of the detector block. Two follow The detector block performs steps after decomposition D
parts will detail the SQRD and the detector block. matrix. It consists of 4 calculating blocks in parallel, each
Fig. 3. Stage 1 of SQRD diagram

block computes the Euclidean distance of one signal in the b)


signal constellation, then another block compares the distances
together to find the smallest value. The decoded signal is the c2 = arg minc∈Ωx ((v2 − r22 · c3 ) − c · r22 )2
one which has the smallest Euclidean distance. c)
Each calculating block performed the following steps:
c1 = arg minc∈Ωx ((v1 −r13 ·c3 −r12 ·c2 )−c·r11 )2
1) Step 1:
      4) Step 4:
t1 y1 h11 x    
 t2   y2   h21 x  z1 h12 h13 h14  
 t3  =  y3
   − 
 z2   h22 c1
  h31 x  h23 h24  

 =
 z3   h32 × c2 
t4 y4 h41 x h33 h34 
c3
z4 h42 h43 h44
2) Step 2:
  5) Step 5:     
    t1 d1 t1 z1
v1 q11 q21 q31 q41
 ×  t2 
   d2   t2   z2 
 v2  =  q12 q22 q32 q42 −
 d3  =  t3
   
 t3    z3 
v3 q13 q23 q33 q43
t4 d4 t4 z4
3) Step 3: 6) Step 6: dm = d21 + d22 + d23 + d24
a)
The detector diagram is shown in the figure 4. In this
c3 = arg minc∈Ωx (v3 − c · r33 )2 diagram, the Find c block that performs the step 3 as described
Fig. 4. Detector diagram

TABLE III
I MPLEMENTATION OF SOME DETECTORS FOR 4 X 4 MIMO SYSTEM

Section In [11] In [12] In [13] This Work


MMSE-SIC
QRD/Givens ISQRD
Algorithm 12-K + best with optimal
Rotation Sorted MGS
order
Target Chip 0.18um 0.18um
Xilinx Virtex4 Xilinx Virtex5
Technology CMOS CMOS
MaxFreq. 100 MHz 170.9 MHz 208 MHz 364.7 MHz
15682 Slides
26128 registers
Hardware 20682 Flip-Flops
152K Gate 79K Gate 17197 LUTs
Usage 23699 LUTs
260 DSPs
116 Multipliers
Detect 2.4Gbps 455.7 Mbps 416 Mbps 2.9 Gbps
throughput (64-QAM) (16-QAM) (4-QAM) (4-QAM)
Latency 147 clocks 68 clocks 165 clocks

above, has just do its work in one cycle clock. It is because works use different algorithms for the detector so the BER
this design is for 4-QAM so the decision is very simple. performance are not equal.
it just take the combination of the signed bit of real part
and image part of the previous term in the expression to In [11], the author presents a VLSI architecture of QR de-
decide c. This principle can apply for M-QAM but with more composition using Givens Rotation algorithm for 4x4 MIMO-
complexity. This block use fixed point number with 6 bits OFDM systems. This results was used to compute detection
integer (including one signed bit) and 8 bits fraction except for throughput of a conventional MIMO system, which has spec-
its input and Euclidean distance representation. This distance tral efficiency of nT log2 M bits/s/Hz. M is the order of
is represented by a fixed point number with 8 bits integer and modulation and was assumed to be 64-QAM. The latency
8 bits fraction. The delay of this block is 22 cycle clock and is not shown because it is not a complete detector. In [12]
the throughput is one output each a cycle clock. introduced an improvement of K-Best detector which has
been a research object recently give good BER performance.
V. E XPERIMETAL R ESULTS However, it seem to be appropriate for the system which has
small number of transmit antennas and the constellation size.
This architecture was synthesized on Xilinx Virtex5 In [13] proposed a low-complexity hard-output MMSE-SIC
XC5VSX240T. The synthesis results is shown on table III in detector for the general spatially multiplexed MIMO system.
comparison to some related works. Notice that this comparison The highlight of detection is low cost of hardware and short
is not fair because our design is for HR-SM system, which has of latency. From this comparison, we can see the variety of
spectral efficiency of (2(nT − 1) + log2 M ) bits/s/Hz while detections for the MIMO systems. The highlight of our work is
the others are for other 4x4 MIMO system. Furthermore, the high frequency therefore it gives a high detecting throughput.
To assess the BER performance of hardware implemen- [10] P. Luethi, A. Burg, S. Haene, D. Perels, N. Felber, and W. Fichtner, “Vlsi
tation, the fixed point model was built on matlab and the implementation of a high-speed iterative sorted mmse qr decomposition,”
May 2007.
simulation result, called fixed-point-ISQRD is shown on figure [11] Z.-Y. Huang and P.-Y. Tsai, “Efficient implementation of qr ecomposi-
5 in comparison to some floating point detectors. from this tion for gigabit mimo-ofdm systems,” october 2011.
figure, we can see that the of error caused by fixed point [12] S. P. Nils Heidmann, Till Wiegand, “Architecture and fpga-
implementation of a high throughput k+-best detector,” march 2011.
number has significant influences on the region of high SNR. [13] J.-Y. J. Tsung-Hsien Liu and Y.-S. Chu, “A low-cost mmse-sic detector
for the mimo system: Algorithm and hardware implementation,” january
2011.

Fig. 5. BERs of a HR-SM scheme with nT = nR = 4 when using the ML,


MSQRD, ISQRD and Fixed-point ISQRD detectors; 4-QAM modulation

VI. C ONCLUSION
In this paper we proposes a hardware architecture imple-
menting ISQRD detection algorithm for a High-rate Spatial
Modulation (HR-SM) scheme. This architecture are imple-
mented on Xilinx Virtex 5. The synthesis results show that
it costs 26128 slice registers, 17197 slice LUTs and 260
DSPs (used for multipliers). The delay of the whole block
is 165 cycle clock. The maximum frequency is 364.7 Mhz.
With that Fmax, the throughput of this design can reach
to 2.9 Gbps. This results was compared to some related
works. The BER performence of fixed point and floating point
ISQRD detections is shown in comparison to ML detection
and Modified MMSE-SQRD (MSQRD) detection.
R EFERENCES
[1] D. Nguyen, X. N. Tran, M. T. Do, V. D. Ngo, and M. T. Le, “Low-
complexity detectors for high-rate spatial modulation.”
[2] T. P. Nguyen, M. T. Le, V. D. Ngo, X. N. Tran, and H.-W. Choi, “Spatial
modulation for high-rate transmission systems,” May 2014.
[3] G. J. Foschini and M. J. Gans, “On limits of wireless communications
in a fading environment when using multiple antennas,” 1998/03/01.
[4] E. Telatar, “Capacity of multi-antenna gaussian channels,” 1999.
[5] P. W. Wolniansky, G. J. Foschini, G. D. Golden, and R. Valenzuela,
“V-blast: an architecture for realizing very high data rates over the rich-
scattering wireless channel,” 1998.
[6] A. Benjebbour, H. Murata, and S. Yoshida, “Comparison of ordered
successive receivers for space-time transmission,” Fall 2001.
[7] B. Hassibi, “An efficient square-root algorithm for blast,” June 2000.
[8] D. Wubben, R. Bohnke, V. Kuhn, and K.-D. Kammeyer, “Mmse exten-
sion of v-blast based on sorted qr decomposition,” Otc 2003.
[9] P. Luethi, C. Studer, S. Duetsch, E. Zgraggen, H. Kaeslin, N. Felber,
and W. Fichtner, “Gram-schmidt-based qr decomposition for mimo
detection: Vlsi implementation and comparison,” 2008.

You might also like