Arquitectura Hardware

ISSSTA2004, Sydney, Australia, 30 Aug. - 2 Sep.
2004
A recongurable systolic architecture for

UMTS/TDD joint detection real time computation
D. N OGUET
CEA-G/LETI/DCIS - 17, rue des Martyrs - F-38054 GRENOBLE cedex 9 - FRANCE
Email: dominique.noguet@cea.fr
Abstract Multi-user detectors based on joint detection have
been proposed to enhance the performance of the classical rake
receiver in TD-CDMA systems. They signicantly improve the
performance of the rake receiver which suffers from Multiple Access Interference. This is the case when non perfectly
synchronised users are sharing the same frequency band at
the same time (up-link case) or when they are experiencing
multi-path channels (down-link and up-link). These are classical
conditions of mobile TD-CDMA communications like for instance
in UMTS/TDD. The main drawback of joint detection algorithms
is their computational complexity. There, the DSP approach
shown its limit when multimedia compliant data rate must be
addressed as it is the case for UMTS-TDD [1] and specially when
long delay proles channels must be addressed. In this paper, we
propose a specic architecture that signicantly improves the
real-time system throughput with a reasonable hardware cost.
This approach is based on an SIMD structure [2] known as
systolic array. Systolic arrays are well known architecture mainly
used for 1 dimensional or 2 dimensional signal processing. The
architecture described in this paper is a derivation of the linear
systolic array proposed in [3]. The architecture is presented in
this paper and we describe how the systolic array has been
modied to address the Zero-Forcing joint detection algorithm
in the UMTS/TDD context. We also describe how the algorithm
can be simplied to lower the computation effort. Fixed-point
accuracy degradation is discussed on AWGN channel, and the
hardware cost for an FPGA target is given.
Parameter
Time slot length
N
Q
Midamble
2560
69
16
256
Value
chips
QPSK symbols
chips
chips
be reminded in section 2. Then, section 3 describes the

algorithm used to solve the linear equation system. The matrix
properties enable to perform this operation through a Cholesky
decomposition followed by triangular linear system solving.
Section 4 presents the systolic architecture. It is based on the
linear systolic array dedicated to linear system solving [3] that
has been improved to perform both linear system solving and
Cholesky decomposition. An analytical performance analysis
and a comparison with a pure sequential computation is given
in section 5. Section 6 depicts the performance degradation
due to xed point implementation. Eventually, the hardware
implementation cost for an FPGA target is given.
II. J OINT DETECTION RECEIVER
A. Signal model
I. I NTRODUCTION
Universal Mobile Telecommunication System (UMTS) using Time Division Duplex (TDD) is a Time Division - Code
Division Multiple Access (TD-CDMA) system expected to
provide high data rate exible asymmetrical communications
for cellular systems. The classical rake receiver has shown
limited results due to the fact that it suffers from Multiple
Access Interference (MAI). Recently, improved receivers aiming at detecting the bursts jointly have been introduced. They
happened to be a good trade-offs between V ERD U` s optimal
receiver [4] and the rake receiver in terms of both complexity
and performance. The joint detection algorithm is used at the
output of a multi-user rake detector in order to decorrelate the
outputs of these detectors. The strategy is based on the fact that
the joint detector knows the spreading codes and the channel
estimates of the other users in order to lower their inuence
on the user of interest. Several joint detectors can be applied
according to the error minimisation criterion. In this paper,
we will focus on the Zero Forcing Block Linear Estimator
(ZF-BLE) which aims at cancelling the MAI regardless to
the noise. The principle of the ZF-BLE joint detection will
0-7803-8408-3/04/$20.00 2004 IEEE
TABLE I
UMTS/TDD BURST TYPE 2 PARAMETER VALUES
957
The UMTS frame is divided into 15 time-slots of 666s.

These time slots can be allocated either to up-link (UL) or
down-link (DL) communication. Each time slot is composed of
2 data blocks separated by a midamble block used for channel
estimation as illustrated in gure 1. A data block is the result of
the addition of K bursts each composed of N QPSK symbols
spread by a known spread sequence of spreading factor Q.
6 6 6 s
D a ta b lo c k 1
B u rs t 1
B u rs t 2
M id a m b le
D a ta b lo c k 2
G u a rd
Q .N c h ip s
N = 6 9
B u rs t 8
Fig. 1.
UMTS/TDD time slot structure
For UMTS/TDD burst type 2, the parameters value of table I

are considered.
Let di be the data vector composed of the data of position

T
i for all bursts, i.e. di = [d1i , d2i , . . . , dK
i ] . Each data block
can be expressed by the concatenation of di vectors, namely
d = [dT1 , dT2 , . . . , dTN ]T .
The receiver input signal vector e after demodulation is
sampled at the rate J/Tchip and can be expressed by the
concatenation of the partial channel outputs for each data as
follows:
e = A.d + n
(1)
where d is the user data vector, A the block matrix built
from the convolution vectors between the channel impulse
response and the spread sequence of each burst, and n the
Additive White Gaussian Noise (AWGN). Since the channel is
considered as stationary for a data block, the transfer function
for the burst k is bk = hk ck where hk is the channel impulse
response and ck the spreading code. Then, A is a block matrix
composed of a repetition of bk vectors as shown in gure 2.
d
d
1
d
2
z = M.dZF BLE = U H .U dZF BLE
(4)
Let y be given by y = dZF BLE . The ZF-BLE is performed

by solving the two triangular systems of equation 5.
then y = U.dZF BLE
z = U H .y
K
(W -1 ).J
(Q + W -1 ).J
III. J OINT DETECTION COMPUTATION

In this paper, we consider that the rake and the channel estimation providing the z and hk vectors have been performed.
The size of matrix M = (AH A) is N.K N.K. When a load
of 8 burst is considered, the size of the M matrix is 552552.
Under these conditions, direct inversion of M cannot be
performed in real time as this operation would require a
computational power of 22.5 GFLOPs [1] to compute all
bursts with K = 8. Then another strategy shall be considered.
According to the properties of M (hermitian and positive)
a Cholesky decomposition of M can be envisaged [6]. It
consists of computing an upper triangular matrix U so that
M = U H .U . Then, the system in equation 3 can be solved in
2 steps as shown by equations 4 and 5.
(5)
Moreover, due to the structure of the M matrix, it has been

shown in [7] that the block lines of U converge rapidly to
a block that can be copied to build the rest of the matrix.
Typically, N0 = 10 block lines of the U matrix are computed.
IV. S YSTOLIC ARRAY ARCHITECTURE

A. Triangular system solving
K
K UNG and L EISERSON proposed a systolic array to perform

triangular system computation [3]. The systolic algorithm for
the lower triangular system Ax = b consists of the recursive
operations described in equation 6.
N .K
Fig. 2.
Structure of the A matrix
B. ZF-BLE detector
The rake receiver computes the d estimates as follows:
drake = AH e = z
(2)
where H denotes the hermitian transform. It can be easily

shown that such a detector does not take advantage of the
knowledge of other busts to cancel their inuence on the
estimation of a given one. That is why the output of the rake
receiver suffers from MAI. In order to cancel the MAI, the
Zero Forcing Block Linear Equaliser (ZF-BLE) is introduced.
It aims at cancelling the MAI by jointly estimating the data
of all bursts.
The ZF-BLE output can be expressed as follows:
(3)
dZF BLE = (AH A)1 AH e = (AH A)1 z
It can be noticed from equation 3 that the ZF-BLE receiver
performs the joint detection by cancelling the MAI regardless
to the additive noise. It can also be noticed that the ZF-BLE
algorithm input is the output of the rake receiver. Note that
other estimators can be used such as the Minimum Mean
Square Estimator (MMSE) which requires noise variance
estimation [5]. They are out of the scope of this paper.
958
(0)
xi
..
(k)
xi
..
xi
(k)
= 0
(k1)
= xi
+ ai,k .xk for 0 < k < i

(i1)
= (bi xi
(6)
)/ai,i
in which xi represents the kth iteration step to compute xi .

The systolic array that processes the algorithm is composed
of 2 kind of Processor Elements (PE):
the Multiply ACcumulate (MAC) cell (cf. g. 3) which
(k)
performs the iterative computation for the xi for 0 <
k < i elements;
the DIAG cell (cf. gure 4) which performs the last subtract and divide operations required for xi computation.
The PEs are connected in a linear array way as illustrated in
gure 5 and are fed by the ai,k diagonal by diagonal. When xki
initially set to 0 moves rightwards in the array, it accumulates
the product terms of the previous iterations until it reaches the
DIAG cell in which the nal value is computed. This value
will move backwards in the array in order to process the next
xkj (j > i).
TABLE II
D ATA STREAM FOR LINE 1, 2 AND 3 COMPUTATION
x
a c c u
a c c u + a .x
in s ta n t t
t=1
t=2
t=3
t=4
in s ta n t t+ 1
Fig. 3.
MAC cell operations
t=1
t=2
t=3
t=4
t=5
(b -a c c u )/a
a c c u
(b -a c c u )/a
in s ta n t t
in s ta n t t+ 1
Fig. 4.
t=1
t=2
t=3
t=4
t=5
t=6
DIAG cell operations
i1 MAC cells and 1 DIAG cell are needed to compute xi .

Hence, to process a nn matrix, n1 MAC cells are required
MAC. In the UMTS/TDD case, considering burst type 2,
a straightforward estimation would lead to 551 MAC cells.
However, for band matrix processing, it can be shown that the
size of the array is limited to the band width. Indeed, the input
of each cell is a diagonal of the matrix (cf. equation 6). Then
a MAC cell connected to a zero diagonal can be removed.
The width of the matrix Band (p) can be expressed as
follow:
p = K.(P + 1) with P =
(Q + W 1)

Q
(7)
In practise a value of P=2 enables to consider all the path

delayed of less than 8.3 s, which is sufcient for typical
3GPP channels. Then, p can be limited to 24.
B. Cholesky factorisation
The Cholesky factorisation consists of computing the ui,j
such that M = U H .U with U upper triangular. The elements
of U are computed line by line, left to right. The diagonal
element is computed as follows:

i1

|uk,i |2
(8)
ui,i = mi,i
k=0
Note that the diagonal elements are positive and real. Then,
the non diagonal elements can be processed according to:
Line 1 computation
MAC2
DIAG
0
u1,1
l1,2
0
u1,1
l1,3
0
u1,1
l1,4
0
0
0
Line 2 computation
MAC1
MAC2
DIAG
0
u1,3
0
0
0
u1,4
u2,2
l2,3
0
u1,5
u2,2
l2,4
0
0
u2,2
l2,5
0
0
0
0
Line 3 computation
MAC1
MAC2
DIAG
u1,4
0
0
0
u1,5
u2,4
0
0
u1,6
u2,5
u3,3
l3,4
0
u2,6
u3,3
l3,5
0
0
u3,3
l3,6
0
0
0
0
MAC1
0
0
0
0
ui,j =
mi,j
i1
k=0
Result
x
u1,2
u1,3
u1,4
Result
x
x
u2,3
u2,4
u2,5
Result
x
x
x
u3,4
u3,5
u3,6
uk,i .uk,j
ui,i
(9)
Considering equation 9, it can be noticed that the computation of the non diagonal elements is close to the equation
for linear system solving: a sum of product terms followed
by a subtract and divide operation. As for the the linear
system solving, the number
k=i1of MAC cells equals the number
of product terms in k=1 uk,i uk,j . However, since j > i
and since U is a p wide band matrix, the number of MAC
cells can be reduced to p 2. Then, for P = 2, only 22 MAC
cells are required. Therefore, 23 MAC cells will be used to
perform both Cholesky decomposition and triangular system
solving. The DIAG cell is almost the same as for linear system
solving except that the backward stream is not fed with ui,j ,
but with uk,i .
The uk,i needed for the product terms are the same for all
elements of the ith line. These values are stored in each MAC
cell before the line elements computation. During a second
step, the uk,j , li,j and ui,i coefcients are send to the MAC
cells. The example below provides with an example of the
computation of 3 lines of a matrix for which p = 4.
Finally, the block diagram of the recongurable systolic
architecture that can perform both triangular system solving
and Cholesky decomposition is shown in gure 6
V. P ERFORMANCE ANALYSIS
2
Fig. 5.
In this section we compare analytically the performance

gain of the systolic array to a sequential processor. To simplify
the comparison, we assume that one arithmetic operation
(addition or multiplication) is performed in one time unit
named cycle for both architectures.
i- 1
Linear systolic array
959
m e m
m e m
a c c u R e
a c c u Im
b a c k R e
b a c k Im
m a c
c e ll
1
b a c k R e
b a c k Im
1
1
b a c k R e
b a c k Im
m a c
c e ll
3
2
m a c
c e ll
p -1
m R e
m Im
R e
Im
p
p
a Im
2
2
a R e
a Im
b R e
b Im
p -1
p -1
a R e
m a c
c e ll
2
a Im
a R e
a c c u R e
a c c u Im
1
1
a c c u R e
a c c u Im
a Im
a R e
a Im
a R e
c o n tro l
a c c u R e
a c c u Im
b a c k R e
b a c k Im
ii
ii
m e m
d ia g R e
m e
m e
re s
re s
lR e
lIm
m e m
x R e
d ia g
c e ll
x Im
p
a c c u
L U T
Fig. 6.
Linear recongurable systolic array
A. System solving timing analysis

Each cell is fed by a coefcient every second cycle. Indeed,
as the accumulation ow xkj and the result ow which are
moving in opposite directions and must be combined, it is
mandatory to insert a dead cycle in order for the xi to be
computed with the coefcients [8].
Lets reckon the time required to solve an n n system. The
rst element is computed at t = 1. Then, every second cycle,
the array outputs a new result. Hence, the computation time to
systol
= 2n 1
output the n results of the output vector is Tsystlin
cycles.
It is known that a sequential processor would require
2
seq
= n2 + O(n) cycles.
Tsystlin
The gain can then be expressed as follows:
seq
Tsystlin
n2
2
+ O(n)
n
= when n is large
2n 1
4
(10)
This gain should be balanced against the complexity ratio
which is in favour of the sequential solution. [9] suggest to
compute the efciency gure set to the the product of the
complexity by the processing time.
gsystlin =
systol
Tsystlin
seq
seq
.Tsystlin
Csystlin
systol
systol
Csystlin
.Tsystlin
n2
2
1
= when n is large
=
n.(2n 1)
4
(11)
B. Cholesky decomposition timing analysis

Let us rst study the case of non diagonal elements computation. Before all the MAC cells are fed (Transient State
(TS)), the processing time for a non diagonal element equals
i + (p 1) cycles. When all the MAC cells are fed (Stationary
State (SS)) i (p 1)), the processing time for the p 1
non diagonal elements of a line is 2.(p 1) cycles. Then, the
global processing time for l lines (l > p 2) is:
systol
Tchol
nde
= (l (p 2)).2(p 1) +

SS

= (p 1). 2l
(p2)
2
p2

i + (p 1)
i=1
TS
(12)
960
It should be remembered that the static coefcients of the

backward stream must be loaded before non diagonal element
processing. However, this preload can be done during the
diagonal element computation, since the coefcients to be
loaded are the ones required for diagonal element processing. In addition to preload, the following operation must be
considered:
1 cycle for accumulation reset
1 cycle for ui(p1),i load
1 cycle to end accumulation
sqrt cy cycles for subtract and square root computation
1 cycle for result storage
For the rst line, only reset cycle, sub/sqrt computation
and storage are required. Then, up to line p 1 the load
of ui(p1),i is not required. Then during SS, the load and
computation time is (p 2) + 4 + sqrt cy. Hence for the
l rst lines, the l diagonal element computation will require
systol
Tchol
de cycles given by:
systol
Tchol
de
(l (p 1)).((p + 2 + sqrt cy) +
(13)
SS
p1

(i + 3 + sqrt cy) + 2 + sqrt cy

rst line

i=2
lines 2 to p-1
(l (p 1)).(p + 2 + sqrt cy) + (14)

p + 7
(15)
(p 2).
+ sqrt cy + 3
2
Global computation time for the Cholesky decomposition is
systol
systol
Tchol
de + Tchol nde . If it is assumed that for both sequential
and systolic architectures sqrt cy is set to 1, we can compare
both architecture performances. This is summed up in table III.
TABLE III
C OMPUTATION TIME COMPARISON FOR 1 TIME SLOT ( BURST TYPE 2)
K=8, N=69, P=2, N0 =10, l=80,
Algorithm
T seq
(cycles)
System solving
44496
Cholesky decomp.
18104
Total
62600
p=24
% seq
71 %
29 %
100 %
T systol
(cycles)
4412
5311
9723
% systol
gain
45,4 %
54,6 %
100 %
10
3,4
6,4
VI. H ARDWARE IMPLEMENTATION

A. Fixed point impact performance
In order to reduce hardware requirements, a xed point
implementation of the architecture is studied. The dynamic
range of the data-path has to be determined as a trade-off
between hardware complexity, processing speed and algorithm
accuracy. In order to have a good estimation of this requirements we compare the results obtained on the xed point
implementation of the architecture to the ones obtained with
a oating point one, taken as reference in a full receiver.
Both coded and uncoded receivers are analysed. The procedure
is applied step by step. First, the output vector of the joint
detection is modied. Several ranges are evaluated. Then, for

the dynamic chosen, the range of the M , z and y values is
studied. For this tests, a scenario consisting of an AWGN
channel with a load of 8 bursts is used. It can be seen on
gure 7 that a dynamic of 8 bits for these operands provides
very close results to the oating point implementation.
1.0e-01
Fixed point
Floating point
1.0e-02
overall receiver performance. The hardware implementation

cost is reasonable since only 350 kgates are required with
light requirements on the clock frequency.
ACKNOWLEDGEMENT
This research work was carried out in the framework of the
French RNRT project PETRUS in partnership with Mistubishi
Electric ITE, Bouygues Telecom, Supelec and ENST Bretagne.
The author would also like to thank C. TASSIN for providing
xed point simulation results.
R EFERENCES
BER
1.0e-03
1.0e-04
1.0e-05
1.0e-06
0
10
Eb/N0 (dB)
Fig. 7.
Fixed point vs oating point implementation
VII. H ARDWARE COMPLEXITY

Now that the dynamic range of the operators has been set,
the hardware complexity of the architecture can be analysed.
A Xilinx Virtex 2 FPGA has been used as target architecture.
Table IV shows the hardware cost of the main data-path and
control blocks of the architecture when P = 2.
TABLE IV
H ARDWARE COST FOR A V IRTEX 2 IMPLEMENTATION
Cell type
MAC
Sqrt
DIAG
Number of cells
Multipliers
Slices
Total multipliers
Total slices
23
2
115
46
2645
1
16
536
16
536
1
2
164
2
164
Master
FSM
1
0
221
0
221
Slave
FSM
22
0
62
0
1364
Total
64
4930
The frequency requirements comes from the operations that

have to be performed within the time slot duration. When only
the joint detection is considered about 10000 cycles must be
processed in the 666 s. This would lead to a frequency of
15MHz. This means that a large number of other operations
could be performed aside.
VIII. C ONCLUSION
In this paper a systolic architecture dedicated to joint
detection has been proposed. The Cholesky decomposition and
triangular system solving are performed to lower the matrix
inversion computational cost. We showed that this architecture
performs the joint detection in roughly 7 times faster than a
sequential processor. The performance impact of decreasing
the data dynamic range has been analysed and we showed
that 8 bits data can be chosen with little impact on the
961
[1] D. Noguet, J.-P. Bouyoud, L. Zaghdoudi, D. Varreau, B. Jechoux,

P. Le Corre, X. Lagrange, and J. Nasreddine, A hardware testbed
for umts/tdd joint detection base-band receivers, in ISSSTA, Sidney,
Septembre 2004.
[2] M. Flynn, Some computer organisations and their effectiveness, IEEE
Transaction on Computers, vol. c-21, 1972.
[3] H. Kung and C. Leiserson, Introduction to VLSI systems. AddisonWesley, 1980, ch. Systollic arrays for VLSI (chap. 8.3).
[4] S. Verd`u, Optimum multi-user signal detection, Ph.D. dissertation,
Urbana-Champaign, Aout 1984.
[5] A. Klein, G. Kahled, and W. Baier, Zero forcing and minimum meansquare-error equalization for multiuser detection in cdma channels, in
IEEE trans. On vehicular technology, vol. 45, no. 2, Mai 1996.
[6] G. Golub and C. Van Loan, Matrix computation.
Johns Hopkins
University Press, 1989.
[7] Y. Pigeonnat, Joint detection for umts : complexity and alternative
solutions, in VTC, Toronto, 1998.
[8] P. Quinton and Y. Robert, Algorithmes et architectures systoliques.
Masson, 1989.
[9] R. Hockney and C. Jesshope, Parallel computers. Adam Hilger, 1981.

Arquitectura Hardware

Uploaded by

Copyright:

Available Formats

You might also like

Arquitectura Hardware

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Arquitectura Hardware

Uploaded by

Copyright:

Available Formats

ISSSTA2004, Sydney, Australia, 30 Aug. - 2 Sep.

A recongurable systolic architecture for

be reminded in section 2. Then, section 3 describes the

0-7803-8408-3/04/$20.00 2004 IEEE

The UMTS frame is divided into 15 time-slots of 666s.

UMTS/TDD time slot structure

For UMTS/TDD burst type 2, the parameters value of table I

Let di be the data vector composed of the data of position

z = M.dZF BLE = U H .U dZF BLE

Let y be given by y = dZF BLE . The ZF-BLE is performed

III. J OINT DETECTION COMPUTATION

Moreover, due to the structure of the M matrix, it has been

IV. S YSTOLIC ARRAY ARCHITECTURE

K UNG and L EISERSON proposed a systolic array to perform

Structure of the A matrix

where H denotes the hermitian transform. It can be easily

+ ai,k .xk for 0 < k < i

in which xi represents the kth iteration step to compute xi .

MAC cell operations

DIAG cell operations

i1 MAC cells and 1 DIAG cell are needed to compute xi .

In practise a value of P=2 enables to consider all the path

In this section we compare analytically the performance

Linear systolic array

Linear recongurable systolic array

A. System solving timing analysis

B. Cholesky decomposition timing analysis

It should be remembered that the static coefcients of the

(l (p 1)).((p + 2 + sqrt cy) +

(i + 3 + sqrt cy) + 2 + sqrt cy

(l (p 1)).(p + 2 + sqrt cy) + (14)

VI. H ARDWARE IMPLEMENTATION

detection is modied. Several ranges are evaluated. Then, for

overall receiver performance. The hardware implementation

Fixed point vs oating point implementation

VII. H ARDWARE COMPLEXITY

The frequency requirements comes from the operations that

[1] D. Noguet, J.-P. Bouyoud, L. Zaghdoudi, D. Varreau, B. Jechoux,

You might also like