You are on page 1of 5

SPARSE BAYESIAN LEARNING FOR DIRECTIONS OF ARRIVAL ON AN FPGA

Herbert Groll1 , Christoph Mecklenbräuker1 and Peter Gerstoft2


1
TU Wien, Vienna, Austria
2
University of California San Diego, La Jolla, CA, USA

ABSTRACT ultimately estimates one signal power parameter per source


rather than the complex source amplitudes individually per
A direction of arrival (DOA) estimator based on sparse
snapshot. This amounts to a significant reduction of the de-
Bayesian learning (SBL) is implemented as a fixed-point
grees of freedom in the estimation problem resulting in low
arithmetic prototype for an FPGA platform. The prototype is
variance of the DOA estimates. FPGA-based DOA estima-
developed from a known algorithm mainly using high-level
tion implementations using MUSIC were presented in [9, 10]
synthesis with C++ based model specifications. The spe-
which have been realized by hardware description on the
cialized equations of the algorithm are reduced to arithmetic
register-transfer level. Highly developed FPGAs enable rapid
operations considering the signal flow within the iterative
prototyping of algorithms using high-level synthesis (HLS)
structure. Cholesky factorization is used to solve the matrix
for FPGA as used in this work. A prototyping setup similar
inverse problem. Scheduling of each module is done as soon
to [11] ensures fast verification of the prototype.
as possible to make use of the parallel FPGA architecture.
Different fixed-point word length assumptions are explained
and implementation results are shown in terms of resources 2. SIGNAL MODEL
and latency. Finally, a representative DOA source scenario is
simulated and tested with the implemented prototype hard- A K-sparse vector x l ∈ CM is observed as signal y l ∈
ware in the loop. The comparison with a floating-point CN on a sensor array with N sensors, with N  M ,
reference implementation is found to have good agreement through linear transformation with a transfer matrix A =
with the fixed-point implementation. [a 1 , . . . , a M ] ∈ CN ×M and noise n l . A grid with M
points define the possible DOA of K impinging waves with
Index Terms— Array processing, directions of arrival far-field and narrowband assumption at a fixed frequency ω
(DOA) estimation, FPGA, high-level synthesis and velocity of propagation c. The indices of the K non-
zero sources of x l , where K  M , define the active set
1. INTRODUCTION Ml = {m ∈ N|xml 6= 0} = {m1 , m2 , . . . , mK }. A total
of L snapshots are taken with stationary active set Ml = M
Direction of arrival (DOA) estimation is used in radar, com- to form the multi-snapshot, or multiple measurement vector
munication, sonar and seismology applications for localizing (MMV), Y = [y 1 , . . . , y L ] ∈ CN ×L as
sources by sensing wavefields. A sparse representation of the Y = AX + N (1)
DOA problem allows estimation with high spatial or tempo-
M ×L
ral resolution [1]. Sparse Bayesian learning (SBL) [2], solves with X = [x 1 , . . . , x L ] ∈ C and additive noise N =
this undetermined linear inverse problem for which numerous [n 1 , . . . , n L ] ∈ CN ×L which is assumed to be i.i.d complex
algorithms have been presented [3–5]. Gaussian, CN (0, σ 2 ), across sensors and snapshots.
A field-programmable gate array (FPGA) implementation Each array steering vector a m , as m-th column of A, de-
for SBL is of great interest due to possible real-time dimen- scribes the time delays τ m = [τm1 , . . . , τmN ]T with respect
sion reduction of online measurements, e.g. in directional to the sensors for a wave front with direction θm . For a uni-
evaluation of fading and two-ray model measurements [6–8]. form linear array (ULA) with element spacing d, the nm-th
ωd
Moreover, FPGAs are becoming more tightly integrated to- element of A is e−jωτnm = e−j(n−1) c sin θm .
gether with high-speed analog-digital converters (ADCs)
which enable new applications. The use of arbitrary preci- 2.1. Bayesian Formulation
sion fixed-point arithmetic in FPGAs offers fast processing
The SBL framework [4] based on [3] defines a Gaussian like-
without loosing too much accuracy.
lihood function of Y |X [4, Eq. (3)] as
In this paper we present an FPGA implementation of
the SBL DOA estimator in [4] using fixed-point arithmetic. exp(− σ12 kY − AX k2F )
The SBL algorithm performs better than MUSIC [4] and p(Y |X ; σ 2 ) = , (2)
(πσ 2 )N L
additive complex Gaussian noise with variable variance σ 2 . which leads to an averaging over L snapshots. Only the lower
The prior distribution of each complex source amplitude xml triangle of Hermitian y l y Hl is calculated during snapshot ac-
is modeled with hyperparameters γm ∈ γ = [γ1 , . . . , γM ]T quisition. Double-buffering ensures gap-free operation be-
stationary across different snapshots l as tween multi-snapshots. By allowing only L = 2Lb with Lb ∈
( N, S y is derived through reinterpretation of the fixed-point
δ(xml ), for γm = 0 data type, i.e. shifting the binary point. The hyperparameters
pm (xml ; γm ) = 1 −|xml |2/γm (3) γmnew
in [4, Eq. (SBL1)] are updated iteratively as
πγm e , for γm > 0
γ old q
new
γm = √m Y H Σ−1 a / aH −1
m Σy a m (6)

L y m
p(X ; γ) =
Y
CN (0, Γ) (4) L 2

l=1
with the inverse data model covariance matrix Σ−1 y . [4, Eq.
with Γ = diag(γ) = E[x l x H (SBL)] and [4, Eq. (M-SBL)] are not considered in this work.
l ; γ], the covariance matrix of
the uncorrelated complex source amplitudes. The SBL algo- Σy is a Hermitian positive definite matrix with its lower tri-
rithm of [4] estimates the source powers by estimating the angular Cholesky factor
hyperparameters Γ.
LΣ = cholesky (Σy ) , (7)

3. HARDWARE ALGORITHM where Σy = LΣ LH


Σ is the Cholesky factorization. LΣ is used
to solve
Σ−1 −H −1
y a m = LΣ LΣ a m (8)
Table 1. SBL FPGA Algorithm by forward- and back-substitution. Using Eq. (8) and S y ,
new
1: Input: Y updating hyperparameters γm changes to
2: Initialize: LΣ = cholesky 0.1 · I N + AAH , jmax = 500

s
3: S y = 1
L
P
y ly H new old (L−1
Σ a m)
−H L−1 S L−H (L−1 a )
Σ y Σ Σ m
L l γm = γm ·
l=1 kL−1
Σ a k
m 2
2
4: tr(S y )
(SBL1)
5: while ∆γM > min · kγ old k1 and j < jmax do
6: j = j + 1, γ old = γ new
where L−1Σ a m is reused in the denominator as a H −1
m y am =
Σ
7: for m = 1 to Mr do kL−1
Σ a k
m 2
2
. We reuse intermediate results during Cholesky
new old (L−1 a m )−H L−1 S y L−H (L−1 am) factorization for forward- or back-substitution to reduce
8: γm = γm · Σ Σ Σ Σ
kL−1
Σ
a m k2
2 costly 1/x-operations.
H new
9: Σy,m = Σy,m−1 + a m a m γm The data model covariance matrix Σy is calculated in
new old
10: ∆γm =P∆γm−1 + |γm − γm | M + 1 passes. First, all contributions from each γm are
11: gm = m+1 γ
i=m−1 i
new
summed
(
gm (gm−1 < gm ≤ gm+1 ) ∧ (1 < m < M )
12: pm =
0 else Σy,m = Σy,m−1 + a m a H new
m γm , Σy,0 = 0 (9)
13: Mm = {i ∈ (Mm−1 ∪ m) | K largest peaks in pi }
14: end for by only evaluating the lower triangle of a m a Hm . Finally, the
15: AM = [a m1 , . . . , a mK ]  noise estimate is added to the real diagonal as
16: RM = cholesky AH M AM
17: Q = AM R−1 Σy = Σy,M + (σ 2 )new I N . (10)
M
(σ 2 )new = N −K
1
tr(S y ) − tr(Q H S y Q)

18:
19: Σy = Σy,M + (σ 2 )new I N
For finding the active set M, we use a 3-stage processing
20: LΣ = cholesky(Σy ) pipeline. First, to reduce false peaks due to limited precision
new T
21: end while we filter the γ new = [γ1new , . . . , γM ] with a moving sum of
22: Output: M, γ new , (σ 2 )new size 3
m+1
X
gm = γinew . (11)
A SBL algorithm, based on [4, Algorithm 1] and adopted i=m−1
for FPGA hardware implementation, is summarized in Ta- No scaling by 1/3 is applied as only indices are relevant. Sec-
ble 1. At the input, each multi-snapshot Y is used only as ond, we use a trivial peak detection algorithm on gm
data sample covariance matrix S y , defined as (
gm (gm−1 < gm ≤ gm+1 ) ∧ (1 < m < M )
L pm =
1X 0 else.
Sy = y ly H
l (5)
L (12)
l=1
Third, the K largest peaks are selected in M passes out of
K + 1 peak candidates yl Sy Syy
S tr(S y ) (σ 2 )new (σ 2 )new

Mm = {i ∈ (Mm−1 ∪ m) | K largest peaks in pi } (13) LΣ γ new find M M


cholesky  ≤ min end
with M = MM and M0 = {}. The transfer matrix defined
by the active set is AM ∈ CN ×K . Using the upper triangular γ new
Cholesky factor Σy Σy
 
RM = cholesky AH (a) data path
M AM (14)
Sy runtime →
with AH H
M AM = R M R M , the projection matrix P based on tr(S y )
the active set is Σy
−1 H
LΣ init
P = AM (RH
M RM ) AM = QQ H (15) γ new γ1 γ2 ... γM γ1 . . .
iteration 1 complete
 ≤ min
with Q ∈ CN ×K as
find M
Q = AM R−1 (σ 2 )new
M, (16)
(b) Scheduling
calculated with back-substitution. Alternatively, a matrix Q
can be directly obtained by QR factorization. However, Vi- Fig. 1. (a) data path and (b) scheduling of modules for the
vado HLS only includes a floating-point implementation of implemented SBL algorithm. Structure for iterative fitting
an QR factorization but a fixed-point Cholesky factorization. of Σy . S y and tr(S y ) need to be calculated only once per
The noise variance estimation [4, Eq. (27)] is rewritten multi-snapshot. Data dependencies limit possible parallelism
using Eq. (15), Eq. (16), and tr(QQ H S y ) = tr(Q H S y Q) as of operations.
1  
(σ 2 )new = tr(S y ) − tr(Q H S y Q) (17)
(N − K) focuses on the algorithmic level using a subset of C++, al-
−1
with constant (N − K) , a signal part tr(S y ) which is cal- lowed by Vivado HLS to produce synthesizable results. HLS
culated only once per multi-snapshot, and an active signal linear algebra library and custom operations are both us-
part tr(Q H S y Q) which only needs the upper triangle of S y . ing arbitrary precision fixed-point signed and unsigned data
Hyperparameters (γ new , (σ 2 )new ) are updated iteratively up to types ap fixed<W,WI> and ap ufixed<W,WI> each
jmax iterations or as long as source power changes sufficiently, with total bit-width W, integer width WI, and fraction width
WF=W-WI.
new old
∆γm = ∆γm−1 + |γm − γm |, ∆γ0 = 0 (18) The input y l is a 32 bit wide sensor data stream. Each
sensor sample consists of concatenated two’s complement
∆γM > min · kγ old k1 , (19) signed real and imaginary part with W=16 bit each and WI=2
reformulated from the convergence rate  [4, Eq. 25] to avoid to allow values symmetric around zero. This enables, e.g.,
a division and optimize for serial input. feeding the system with samples from an ADC, optionally
prepended with automatic gain control (ACG). The outputs
of the system are (σ 2 )new with W=32 bit, γ new with W=48 bit,
4. HARDWARE IMPLEMENTATION
active set M as unsigned integers, and exit criteria signaling.
The algorithm [4, Algorithm 1] is split up into modules ac- The transfer A is pre-calculated for a given sensor array with
cording to Fig. 1 (a) with well defined interfaces. Either W=16 bit, implemented as a look-up table and instantiated

streaming or dual-port memory is used between the mod- at several HLS modules. Normalization of A by 1/ M is
ules. Xilinx Vivado HLS 2017.2 is used for transforming beneficial for required WI when calculating LΣ .
the hardware specification of individual modules on an algo-
rithmic level based on C++ into a specification in Verilog 4.1. Hardware Results
or VHDL on the register-transfer level. We use several
modules, opposed to a monolithic HLS description of the The resources and latencies of the synthesized system are
algorithm, which reduces complexity for the HLS tool and based on the specific parameters M , N , K, and clock fre-
lead to more predictable results. Data dependencies be- quency fclk . For M = 512, N = 16, K = 3 and fclk =
tween modules translate to a coarse grain scheduling scheme 150 MHz, a summary of resources for each module is in Fig. 2
as shown in Fig. 1 (b). The development of each module together with introduced latency by scheduling a single mod-
ule. The synthesized HLS C++ implementation can processes 10 SBL1 fixed SBL1 fixed

DOA RMSE (◦ )
43 iterations/s. This is faster than the MATLAB implemen- SBL1 float 1 SBL1 float

(σ 2 )new /σT2
tation (floating-point reference) of 24 iterations/s on a Intel
Xeon CPU E5-2690 v3 at 2.60 GHz. 5
0.5

0.3 LUTs
0 0
utilization

FF
resource

0.2 0 10 20 0 10 20
DSP 400 20
0.1 BRAM

γRM SE (dB)
300

iterations
0 15
(cycles)

106
latency

200
103
100 100 SBL1 fixed SBL1 fixed
SBL1 float 10 SBL1 float
find M

tr(S y )

 ≤ min

γ new
(σ 2 )new

Mem γ new

Mem LΣ
Sy

Σy

Mem S y
Mem Σy
0
0 10 20 0 10 20
array SNR (dB) array SNR (dB)

Fig. 2. Resource utilization and introduced latency for each Fig. 3. Comparison between the presented fixed-point imple-
module. Based on used Kintex-7 XC7K325T FPGA with mentation and floating-point reference for J = 100 Monte
Slice look-up-tables (LUTs), Slice registers (FF), special Carlo simulations. (a) DOA estimation error, (b) noise power
digital-signal-processing (DSP) slices, block RAM (BRAM) estimation, (c) iterations for  ≤ min , (d) source power esti-
of 18 kbit. Total LUT=203800, FF=407600, DSP=840, mation error.
BRAM=890.

Clearly, calculating γmnew


utilizes most resources and con- is compared with the floating-point reference in Fig. 3 (a).
√ Noise power estimation is shown in Fig. 3 (b) with respect
tributes severely to the overall latency, e.g. due to 1/ x, which
could be improved by fast approximation techniques [12] or to σT = 10−SNR/10 E[kAX k2F ]/L/N . The mean number of
stronger parallelism. iterations necessary for the stop criteria Eq. (19) is shown in
Fig. 3 (c). The error in source power estimation is shown in
Fig. 3 (d) as
5. SIMULATION RESULTS
v
The fixed-point prototype, developed with Vivado HLS, is
u K X
J
u1 1X
new 2
running on an Kintex 7 FPGA and connected through a γ RMSE =t |E[|xmk ,l |2 ]j − γm k ,j
| (21)
KJ j
TCP/IP prototyping setup with MATLAB. It is compared k
with a floating-point reference in MATLAB. An example
scenario similar to [4] places K = 3 independent sources with k-th source power E[|xmk ,l |2 ]j and k-th estimated source
on an angular grid in the interval θ ∈ [−90, 90)◦ with power γm new
k ,j
. For high array SNR, estimated source power
M = 512 different angles. The sources are located at of the fixed-point implementation is less accurate than the
−3.3, 1.9, and 74.9◦ with magnitudes 12, 22, and 20 dB. floating-point reference due to limited dynamic range in cal-
An ULA with N = 16 sensors is used. Additive i.i.d. culating a single iteration.
complex Gaussian noise is added according to a given
array SNR = 20 log10 (kAx l k2 /nrel ). The MMV Y has
L = 64 snapshots, is mapped to [−1, 1] + [−1, 1]j for both,
6. CONCLUSION
and quantized for the fixed-point prototype. A Monte Carlo
simulation with J = 100 realizations are carried out. A
widely used quality measure for DOA estimation is The presented implementation of a sparse Bayesian learning
algorithm for directions of arrival estimation based on [4] is
v
u
u1 1X K X J suitable for FPGA implementation using on fixed-point arith-
DOA RMSE = t |θmk − θ̂mk ,j |2 (20) metic. It offers higher performance with continuous operation
KJ j and has a good agreement with the floating-point reference in
k
terms of DOA root mean squared error. Several signal pro-
with k-th source direction θmk and k-th estimated direc- cessing steps of the SBL algorithm benefit from the paral-
tion θ̂mk ,j , for which the fixed-point FPGA implementation lelism inherent to the FPGA architecture.
7. REFERENCES [11] A. Suardi, E. C. Kerrigan, and G. A. Constantinides,
“Fast FPGA prototyping toolbox for embedded opti-
[1] D. Malioutov, M. Cetin, and A. S. Willsky, “A sparse mization,” in Proc. of Eur. Control Conference (ECC).
signal reconstruction perspective for source localization IEEE, 2015, pp. 2589–2594.
with sensor arrays,” IEEE Transactions on Signal Pro-
cessing, vol. 53, no. 8, pp. 3010–3022, Aug 2005. [12] M. Iştoan and B. Pasca, “Flexible fixed-point function
generation for FPGAs,” in Proc. of IEEE 24th Sympo-
[2] M. E. Tipping, “Sparse Bayesian learning and the rel- sium on Computer Arithmetic (ARITH). IEEE, 2017, pp.
evance vector machine,” Journal of machine learning 123–130.
research, vol. 1, pp. 211–244, Jun 2001.

[3] D. P. Wipf and B. D. Rao, “An empirical Bayesian strat-


egy for solving the simultaneous sparse approximation
problem,” IEEE Transactions on Signal Processing, vol.
55, no. 7, pp. 3704–3716, Jul 2007.

[4] P. Gerstoft, C. F. Mecklenbräuker, A. Xenaki, and


S. Nannuru, “Multisnapshot sparse Bayesian learning
for DOA,” IEEE Signal Processing Letters, vol. 23, no.
10, pp. 1469–1473, 2016.

[5] Z. Zhang and B. D. Rao, “Sparse signal recovery


with temporally correlated source vectors using sparse
Bayesian learning,” IEEE Journal of Selected Topics in
Signal Processing, vol. 5, no. 5, pp. 912–926, 2011.

[6] E. Zöchmann, M. Lerch, S. Caban, R. Langwieser, C.F.


Mecklenbräuker, and M. Rupp, “Directional evaluation
of receive power, Rician K-factor and RMS delay spread
obtained from power measurements of 60 GHz indoor
channels,” in Proc. of IEEE-APS Topical Conference on
Antennas and Propagation in Wireless Communications
(APWC). IEEE, 2016, pp. 246–249.

[7] E. Zöchmann, M. Lerch, S. Pratschner, R. Nissel, S. Ca-


ban, and M. Rupp, “Associating spatial information to
directional millimeter wave channel measurements,” in
Proc. of IEEE 86th Vehicular Technology Conference
(VTC-Fall). IEEE, 2017, pp. 1–5.

[8] E. Zöchmann, K. Guan, and M. Rupp, “Two-ray models


in mmWave communications,” in Proc. of IEEE 18th
International Workshop on Signal Processing Advances
in Wireless Communications (SPAWC), July 2017, pp.
1–5.

[9] D. Xu, Z. Liu, X. Qi, Y. Xu, and Y. Zeng, “A FPGA-


based implementation of MUSIC for centrosymmetric
circular array,” in Proc. of 9th International Conference
on Signal Processing (ICSP), Oct 2008, pp. 490–493.

[10] K. Huang, J. Sha, W. Shi, and Z. Wang, “An efficient


FPGA implementation for 2-D MUSIC algorithm,” Cir-
cuits, Systems, and Signal Processing, vol. 35, no. 5, pp.
1795–1805, May 2016.

You might also like