You are on page 1of 12

IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 61, NO.

8, AUGUST 2014 4093


An FPGA-Based Fully Synchronized Design of a
Bilateral Filter for Real-Time Image Denoising
Anna Gabiger-Rose, Student Member, IEEE, Matthias Kube, Robert Weigel, Fellow, IEEE, and
Richard Rose, Student Member, IEEE
AbstractIn this paper, a detailed description of a synchronous
eld-programmable gate array implementation of a bilateral lter
for image processing is given. The bilateral lter is chosen for one
unique reason: It reduces noise while preserving details. The de-
sign is described on register-transfer level. The distinctive feature
of our design concept consists of changing the clock domain in
a manner that kernel-based processing is possible, which means
the processing of the entire lter window at one pixel clock cycle.
This feature of the kernel-based design is supported by the ar-
rangement of the input data into groups so that the internal clock
of the design is a multiple of the pixel clock given by a targeted
system. Additionally, by the exploitation of the separability and
the symmetry of one lter component, the complexity of the design
is widely reduced. Combining these features, the bilateral lter is
implemented as a highly parallelized pipeline structure with very
economical and effective utilization of dedicated resources. Due to
the modularity of the lter design, kernels of different sizes can be
implemented with low effort using our design and given instruc-
tions for scaling. As the original form of the bilateral lter with
no approximations or modications is implemented, the resulting
image quality depends on the chosen lter parameters only. Due
to the quantization of the lter coefcients, only negligible quality
loss is introduced.
Index TermsBilateral lter, eld-programmable gate array
(FPGA), image processing, noise reduction, real-time processing.
I. INTRODUCTION
B
ILATERAL ltering has gained great popularity in image
processing due to its capability of reducing noise while
preserving the structural information of an image. The bilateral
lter [1] consists of two components. The detail-preserving
property of the lter is mainly caused by the nonlinear lter
component also called photometric lter. It selects the pixels of
similar intensity which are averaged by the linear component
afterward. Very often, the linear component is formulated as
a low-pass lter. The amount of noise reduction via selective
averaging and the amount of the blurring via low-pass ltering
are both adjusted by two parameters. The understanding of
Manuscript received March 5, 2012; revised August 6, 2012 and October 24,
2012; accepted December 6, 2012. Date of publication October 25, 2013; date
of current version February 7, 2014.
A. Gabiger-Rose, R. Weigel, and R. Rose are with the Institute for Elec-
tronics Engineering, Friedrich-Alexander University of Erlangen-Nuremberg,
91058 Erlangen, Germany (e-mail: anna.gabiger-rose@fau.de; robert.weigel@
fau.de; richard.rose@fau.de).
M. Kube is with the Department of Contactless Test and Measuring Systems,
Fraunhofer Institute for Integrated Circuits, 91058 Erlangen, Germany (e-mail:
matthias.kube@iis.fraunhofer.de).
Digital Object Identier 10.1109/TIE.2013.2284133
these parameters is very intuitive, which leverages the bilateral
lter to an almost all-purpose solution in image processing.
The authors of [2] and [3] show that noise ltering, despite
the prevailing view, not always implies resolution reduction
but can even be used to sharpen the edges [2] or to enhance
the owlike structures [3]. In [4], the motion-adaptive bilateral
lter is used for quality improvement in low bit rate video
coding. Also, in [5], the bilateral lter is applied for noise
reduction in a method for local tone mapping which maps high
dynamic range image to low dynamic range image.
Recently, bilateral ltering has gained a high awareness
level in medical image processing and nondestructive testing.
The authors of [6] studied the impact of noise reduction by
the bilateral lter applied to the reconstructed images. They
concluded that the images processed with this lter show a
signicant improvement in image quality compared to their
unltered counterparts. In [7], the authors discuss the results of
noise reduction by the bilateral lter in projection space. This
means that the noise ltering takes place prior to computing the
reconstructed volume. It has been concluded that noise reduc-
tion of this kind can be translated into a dose reduction in X-ray
computed tomography. Considering industrial applications, the
dose reduction permits the reduction of the scanning time and
thus allows a higher throughput of test items.
Our own experiments and studies shown in [8] and [9]
conrm the possible dosis reduction. As the reduction of the
exposure time due to ltering is feasible, we are interested
in a real-time ltering of projections. Moreover, the lter is
not supposed to reduce the spatial resolution of projections to
maintain the visibility of defects in a reconstruction. Since we
achieve very satisfying results considering detail preservation
with our eld-programmable gate array (FPGA) implementa-
tion presented in [10], we intend to give a deeper insight in our
work.
The major contribution of this paper is the detailed descrip-
tion of a novel FPGA design architecture of the bilateral lter
on register-transfer level (RTL). This abstraction level is chosen
for the possibility of direct specication of the clocking scheme
[11]. The main advantages of this design are the capability of
real-time processing and economical and effective utilization of
resources through the following.
1) Sorting the data into equal groups to which separate
pipelines are assigned.
2) Raising the internal clock frequency according to the data
ow.
3) No external image buffer is necessary.
0278-0046 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
4094 IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 61, NO. 8, AUGUST 2014
Moreover, due to the modularity of the design, it can be
extended to implement arbitrary kernel size with low effort. The
instructions required for this can be found later in this paper.
The remainder of this paper is organized as follows. In
Section II, we consider the related work. After a short descrip-
tion of the bilateral lter in Section III, we give a detailed
description of our FPGA design in Section IV. Section IV is
the main part of this paper presenting the lter design stage
by stage. In Section V, the criteria applied to the evaluation of
the image quality prior and after the noise ltering are detailed.
After that, in Section VI, the results are discussed, and the
performance potential of our lter design is analyzed.
II. RELATED WORK
Since the bilateral lter is in widespread use, a lot of effort
has been put into acceleration for use in practical applications.
Mainly, among the publications concerning speeding up of
the bilateral ltering, two trends can be stated. One stream
is focused on the modication of the ltering components,
resulting in an efcient algorithm. Another trend is to accelerate
the ltering through parallelizing the algorithm or through
hardware acceleration, including modications of the lter at
the same time.
In [12], a fast approximation of the original bilateral lter
is proposed. Here, the 2-D ltering is separated into two 1-D
operations performing 1-D bilateral ltering in one arbitrary
dimension and ltering the intermediate result in the same
manner in the subsequent dimension. The authors report that
the proportionality of the execution time to the number of
lter dimensions decreases from exponential to linear. This
approach requires a little memory overhead but results in a
lter which is fast enough to be used for preprocessing in video
compression systems. However, as the photometric component
of the bilateral lter is not separable, the image resulting from
the modied lter is documented to be slightly different from
the image produced by the original lter.
Another acceleration approach proposed in [13] has given a
basis for numerous extensive works. This approach provides a
numerical scheme for speeding up the ltering via a piecewise-
linear approximation of the bilateral lter in the intensity do-
main and substituting the low-pass ltering by downsampling.
In [14], this technique is extended by transposing the computa-
tion to a 3-D space presenting the image intensity as a third
dimension over the 2-D image coordinate space. After that,
the authors of [15] formulated the concept of the bilateral grid
and implemented the bilateral lter using the proposed data
structure on three different graphics processing units (GPUs).
Not until then, by means of their hardware acceleration, a
processing with 30 fps is possible which they assign as real-
time performance. Later, the technique proposed in [13] was
also implemented on a GPU by the authors of [16] and is
also capable of the real-time processing with the same frame
rate. More recently, the lazy sliding window implementation
of the approach in [13] was proposed in [17]. This method
is suitable for single-instruction-multiple-data-type processors
like DSPs. In this case, the speedup also allows applications
requiring real-time performance. The main drawback of the
lter acceleration approach discussed so far is the high amount
of memory required for the implementation.
Instead of a piecewise-linear approximation and subsam-
pling, the idea of utilizing a histogram-based approach for
accelerating the lter is presented in [18] and [19]. The main
difference between these two works is that, in [18], a hierarchy
of partial distributed histograms on multiple tiers is computed
and adjusted for each output pixel while the author of [19]
calculates the integral histogram of the image and extracts the
histogram for each target lter window to obtain one output
pixel. These methods both are fast, but a real-time performance
of the histogram-based approach in [19] can only be achieved
by very-large-scale-integration design of the lter shown in
[20]. The memory demand of the histogram-based acceleration
method is also high but is lower than that of the piecewise-linear
approximation and subsampling approach.
The aforementioned examples show that a lter modication
technique reaches real-time performance only if its imple-
mentation utilizes hardware acceleration. Most of the referred
works rely on GPUs for acceleration. However, in elds of
applications in which high power efciency is crucial, an FPGA
solution is preferable. In [21], an algorithm for the denoising of
medical images is implemented on an FPGA and four different
GPUs. The authors show that the power consumption of their
FPGA implementation is always signicantly lower. Further-
more, the authors of [21] point out that an FPGA implementa-
tion allows to count latency in image lines, resulting in delays
lower than one frame, while the latency on a GPU is always
one frame. This is relevant for many medical applications which
demand fast image output to supply interactive operations.
The authors of [22] also choose an FPGA implementation
for their image processing system because moving time-critical
functionalities, like the edge detection in an image, to hardware
platforms makes it possible to keep delays in the control loop
to a minimum. The authors of [23] and [24] report excellent
experience of using FPGAs for motion control of robots based
on real-time image processing. The main reason for using
FPGAs for real-time robotics tasks is the ability of FPGAs to
satisfy the requirement for high computational power and data
throughput [24]. Moreover, FPGA solutions offer additional
advantages, such as recongurability and portability.
However, considering complexity and timing constraints of
the algorithm to be implemented, the suitability of the chosen
hardware platform has to be checked [25]. A DSP implemen-
tation has been regarded to be more appropriate for complex
algorithms with high data dependence. For algorithms with
low data dependence and high timing constraints, an FPGA
solution is more suitable. The authors of [25] discuss in detail
the advantages of using FPGAs even if the algorithm shows
both high complexity and timing constraints. At the same time,
the authors of [26] emphasize in their conclusion that FPGA-
based digital processing systems achieve better performance, at
a lower cost, than traditional solutions based on DSPs.
Furthermore, the parallel architecture of the FPGA provides
an excellent platform for the implementation of paralleled and
pipelined structures. This conclusion is made by many authors.
Therefore, implementing an algorithm for color image segmen-
tation for object detection in full parallelism on an FPGA, the
GABIGER-ROSE et al.: FPGA-BASED FULLY SYNCHRONIZED DESIGN OF BILATERAL FILTER 4095
authors of [27] report a drastic improvement of the speed of
segmentation compared with the sequential-code-based seg-
mentation. In [28], a design of a fully pipelined data path for
real-time face detection using FPGA is described which sup-
ports high-speed detection irrespective of the number of faces
in an image. The authors of [29] implement their paralleled and
fully pipelined hardware for real-time electromagnetic transient
simulation on an FPGA and thereby solve a challenging prob-
lem of implementation of the complex simulation models.
There are several publications dealing with FPGAimplemen-
tations of the bilateral lter. In [30], one of these designs is
presented. The verilog hardware description language (VHDL)
code of this design is generated automatically from the mod-
els for FPGA synthesis using System Generator from Xilinx.
Although the optimization setting for the code generation was
for maximum clock frequency, the authors admit that the speed
of their implementation for a 15 15 pixel lter kernel is
insufcient for a real-time application. The authors of [31]
compared a VHDL and a high-level synthesis (HLS) descrip-
tion, created by System Generator, of an adaptive impulse noise
lter and concluded that higher speed of the system clock can
be achieved using VHDL description. Thus, these publications
showexemplarily that the handcrafted optimization of an FPGA
design regarding both the operating frequency and the resource
utilization is still irreplaceable.
A different approach for the FPGA implementation of a real-
time bilateral lter has been proposed in [32]. The modied
lter is based on the calculation of the lter coefcients from
the photometric lter only. The spatial ltering is eliminated
due to the processing of the minimal window of 3 3 and
raising of the derived photometric coefcients to the power of
8. According to the authors, for a moderate noise level, their
modied bilateral lter can achieve slightly better results com-
pared to the traditional bilateral lter shown in [1]. However,
the original bilateral lter can be tuned by two parameters
which are highly responsible for the ltering performance.
Unfortunately, no description of the parameters used for this
comparison is given in [32].
The work published in [33] is most related to our work.
The major parallel to our design consists in implementing the
bilateral lter on an FPGA without any modication. This
approach is sometimes called brute-force method. However, the
main difference to our work is that the authors developed their
design using an HLS tool. The resulting architecture presents a
3 3 lter kernel. In contrast, our design is based on an RTL
description and presents a 5 5 lter kernel. Our design allows
high clock frequency and high data throughput and shows only
a slight increase of resource demand considering the larger
kernel. From this follows that our architecture utilizes hardware
resources more efciently and more economically.
III. BILATERAL FILTER
The bilateral lter [1] embodies the idea of a combination
of domain and range ltering. The domain lter averages the
nearby pixel values and acts thereby as a low-pass lter. The
range lter stands for the nonlinear component and plays an
important part in edge preserving. This component allows
averaging of similar pixel values only, regardless of their po-
sition in the lter window. If the value of a pixel in the lter
window diverges from the value of the pixel being ltered by a
certain amount, the pixel is skipped.
Taking Gaussian noise into account, the shift-variant ltering
operation of the bilateral lter is given by

( m
0
) =
1
k(m
0
)

mF
(m) s ((m
0
), (m)) c(m
0
, m).
(1)
The term m = (m, n) denotes the pixel coordinates in the
image to be ltered and m
0
= (m
0
, n
0
) and m
0
= ( m
0
, n
0
)
represent the coordinates of the centered pixel in the noisy and
in the ltered images, respectively. With these notations,

( m
0
)
means the gray value of the pixel being ltered, and (m)
identies the gray value of the spatially neighboring pixels to
(m
0
) in the lter window F.
The following expressions (2) and (3) describe the photo-
metric and the geometric components s((m
0
), (m)) and
c(m
0
, m), respectively:
s ((m
0
), (m)) = exp
_

1
2
_
(m
0
) (m)

ph
_
2
_
(2)
c(m
0
, m) = exp
_

1
2
_
m
0
m

c
_
2
_
(3)
where parameters
ph
and
c
regulate the width of the Gaussian
curve assigned to s((m
0
), (m)) and c(m
0
, m), respectively.
The photometric component compares the gray value of the
centered pixel with the gray values of the spatial neighborhood
and computes the corresponding weight coefcients depending
on the factor
ph
. The more the absolute difference of the
gray values exceeds
ph
, the lower is the corresponding lter
coefcient and vice versa. The domain lter c(m
0
, m) acts as
a standard low-pass lter, the weights of which are reciprocally
proportional to the spatial distance of the centered pixel to the
pixels in the neighborhood.
Normalization with
k(m
0
) =

mF
s ((m
0
), (m)) c(m
0
, m) (4)
guarantees that the range of the ltered images does not change
signicantly due to the ltering. Owing to the fact that the
coefcients of the photometric component cannot be computed
in advance, the division by the normalization factor cannot be
avoided by means of prescaling of the lter coefcients.
IV. DESIGN CONCEPT
The image data, as well as all constants and coefcients
used in the following design concept, are integer numbers. As
discussed in Section VI, there is no need to implement oating-
point computation. With the aid of the presented design con-
cept, the bilateral lter can be realized as a highly parallelized
pipeline structure giving great importance to the effective re-
source utilization. In this paper, the data paths are detailed. The
description of the control signals is not addressed here.
4096 IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 61, NO. 8, AUGUST 2014
Fig. 1. Order of the functional units of the bilateral lter.
Fig. 2. Principle of the input data retrieval for the image ltering.
For the design description, a window size of 5 5 is chosen.
This window size is the tradeoff between high noise reduction
and low blurring effect.
The design concept for the implementation of the bilateral
lter is subdivided into three functional blocks. The block-
based design approach reduces design complexity and simpli-
es validation [34]. Fig. 1 presents these units and their order
in the concept. The input data marked by Data_in are read
line by line and arranged for further processing in the register
matrix. The second unit is the photometric lter which weights
the input data according to the intensity of the processed pixels.
The ltering is completed by the geometric lter, and the
ltered data are marked by Data_out.
A. Register Matrix
The photometric lter component, also often referred to as
a range lter in the related literature, is a nonlinear lter. It
means that the lter coefcients change for every lter position.
Thus, the pixel weights for the photometric component have
to be calculated separately for every pixel in the lter window.
The number of weights depends on the lter window size. Here,
24 weights have to be computed for the ltering of one image
pixel.
The lter window is shifted rst along the input lines rep-
resenting the image rows, moving one row down every time
the precedent row has been ltered. Consequently, the demand
arising from this ltering technique is that at least ve lines
have to be stored for the period of time during which a line
is ltered. As an external image buffer is undesired because
of the additional expenses of resources due to the memory
controller and because of the additional latency due to the
memory accesses, the ve input lines are stored in the line
storages which are implemented as block RAMs for data with
N bits. The ve input lines are called image rows or rows in the
following. These ve rows include the row to be ltered, two
foregoing rows, and two succeeding rows.
This arrangement is depicted in Fig. 2. The pixel being l-
tered is marked by mid_pix. This pixel and its neighborhood
in the solid box represent the kernel of the bilateral lter.
After the middle row has been ltered, the outer foregoing row
Fig. 3. Register matrix of the kernel-based design concept.
line storage n-2 moves out of the register matrix. As the
input data are read into the register matrix pixel by pixel, the
content of the line storages and of the lter kernel is shifted
by one pixel at each clock event. This shift emulates the shift
of the lter kernel. Acting this way, at the end of an image
line, all remaining rows are shifted one row down. The former
succeeding row line storage n + 1 can now be processed. The
output lines form the output image which is stored externally.
The parallel calculation of 24 weights in the photometric
lter component and the subsequent weighting in the geometric
component combined with the nal normalization at the lter
output require a large amount of resources considering the
sparse time of just one pixel cycle. Due to the exibility of the
clock management in FPGAs, this challenge can be accepted.
The solution is offered by our kernel-based design concept in
Fig. 3. The single registers are interconnected in a manner that,
aside from the shift of the lter window by one pixel, the entire
kernel is provided to the next lter stage simultaneously. This
is an important advantage of the presented kernel-based design
concept as no extra data buffer is required. On the other hand, it
is necessary to process all 25 pixels in one pixel cycle in order
to keep up with the reading of the input lines into the register
matrix.
The output of the register matrix is sorted into groups, in this
case into six groups, and fed into the photometric lter compo-
nent with the quadruple pixel clock frequency synchronously.
GABIGER-ROSE et al.: FPGA-BASED FULLY SYNCHRONIZED DESIGN OF BILATERAL FILTER 4097
Fig. 4. Abstract illustration of the photometric lter component.
The number of the groups is explained by the symmetry of
the geometric lter component which is discussed later in
Section IV-C. The sorting is done by means of multiplexing the
pixels in the manner shown in Fig. 3. The quadruplication of
the lter processing clock is implemented by setting the select
signal of the multiplexers four times in one pixel clock. Here,
the clock domain changes to the fourfold of the input pixel
clock. The counter on the top of Fig. 3 generates the select
signal and thus controls the readout of the register matrix. This
counter is clocked with the quadruple pixel clock as well. The
counter is rst enabled after the whole register matrix is lled.
The pixels in each group are processed in parallel while each
group is pipelined through to the register matrix output stage.
The pixel in the center of the lter window is not a part of any
group and is forwarded to a latch belonging to the input stage
of the photometric lter component. The sorting of the pixels
into groups and the quadruplication of the pixel clock are the
key to the presented synchronous FPGA design concept using
a parallelized pipeline architecture.
B. Photometric Component
After the register matrix has been lled, the grouped image
data are provided to the photometric lter component which
is pictured in Fig. 4. At the output of the photometric lter, the
weighted pixels appear, still sorted into groups, accompanied by
the weighted mid_pix. Additionally, the photometric coef-
cients have to be forwarded for the required normalization at the
last stage of the ltering according to (4). Thus, in parallel to the
pixels, the photometric coefcients also have to be processed by
the geometric lter in order to obtain the normalization factor
dened in (4). For this reason, the output of the photometric
lter consists of the following:
1) weighted pixels sorted into groups 0 . . . 5;
2) the weighted pixel being ltered, marked by mid_pix;
3) photometric coefcients corresponding to groups 0 . . . 5.
In further stages of the design, the weighted pixel values, i.e.,
the outputs of the multipliers, are named by their groups 0 . . . 5.
A detailed functional ow block diagram of the photometric
lter is shown in Fig. 5. The pixel in the center of the lter
window has to be available during the calculation of the re-
quired 24 pixel weights. Latching the centered pixel allows the
computation of the gray value differences between the centered
pixel and the remaining pixels inside of the lter window. Each
group contains four pixels. A separate pipeline belonging to
each group makes it possible to process the entire neighborhood
of mid_pix at one pixel clock signal. All six pipelines are
designed identically.
Fig. 5. Photometric lter component.
Fig. 6. Processing order of input data in the photometric lter component.
The way of arranging and the processing order of the input
data of the photometric component are shown in Fig. 6. At the
rst internal clock event t
0
, the rst pixels of each group are
provided to the respective pipeline. At the second internal clock
t
1
, the second pixels of each group enter the component. This
organization of groups allows the processing of the whole lter
window in four internal clock cycles corresponding to one pixel
cycle. In the upper part of Fig. 5, the processing path for the
group 0 is shown; in the lower part, there is the processing path
for the group 5.
4098 IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 61, NO. 8, AUGUST 2014
Fig. 7. Limitation of the number of coefcients.
The combinatory blocks comb.0 . . . 5 compute the abso-
lute gray value difference required by (2). In order to keep
the design synchronous, the gray values of each pipeline are
registered during the difference calculation. The upper path in
Fig. 5 shows the required registers labeled group 0 to make
sure that the gray value appears at the input of the multiplier
at the same time as the corresponding photometric coefcient.
Through the following, we use registers to keep our design
synchronous. Thus, it makes any delay control inside of our
architecture redundant.
To avoid the calculation of the expensive exponential, all
possible values of the function (2) are precalculated and stored
in the lookup table (LUT). The absolute difference of the
gray values itself is directly interpreted as the address of the
corresponding weight coefcient in the LUT.
Due to the quantization, the number of the weight coef-
cients is limited. This limit depends on three parameters:
1) the word length N of the input data;
2) the parameter
ph
;
3) the word length W of the coefcients.
The rst point means that increasing the color depth of an
image causes a larger amount of intensity differences that
have to be stored in the LUT. Depending on the parameter

ph
, the slope of the Gaussian curve is steeper or more at
which inuences the number of coefcients different from zero
after the quantization. It depends on the word length W itself
whose coefcients actually are different from zero after the
quantization.
In Fig. 7, the coefcients are plotted for N = 8 b, W = 8 b,
and
ph
= 60. As the negative exponential converges toward
zero for increasing gray value differences, there are only a
limited number of quantized coefcients that are different from
zero. Considering the example in Fig. 7, there are only 188
coefcients to be stored. For simplication of the internal
control, the number of coefcients is extended to the next
power of 2, resulting in the highest address 2
P
1. In the
example, the highest address is 255. The coefcients are stored
in the LUT of each pipeline in the initialization phase of the
ltering.
Fig. 8. Abstract illustration of the geometric lter component.
If N is greater than P, via logical disjunction of left (N-P) bits,
it is checked whether the gray value difference is greater than
the chosen limit 2
P
1. The result of the disjunction selects the
coefcient address. If the gray value difference is greater than
the limit, the weight coefcient is set to zero which is stored
at the address 2
P
1. In the opposite case, the corresponding
coefcient is read out of the LUT. This coefcient may also
be zero as the number of coefcients is extended to 2
P
1.
During the readout of the coefcient, the related gray value is
registered for synchronicity. At the next internal clock event, the
gray values of each group are multiplied by the corresponding
coefcients while registering the coefcients in coeff. group
0 . . . 5 for the nal normalization.
The pixel in the center of the lter window does not belong to
any group and is processed separately. This pixel is multiplied
by the highest coefcient 2
W
1 and delayed by registers
photo_k middle and geom_in middle for synchronicity.
C. Geometric Component
For the design of the geometric lter component, advantage
is taken of its separability and its symmetry. Because of the
separability, the geometric lter is split into the vertical and hor-
izontal parts. Therefore, 2-D ltering is replaced by successive
1-D ltering in vertical and horizontal directions. This solution
is preferred in the design of the geometric lter because 1-D
ltering can be implemented more efciently. Both parts are
implemented twice to lter the weighted image data and the
photometric weights simultaneously which is shown in Fig. 8.
The input of the vertical component parts is the 2-D array
of the lter window and the 2-D array of the corresponding
coefcients. Each output is a 1-D vector in which each entry
represents one ltered and cumulated column. The coefcients
of the geometric component are labeled C_0, C_1, C_2. The
output of the geometric lter consists of the ltered unnor-
malized gray value (kernel result) and the normalization factor
(norm result).
Due to the symmetry of the weight coefcients of the geo-
metric component, the order of multiplication and addition is
swapped in both lter parts. This fact plays an important role
in pixel group formation. At rst, the weighted gray values
which are located at the same distance from the centered pixel
in the lter window are summed up [35]. Because of the equal
distance, these gray values should be weighted with the same
coefcient anyway. For a 5 5 window, there are always 4
GABIGER-ROSE et al.: FPGA-BASED FULLY SYNCHRONIZED DESIGN OF BILATERAL FILTER 4099
Fig. 9. Vertical part of the geometric lter component.
or 8 pixels at the same distance from the centered pixel. For
the simplicity of the design, it makes sense to assemble the
pixels into equally large groups. Smaller groups allow for better
handling of the design. For this reason, the pixels are divided
into groups of four with regard to the subsequent processing
explained in the following sections. After the accumulation of
the pixels according to their symmetry, the sum is multiplied
by the corresponding coefcient. The horizontal processing is
done in the same way.
The coefcients for the geometric component are scaled in
such a manner that the sum of the vertical coefcients (and
the horizontal ones, respectively) is equivalent to the so-called
normalized one [35]. For the signed coefcients with the word
length W, the normalized one is equal to 2
W1
. This means
that the division of the weighted gray values and photometric
coefcients after geometric ltering can be realized as a simple
shift operation. In the last stage, the normalized ltered gray
value has to be divided by the normalized product of the photo-
metric coefcients. The geometric coefcients are calculated in
advance and stored in a block RAM.
1) Vertical Component Part: The rst stage of the geometric
component is the vertical part which is pictured in Fig. 9. With
the aid of Fig. 6, it can be seen that the pixels of the rst column
numbered 1, 2, 3, 4, 5 and the rst pixel of the middle column
numbered 11 enter the vertical component part simultaneously.
For the corresponding photometric coefcients, the same order
of processing is valid.
The groups 0, 1, 2, 3, 4, which means all columns with the
exception of the centered column, are processed as shown in
the upper part of Fig. 9. The geometrically symmetrical pixels
are cumulated at rst and then multiplied by the geometric
weight coefcient. All coefcients for the geometric lter are
constant for the chosen lter window size. Due to the scaling
Fig. 10. Horizontal part of the geometric lter component.
of the geometric coefcients, it is assured that the accumulation
does not result in a carry. The registers REGcol 0,1,2 in this
part of the design are used to delay weighted data to maintain
synchronicity. After the multiplication, the weighted values are
summed up by the adder tree to one value at each internal clock
event.
The processing of the centered column is detailed in the
lower part of Fig. 9. The centered pixel is weighted and delayed
by REGcen so that this pixel and the remaining pixels in the
centered column can be fed to the input of the adder tree simul-
taneously. The remaining pixels enter the dedicated processing
path one by one. They were multiplexed in the register matrix
in the way that they can be combined pairwise and multiplied
by the same coefcient in the geometric component. In order
to weight the pixels in a proper way, every incoming pixel is
stored in the register REGcol mid so that the subsequently
calculated sum is valid every second internal clock event. The
multiplexing of the lter coefcients with zeros assures that
invalid sums vanish due to the multiplying by zero and do not
falsify the result.
As it is shown in Fig. 8, the vertical part of the geometric l-
ter for the weighting of the photometric coefcients is designed
identically.
2) Horizontal Component Part: In Fig. 10, the horizontal
part of the geometric component is displayed. After processing
in the vertical dimension, the lter window is reduced to one
row, and its elements are computed at one internal clock event
each. In order to be able to reuse the symmetrical design, the
values of the ltered columns 0, 1, 3, 4 are stored in the shift
registers according to the order of their reception. The ltered
photometrical coefcients are stored in the same way. Since the
content of the shift register in the left part of Fig. 10 is valid
at every fourth internal clock event, the time domain changes
4100 IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 61, NO. 8, AUGUST 2014
Fig. 11. Final normalization of the ltered data.
here to the domain of the pixel clock. This domain change is
indicated by the dashed line in Fig. 10. All operations on the
right-hand side of the dashed line are executed according to the
pixel clock.
At every pixel clock signal, the valid column values are writ-
ten to the registers which perform the division of the weighted
gray values by the normalized ones. The division is imple-
mented through a shift operation. The remaining processing is
similar to the processing described in the previous paragraph.
The geometrically symmetrical pixels are cumulated at rst and
multiplied afterward by the geometric weight coefcient. For
the geometric ltering in the horizontal direction, the same geo-
metric coefcients are used as for the vertical ltering. The nal
division by the normalized one is performed in the next stage.
D. Normalization
At the nal stage, the kernel result has to be normalized by
the normresult as shown in Fig. 11. After the nal accumulation
of these values, they are both divided by the normalized one
again. In this manner, the word lengths of the weighted gray
values and of the norm are both (W 1) bits shorter. Finally,
after the division, N bits of the nal result are forwarded to the
output of the bilateral lter.
E. Design Scalability
In previous paragraphs, we detailed the lter design for the
5 5 kernel. However, depending on an application, another
kernel size might be required. For small images, a 3 3
window size is more suitable to prevent blurring. Some authors
choose to work with a larger kernel of the size of 11
11 pixels [36]. Our design can be scaled for different kernel
sizes. Starting at the register matrix, it has to be dimensioned
according to the required kernel size. The kernel size in one
dimension is assigned with K in the following:
N
groups
= K + 1 (5)
where N
groups
means the number of the pixel groups. The
quantity of the line storages equals K. The number of required
multiplexers equals N
groups
. The multiplexing pattern of the
pixels remains unchanged for every kernel size. According to
the symmetry of the kernel, the pixels have to be grouped into
N
groups
containing n
group_member
pixels each
n
group_member
= K1. (6)
The groups are always built up in the manner that each row
except for the middle pixel forms a pixel group. The middle
column represents the last pixel group in which particular
attention has to be paid to the arrangement of the pixels in order
to keep the weighting in the geometric component valid.
Furthermore, the number of pipelines, including combinatory
blocks and coefcient LUTs in the photometric component,
equals N
groups
. The design of the pipelines remains the same.
The number of the pipelines in the vertical part of the geo-
metric component changes according to the kernel size. For
the structure in the upper part of Fig. 9, (K + 1)/2 pipelines
are required because the geometrical symmetry of the pixels
has to be taken into account. The lower part of the verti-
cal geometric component remains unchanged except for the
multiplexer which has n
group_member
inputs according to the
required lter window size. The shift register of the horizontal
part of the geometric component has to be dimensioned for
(K1) values. The number of the connected pipelines has
to be adjusted to the length of the shift register, taking the
geometrical symmetry into account again. The processing of
the centered column remains unchanged. The same holds for
the normalization coefcients as well.
Finally, if the maximal operating frequency f
operating
is
known, the internal clock frequency f
internal
can be determined
as follows:
f
internal
=
f
operating
n
group_member
. (7)
According to the internal clock frequency f
internal
, the counter
has to be adjusted, which generates the select signal for the
multiplexers and the enable signal EnREG for the horizontal
part of the geometric component.
V. IMAGE QUALITY ASSESSMENT
To evaluate the performance of the noise reduction and the
accuracy of the detail preservation, criteria for the image quality
assessment are required. The criteria chosen in this work are
PSNR
dB
and MSSIM.
1) PSNR
dB
: The well-known peak-signal-to-noise ratio
PSNR
dB
in decibels is dened as follows:
PSNR
dB
=20 log
10
_
GV
max

MSE
_
(8)
MSE =
1
MN

N
_

ref
(m)

(m)
_
2
(9)
where MSE denotes the mean squared error between the
image to be compared and the reference image. GV
max
represents the maximum gray value depending on the
word length after the digitalization of the images. The
noiseless M N image with gray values
ref
(m) pro-
vides the reference for the measurement of the MSE.
The gray values

(m) originate from the image to be
compared. Considering the quality of the noise lter,
PSNR
dB
describes the capability of the lter to suppress
noise regardless of the perceived visual quality of the
ltered image.
GABIGER-ROSE et al.: FPGA-BASED FULLY SYNCHRONIZED DESIGN OF BILATERAL FILTER 4101
2) MSSIM: The mean structural similarity index MSSIM is
a method for the assessment of the image quality that
takes advantage of the characteristics of the human visual
system [37]. First, the local structural similarity SSIM of
the 11 11 image blocks v(
ref
) and v(

) is calculated
SSIM
_
v(
ref
), v(

)
_
= l
_
v(
ref
), v(

)
_

c
_
v(
ref
), v(

)
_
s
_
v(
ref
), v(

)
_
(10)
where l(v(
ref
), v(

)) is the luminance comparison


function, c(v(
ref
), v(

)) compares the contrast of


the image blocks after luminance subtraction, and
s(v(
ref
), v(

)) conducts the structure comparison after


contrast normalization. After averaging the SSIM of J
blocks over the whole image, the mean value MSSIM
MSSIM(
ref
,

) =
1
J
J

j=1
SSIM
_
v
j
(
ref
), v(

)
_
(11)
of an entire image represented by

is identied. The
value MSSIM = 1 means that two images are completely
identical. The smaller the MSSIM, the less the structural
similarity that the two images show. The detailed descrip-
tion of MSSIM can be found in [37].
VI. RESULTS
After an implementation in Matlab, the proposed architecture
of the bilateral lter was implemented in VHDL and simulated
with ModelSim. A test image was ltered by Matlab imple-
mentation as well as the ModelSim simulation, and the ltered
images were compared. The purpose of this comparison is to
analyze the image quality drop due to the quantization of the
lter coefcients in our FPGA design.
The test image Lighthouse shown in Fig. 12(a) is an 8-b
grayscale image with a size of 512 512 pixels. Hence, in the
following, GV
max
= 255 is used.
In order to apply the bilateral lter to a color image, the
color data have to be transformed into the CIELab color space
[1]. The structure of the lter remains unchanged. However,
processing of color images is beyond our research interest, so
no results on this topic will be reported.
A. Performance Analysis
For the comparison of the ltering capability between
the Matlab implementation and the ModelSim simulation,
Gaussian noise with standard deviation
noise
= [10, 20, 30, 40,
50, 60] was added to the test image.
In Fig. 12, the test image is contrasted with its noisy coun-
terpart with
noise
= 20 and two ltered images. The lter
parameters
ph
= 3
noise
and
c
= 1 were chosen for the
photometric and geometric components, respectively. For lter-
ing in Matlab, no quantization of the lter coefcients was ap-
plied. The corresponding ltered image is shown in Fig. 12(c).
For the simulation with ModelSim, the coefcient word length
W = 8 was used. The simulation result is shown in Fig. 12(d).
Fig. 12. (a) Original image. (b) Noisy image with
noise
= 20. (c) Filtering
in Matlab. (d) Filtering in ModelSim.
4102 IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 61, NO. 8, AUGUST 2014
Fig. 13. Performance comparison of the Matlab implementation and the
ModelSim simulation.
Between the Matlab implementation and the ModelSim simu-
lation, no visually distinguishable difference can be registered.
The results of the quantitative comparison between the Mat-
lab implementation and the ModelSim simulation are con-
trasted in Fig. 13 and summarized in Table I. As our recent
research shows, by adjusting
ph
as a multiple of the measured
standard deviation of noise rather than by a single constant,
even better PSNR
dB
can be achieved. Thus, an optimal setting
for the lter can be chosen which reduces noise and prevents
blurring at the same time as far as possible. Exceeding this point
causes oversmoothing, and choosing the adjusting parameter
below this point leads to insufcient noise suppression. The
discussion of this topic is important but beyond the scope of
this paper. For more details, refer to [38].
Fig. 13 reveals that, for increasing noise levels, PSNR
dB
and
MSSIM both increase after noise ltering. For higher standard
deviation of noise, the gain is higher. Using our setting
ph
=
3
noise
, averaging with higher weights is performed for in-
creasing noise levels. Owing to this fact, PSNR
dB
rises by a
higher amount. MSSIM also increases because the geometrical
component remains narrow, preventing oversmoothing.
TABLE I
FILTERING RESULTS
TABLE II
SYNTHESIS RESULT
The numbers in Table I show that applying the presented
lter architecture delivers results almost as good as that of
the Matlab implementation. The slight decrease of the image
quality due to ltering by ModelSim simulation is explained by
coefcient quantization and by rounding of the internal values
during the shift operations. No artifacts caused by quantization
are introduced into the ltered image. In summary, the simula-
tion results are highly satisfying.
B. Verication
For verication, a Virtex-5 FPGA platform equipped with a
Virtex XC5VLX50-1 device was used. The shortened synthesis
report of the lter design is shown in Table II. A long-term
trial proved that the design is suitable for real-time processing.
The FPGA board was connected to a camera with a 12-b
resolution depth, generating 30 fps at a full resolution of 1024
1024 pixels.
Due to the technical specication of the camera, pauses be-
tween the frames are necessary so that 30 fps is the maximally
achievable frame rate. Thus, the maximal data ow reaches
approximately 31.5 Mpixel/s. Consequently, we restricted the
clock frequency of our design to 40 MHz in this application.
The internal clock frequency is 160 MHz. With this clock rate,
a maximal throughput of 38 fps is possible.
With a different camera, an even higher frame rate is achiev-
able. Using our FPGA platform, the maximal possible internal
frequency shown in Table II is 220 MHz. Hence, the maximal
operating frequency of our lter design with the contemplated
FPGA Virtex-5 equals 55 MHz. Considering the image reso-
lution of 1024 1024 pixels, the following frame rate can be
computed:
_
(1024 1024)
pixels
frame

18.18 ns
pixel
_
1
= 52.45
frames
second
. (12)
This calculation is valid only for a throughput of 1 pixel/cycle
which is given by our design.
GABIGER-ROSE et al.: FPGA-BASED FULLY SYNCHRONIZED DESIGN OF BILATERAL FILTER 4103
TABLE III
CITED FPGA IMPLEMENTATIONS OF THE BILATERAL FILTER
The total delay of the output pixels of our architecture with
a kernel size of 5 5 pixels applied to an image of 512
512 pixels is 2560 + 36 cycles. The time required for lling
up of the register matrix, depending on the kernel size and
image width, results in a delay of 5 512 = 2560 cycles. The
processing time from the multiplexers in the register matrix to
the output of the normalization stage is constant and depends
not on the kernel size. The critical operations are performed
at internal clock frequency. If the kernel size is changed, the
pixel groups have to be reordered, and the internal clock has to
be adjusted according to (7). In this case, the processing time
still accounts for 36 cycles. The normalization by division costs
24 cycles, which makes out 66% of the whole processing time.
For the evaluation of the performance of the lter design,
a comparison with other implementations from the references
is given in Table III. Except for the authors of [32], all other
authors implement the original bilateral lter from [1]. From
[32], the full parallel architecture is used for the comparison
in Table III. All lters are implemented on different FPGAs of
different families and generations, which makes the comparison
less signicant, but still, itemizing some features like the max-
imum clock frequency of the design or the resource demand
might give a good insight.
Our design works at the highest clock frequency. However,
considering the kernel size of 5 5 pixels and the switching
of the time domain, our architecture presents only the third
highest frame rate. However, it looks different if we implement
a 3 3 lter kernel. In this case, the operating frequency is
110 MHz, and the resulting frame rate doubles, which puts the
performance of our design on the second place.
Regarding the resource demand, it should be clear that the
logic elements of Altera and the logic slices of Xilinx are
built differently. The values in Table III give merely a hint at
the FPGA area used by each design. On the other hand, the
number of required multipliers can be compared directly. In
[30], the number of the multipliers is not available. According
to the statement of the authors of [33], an efcient parallel
implementation of a bilateral lter for a 5 5 mask requires 25
multipliers.We have shown that our design concept is efcient
and it requires only 23 multipliers. Therefore, considering the
implemented window size of 5 5 pixels, we use the resources
more economically.
VII. CONCLUSION
In this paper, we have given a detailed description of an
FPGA design of the bilateral lter for real-time image pro-
cessing. The advantages of our design can be summarized in
following points.
1) The lter design for a kernel size of 5 5 shown here
utilizes the FPGA resources economically, which makes
it feasible to implement the lter on a common medium-
sized FPGA.
2) The introduced register matrix at the rst stage of the
lter makes external image storage redundant, contribut-
ing to the decrease of the resource demand of the lter
implementation.
3) The shown architecture is synchronous and capable of
real-time processing supporting high clock frequencies.
Maximal operating frequency depends on the chosen
FPGA family.
4) Conceiving our lter architecture, we kept in mind the
scalability of the design in order to enable the implemen-
tation of arbitrary lter window size with low effort.
5) The shown lter architecture assures a constant process-
ing delay independent of the lter window size. The total
delay is the sum of the processing delay and the ll-up
time of the line storages which depends on the kernel size
and image width.
6) Image quality assessment in terms of PSNR
dB
and struc-
tural similarity assured that the image quality loss due
to coefcient quantization and due to rounding of the
internal results is negligible.
REFERENCES
[1] C. Tomasi and P. Manduchi, Bilateral ltering for gray and color im-
ages, in Proc. IEEE ICCV, 1998, pp. 839846.
[2] B. Zhang and J. P. Allebach, Adaptive bilateral lter for sharpness en-
hancement and noise removal, IEEE Trans. Image Process., vol. 17,
no. 5, pp. 664678, May 2008.
[3] B. Yan and A.-D. Saleh, Structure enhancing bilateral ltering of
images, in Proc. IEEE PCSPA, 2010, pp. 614617.
[4] M. de-Frutos-Lpez, H. Medina-Chanca, S. Sanz-Rodrguez, C. Pelez-
Moreno, and F. Daz-de-Mara, Perceptually-aware bilateral lter for
quality improvement in low bit rate video coding, in Proc. IEEE PCS,
2012, pp. 477480.
[5] J. Won Lee, R.-H. Park, and S. Chang, Noise reduction and adaptive
contrast enhancement for local tone mapping, IEEE Trans. Consum.
Electron., vol. 58, no. 2, pp. 578586, May 2012.
[6] J. Giraldo, Z. Kelm, L. Yu, J. Fletcher, B. Erickson, and C. McCollough,
Comparative study of two image space noise reduction methods for com-
puted tomography: Bilateral lter and nonlocal means, in Proc. Conf.
IEEE EMBS, 2009, pp. 35293532.
[7] L. Yu, A. Manduca, J. Trzasko, N. Khaylova, J. Koer, C. McCollough,
and J. Fletcher, Sinogram smoothing with bilateral ltering for low-
dose CT, in Proc. SPIE Med. Imag.: Phys. Med. Imag., 2008, vol. 6913,
pp. 691329-1691329-8.
[8] A. Gabiger, R. Weigel, S. Oeckl, and P. Schmitt, Enhancement of CT
image quality via bilateral ltering of projections, in Proc. 1st Int. Conf.
Image Formation X-ray Comput. Tomography, 2010, pp. 140143.
[9] A. Gabiger-Rose, R. Rose, M. Kube, P. Schmitt, and R. Weigel, Noise
adaptive bilateral ltering of projections for computed tomography, in
Proc. 11th Int. Meet. Fully Three-Dimens. Image Reconstruction Radiol.
Nucl. Med., 2011, pp. 306309.
[10] A. Gabiger, M. Kube, and R. Weigel, A synchronous FPGA design of
a bilateral lter for image processing, in Proc. IEEE IECON, 2009,
pp. 19901995.
[11] T. Riesgo, Y. Torroja, and E. de la Torre, Design methodologies based
on hardware description languages, IEEE Trans. Ind. Electron., vol. 46,
no. 1, pp. 312, Feb. 1999.
4104 IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 61, NO. 8, AUGUST 2014
[12] T. Q. Pham and L. J. van Vliet, Separable bilateral ltering for fast video
preprocessing, in Proc. IEEE ICME, 2005, pp. 14.
[13] F. Durand and J. Dorsey, Fast bilateral ltering for the display of high-
dynamic-range images, ACM Trans. Graph., vol. 21, no. 3, pp. 257266,
Jul. 2002.
[14] S. Paris and F. Durand, A fast approximation of the bilateral lter using
a signal processing approach, in Proc. ECCV, 2006, pp. 568580.
[15] J. Chen, S. Paris, and F. Durand, Real-time edge-aware image processing
with the bilateral grid, ACM Trans. Graph., vol. 26, no. 3, pp. 19,
Jul. 2007.
[16] Q. Yang, K.-H. Tan, and N. Ahuja, Real-time O(1) bilateral ltering, in
Proc. IEEE CVPR, 2009, pp. 557564.
[17] M. M. Bronstein, Lazy sliding window implementation of the bilateral
lter on parallel architectures, IEEE Trans. Image Process., vol. 20, no. 6,
pp. 17511756, Jun. 2011.
[18] B. Weiss, Fast median and bilateral ltering, ACM Trans. Graph.,
vol. 25, no. 3, pp. 519526, Jul. 2006.
[19] F. Porikli, Constant time O(1) bilateral ltering, in Proc. IEEE CVPR,
2008, pp. 18.
[20] Y.-C. Tseng, P.-H. Hsu, and T.-S. Chang, A 124 Mpixels/sec VLSI de-
sign for histogram-based joint bilateral ltering, in IEEE Trans. Image
Process., Nov. 2011, vol. 20, no. 11, pp. 32313241.
[21] F. Hannig, M. Schmid, J. Teich, and H. Hornegger, A deeply pipelined
and parallel architecture for denoising medical images, in Proc. IEEE
FPT, 2010, pp. 485490.
[22] L. Costas, P. Colodrn, J. J. Rodrguez-Andina, J. Faria, and
M.-Y. Chow, Analysis of two FPGA design methodologies applied to
an image processing system, in Proc. IEEE ISIE, 2010, pp. 30403044.
[23] N. Sudha and A. R. Mohan, Hardware-efcient image-based robotic path
planning in a dynamic environment and its FPGA implementation, IEEE
Trans. Ind. Electron., vol. 58, no. 5, pp. 19071920, May 2011.
[24] R. Marin, G. Len, R. Wirz, J. Sales, J. M. Claver, P. J. Sanz, and
J. Fernndez, Remote programming of network robots within the UJI in-
dustrial robotics telelaboratory: FPGA vision and SNRP network proto-
col, IEEETrans. Ind. Electron., vol. 56, no. 12, pp. 48064816, Dec. 2009.
[25] E. Monmasson and M. N. Cirstea, FPGA design methodology for in-
dustrial control systemsA review, IEEE Trans. Ind. Electron., vol. 54,
no. 4, pp. 18241842, Aug. 2007.
[26] J. J. Rodriguez-Andina, M. J. Moure, and M. D. Valdes, Features, design
tools, and application domains of FPGAs, IEEE Trans. Ind. Electron.,
vol. 54, no. 4, pp. 18101823, Aug. 2007.
[27] H. Zhuang, K.-S. Low, and W.-Y. Yau, Multichannel pulse-coupled
neural-network-based color image segmentation for object detection,
IEEE Trans. Ind. Electron., vol. 59, no. 8, pp. 32993308, Aug. 2012.
[28] S. Jin, D. Kim, T. T. Nguyen, D. Kim, M. Kim, and J. W. Jeon, Design and
implementation of a pipelined datapath for high-speed face detection using
FPGA, IEEE Trans. Ind. Informat., vol. 8, no. 1, pp. 158167, Feb. 2012.
[29] Y. Chen and V. Dinavahi, Digital hardware emulation of universal ma-
chine and universal line models for real-time electromagnetic transient
simulation, IEEE Trans. Ind. Electron., vol. 59, no. 2, pp. 13001309,
Feb. 2012.
[30] C. Charoensak and F. Sattar, FPGA design of a real-time implementation
of dynamic range compression for improving television picture, in Proc.
IEEE ICICS, 2007, pp. 15.
[31] A. Rosado-Muoz, M. Bataller-Mompen, E. Soria-Olivas, C. Scarante,
and J. F. Guerrero-Martnez, FPGA implementation of an adaptive lter
robust to impulsive noise: Two approaches, IEEE Trans. Ind. Electron.,
vol. 58, no. 3, pp. 860870, Mar. 2011.
[32] T. Q. Vinh, J. H. Park, Y.-C. Kim, and S. H. Hong, FPGA implementation
of real-time edge-preserving lter for video noise reduction, in Proc.
IEEE ICCEE, 2008, pp. 611614.
[33] H. Dutta, F. Hannig, J. Teich, B. Heigl, and H. Hornegger, A design
methodology for hardware acceleration of adaptive lter algorithms in
image processing, in Proc. IEEE ASAP, 2006, pp. 331340.
[34] R. Chen, L. Chen, and L. Chen, System design consideration for digital
wheelchair controller, IEEE Trans. Ind. Electron., vol. 47, no. 4, pp. 898
907, Aug. 2000.
[35] R. Turney, Two-dimensional linear ltering, in Application Note: Xilinx
FPGAs, 2007, pp. 18.
[36] M. Zhang and B. K. Gunturk, Multiresolution bilateral lter for image
denoising, IEEE Trans. Image Process., vol. 17, no. 12, pp. 23242333,
Dec. 2008.
[37] Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli, Image quality assess-
ment: From error visibility to structural similarity, IEEE Trans. Image
Process., vol. 13, no. 4, pp. 600612, Apr. 2004.
[38] A. Gabiger-Rose, M. Kube, P. Schmitt, R. Weigel, and R. Rose, Image
denoising using bilateral lter with noise-adaptive parameter tuning, in
Proc. IEEE IECON, 2011, pp. 45154520.
Anna Gabiger-Rose (S09) was born in
Ordshonikidse, Ukraine, in 1978. She received the
Dipl.-Ing. degree in electrical engineering, electro-
nics, and information technology from the Friedrich-
Alexander University of Erlangen-Nuremberg,
Erlangen, Germany, in 2007.
From 2001 to 2007, she was a Student Assistant
with the Department of Contactless Test and Mea-
suring Systems, Fraunhofer Institute for Integrated
Circuits, Erlangen. She is currently a Research As-
sistant with the Institute for Electronics Engineering,
University of Erlangen-Nuremberg. Her research interests include the design of
embedded systems for image processing and the investigation of digital ltering
techniques for image quality enhancement.
Mrs. Gabiger-Rose is member of the IEEE Industrial Electronics Society.
She served as a reviewer for the 35th Annual Conference of the IEEE Industrial
Electronics Society (IECON09).
Matthias Kube was born in Mainz, Germany, in
1975. He received the Dipl.-Ing. FH (M.Sc.) degree
in electrical engineering and microelectronics from
the Georg-Simon-Ohm University of Applied Sci-
ence of Nuremberg, Nuremberg, Germany, in 2002.
Since 2003, he has been working as a member of
the research staff at the Department of Contactless
Test and Measuring Systems, Fraunhofer Institute
for Integrated Circuits, Erlangen, Germany. He has
the technical leadership for the development of an
innovative indirect converting X-ray detector with
conventional optical sensors for scientic and industrial applications of non-
destructive testing (NDT), which is optimized for tasks that require a high
dynamic range, a high speed, and a long life cycle. His interests in research
include optical sensors and cameras, eld-programmable-gate-array design,
embedded systems for image processing, and X-ray imaging for NDT.
Robert Weigel (S88M89SM95F02) was born
in Ebermannstadt, Germany, in 1956. He received
the Dr.-Ing. and Dr.-Ing.habil. degrees in electrical
engineering and computer science from the Mu-
nich University of Technology, Munich, Germany, in
1989 and 1992, respectively.
He was a Research Engineer from 1982 to 1988,
a Senior Research Engineer from 1988 to 1994, and
a Professor for RF Circuits and Systems from 1994
to 1996 with the Munich University of Technology.
From 1996 to 2002, he was the Director of the
Institute for Communications and Information Engineering, University of Linz,
Linz, Austria. Since 2002, he has been the Head of the Institute for Electronics
Engineering, University of Erlangen-Nuremberg, Erlangen, Germany.
Dr. Weigel was the recipient of the IEEE Microwave Applications Award in
2007. Within IEEE Microwave Theory and Techniques Society (MTT-S), he has
been the Founder andChair of the AustrianCommunications/Microwave Theory
and Techniques Society Joint Chapter and Region 8 Coordinator. He is the Chair
of MTT-2 Microwave Acoustics and the MTT-S President-Elect in 2013.
Richard Rose (S09) was born in Nuremberg,
Germany, in 1981. He received the Dipl.-Ing. degree
in electrical engineering, electronics, and informa-
tion technology from the Friedrich-Alexander Uni-
versity of Erlangen-Nuremberg, Erlangen, Germany,
in 2007.
In 2008, he joined the Institute for Electronics
Engineering, University of Erlangen-Nuremberg, as
a Research Assistant, and since 2010, he has been the
Team Leader of the System Engineering group. His
research interests include digital signal processing,
receiver design, antenna design, localization techniques, and wireless commu-
nication systems.
Mr. Rose is a member of the IEEE Microwave Theory and Techniques So-
ciety, the IEEE Signal Processing Society, the IEEE Antennas and Propagation
Society, and the IEEE Communications Society. He served as a reviewer for the
journal of Mathematical Problems in Engineering and the International Journal
of Electronics and Communications.