You are on page 1of 10

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO.

8, AUGUST 2012

1419

Resource-Efficient FPGA Architecture and


Implementation of Hough Transform
Zhong-Ho Chen, Alvin W. Y. Su, and Ming-Ting Sun, Fellow, IEEE

AbstractHough transform is widely used for detecting straight


lines in an image, but it involves huge computations. For embedded
application, field-programmable gate arrays are one of the most
used hardware accelerators to achieve real-time implementation of
Hough transform. In this paper, we present a resource-efficient architecture and implementation of Hough transform on an FPGA.
The incrementing property of Hough transform is described and
used to reduce the resource requirement. In order to facilitate
parallelism, we divide the image into blocks and apply the incrementing property to pixels within a block and between blocks.
Moreover, the locality of Hough transform is analyzed to reduce
the memory access. The proposed architecture is implement on an
Altera EP2S180F1508C3 device and can operate at a maximum
frequency of 200 MHz. It could compute the Hough transform
512 test images with 180 orientations in 2.073.16 ms
of 512
without using many FPGA resources (i.e., one could achieve the
performance by adopting a low-cost low-end FPGA).

noise or missing data, but also involves huge computations and


excessive memory requirements.
Through Hough transform, the for a line with an angle
passing through a feature point at the image coordinate
can be calculated by
(1)
Practical implementations of Hough transform generally involve a voting procedure over the discrete parameter space. Algorithm 1 below shows the voting process of Hough transform.
is the number of angles, and the rounding operation is applied
to the results of (1) to get integer values of .
Algorithm 1. Voting Process of Hough Transform

Index TermsFPGA, hough transform, real-time.

Initialize Votes as zeros


I. INTRODUCTION

for all feature points

OUGH transform [1] is a popular technique for detecting


straight lines in images. In actual applications, through
preprocessing and thresholding, the images are converted into
binary feature images. The pixels with the pixel value 1 are
called feature points. A line is the one that passes through many
features points. Imagining we draw lines with various angles
passing through a feature point, each line can be represented as
a point in the space, where is the perpendicular distance
of the line to the origin and is the angle between a normal
to the line and the positive -axis. Each time when a line is
value which
drawn for a feature point, it will produce a
value. After
can be considered as a vote for the specific
processing all of the feature points, the
value that has
the largest accumulated votes will correspond to the line that
passes through the largest number of feature points. In the imvalue can be stored
plementation, the votes for a specific
value. The Hough
in a memory addressed by the specific
transform is robust and performs well even in the presence of

Manuscript received December 18, 2010; revised April 07, 2011; accepted
June 01, 2011. Date of publication July 25, 2011; date of current version June
14, 2012. This work was supported in part by the National Science Council,
Taiwan, under Grant NSC 98-2221-E-006-158-MY3.
Z.-H. Chen and A. W. Y. Su are with the SCREAM Lab, Department of Computer Science and Information Engineering, National Cheng-Kung University,
701 Tainan, Taiwan. (e-mail: zhonghochen@gmail.com; alvinsu@mail.ncku.
edu.tw).
M.-T. Sun is with the Electrical Engineering Department, University of Washington, Seattle, WA 91895 USA (e-mail: sun@ee.washington.edu).
Digital Object Identifier 10.1109/TVLSI.2011.2160002

for

, using 180

as the step-size

end for
end for.
After the voting process, the
with local-maximum
values of votes are considered as candidate lines. In this paper,
we only focus on the voting process of the Hough transform.
Given a CIF (352 288) video with 30 frames/second (fps) and
10% feature points, it needs 109 M multiplications per second to
compute the values of 180 angles. For embedded applications,
it requires hardware accelerators to achieve real-time Hough
transform. Compared with application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs) usually
target smaller markets and require much less development time.
In FPGA, high throughput is often achieved by exploiting the
parallelism of the design rather than by operating the chip at
a very high clock frequency. In addition, a better architecture
should have more efficient utilization of the function blocks in
the FPGA. In this paper, we propose an architecture and the
implementation of Hough transform on an FPGA by exploiting
both angle-level and pixel-level parallelism. The goal is to
achieve the highest throughput with the minimum hardware
resource.

1063-8210/$26.00 2011 IEEE

1420

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 8, AUGUST 2012

The remainder of this paper is organized as follows. Section II


provides a brief review of related works for the implementation of Hough transform. Section III describes our observations,
which lead to an efficient architecture for implementing Hough
transform. In Section IV, we describe the proposed Hough transform architecture and FPGA implementation. Section V evaluates the proposed architecture. Finally, a conclusion is given in
Section VI.
II. RELATED WORKS
Because the general Hough transform is very computationally intensive, other line-detection schemes have been proposed,
such as gradient-based Hough transform [2] and kernel-based
transform [3]. These schemes require fewer computations than
Hough transform, but they still require a high-end CPU, which is
often unavailable in practical applications, or special hardware
devices to achieve real-time performance. The gradient-based
Hough transform has been implemented in special hardware
[4], but kernel-based Hough transform, which adopts a linklist data structure, is difficult to implement efficiently in hardware. There has been some research to implement the Hough
transform on special hardware, such as graphic processors [5],
scan line array processors [6], and pyramid multiprocessors [7].
However, these devices are unsuitable for low-cost embedded
systems. This paper focuses on the implementation of the general Hough transform using FPGA.
One straightforward method to implement Hough transform
is using multipliers [8]. However, multipliers are less available on low-end FPGAs. Hence, some researchers implement
Hough transform using a coordinate rotation digital computer
(CORDIC) [9][13] or a simplified CORDIC algorithm such as
the multisector algorithm [14]. CORDIC is an arithmetic technique developed by Volder [15] to solve trigonometric problems
by rotating a vector in small angles until the desired angle is
achieved. The CORDIC algorithms for FPGA are surveyed in
[16]. CORDIC could implement the Hough transform using
only shifters and adders rather than multipliers. The major
disadvantage is that it requires multiple iterations to obtain one
in the parameter space. Hence, pipelined CORDIC implementations are proposed to improve the throughput; however,
the required resources are also increased. Another drawback
of CORDIC is that the result produced is not the correct value
but the correct value with a constant gain. The gain can
be eliminated by applying an inverse gain to the initialization
vector. However, this also requires additional resources. In [17],
an FPGA platform is proposed to implement Hough transform
by using hybrid-log arithmetic. In [18], a distributed arithmetic
(DA) architecture is proposed to implement Hough transform
with shiftadd operations. Unfortunately, it also requires multiple iterations to obtain one in the parameter space.
An accumulator-based architecture is proposed in [19]. An
angle-level parallelism is applied in this architecture. It could
obtain a point in the parameter space by a single accumulation. However, all of the pixels in the binary feature image need
to go through the computation pixel by pixel, which limits its
throughput. Additive Hough transform [20] utilizes the properties of Hough transform. Although the pixel-level parallelism is

facilitated in this architecture, the memory requirement is also


increased in proportion to the parallelism. Another drawback
is that it requires additional cycles to get the entire parameter
space.
Incremental Hough transforms [21][24] are modified Hough
transforms for the hardware implementation. It reuses previously computed values to derive another point in the parameter
space.
Hough transform is not only computation-demanding but
also memory-demanding. In [25], a line-based implementation
is proposed to reduce the bandwidth requirement on SIMD
architecture. In [26], authors propose a memory-efficient implementation of Hough transform by using circular buffers of
DSP processors. In [27], the memory requirement is reduced
by storing the coordinates of a binary feature image rather
than entire binary feature image. In [28] and [29], a modified
cache-friendly Hough transform is proposed to reduce the
memory requirement and the parallelism overhead on multiprocessors.
In this paper, we utilize the incrementing property of Hough
transform to achieve an efficient architecture for implementing
Hough transform. The proposed architecture facilitates both
pixel-level and angle-level parallelisms. Unlike [20], the
memory requirement is not increased in proportion to the
parallelism. We propose using run-length encoding to skip
unnecessary computations and memory accesses. Run-length
encoding is widely used for data compression [30] but could
also be used for reducing the computing complexity of image
processing [31]. Another application of run-length encoding is
the skew detection of documents [32], [33]. In this paper, we
use run-length encoding to reduce the computing complexity
of Hough transform. Moreover, we utilize the locality of the
Hough transform and Vote Consolidation to reduce the memory
requirement.
To help later discussions, we summarize some terms we use
in this paper as follows.
specific
represents a line with an
angle and distance from the origin.
value for a line with angle which passes
through the point at the coordinate
.
value for a line with angle
through the point .

which passes

Vote

line with
which passes through a
feature point will generate a vote to the
value.
specific

Vote-offset

difference of the integer part between


and
, where is any point in a block
has the minimum integer part
and
among all points in a block.
III. OBSERVATIONS

In Hough transform, only feature pixels produce votes. The


number of feature pixels in an image is usually much less than
that of nonfeature pixels. In order to perform the pixel-level parimage into
allelism of Hough transform, we divide a

CHEN et al.: RESOURCE-EFFICIENT FPGA ARCHITECTURE AND IMPLEMENTATION OF HOUGH TRANSFORM

Fig. 3. Illustration of (2).

Fig. 1. Example image and its zero-run-length encoded symbols.

Fig. 2. Number of symbols versus the block-sizes for several 512


images.

1421

2 512

blocks with a block-size


by , so that a relatively small processing element (PE) can process all of the pixels inside a block
simultaneously. We call the blocks that do not contain feature
pixels nonfeature blocks (i.e., all zero blocks). The performance
of computation can be significantly improved if the nonfeature
blocks are skipped.
A. Run-Length Encoding
A run-length encoding can encode an input binary feature
image into a zero-run-length symbol stream, in order to skip
the nonfeature blocks. We encode the binary image as a list of
symbols before the calculation of the values. A symbol is reptriplet, where is a bit to indicate
resented as a
represents the pixel values
the beginning of a block-row,
in a block, and is the number of successive zero-blocks after
the current block. Fig. 1 gives an example of a binary image and
the encoded run-length symbols, where pixels with a value 1
represent feature pixels. In this example, the image size is 16
8 and the block size is 2 4. Note that the first block of each
block-row is always encoded whether it is a nonfeature block or
not. These 16 blocks are encoded into six symbols.
The maximum number of successive zero-blocks needs to
be limited in order to limit the size of a lookup table (LUT)
in our proposed architecture as will be explained later. The
coding efficiency depends on both the block-size and the maximum number of successive zero-blocks. Fig. 2 shows the total
512 images
number of symbols for various gray-level 512
with the maximum number of successive zero-blocks set to 15.
The original images are preprocessed by the edge function of
the MATLAB image processing toolbox: 1) a Sobel operator exand vertical gradients
;
tracts the horizontal gradients

2) the magnitudes of the gradients are calculated by


;
3) a cutoff value is set as four times the mean of all magnitudes;
and 4) the magnitudes are thresholded by the threshold value,
which is set as the square root of the cutoff value.
Typically, the number of symbols is roughly inversely proportional to the block-size. However, it is very likely that a larger
block will contain few nonzero feature pixels. In our proposed
values for a spearchitecture, we use a PE to compute the
cific for all of the pixels inside a block simultaneously. In order
to limit the complexity of the PE, the block-size cannot be too
large. Based on the simulation results in Fig. 2, we choose the
block size of 2 4, which gives a good tradeoff between the
resulting number of symbols and the PE complexity. The performance is similar to that of 4 2, but the block-size of 2
4 results in a smaller number of block-rows, which is preferred
with our architecture as will be clear later. It should be noted
that in practical applications, different preprocessing from the
one we use in this example will be used depending on the applications. So, this example only serves to illustrate the parameter
selection process and considerations using our proposed architecture. In the following discussions, without loss of generality,
we will use the image shown in Fig. 1 as an example to illustrate
the operation of our proposed architecture.
B. Incrementing Property of Hough Transform
We can use parallel PEs to compute different in parallel. To
further improve the computation speed, we also perform pixellevel parallelism for calculating Hough transform. For a specific
, given two points in the image with coordinates
and
, one could directly calculate
by (1) or derive it from
as
(2)
The equation could be illustrated by Fig. 3. In Fig. 3, L1 and
and B
L2 are lines with angle passing through point A
, respectively. The of L1 and L2 is the distance
and
, respectively. One could directly calculate
or
and
.
compute it by adding
From (2), it is easy to see that the pixel which gives the
smallest value is the first pixel in the block for
and is the upper rightmost pixel in the block for
, respectively. In the proposed architecture, the pixel has
the smallest
value. If we label the pixels inside a block as

1422

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 8, AUGUST 2012

Fig. 4. Pixel labels in a block: (a) 0

  < 90

. (b) 90

  < 180

TABLE I
MAXIMUM VOTE-OFFSET FOR DIFFERENT BLOCK-SIZES

Fig. 5. Proposed architecture for Hough transform using FPGA.

Since Hough transform may be used in different applications,


we should consider all possible angles. For any angles, we could
by solving:
determine the maximum
shown in Fig. 4, all of the pixels in the block can be calculated
which is the smallest value in the block. Furtherfrom
more, since
, and with the labeling in
will be negative, the values for
Fig. 4(b),
can be calculated using exactly the same circuits as those used
. So, we will only use
in the
for
following discussion. Note that, from (1), for the first block in
for
, and
the image,
for
.

where
be represented as
and

and

. Since
, where

can

(5)
we can compute the maximum

among all

as

C. Locality of Hough Transform


We observe that the differences among produced by pixels
in the same block are small. Also, pixels in the same block may
contribute votes to the same in the parameter space. We call
this property locality of Hough transform. The votes from the
pixels of a block will locate in a small range. Instead of individually accumulating the vote from each feature pixel in the
parameter space, pixels giving the same value can be jointly
accumulated. We can utilize (2) to determine whether points in
value.
the same block give the same
Let , which has the minimum value among all pixels of
. Let
, where
a block, be at the coordinate
is the integer part and is the residual fractional part. Let
be another pixel in the same block at coordinate
.
can be derived by
The integer part
(3)
where
is called the vote offset from ,
which represents the contribution from the accumulation of the
with
, which will add an
fractional parts
by block and a given
integer vote-offset to . For a given
, the maximum value for the vote-offset is
(4)

Table I shows the maximum vote-offset values for different


block-sizes. For the block-size of 2 4, the largest vote offset
is 4, which means there can only be five different integer values
for in the whole block. We will discuss the use of this locality
property in our proposed implementation of Hough transform to
reduce the memory bandwidth requirement.
IV. PROPOSED HOUGH TRANSFORM ARCHITECTURE AND
FPGA IMPLEMENTATION
Based on the above observations, we propose an efficient architecture for implementing Hough transform. A block diagram
of the proposed architecture is shown in Fig. 5 which will be discussed in detail as follows.
Run-length encoding is a simple process which reads
the binary pixel values from the feature image and output
triplet. Efficient implementations of the
the
run-length encoder could be found in the implementations of
standard codecs, such as JPEG [34]. The run-length encoding
also reduces the memory bandwidth requirement for the FPGA
by a factor determined by the compression ratio achieved by the
run-length encoding. Since it has little effect on the complexity
of the overall circuits, and can reduce the data bandwidth, in
the system we implemented for our specific application, it was

CHEN et al.: RESOURCE-EFFICIENT FPGA ARCHITECTURE AND IMPLEMENTATION OF HOUGH TRANSFORM

Fig. 6. Functional blocks in the proposed PE. The Inter-block incrementing


considers rb; zl as input and computes  (p ) of a nonzero block.  (p ) is
divided into integer part (i ) and fractional part (f ). Integer part is used to
access Vote Memory, and fractional part is used for intra-block incrementing.
Vote Consolidation considers code as input and consolidates votes. Finally, Vote
Alignment aligns votes for access Vote Memory.

implemented by the preprocessor off-chip. The PE is run-time


configurable for computing Hough transform of any angles.
Each PE calculates the consolidated values for all the pixels
in a block for a given . The number of PEs is adjustable and
depends on the performance requirement. The Vote Memory
stores all the votes. These functional blocks are described in
detail in the following. A block diagram of the PE is shown in
Fig. 6.
The incrementing property in (2) is utilized for both
inter-block and intra-block incrementing. The inter-block inof first blocks in block-rows
crementing calculates the
and non-zero blocks in the run-length encoded symbols. The
intra-block incrementing calculates the values of other
after the inter-block incrementing. As shown in the above
values from all the pixels in a block can
section, the eight
, where
only have five different integer values, to
is the integer part of
. So, some of the
from different
pixels will have the same values. If the individual votes are
directly saved into the memory, it will incur multiple memory
accesses. To save the memory bandwidth, Vote Consolidation
consolidates the eight single votes from the eight pixels into five
). Since the addresses for storing
consolidated votes ( to
these five different votes are continuous, the Vote Alignment
aligns the initial address so that the votes can be stored into the
correct memory locations in one clock cycle. The functional
blocks of the PE are described in detail as follows.
A. Inter-Block Incrementing
Because we divide an image into blocks with a fixed blocksize,
and
are constants between the corresponding pixels
of two blocks. Two accumulators can be used to implement the
inter-block incrementing as shown in Fig. 7(a), where
can be precomputed. In order to skip zero-blocks, a step-table
is introduced in the proposed architecture. The step-table stores
. Two PEs for calculating the votes
all possible
can share the same step-table. In our imfor and
plementation, the maximum number of successive zero-blocks

1423

is set to 15 to limit the size of the step-table. Hence, there are


only 16 entries in each step-table. The output of the step-table is
component of (2) for computing
called step which is the
based on the
of the previous nonskipped block
values
in the same block-row. Col-reg calculates the
for the nonzero blocks in a block-row in the -direction every
values for the
clock cycle, and row-reg calculates the
first blocks of block-rows in the -direction every time after
a block-row processing is completed. At the beginning of the
, or
frame, row-reg is initialized to 0 for
for
. The result is represented in a fixed-point
format,
, where is the integer part and is the fractional
is a control signal which is asserted bepart. In Fig. 7(a),
of a frame and
is another control signal
fore processing
which is asserted before processing a row-block. The col-reg is
value of the first pixel
only responsible for calculating the
values of other pixels are calculated by the
in a block. The
intra-block incrementing to be discussed in Section IV-B.
B. Intra-Block Incrementing and Vote Consolidation
The computed
can be used to calculate all the other
values of the pixels in the block simultaneously by using the
, and
values. This will result in
corresponding , ,
seven more values. For the whole block, the eight votes in the
memory addressed by the eight values will need to be accumulated. We observe that based on the locality of Hough transform as discussed in Section III-C, the maximum vote-offset is
4 for the block-size of 2 4. So, there are at most five different
values ( to
) in the whole block.
is divided into the integer part and
The computed
the fractional part . To result in an efficient circuit, only the
is used for calculating the vote-offsets relafractional part
tive to for the pixels in the block as shown in Fig. 8. The first
for the th pixel
stage of Fig. 8 calculates the vote-offsets
in the block. The vote-offsets range from 0 to 4, and are represented by 3-b numbers. These numbers are decoded by 3:8
decoders as shown in the second stage of Fig. 8. Each decoder
output contains eight lines indicating the vote-offset value for
each pixel. These lines will be used with combination logic to
produce consolidated votes. In Fig. 8, we eliminate those signals
at the output of the decoders which are always zero, and only
keep those lines which may be nonzero. The constants in Fig. 8
are precomputed and stored in the registers before activating the
PE. They are also shared between the two PEs calculating the
.
angles and
In Fig. 9, the outputs of the decoders are combined with the
values of the corresponding pixels (1 for feature pixels and 0
for nonfeature pixels) using a combination logic circuit to determine , which represents the consolidated number of votes
, it reprefor each different vote-offset. For example, if
.
sents three votes with
In summary, using the locality of Hough transform and Vote
Consolidation, the circuits in Figs. 8 and 9 take the fractional
and produce the output to which reprepart of
equals to
, respecsent the numbers of votes with
tively. Thus, instead of accessing the memory multiple times to
accumulate the votes, we only need to access the memory once

1424

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 8, AUGUST 2012

Fig. 7. (a) Proposed accumulator architecture for the inter-block incrementing. (b) Basic dataflow of the circuit.

Vote Consolidation, since the generated votes from the pixels


may refer to the same memory location.
C. Vote Alignment

Fig. 8. Intra-block incrementing.

Fig. 9. Vote consolidation.

(with addresses to
), and each time we accumulate the
memory contents in parallel. This is not possible without the

,
To update the contents of the memory locations to
instead of accessing the memory five times sequentially, we update the five memory contents in one clock cycle. This is possible since the five memory contents which need updating are
stored in continuous memory locations with addresses from to
with our vote consolidation scheme. The is considered
as the base address of these votes. We use two 4K RAM blocks
in the FPGA to implement the Vote Memory. Because the accumulation requires one read and one write memory operation, the
4K RAM blocks are configured as dual-port. In the FPGA, each
4K RAM could be configured as 128 36, 256 18, 512 9,
1024 4, 2048 2 or 4096 1. The maximum bit-width of a
4K RAM is 36 b, thus two 4K RAMs can store eight votes (each
vote with 9 b). Since we only have five vote-offsets, three 0 value
votes are added in Fig. 10(a). In the case that the block-size is
larger than 2 4 or 4 2, the circuit can be modified accordingly or it can use multiple clock cycles to handle a symbol.
The consolidated votes must be aligned based on the base-address before they are accumulated to the Vote Memory. Since
the value may not be a multiple of 8, we need to align the five
vote-offsets with the correct memory locations before the accumulation. This is achieved with a barrel rotator controlled by
the value (which is the base-address) as shown in Fig. 10(a) to
align the votes with the corresponding Vote Memory contents.
The votes are grouped in two groups, each group containing four
votes. The lower four votes are stored in one 4K RAM and the
upper four votes are stored in the other 4K RAM. Fig. 10(b)
shows the two 4K RAM configuration. In the circuit implementation, the upper bits of are used to address the memory and
the lower bits of are used to align the votes. If the
is located in the upper 4K RAM, the address of the lower 4K RAM
should be increased by 1 to give the correct addresses.
In our implementation, the image size is 512 512. Although
the maximum can be
which is a 10-b number, in practice we limit it to a 9-b number, since, if a receives more than
value in the
511 votes, it is certain there is a line with that
input binary feature image.

CHEN et al.: RESOURCE-EFFICIENT FPGA ARCHITECTURE AND IMPLEMENTATION OF HOUGH TRANSFORM

1425

TABLE II
ACCURACY, FPGA RESOURCES, AND MAXIMUM FREQUENCY
UNDER DIFFERENT ACCURACY FOR THE PE

Fig. 10. (a) Vote alignment. (b) Vote memory and accumulators.

D. Initialization
Before a PE starting to process an angle, a host needs to initialize the PE by initializing three components: 1) the row-register, 2) the step table, and 3) the registers in the intra-block in. If a processor
crementing block for storing
is used to control the PE, these values could be computed by the
processor. Otherwise, it requires a lookup table and an accumulator to compute these values. The lookup table stores all
of all angles, and
of an angle is compute by
.
V. EVALUATION
Because the output of the proposed architecture is identical
to the ideal Hough transform, we do not show the result of test
images. Here, we evaluate the resource requirement, memory
bandwidth, and computation time of the proposed architecture.
A. Resource Requirement
and
are
In the hardware implementation, the
represented in the fixed-point format. Let the fraction part be
represented in bits. The maximum error introduced by each
. There are (W/M-1) steps in the -direcincrement is
tion and H/N steps, including the initialization, in the -direction. Moreover, intra-block incrementing involves another inis
crement. Therefore, the maximum error of
.
Table II shows the accuracy, FPGA resources, and the maximum achievable frequency for PEs under different accuracy.
ALUT stands for adaptive LUT, and it is the basic cell in Altera
Stratix II FPGAs. The result is reported by the FPGA vendors
synthesis tool, Quartus II Version 9.1 with SP2, on the device
EP2S180F1508C3. By increasing the number of bits in the
fixed-point implementation, the maximum error is reduced.
However, it also increases the required resources and reduces
the maximum achievable frequency.
In order to evaluate the proposed architecture, we also
implement the architecture of previous works [9], [18], [19]
on the same device. In the implementation, we use 12 b for
the fractional part. Table III compares the throughput of these
approaches, where the Throughput per Cycle is measured
values per cycle. The
as the average number of computed
values per
Throughput (M/s) is measured in millions of
second. In the comparisons of the throughput, we do not consider the effect of the Vote Memory and Accumulators, since

the main purpose of this work is to compare the throughput


of the PE for the different architectures. Vote Memory is a
common part for all architectures to store all of the votes.
Also, the memory access can be speeded up by using multiple
reconfigurable memories on the FPGA. If the PE can run at a
much higher speed than the memory access, it is possible to use
one PE to process multiple angles and use multiple memories
to match the throughput.
The accuracy of the CORDIC algorithm [9] depends on both
the number of fractional bits and the number of iterations. In
our implementation, we keep 9 b for the fractional parts and use
values
13 iterations. The CORDIC algorithm produces two
per cycle. In the DA architecture [18], the accuracy depends on
the fractional part, and we keep 3 b for the fractional parts. The
throughput depends on the bits for the image coordinates. Since
the image coordinates are 9 b, each value is computed in nine
clock cycles. The accuracy of Cherns [19] method depends on
the number of fractional bits, and it computes one value per
cycle. We use two proposed PEs to simultaneously calculate two
of one 2 4 block, since the two PEs
angles ( and
could share some resources. So, 16 values are computed per
cycle. In general, using more parallel PEs, higher throughput
can be achieved; however, this will also use more resources. So,
the more important number in the comparison in Table III is the
Throughput/ALUTs. As can be seen from Table III, our proposed architecture can achieve much better Throughput/ALUTs
compared to other architectures. Although the maximum frequency of the proposed architecture is lower, it could achieve
higher throughput with the minimum resources.
B. Memory and Bandwidth Requirement
Previous researches reported that Hough transform is not
only computation demanding but also memory-bandwidth
demanding. Here, we analyze the memory bandwidth of the
proposed architecture.
The entire votes of Hough transform are usually too large to
be stored in an internal temporary storage, and so, are stored
in an external memory. Hence, the direct implementation requires two external memory accesses for accumulating each
nonzero pixel of an angle. In addition, all votes should be initialized before the accumulation. The maximum value of an
. Hence, the required memory bandwidth
angle is
, where is the ratio
is
of the number of nonzero pixels to the total number of pixels of
the binary feature image and is the number of angles.

1426

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 8, AUGUST 2012

TABLE III
PERFORMANCE COMPARISON AMONG THE PES OF DIFFERENT APPROACHES

TABLE IV
MEMORY BANDWIDTH OF DIRECT IMPLEMENTATION OF HOUGH TRANSFORM
AND THE PROPOSED ARCHITECTURE

TABLE V
SYNTHESIS RESULT OF THE PROPOSED ARCHITECTURE ON EP2S180F1508C3

TABLE VI
IMAGE SPECIFICATIONS AND EXECUTION TIME

The proposed PE used Vote Memory to store votes of an


angle. In processing an angle of Hough transform, votes are temporarily stored in the Vote Memory to avoid accessing the external memory. After an angle of Hough transform is calculated,
a transfer from the Vote Memory to the external memory is ini.
tialized. The required memory bandwidth is
Since the memory bandwidth is dependent on images,
Table IV compares the required memory bandwidth of the test
images between the direct implementation of Hough transform
and the proposed architecture. All image sizes are 512 512.
The number of votes for an angle is 724 and the size of a
vote is limited to 9 b. The total number of angles is 180. The
result shows that the proposed architecture requires much less
memory bandwidth than the direct implementation of Hough
transform.
C. Computation Time
We use the proposed PE to compute the accumulated votes
of Hough transform of a 512 512 image. Table V shows the
synthesis result on EP2S180F1508C3. The maximum frequency
is bounded by the Vote Memory. One M-RAM is used for runlength symbols and one 512-b RAM (M512) is used for the steptable. Four 4K RAMs (M4K), two 4K RAMs for each angle,
are used for the Vote Memory with each PE. The last 4K RAM

(M4K) in the chip is used for storing all


and
values,
and all other constants.
The total execution time for computing Hough transform depends on the number of symbols and angles. Before accumulating votes in the parameter space, the Vote Memory should
be initialized, and it takes 128 cycles. The time to initialize the
constants of PEs and to output the content of the Vote Memory
is overlapped with the initialization of the Vote Memory. The
, are computed simultaneously, but 0
angles, and
and 90 are computed individually. Table VI shows the specification and execution time for each image. With the execution
time shown in Table VI, real-time processing of video can be
easily achieved.
D. Extend the Proposed Architecture to Different Image Sizes
The proposed architecture could be extended to process
Hough transform of different image sizes. The first step is to

CHEN et al.: RESOURCE-EFFICIENT FPGA ARCHITECTURE AND IMPLEMENTATION OF HOUGH TRANSFORM

decide the block-size. Block-sizes affect the number of symbols


of an image and the computation time. For different block sizes,
the Vote Memory should be carefully designed as described
in Section IV-C to match the bandwidth requirement of the
maximum vote-offset (Table I). The proposed PE can process a
symbol per clock cycle and the total number of cycles to process
.
Hough transform of an image is: number of symbols
Since the design of the Vote Memory varies with FPGAs, the
clock cycles for the initialization of Vote Memory and the
transmission between Vote Memory and the external memory
are not included.
The resource requirement of a PE also varies with different
image-sizes and block-sizes. The inter-block incrementing requires two adders and the bit-width is dependent on the imagesize and the precision F. The intra-block incrementing requires
adders and the bit-width is dependent on the blocksize and the precision F. The Vote Consolidation requires V
(maximum vote-offset) adders to consolidate votes. The number
, and the bit-width of the adders
of votes is less than
is less than
. The Vote Alignment requires a
.
rotator to rotate V votes and the bit-width is
VI. CONCLUSION
In this paper, we propose a resource efficient architecture
for calculating Hough transform. The incrementing property
for both inter-block and intra-block incrementing are exploited
to reduce the resource requirement. We use two accumulators
to facilitate the inter-block incrementing, and zero-blocks are
skipped by introducing a run-length coding scheme and a steptable. The intra-block incrementing could efficiently reduce the
resource requirement. Instead of computing the of every pixel
in a block, vote-offset is more efficient to determine the corresponding votes. We observe that pixels which are in the same
block may generate identical votes in the parameter space. The
locality of a block is analyzed and the votes corresponding to an
identical are consolidated in order to reduce the memory access and fully utilize the FPGA memory bandwidth. The result
shows that the proposed PE could achieve the best throughput
with the same amount of resources compared to previously reported architectures. The proposed PE is implemented on an
Altera EP2S180F1508C3 device and the maximum frequency
is 200 MHz. It could compute the Hough transform of a 512
512 image with 180 orientations in 2.07 3.16 ms. This performance is sufficient for real-time video processing.
REFERENCES
[1] R. O. Duda and P. E. Hart, Use of the Hough transformation to detect
lines and curves in pictures, Commun. ACM, vol. 15, pp. 1115, 1972.
[2] F. OGorman and M. B. Clowes, Finding picture edges through
collinearity of feature points, IEEE Trans. Comput., vol. C-100, pp.
449456, 1976.
[3] L. A. F. Fernandes and M. M. Oliveira, Real-time line detection
through an improved Hough transform voting scheme, Pattern
Recognit., vol. 41, no. 1, pp. 299314, 2008.
[4] L. Lin and V. K. Jain, Parallel architectures for computing the Hough
transform and CT image reconstruction, in Proc. Int. Conf. Applic.
Specific Array Processors, 1994, pp. 152163.
[5] R. Strzodka, I. Ihrke, and M. Magnor, A graphics hardware implementation of the generalized hough transform for fast object recognition, scale, and 3D pose detection, in Proc. 12th Int. Conf. Image Anal.
Process., 2003, pp. 188193.

1427

[6] A. L. Fisher and P. T. Highnam, Computing the Hough transform on


a scan line array processor, IEEE Trans. Pattern Anal. Mach. Intell.,
vol. 11, no. 3, pp. 262265, Mar. 1989.
[7] M. Atiquzzaman, Pipelined implementation of the multiresolution
Hough transform in a pyramid multiprocessor, Pattern Recognit.
Lett., vol. 15, no. 9, pp. 841851, 1994.
[8] K. Hanahara, T. Maruyama, and T. Uchiyama, A real-time processor
for the Hough transform, IEEE Trans. Pattern Anal. Mach. Intell., vol.
10, no. 1, pp. 121125, Jan. 1988.
[9] F. Zhou and P. Kornerup, A high speed Hough transform using
CORDIC, Univ. Southern Denmark, Tech. Rep. PP-1995-27, 1995.
[10] S. M. Karabernou and F. Terranti, Real-time FPGA implementation of
Hough transform using gradient and CORDIC algorithm, Image Vis.
Computing, vol. 23, no. 11, pp. 10091017, 2005.
[11] J. D. Bruguera, N. Guil, T. Lang, J. Villalba, and E. L. Zapata, Cordic
based parallel/pipelined architecture for the Hough transform, J. VLSI
Signal Process., vol. 12, no. 3, pp. 207221, 1996.
[12] D. D. S. Deng and H. Elgindy, High-speed parameterisable Hough
transform using reconfigurable hardware, in Proc. Pan-Sydney Area
Workshop Vis. Inf. Process., Sydney, Australia, 2001, vol. 11, pp.
5157.
[13] K. Maharatna and S. Banerjee, A VLSI array architecture for Hough
transform, Pattern Recognit., vol. 34, no. 7, pp. 15031512, 2001.
[14] E. K. Jolly and M. Fleury, Multi-sector algorithm for hardware acceleration of the general Hough transform, Image Vis. Computing, vol.
24, no. 9, pp. 970976, 2006.
[15] J. E. Volder, The CORDIC trigonometric computing technique, IRE
Trans. Electron. Comput., vol. 8, no. 3, pp. 330334, 1959.
[16] R. Andraka, A survey of CORDIC algorithms for FPGA based computers, in Proc. ACM/SIGDA 6th Int. Symp. Field Programmable Gate
Arrays, Monterey, CA, 1998, pp. 191200.
[17] P. Lee and A. Evagelos, An implementation of a multiplierless Hough
transform on an FPGA platform using hybrid-log arithmetic, in Proc.
SPIE, 2008, vol. 6811, p. 68110G.
[18] K. Mayasandra, S. Salehi, W. Wang, and H. M. Ladak, A distributed
arithmetic hardware architecture for real-time Hough-transform-based
segmentation, Can. J.Electr. Comput. Eng., vol. 30, no. 4, pp.
201205, 2005.
[19] M.-Y. Chern and Y.-H. Lu, Design and integration of parallel Houghtransform chips for high-speed line detection, in Proc. 11th Int. Conf.
Parallel Distrib. Syst. Workshops, 2005, vol. 2, pp. 4246.
[20] S. S. Sathyanarayana, R. K. Satzoda, and T. Srikanthan, Exploiting
inherent parallelisms for accelerating linear Hough transform, IEEE
Trans. Image Process., vol. 18, no. 10, pp. 22552264, Oct. 2009.
[21] H. Koshimizu and M. Numada, FIHT2 algorithm: A fast incremental
Hough transform, IEICE Trans., vol. E74, pp. 33893393, 1991.
[22] S. Tagzout, K. Achour, and O. Djekoune, Hough transform algorithm for FPGA implementation, Signal Process., vol. 81, no. 6, pp.
12951301, 2001.
[23] O. Djekoune and K. Achour, Incremental Hough transform: An
improved algorithm for digital device implementation, Real-Time
Imaging, vol. 10, no. 6, pp. 351363, 2004.
[24] H. Bessalah, S. Seddiki, F. Alim, and M. Bencherif, On line mode
incremental Hough transform implementation on Xilinx FPGAS, in
Proc. 8th Conf. Signal, Speech Image Process., Santander, Cantabria,
Spain, 2008, pp. 176179.
[25] Y. He, Z. Zivkovic, R. Kleihorst, A. Danilin, and H. Corporaal, Realtime implementations of Hough transform on SIMD architecture, in
Proc. 2nd ACM/IEEE Int. Conf. Distrib. Smart Cameras, 2008, pp. 18.
[26] M. Khan, A. Bais, K. Yahya, G. Hassan, and R. Arshad, A swift
and memory efficient Hough transform for systems with limited fast
memory, Image Anal. Recognit., vol. 5627, Lecture Notes in Computer Science, pp. 297306, 2009, vol. 5627.
[27] S. R. Geninatti, J. I. B. Bentez, and M. H. Calvio, FPGA implementation of the generalized Hough transform, in Proc. Int. Conf. Reconfigurable Computing and FPGAs, 2009, pp. 172177.
[28] Y.-K. Chen, W. Li, J. Li, and T. Wang, Novel parallel Hough
transform on multi-core processors, in Proc. IEEE Int. Conf. Acoust.,
Speech Signal Process., 2008, pp. 14571460.
[29] W. Li and Y.-K. Chen, Parallelization, performance analysis, and algorithm consideration of Hough transform on chip multiprocessors,
ACM SIGARCH Comput. Architecture News, vol. 36, pp. 1017, 2008.
[30] D. Salomon, Data Compression: The Complete Reference 4th Ed..
New York: Springer, 2006.
[31] C. H. Messom, G. Sen Gupta, and S. N. Demidenko, Hough transform
run length encoding for real-time image processing, IEEE Trans. Instrum. Meas., vol. 56, no. 3, pp. 962967, Jun. 2007.

1428

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 8, AUGUST 2012

[32] S. C. Hinds, J. L. Fisher, and D. P. DAmato, A document skew detection method using run-length encoding and the Hough transform,
in Proc. 10th Int. Conf. Pattern Recognit., 1990, vol. 1, pp. 464468.
[33] H. Liu, Q. Wu, H. Zha, and X. Liu, Skew detection for complex document images using robust borderlines in both text and non-text regions, Pattern Recognit. Lett., vol. 29, no. 13, pp. 18931900, 2008.
[34] M. Kovac and N. Ranganathan, JAGUAR: A fully pipelined VLSI
architecture for JPEG image compression standard, Proc. IEEE, vol.
83, no. 2, pp. 247258, Feb. 1995.
Zhong-Ho Chen received the M.S. and Ph.D.
degrees in computer science and information engineering from National Cheng-Kung University,
Tainan, Taiwan, in 2005 and 2011, respectively.
He was a Visiting Student with the University
of Washington, Seattle, from May 2010 to January
2011. Currently, he holds a Postdoctoral position
with the SCREAM Lab, Department of Computer
Science and Information Engineering, National
Cheng-Kung University, Tainan, Taiwan. His research activities include digital signal processing,
VLSI/FPGA circuit design, computer architecture, and embedded systems.

Alvin W. Y. Su was born in Taiwan, Taiwan, in 1964.


He received the B.S. degree in control engineering
from National Chiao-Tung University, Hsinchu,
Taiwan, in 1986, and the M.S. and Ph.D. degrees in
electrical engineering from Polytechnic University,
Brooklyn, NY, in 1990 and 1993, respectively.
From 1993 to 1994, he was with the Center
for Computer Research in Music and Acoustics
(CCRMA), Stanford University, Stanford, CA. From
1994 to 1995, he was with Computer Communication Laboratory, Industrial Technology Research
Institute, Taiwan. In 1995, he joined the Department of Information Engineering and Computer Engineering, Chung-Hwa University, Taiwan, where he
serves as an Associate Professor. In 2000, he joined the Department of Computer Science and Information Engineering, National Cheng-Kung University,
Tainan, Taiwan, where he is a Professor. His research interests cover the areas
of digital audio signal processing, musical signal analysis and synthesis, pattern
recognition, data compression, image/video signal processing, and VLSI signal
processing.

Ming-Ting Sun (S79M81SM89F96) received the B.S. degree from National Taiwan
University, Taipei, Taiwan, in 1976, the M.S. degree
from the University of Texas at Arlington in 1981,
and the Ph.D. degree from University of California,
Los Angeles, in 1985, all in electrical engineering.
He joined the University of Washington, Seattle, in
August 1996, where he is a Professor. Previously, he
was the Director of the Video Signal Processing Research Group at Bellcore. He has been a Chaired/Visiting Professor with Tsinghua University, Tokyo University, National Taiwan University, National Cheng Kung University, National
Chung Cheng University, National Sun Yat-sen University, and Hong Kong University of Science and Technology. He holds ten patents and has published over
200 technical papers, including 14 book chapters in the area of video and multimedia technologies. He coedited a book, Compressed Video Over Networks
(CRC, 2000).
Dr. Sun was the Editor-in-Chief of the IEEE TRANSACTIONS ON MULTIMEDIA
(TMM) and a Distinguished Lecturer of the Circuits and Systems Society from
2000 to 2001. He received an IEEE CASS Golden Jubilee Medal in 2000, and
was the general co-chair of the Visual Communications and Image Processing
2000 Conference. He was the Editor-in-Chief of the IEEE TRANSACTIONS ON
CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY (TCSVT) from 1995 to 1997.
He received the TCSVT Best Paper Award in 1993. From 1988 to 1991, he was
the chairman of the IEEE Circuits and Systems Society Standards Committee
and established the IEEE Inverse Discrete Cosine Transform Standard. He received an Award of Excellence from Bellcore for his work on the digital subscriber line in 1987.

You might also like