Professional Documents
Culture Documents
8, AUGUST 2012
1419
Manuscript received December 18, 2010; revised April 07, 2011; accepted
June 01, 2011. Date of publication July 25, 2011; date of current version June
14, 2012. This work was supported in part by the National Science Council,
Taiwan, under Grant NSC 98-2221-E-006-158-MY3.
Z.-H. Chen and A. W. Y. Su are with the SCREAM Lab, Department of Computer Science and Information Engineering, National Cheng-Kung University,
701 Tainan, Taiwan. (e-mail: zhonghochen@gmail.com; alvinsu@mail.ncku.
edu.tw).
M.-T. Sun is with the Electrical Engineering Department, University of Washington, Seattle, WA 91895 USA (e-mail: sun@ee.washington.edu).
Digital Object Identifier 10.1109/TVLSI.2011.2160002
for
, using 180
as the step-size
end for
end for.
After the voting process, the
with local-maximum
values of votes are considered as candidate lines. In this paper,
we only focus on the voting process of the Hough transform.
Given a CIF (352 288) video with 30 frames/second (fps) and
10% feature points, it needs 109 M multiplications per second to
compute the values of 180 angles. For embedded applications,
it requires hardware accelerators to achieve real-time Hough
transform. Compared with application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs) usually
target smaller markets and require much less development time.
In FPGA, high throughput is often achieved by exploiting the
parallelism of the design rather than by operating the chip at
a very high clock frequency. In addition, a better architecture
should have more efficient utilization of the function blocks in
the FPGA. In this paper, we propose an architecture and the
implementation of Hough transform on an FPGA by exploiting
both angle-level and pixel-level parallelism. The goal is to
achieve the highest throughput with the minimum hardware
resource.
1420
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 8, AUGUST 2012
which passes
Vote
line with
which passes through a
feature point will generate a vote to the
value.
specific
Vote-offset
1421
2 512
1422
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 8, AUGUST 2012
< 90
. (b) 90
< 180
TABLE I
MAXIMUM VOTE-OFFSET FOR DIFFERENT BLOCK-SIZES
where
be represented as
and
and
. Since
, where
can
(5)
we can compute the maximum
among all
as
1423
1424
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 8, AUGUST 2012
Fig. 7. (a) Proposed accumulator architecture for the inter-block incrementing. (b) Basic dataflow of the circuit.
(with addresses to
), and each time we accumulate the
memory contents in parallel. This is not possible without the
,
To update the contents of the memory locations to
instead of accessing the memory five times sequentially, we update the five memory contents in one clock cycle. This is possible since the five memory contents which need updating are
stored in continuous memory locations with addresses from to
with our vote consolidation scheme. The is considered
as the base address of these votes. We use two 4K RAM blocks
in the FPGA to implement the Vote Memory. Because the accumulation requires one read and one write memory operation, the
4K RAM blocks are configured as dual-port. In the FPGA, each
4K RAM could be configured as 128 36, 256 18, 512 9,
1024 4, 2048 2 or 4096 1. The maximum bit-width of a
4K RAM is 36 b, thus two 4K RAMs can store eight votes (each
vote with 9 b). Since we only have five vote-offsets, three 0 value
votes are added in Fig. 10(a). In the case that the block-size is
larger than 2 4 or 4 2, the circuit can be modified accordingly or it can use multiple clock cycles to handle a symbol.
The consolidated votes must be aligned based on the base-address before they are accumulated to the Vote Memory. Since
the value may not be a multiple of 8, we need to align the five
vote-offsets with the correct memory locations before the accumulation. This is achieved with a barrel rotator controlled by
the value (which is the base-address) as shown in Fig. 10(a) to
align the votes with the corresponding Vote Memory contents.
The votes are grouped in two groups, each group containing four
votes. The lower four votes are stored in one 4K RAM and the
upper four votes are stored in the other 4K RAM. Fig. 10(b)
shows the two 4K RAM configuration. In the circuit implementation, the upper bits of are used to address the memory and
the lower bits of are used to align the votes. If the
is located in the upper 4K RAM, the address of the lower 4K RAM
should be increased by 1 to give the correct addresses.
In our implementation, the image size is 512 512. Although
the maximum can be
which is a 10-b number, in practice we limit it to a 9-b number, since, if a receives more than
value in the
511 votes, it is certain there is a line with that
input binary feature image.
1425
TABLE II
ACCURACY, FPGA RESOURCES, AND MAXIMUM FREQUENCY
UNDER DIFFERENT ACCURACY FOR THE PE
Fig. 10. (a) Vote alignment. (b) Vote memory and accumulators.
D. Initialization
Before a PE starting to process an angle, a host needs to initialize the PE by initializing three components: 1) the row-register, 2) the step table, and 3) the registers in the intra-block in. If a processor
crementing block for storing
is used to control the PE, these values could be computed by the
processor. Otherwise, it requires a lookup table and an accumulator to compute these values. The lookup table stores all
of all angles, and
of an angle is compute by
.
V. EVALUATION
Because the output of the proposed architecture is identical
to the ideal Hough transform, we do not show the result of test
images. Here, we evaluate the resource requirement, memory
bandwidth, and computation time of the proposed architecture.
A. Resource Requirement
and
are
In the hardware implementation, the
represented in the fixed-point format. Let the fraction part be
represented in bits. The maximum error introduced by each
. There are (W/M-1) steps in the -direcincrement is
tion and H/N steps, including the initialization, in the -direction. Moreover, intra-block incrementing involves another inis
crement. Therefore, the maximum error of
.
Table II shows the accuracy, FPGA resources, and the maximum achievable frequency for PEs under different accuracy.
ALUT stands for adaptive LUT, and it is the basic cell in Altera
Stratix II FPGAs. The result is reported by the FPGA vendors
synthesis tool, Quartus II Version 9.1 with SP2, on the device
EP2S180F1508C3. By increasing the number of bits in the
fixed-point implementation, the maximum error is reduced.
However, it also increases the required resources and reduces
the maximum achievable frequency.
In order to evaluate the proposed architecture, we also
implement the architecture of previous works [9], [18], [19]
on the same device. In the implementation, we use 12 b for
the fractional part. Table III compares the throughput of these
approaches, where the Throughput per Cycle is measured
values per cycle. The
as the average number of computed
values per
Throughput (M/s) is measured in millions of
second. In the comparisons of the throughput, we do not consider the effect of the Vote Memory and Accumulators, since
1426
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 8, AUGUST 2012
TABLE III
PERFORMANCE COMPARISON AMONG THE PES OF DIFFERENT APPROACHES
TABLE IV
MEMORY BANDWIDTH OF DIRECT IMPLEMENTATION OF HOUGH TRANSFORM
AND THE PROPOSED ARCHITECTURE
TABLE V
SYNTHESIS RESULT OF THE PROPOSED ARCHITECTURE ON EP2S180F1508C3
TABLE VI
IMAGE SPECIFICATIONS AND EXECUTION TIME
1427
1428
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 8, AUGUST 2012
[32] S. C. Hinds, J. L. Fisher, and D. P. DAmato, A document skew detection method using run-length encoding and the Hough transform,
in Proc. 10th Int. Conf. Pattern Recognit., 1990, vol. 1, pp. 464468.
[33] H. Liu, Q. Wu, H. Zha, and X. Liu, Skew detection for complex document images using robust borderlines in both text and non-text regions, Pattern Recognit. Lett., vol. 29, no. 13, pp. 18931900, 2008.
[34] M. Kovac and N. Ranganathan, JAGUAR: A fully pipelined VLSI
architecture for JPEG image compression standard, Proc. IEEE, vol.
83, no. 2, pp. 247258, Feb. 1995.
Zhong-Ho Chen received the M.S. and Ph.D.
degrees in computer science and information engineering from National Cheng-Kung University,
Tainan, Taiwan, in 2005 and 2011, respectively.
He was a Visiting Student with the University
of Washington, Seattle, from May 2010 to January
2011. Currently, he holds a Postdoctoral position
with the SCREAM Lab, Department of Computer
Science and Information Engineering, National
Cheng-Kung University, Tainan, Taiwan. His research activities include digital signal processing,
VLSI/FPGA circuit design, computer architecture, and embedded systems.
Ming-Ting Sun (S79M81SM89F96) received the B.S. degree from National Taiwan
University, Taipei, Taiwan, in 1976, the M.S. degree
from the University of Texas at Arlington in 1981,
and the Ph.D. degree from University of California,
Los Angeles, in 1985, all in electrical engineering.
He joined the University of Washington, Seattle, in
August 1996, where he is a Professor. Previously, he
was the Director of the Video Signal Processing Research Group at Bellcore. He has been a Chaired/Visiting Professor with Tsinghua University, Tokyo University, National Taiwan University, National Cheng Kung University, National
Chung Cheng University, National Sun Yat-sen University, and Hong Kong University of Science and Technology. He holds ten patents and has published over
200 technical papers, including 14 book chapters in the area of video and multimedia technologies. He coedited a book, Compressed Video Over Networks
(CRC, 2000).
Dr. Sun was the Editor-in-Chief of the IEEE TRANSACTIONS ON MULTIMEDIA
(TMM) and a Distinguished Lecturer of the Circuits and Systems Society from
2000 to 2001. He received an IEEE CASS Golden Jubilee Medal in 2000, and
was the general co-chair of the Visual Communications and Image Processing
2000 Conference. He was the Editor-in-Chief of the IEEE TRANSACTIONS ON
CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY (TCSVT) from 1995 to 1997.
He received the TCSVT Best Paper Award in 1993. From 1988 to 1991, he was
the chairman of the IEEE Circuits and Systems Society Standards Committee
and established the IEEE Inverse Discrete Cosine Transform Standard. He received an Award of Excellence from Bellcore for his work on the digital subscriber line in 1987.