You are on page 1of 5

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/273393986

A Fast Integral Image Computing Hardware Architecture With High Power and
Area Efficiency

Article  in  Circuits and Systems II: Express Briefs, IEEE Transactions on · January 2015
DOI: 10.1109/TCSII.2014.2362651

CITATIONS READS

18 993

5 authors, including:

Shouyi Yin Yuchi Zhang


Tsinghua University Tsinghua University
276 PUBLICATIONS   1,782 CITATIONS    5 PUBLICATIONS   35 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

DNA Project View project

All content following this page was uploaded by Yuchi Zhang on 16 October 2016.

The user has requested enhancement of the downloaded file.


A Parallel Hardware Architecture for Fast Integral
Image Computing
Yuchi Zhang, Shouyi Yin, Peng Ouyang, Leibo Liu and Shaojun Wei
Institute of Microelectronics,
Tsinghua University, Beijing 100084, China
Email: yinsy@tsinghua.edu.cn

Abstract—This paper proposes a method of fast integral image computing sub architectures of the object detection
image computing on hardware. We propose a high efficient systems in [6] and [7] on hardware for comparison. The
hardware-based algorithm, and design a pipelined architecture performance metrics including speed, area, power dissipations
suitable for our algorithm. Parallelism and time complexity of the etc are presented to show that our architecture is more efficient.
algorithm are analyzed. And the hardware implementation of This paper is organized as follows. In Section II, we
each operations of the algorithm is presented. Compared with illustrate the definition of integral image and present a property
two related works, we find that our architecture is of the highest which we could make use of in our calculation. In Section III,
efficiency, as it reaches the highest speed by consuming the we proposed an algorithm to do fast integral image computing
comparatively lowest logic resources and power.
and analyze the time complexity. In Section IV, we present the
Keywords—Integral Image, Parallel Algoritmn, Pipelined
hardware architecture based on the algorithm proposed in
Architecture section III. We show the FPGA implementation results and
compare our structure with related works in [6] and [7] in
I. INTRODUCTION Section V. Finally we come to our conclusion and list all the
references.
AdaBoost learning algorithm, proposed by Yoav Freund
and Robert Schapire[1], is one of the most widely used II. INTEGRAL IMAGE GENERATION
algorithms in detection. Haar-like features are commonly used
in AdaBoost algorithm as simple weak classifiers. AdaBoost Integral image, also known as summed area table, is a data
algorithm based on Haar-like features, is proposed and used by structure which could quickly and efficiently generate the sum
Viola and Jones in object detection[2], it achieves a high of values in a rectangular subset of a grid. It was first
detection rate and is widely used in face and pedestrian introduced to computer graphics in 1984 by Frank Crow[8].
detection[3][4]. The value at any point of the integral image is
AdaBoost algorithm based on Haar-like features is defined in formula (1) as the sum of all the pixels above and to
computation intensive and is hard to achieve real-time the left of the point in the original image as shown
detection. Therefore, several attempts have been made to in Fig.1.a.
accelerate it such as hardware acceleration in [5]. During the (1)
processing of this algorithm, the calculation of integral image,
a critical part of the whole algorithm, usually accounts for more With the integral image, Haar-like features could be
than 50% of total execution time. Thus the efficiency of calculated conveniently in a constant time[2].
detection based on this algorithm will be greatly increased if a Integral image has a property which we could make use of
method of faster integral image computing is adopted. in our hardware-based algorithm(See Fig.1.b). The sum of all
In embedded applications, such as object detection in the pixels in the area A , is the value at of the
automotive systems, biomedical systems or some portable integral image.
systems, real-time processing is required within limited power
(2)
and size. Therefore, as specialized hardware consumes low
power and could be built into small systems, it is more suitable It is also the sum of all the pixels in the area A plus the sum
to be used for embedded purposes. Thus to implement fast of pixels in the area B.
integral image computing on specialized hardware is of vital
practical significance. Different algorithms have been proposed (3)
in several related works to do integral image computation of The first term on the right of the equal sign in (3) is the
high efficiency, such as Krykou‟s[6] and Hiromoto‟s[7] system. value of the integral image at the point . Thus we get
However, their works have their own disadvantages so that (4).
upside potential of accelerating the efficiency of computation
still exists. A faster speed of computation could be achieved (4)
with comparatively low power and small area.
In this paper, we propose a hardware-based algorithm of Therefore, it‟s easy to infer that if we want to calculate the
integral image computing , and based on which we then design integral image values at points in B given A‟s integral image,
a parallel hardware architecture. We implement the integral we could regard B as an independent image and calculate its
own integral image, then we plus each row of our values in this
integral image with the rightmost value at the same row of A‟s For images of large sizes, operating a whole row of pixels
integral image. might be impossible, as the output bit width of the memory
which stores the image pixels is limited so that we could not
access a whole row simultaneously.

Fig. 1. (a) Definition of Integral Image (b) A property of Integral Image

If we know the integral image of A, we only have to know Fig. 3. Image is divided into “stripes”
the rightmost column of A‟s integral image, to calculate the
integral image value at every point of B. To solve this problem, we could divide the image into
several „stripes‟, as shown in Fig.3. We first calculate the
III. PROPOSED METHOD integral image of stripe No.1 and store the rightmost value of
To calculate the integral image for an image whose height each row of stripe No.1‟s integral image. Then we use the
is n and width is m, at least m times n cycles should be used in property mentioned in Section II to calculate the integral image
software algorithm as every pixel of the image should be values at the points in stripe No.2, then we store the rightmost
accessed at least once. value of each row, then we could calculate the integral image
However, on hardware, pixels in the same row could be values at the points in stripe No.3. For an n by m image, if the
operated simultaneously. Such operation exploits the high width of each stripe is w(Usually m could be exactly divided
parallelism of hardware structures, and thus will reduce the by w), it will take n steps to calculate each stripe and there
time complexity of integral image computing. would be m/w stripes, and as each stripe is calculated using a
We propose an integral image computing algorithm which pipelined structure with cascaded row and column operations,
is efficient and suitable for hardware implementation. We use a there would be a delay of w steps in the cascaded structure.
image as an example to show each step of our method in Thus, in total it will take steps to calculate the
Fig.2. Only 6 steps can we calculate its integral image. integral image. By selecting an appropriate value of w, we
could achieve the highest efficiency by trading off between
speed, area and power dissipation. To store the rightmost
column of each “stripe”, we should add extra registers to the
system.
IV. HARDWARE ARCHITECTURE
Our system mainly consists of a Control Unit (CU), a
Calculation Unit (CALU) and a First In First Out Structure
(FIFO). Pixels of the original image are read from the Original
Image Memory(OIM). Then the integral image is calculated by
CALU. Because of the high parallelism and fast speed of the
calculation unit, an asynchronous First In First Out Structure
(FIFO) is used to buffer the output data of the CALU, as the
interface width of the integral image memory(IIM), and the
speed of writing which are limited. Fig.4 is an overview of our
Fig. 2. Processing of our proposed algorithm system.

For an image, we need n-1 row operations and m-1


column operations to calculate its integral image, thus m+n-2
steps in total are used, and the time complexity is O(m+n),
much lower than O(mn) which is the time complexity of
software algorithms, especially when m and n are large, e.g.
several hundreds or thousands.
To implement such an algorithm on hardware, a pipelined
structure is designed. We access the image row by row, thus
row operations could be done by adding the incoming row of
pixels to update the current row of pixels in each cycle, column
operations could be achieved using cascaded row registers and
adders between them. We‟ll illustrate the details of our
architecture in Section III.
Fig. 4. System overview
The FIFO will feed back whether it is nearly full or not. as a row, each of whose pixels is the sum of the two pixels at
The clock of the calculation unit will be disabled by the control the same position of the two rows. In each cycle, pixel at
unit and will be enabled again when the FIFO sends all its data column k in number k register will be updated to the sum of
to IIM and become empty to avoid data overflow. pixel at column k-1 and column k in number k-1 register in
As the whole calculation process needs to access pixels row each cycle, meanwhile other pixels will be updated to the pixel
by row in each “stripe” and “stripe” by “stripe” (See Fig.3), we at the same column in the last registers(k=2,3,…,w). Then all
use a unique memory structure to store the image. Image is pixels add up the output of a shift register on the right which
stored in a RAM, whose interface width is the bit width of the stores the rightmost column of the last stripe. Then we output
whole row of a “stripe”, thus a row of pixels of a “stripe” could this row as part of the integral image, and store the rightmost
be accessed in one cycle. Rows of the same “stripe” are stored value simultaneously for the next stripe.
in continuous addresses and “stripes” are stored one by one. As pixels in the integral image might have higher width
We use a grayscale image whose width of each than those in the original image, we leave enough width for
“stripe” is 3 as an example, to show our unique memory each pixel so that there is no overflow during the calculation.
structure (See Fig.5). CU will access the image memory For an grayscale image, the maximum width one
continuously from address 0x00 to the last address (0x11 in pixel might require in its integral image is given in formula (5).
Fig.5) during its calculating process (It will access one address
in a cycle and the next address in the next cycle). (5)
For an grayscale image for example, the
maximum width of a pixel in the integral image is .
V. EXPERIMENT
The proposed architecture is implemented on an FPGA, and
the details of our experiment are shown below.
A. Experimental Setup
We obtain 3000 sample images and to evaluate the average
performance of the calculation systems. We download some of
those images from CMU VASC Image Database, acquire some
of them in the internet, and obtain others by our own Camera.
Multiple sizes of images are used in our experiment, including
images sized , , and
Fig. 5. Our unique memory accessing structure .
The experimental environment is an Altera Cyclone IV
CALU is the core hardware structure of the system which FPGA, which has 114480 logic elements (LE), 3981312 on-
implements our algorithm (See Fig.6). It consists of w chip memory bits and 532 embedded multipliers of 9-bit
cascaded row registers (We select w=4 in Fig.6 for example). elements in total. Fig.7 shows us a sample image and our
development board (LEDs on which are displaying the value of
the point at the last row and last column of the integral image).

Fig. 7. (a) A sample image (b) Our FPGA development board

The architecture proposed in our work is implemented by


selecting w as 32, as we found in experiments that such value
of w leads to the least power consumption per unit area. We
also implement the computing algorithm in [6] and [7] on our
Fig. 6. Calculation Unit of our system hardware. The clock frequency of CU is 50 Mega Hertz(MHz)
in our experiment. The correctness is verified by a verification
In each cycle, if the incoming row is the first row of a tool, Signal Tap II in our development, Quartus II.
“stripe”, the first register updates its value to the incoming row We measure the clock cycles used during the whole
of pixels, else it will update its value to the sum of its current execution process, thus the frame rate, i.e. the speed of
row of pixels and the incoming row. This selection is achieved calculation could be derived given the clock frequency. And
by the bus multiplexer in Fig.6 and whose control signal obtain the Area and the power consumption from the synthesis
“Select” is generated by CU. The sum of two rows is defined report.
B. Result larger than Hiromoto‟s, our speed per unit power is 33%~34%
Each of the three systems could do correct calculation. The higher than Krykou‟s and 2~3 times higher than Hiromoto‟s.
measurement of the speed of the three systems while
TABLE IV. SPEED PER UNIT POWER
calculating the integral image of different sizes of images is
listed in TABLE I. Image Size Speed per Unit Power(fps/mW)
TABLE I. COMPUTATION SPEED Krykou[6] Hiromoto[7] This Work
Image Size Kyrkou[6] Hiromoto[7] This Work 151.8 64.42 201.7
1339 163 5191 50.57 21.34 67.37
446 54 1734 22.45 9.486 29.95
198 24 771 5.22 2.372 7.03
46 6 181 In addition, even for images of large sizes, we could
achieve fast processing (>180fps for images of ultra high
a.
The unit of the speed of calculation in this table is Frame Per Second(fps) definition). Thus our architecture is very suitable for real-time
Area (number of LEs) and power dissipation of each object detection.
architecture are shown in Table II.
It is obvious that our architecture has the highest speed and VI. CONCLUSION
the best performance of the three. Although our area and the In this paper we propose a hardware-based algorithm to do
power dissipation are both the largest, it is worth doing so, for integral image computation. We implement it on a suitable
our speed per unit area and speed per unit power dissipation is hardware and compare the performance to two other related
the highest, much higher than the two that in the two other works. We find that our work is the most efficient one among
works (See TABLE III and TABLE IV). Therefore, we do the the three, and could achieve fast processing even if the image
best trade-off among parallelism, area and power consumption. size is large. Our work could be used in embedded systems,
and we are looking forward to its contribution to many exciting
TABLE II. PERFORMANCE applications.
Krykou [6] Hiromoto [7] This Work
ACKNOWLEDGEMENT
Area(# of LEs) 3382 1716 3737 This work is supported in part by the China Major S&T
Power(mW) 8.82 2.53 25.74 Project (No.2013ZX01033001-001-003), the International
S&T Cooperation Project of China grant (No. 2012DFA11170),
b.
All three works consume the same number of memory bits and no multipliers the Tsinghua Indigenous Research Project (No.20111080997)
There are two reasons of the best performance of our work. and the NNSF of China grant (No.61274131).
One is that our algorithm is the most efficient, for an
image, the time complexity is O(m+n), while it is O(2m+n) in REFERENCES
Krykou‟s work and O(mn) in Hiromoto‟s work. The other is [1] Freund Y, Schapire R E. A desicion-theoretic generalization of on-line
that the relationship between speed and area, and between learning and an application to boosting[C]//Computational learning
theory. Springer Berlin Heidelberg, 1995: 23-37.
speed and power dissipation are non-linear, we choose the best [2] Viola P, Jones M. Robust real-time object detection[J]. International
combination of the three indicators so that our architecture Journal of Computer Vision, 2001, 4.
could achieve the fastest processing by consuming [3] Lienhart R, Maydt J. An extended set of haar-like features for rapid
comparatively small area and low power. object detection[C]//Image Processing. 2002. Proceedings. 2002
International Conference on. IEEE, 2002, 1: I-900-I-903 vol. 1.
[4] Miyamoto R, Sugano H, Saito H, et al. Pedestrian recognition in far-
TABLE III. SPEED PER UNIT AREA
infrared images by combining boosting-based detection and skeleton-
Image Size Speed per Unit Area(fps/LE) based stochastic tracking[M]//Advances in Image and Video Technology.
Springer Berlin Heidelberg, 2006: 483-494.
Krykou[6] Hiromoto[7] This Work [5] Wei Y, Bing X, Chareonsak C. FPGA implementation of AdaBoost
algorithm for detection of face biometrics[C]//Biomedical Circuits and
0.396 0.095 1.389 Systems, 2004 IEEE International Workshop on. IEEE, 2004: S1/6-17-
20.
0.132 0.031 0.464 [6] Kyrkou C, Theocharides T. A flexible parallel hardware architecture for
AdaBoost-based real-time object detection[J]. Very Large Scale
0.058 0.014 0.206 Integration (VLSI) Systems, IEEE Transactions on, 2011, 19(6): 1034-
1047.
0.014 0.0035 0.048 [7] Hiromoto M, Sugano H, Miyamoto R. Partially parallel architecture for
adaboost-based detection with haar-like features[J]. Circuits and
Bcause of our high efficient algorithm and our optimized Systems for Video Technology, IEEE Transactions on, 2009, 19(1): 41-
designation synthesized parallelism, area and power, our 52.
[8] Crow F C. Summed-area tables for texture mapping[C]//ACM
architecture achieves great performances. Our speed per unit SIGGRAPH Computer Graphics. ACM, 1984, 18(3): 207-212.
area is 2.4~2.5 times higher than Krykou‟s and 12.7~13.6 times

View publication stats

You might also like