Professional Documents
Culture Documents
net/publication/273393986
A Fast Integral Image Computing Hardware Architecture With High Power and
Area Efficiency
Article in Circuits and Systems II: Express Briefs, IEEE Transactions on · January 2015
DOI: 10.1109/TCSII.2014.2362651
CITATIONS READS
18 993
5 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Yuchi Zhang on 16 October 2016.
Abstract—This paper proposes a method of fast integral image computing sub architectures of the object detection
image computing on hardware. We propose a high efficient systems in [6] and [7] on hardware for comparison. The
hardware-based algorithm, and design a pipelined architecture performance metrics including speed, area, power dissipations
suitable for our algorithm. Parallelism and time complexity of the etc are presented to show that our architecture is more efficient.
algorithm are analyzed. And the hardware implementation of This paper is organized as follows. In Section II, we
each operations of the algorithm is presented. Compared with illustrate the definition of integral image and present a property
two related works, we find that our architecture is of the highest which we could make use of in our calculation. In Section III,
efficiency, as it reaches the highest speed by consuming the we proposed an algorithm to do fast integral image computing
comparatively lowest logic resources and power.
and analyze the time complexity. In Section IV, we present the
Keywords—Integral Image, Parallel Algoritmn, Pipelined
hardware architecture based on the algorithm proposed in
Architecture section III. We show the FPGA implementation results and
compare our structure with related works in [6] and [7] in
I. INTRODUCTION Section V. Finally we come to our conclusion and list all the
references.
AdaBoost learning algorithm, proposed by Yoav Freund
and Robert Schapire[1], is one of the most widely used II. INTEGRAL IMAGE GENERATION
algorithms in detection. Haar-like features are commonly used
in AdaBoost algorithm as simple weak classifiers. AdaBoost Integral image, also known as summed area table, is a data
algorithm based on Haar-like features, is proposed and used by structure which could quickly and efficiently generate the sum
Viola and Jones in object detection[2], it achieves a high of values in a rectangular subset of a grid. It was first
detection rate and is widely used in face and pedestrian introduced to computer graphics in 1984 by Frank Crow[8].
detection[3][4]. The value at any point of the integral image is
AdaBoost algorithm based on Haar-like features is defined in formula (1) as the sum of all the pixels above and to
computation intensive and is hard to achieve real-time the left of the point in the original image as shown
detection. Therefore, several attempts have been made to in Fig.1.a.
accelerate it such as hardware acceleration in [5]. During the (1)
processing of this algorithm, the calculation of integral image,
a critical part of the whole algorithm, usually accounts for more With the integral image, Haar-like features could be
than 50% of total execution time. Thus the efficiency of calculated conveniently in a constant time[2].
detection based on this algorithm will be greatly increased if a Integral image has a property which we could make use of
method of faster integral image computing is adopted. in our hardware-based algorithm(See Fig.1.b). The sum of all
In embedded applications, such as object detection in the pixels in the area A , is the value at of the
automotive systems, biomedical systems or some portable integral image.
systems, real-time processing is required within limited power
(2)
and size. Therefore, as specialized hardware consumes low
power and could be built into small systems, it is more suitable It is also the sum of all the pixels in the area A plus the sum
to be used for embedded purposes. Thus to implement fast of pixels in the area B.
integral image computing on specialized hardware is of vital
practical significance. Different algorithms have been proposed (3)
in several related works to do integral image computation of The first term on the right of the equal sign in (3) is the
high efficiency, such as Krykou‟s[6] and Hiromoto‟s[7] system. value of the integral image at the point . Thus we get
However, their works have their own disadvantages so that (4).
upside potential of accelerating the efficiency of computation
still exists. A faster speed of computation could be achieved (4)
with comparatively low power and small area.
In this paper, we propose a hardware-based algorithm of Therefore, it‟s easy to infer that if we want to calculate the
integral image computing , and based on which we then design integral image values at points in B given A‟s integral image,
a parallel hardware architecture. We implement the integral we could regard B as an independent image and calculate its
own integral image, then we plus each row of our values in this
integral image with the rightmost value at the same row of A‟s For images of large sizes, operating a whole row of pixels
integral image. might be impossible, as the output bit width of the memory
which stores the image pixels is limited so that we could not
access a whole row simultaneously.
If we know the integral image of A, we only have to know Fig. 3. Image is divided into “stripes”
the rightmost column of A‟s integral image, to calculate the
integral image value at every point of B. To solve this problem, we could divide the image into
several „stripes‟, as shown in Fig.3. We first calculate the
III. PROPOSED METHOD integral image of stripe No.1 and store the rightmost value of
To calculate the integral image for an image whose height each row of stripe No.1‟s integral image. Then we use the
is n and width is m, at least m times n cycles should be used in property mentioned in Section II to calculate the integral image
software algorithm as every pixel of the image should be values at the points in stripe No.2, then we store the rightmost
accessed at least once. value of each row, then we could calculate the integral image
However, on hardware, pixels in the same row could be values at the points in stripe No.3. For an n by m image, if the
operated simultaneously. Such operation exploits the high width of each stripe is w(Usually m could be exactly divided
parallelism of hardware structures, and thus will reduce the by w), it will take n steps to calculate each stripe and there
time complexity of integral image computing. would be m/w stripes, and as each stripe is calculated using a
We propose an integral image computing algorithm which pipelined structure with cascaded row and column operations,
is efficient and suitable for hardware implementation. We use a there would be a delay of w steps in the cascaded structure.
image as an example to show each step of our method in Thus, in total it will take steps to calculate the
Fig.2. Only 6 steps can we calculate its integral image. integral image. By selecting an appropriate value of w, we
could achieve the highest efficiency by trading off between
speed, area and power dissipation. To store the rightmost
column of each “stripe”, we should add extra registers to the
system.
IV. HARDWARE ARCHITECTURE
Our system mainly consists of a Control Unit (CU), a
Calculation Unit (CALU) and a First In First Out Structure
(FIFO). Pixels of the original image are read from the Original
Image Memory(OIM). Then the integral image is calculated by
CALU. Because of the high parallelism and fast speed of the
calculation unit, an asynchronous First In First Out Structure
(FIFO) is used to buffer the output data of the CALU, as the
interface width of the integral image memory(IIM), and the
speed of writing which are limited. Fig.4 is an overview of our
Fig. 2. Processing of our proposed algorithm system.