You are on page 1of 3

HaarFilter: A Machine Learning Tool for Image

Processing in Hadoop
Miss Nausheen Khilji

Mr. Shrawan Ram

M.Tech Scholar, Dept. of Computer Science Engineering

Jodhpur National University
Jodhpur, India

Asst. Professor, Dept. of Computer Science Engineering

M.B.M. Engineering College
Jodhpur, India

Abstract - This paper presents a machine learning tool

HaarFilter that can be used for analyzing data of JPEG/PNG
image formats. It tries to solve the problem of big data
phenomena of present age, produced as a side effect of image
data generation in astronomical units.[8] We have tried to
combine and harness the power of two rich open source
softwares namely Hadoop and OpenCV[1] using Hadoop
Image Processing Interface. HaarFilter is java based tool
that will process and filter the required image data from the
given storage location. Next, using Haar Cascading machine
learning technique it will do the object detection for analysis
work and stores the resultant images in Hadoop environment
with specialized data format (HIB). Processing at such a large
scale using a single machine can be very time consuming and
costly. HaarFilter can add much a relief while processing and
performing analytics with image data. It will facilitate
efficient and high-throughput image processing with
MapReduce style parallel programs typically executed on a
cluster. It provides a solution for how to store a large
collection of images on the Hadoop Distributed File System
(HDFS) and make them available for efficient machine
Haar feature-based cascade screening is an effective object
detection method proposed by Paul Viola and Michael Jones in
their paper, "Rapid Object Detection using a Boosted Cascade
of Simple Features" [2]. It is a machine learning technique
where a cascade function is trained from a lot of positive and
negative images. The result is then used to detect required
object in given images. Traditionally, training of machine was
done through the algorithm that recursively process the positive
images flagged as images of faces and negative images flagged
as images without faces. Next extraction of features was done.
Each feature is a singlet obtained by subtracting sum of pixels
under white division from sum of pixels under black division.
Next, the sum of pixels is taken under white and black
divisions by introducing the integral images. It simplifies
calculation of sum of pixels, how large may be the number of
pixels, to an operation involving just four pixels making the
detection fast. Among all these features, most of them are
irrelevant. For example, in an image containing human face,
the first feature selected may seem to focus on the property that

the region of the eyes is often darker than the region of the
nose and cheeks.

Figure 1: Haar Features

But, same windows applying on cheeks or any other place

could be irrelevant. Thus, we have to apply each and every
feature on all the training images till it finds the best threshold
which will classify the faces to positive and negative in the best
possible way.
This process is very time-consuming. A better idea is to have a
simple method to check if a window is not a face region. In an
image, most of the image region is non-face region. So if it is
not, discard it in a single shot, don't process it again and focus
on region where there can be a face. This way, we can quickly
check a possible face region. Thus, the concept of Cascading
was introduced to cut down unwanted scanning process by
classifiers. Here features are first grouped into different stages
of classifiers and then applied one-by-one. The window which
passes all stages is a face region.
There are two applications in OpenCV library to train cascade
opencv_traincascade. Local Binary Patterns don't have
parameters while Haar-like feature have BASIC
(default)/CORE /ALL parameters that selects the type of Haar
features set used in training. BASIC use only upright features,
while ALL posses the full set of upright and 45 degree rotated
features and thus we have selected Haar-like features to train
our machine for human face detection. We have multiple ways
to acquire digital data in complex forms from the real world

through: digital cameras, scanners, computed tomography, and

magnetic resonance imaging to name a few. In every case what
we (humans) see are images. However, when transforming this
to our digital devices what we record are numerical values for
each of the points of the image.
We will demonstrate human face detection using Haar
cascading with java based tool HaarFilter. To be able to present
desired result in the most relevant and effective way possible, it
should be able to adapt the way it interacts with data when they
encounter different objects.
HaarFilter will be performing human face detection here to
support Big Data technologies. It selects image files
(JPEG/PNG only) from a given bundle of data in local
repository using runnable java archive tool called
HaarFilter.jar in Hadoop Image Processing Interface
(HIPI)[4] framework. HaarFilter performs the selection of
image data, process them for human face detection, creates a
HipiImageBundle (HIB) file format of the resultant data, then
writes it to HDFS of a system.
To perform machine learning with HaarFilter tool we have
compiled OpenCV library along with HIPI framework. This
enable us to explore and harness all the powers associated with
OpenCV library and implement Data Analyics of image as
complex data type on Hadoop platform. OpenCV is an image
processing library. It contains a large collection of image
processing functions.

After each image in a HIB is decompressed and decoded, it is

forwarded to the map tasks as a FloatImage object. This is a
simple representation of the image pixel data as an array of
floating point values for each of the point. Map tasks for
parallel processing are carried out in a Culling step that takes
place before the images in HIB are distributed. CullMapper
class is an extension of a Mapper class that allows a
MapReduce program to efficiently skip images that do not
meet a specified set of criteria. The CullMapper class defines a
special type of Mapper function that contains method which
allows usage of an ImageHeader object to decide whether the
image should be processed or not before the image pixel data is
decompressed and decoded. We have selected Haar-like
features to train our machine for human face detection. Modern
cameras are capable of producing images with resolutions in
the range of tens of megapixels. These images need to be
compressed before storage and transfer. We are using Haar
transform that can be used for image compression. It was
proposed in 1910 by the Hungarian mathematician Alfrd
Haar[5]. It has been found very effective in applications such
as signal and image compression in electrical and computer
engineering as it provides a simple and computationally
efficient approach for analysing the local aspects of a signal.
The Haar transform is derived from the Haar matrix. The basic
idea is to transfer the image into a matrix in which each
element of the matrix represents a pixel in the image. For
example, a 256256 matrix is saved for a 256256 image.
JPEG image compression involves cutting the original image
into 88 sub-images. Each sub-image is an 88 matrix. The
equation of the Haar transform is Bn = Hn An HnT where A (n)
is a nn matrix and Hn is n-point Haar transform. The inverse
Haar transform is An = Hn T Bn Hn .[6]
It has been studied that the performance of the
HipiImageBundle (HIB) out of three common alternatives for
storing images as individual files on the HDFS like using an
SequenceFile, or using the Hadoop Archives (or HAR) file
system, the running time of a MapReduce program that
essentially performs a "nop" (empty tasks) in order to measure
the amount of time it takes to simply read and decoding the
images from the HDFS and present them to the mapper. [9]

Figure 2: Haar Training

HIPI (Hadoop Image Processing Interface) is a library
designed to serve efficient and high-throughput image
processing in the Hadoops MapReduce parallel programming
framework. There are three main classes in HIPI. Most
prominent among all classes which is most frequently used in
HIPI are HipiImageBundle (HIB) for representing a collection
of images on the HDFS, then FloatImage for representing a
decoded image in memory, and third is CullMapper class.
HaarFilter generates result in HIPI Image Bundle (HIB) which
is a collection of processed images that are stored together in
one file, somewhat analagous to .tar files in UNIX. HIBs are
implemented via the HipiImageBundle class and can be used
directly in the Hadoop MapReduce framework for further
Analylics work. HIBs support several useful operations like
merging, adding, iteration etc.

Figure 3: HaarFiltering


We have proposed the working of a java based tool

Haarfilter which will detect human faces using Haar
Cascading technique in jpeg/jpg/png images from a given
bundle of data.

For future endeavors, some research areas are: Analysis of

Medical/Satellite data and images, Mask operations on
matrices, Bulk Images processing like blending, matching
etc., Analysis of Real-time image feed, Analysis of realtime video data like, Analysis of prerecorded video data,
Device Calibration & 3D reconstruction, Machine
Learning with Support Vector Machine (SVM) and
optimization, Group Processing Unit based computations,
Visualizations & Animation of Analysis work

Fig. 4

Figure 4: Human Face Detection

Figure 6: Culling V/S Non-Culling

It use Hadoop Image Processing Interface (HIPI)

framework powerfully to encapsulate the resultant data
into Hadoop Image Bundle.hib file format.
After examining we found significant improvement in
performance by processing images from .hib in Hadoop
Distributed File System (HDFS). While trying to process
10,000 individual files or more, even MapReduce timed
out its setup phase but HIBs proved advantageous over
many small files using other techniques.
HaarFilter as a Machine Learning tool can be used for
detection of various object types in future with potential
advantage of Big Data Technology.
In an experiment, time taken to compute the average
image captured using a Canon PowerShot S500 digital
camera with a resolution of 2592x1944 from a collection
of images downloaded from Flickr, was measured. The
cull() method achieves this goal as the following graph
plots the running times (in seconds, along y-axis) as a
function of the input size (in number of images, along xaxis) for different input sizes: [7]
Figure 5: HIB Performance

[2] Paul Viola, Michael Jones, Rapid Object Detection using
a Boosted Cascade of Simple Features, Conference on
Computer Vision and Pattern Recognization, 2001, Pg. No. 1-3
[4] Hadoop
[5] Haar, Alfred, Zur Theorie der orthogonalen
Funktionensysteme, Mathematische Annalen 63 (3): 331-371,
doi: 10.1007/BF01456326.
[8] Short, Roger E. Bohn-James E., 'How much information in
2009 Report on American Consumers', Tech rep., Global
Information Industry Center, University of California, San
Diego (2009).
[9] Rong, Baodong Jia-Tomasz WiktorWlodarczykChunming, 'Performance Considerations of Data Acquisition in
Hadoop System', 2nd IEEE International Conference on Cloud
Computing Technology and Science , pp. 1_5 (2010).
[10] Gilder, Bret Swanson & George, 'Estimating the Exa_ood',
Tech rep., Discovery Institute, Seattle, Washington (2008).