You are on page 1of 40

PARALLEL PROCESSING

OF IMAGES FOR FEATURE


EXTRACTION

A
PROJECT REPORT
On
PARALLEL PROCESSING OF IMGAES FOR FEATURE
EXTRACTION
Submitted in partial fulfilment of the requirements for the award of the degree

OF
BACHELOR OF TECHNOLOGY
In
SOFTWARE ENGINEERING
By
PIYUSH TYAGI[REG NO.:1201210039]
SWATHISH THIYAGU [REG NO.:1201210031]
Under the guidance of

Mrs. M.S.ABIRAMI
(Assistant Professor, Department of Software Engineering)

FACULTY OF ENGINEERING AND TECHNOLOGY


DEPARTMENT OF SOFTWARE ENGINEERING
SRM UNIVERSITY, SRM Nagar, Kattankulathur – 603203
Kancheepuram District, Tamil Nadu
MAY 2016
BONAFIDE CERTIFICATE

Certified that this project report titled “PARALLEL PROCESSING OF IMAGES FOR
FEATURE EXTRACTION”is the bonafide work of PIYUSH TYAGI (Reg. No:
1201210039)andSWATHISH THIYAGU (Reg. No: 1201210031) who carried out the project
under my supervision. Certified further, that to the best of my knowledge the work reported
herein does not form part of any other project report or dissertation on the basis of which a
degree or award was conferred on an earlier occasion or any other candidate.

Signature of the Guide Signature of the HOD


Mrs. M.S.Abirami Dr. C. LAKSHMI
Assistant Professor Professor& Head
Department of Software Engineering Department of Software Engineering
SRM University SRM University
Kattankulathur- 603203 Kattankulathur- 603203

Submitted for Project Work Viva-Voce Examination held on ......................

PLACE: Kattankulathur

DATE:

INTERNAL EXAMINER EXTERNAL EXAMINER


DATE: Examiner
ACKNOWLEDGEMENT

During the entire tenure of the project we have been fortunate to receive considerable support and
guidance from a number of persons who have made positive contribution to its final outcome.

We take this opportunity to thank our Director (E&T), Dr. C.MuthamizhChelvan, for providing us with
excellent infrastructure that is required for the development of our project.

We would also like to thank Dr. C.Lakhsmi, Professor and Head of Department of Software
Engineering, for providing us conductive ambience for developing our project.

We specially thank our project guide Mrs.M.S.Abirami, Assistant Professor, Department of Software
Engineering, who has been instructing us right from the initial stages of the project till its finishing days.
Our guide provided us with all knowledge, support and motivation to carry out this project with utmost
dedication. We owe her for all the many opportunities catered to us to enlarge our perspective. We
gratefully acknowledge her leadership and brilliant ideas throughout the project days.

We also convey our gratitude to all other members in the project panel and those who have contributed to
this project directly or indirectly.

Last but not the least, we would like to thank our parent, friends and acquaintance for their help and
constant support throughout the project, thank you all.
ABSTRACT

Processing large set of images on a single machine can be very time consuming and

costly.This project presents an image processing library designed to be used with the

Apache Hadoop MapReduce, a software framework for sorting and processing big data in

a distributed fashion on large cluster of commodity hardware. HIPI facilitates efficient and

high-throughput image processing with MapReduce style parallel programs typically

executed on a cluster. It provides a solution for how to store a large collection of images on

the Hadoop Distributed File System (HDFS) and make them available for efficient

distributed processing.
Table of Contents

i) List of Figures .......................................................................................................................... 2


ii) List of Symbols, Abbreviations or Nomenclature................................................................. 3
1. Introduction ............................................................................................................................. 5
1.1 General ......................................................................................................................... 5
1.2 Objective ...................................................................................................................... 7
1.3 Existing System ............................................................................................................ 8
1.4 Proposed System .......................................................................................................... 8
2. Literature Survey .................................................................................................................. 10
3. System Analysis ..................................................................................................................... 13
3.1 Requirements Elicitation ............................................................................................ 13
3.2 Objective .................................................................................................................... 13
3.2.1 Hardware Requirement .................................................................... 13
3.2.2 Software Requirement ...................................................................... 14
3.3 Software Requirement Specification .......................................................................... 15
4. System Design ........................................................................................................................ 20
4.1 System Architecture ................................................................................................... 20
4.2 Architectural Views and Detailed Design .................................................................. 21
4.2.1 Use Case Diagram ............................................................................ 21
4.2.2 Activity Diagram .............................................................................. 22
5. Coding .................................................................................................................................... 23
5.1 Code ........................................................................................................................... 23
5.2 Screenshots ................................................................................................................. 32
6. Testing .................................................................................................................................... 38
7. Result Analysis ...................................................................................................................... 39
7.1 Experimental Results.................................................................................................. 39
7.2 Evaluation metrics ...................................................................................................... 40
8. Conclusion ................................................................................................................................... 42
9. Future Works ............................................................................................................................... 43
10. References ..................................................................................................................................... 44
11. Paper ............................................................................................................................................. 45

1
LIST OF FIGURES

No TITLE Page.no

1 Individual map task of Downloader Module 6

2 Single node running the Downloader Module 7

3 Individual map task of Processor Module 8

4 Individual map task of Extractor Module 9

5 Image-based MapReduce 13

6 Architecture of Proposed System 16

LIST OF NOMENCLATURE AND ABBREVATIONS

S.no Abbreviations Full Form


1. HIPI Hadoop Image Processing Interface
2. HIB Hadoop Image Bundle
3. HDFS Hadoop Distributed File System
4. RGB Red Green Blue
5. MATLAB Matrix Laboratory
6. WLAN Wireless Local Area Network
14. IEEE Institute of Electrical and Electronic Engineers

2
CHAPTER 1
INTRODUCTION

1.1 General
With the spread of social media in recent years, a large amount of image data has been
accumulating. When processing this massive data resource has been limited to single computers,
computational power and storage ability quickly become bottlenecks. Alternately, processing
tasks can typically be performed on a distributed system by dividing the task into several
subtasks. The ability to parallelize tasks allows for scalable, efficient execution of resource-
intensive applications. The Hadoop MapReduce framework provides a platform for such tasks.
When considering operations such as face detection, image classification and other types of
processing on images, there are limits on what can be done to improve performance of single
computers to make them able to process information at the scale of social media. Therefore, the
advantages of parallel distributed processing of a large image dataset by using the computational
resources of a cloud computing environment should be considered. In addition, if computational
resources can be secured easily and relatively inexpensively, then cloud computing is suitable for
handling large image data sets at very low cost and increased performance. Hadoop, as a system
for processing large numbers of images by parallel and distributed computing, seems promising.
In fact, Hadoop
is in use all over the world. Studies using Hadoop have been performed, dealing with text data
files, analysing large volumes of DNA sequence data, converting the data of a large number of
still images to PDF format, and carrying out feature selection/extraction in astronomy. These
examples demonstrate the usefulness of the Hadoop system, which can run multiple processes in
parallel for load balancing and task management.
Most of the image processing applications that use the Hadoop MapReduce framework are
highly complex and impose a staggering learning curve. The overhead, in programmer time and
expertise, required to implement such applications is cumbersome. To address this, we present
the Hadoop Image Processing Framework, which hides the highly technical details of the
Hadoop system and allows programmers who can implement image processing algorithms but
who have no particular expertise in distributed systems to nevertheless leverage the advanced
resources of a distributed, cloud-oriented Hadoop system. Our framework provides users with
easy access to large-scale image data, smoothly enabling rapid prototyping and flexible
application of complex image processing algorithms to very large, distributed image databases.
The Hadoop Image Processing Framework is largely a software engineering platform, with the
goal of hiding Hadoop’s complexity while providing users with the ability to use the system for

3
large-scale image processing without becoming crack Hadoop engineers. The framework’s ease
of use and Java-oriented semantics will further ease the process of creating large scale image
applications and experiments. This framework is an excellent tool for novice Hadoop users,
image application developers and computer vision researchers, allowing the rapiddevelopment
ofimage softwarethat can take advantage of the huge data stores, rich metadata and global reach
of current online image sources.

1.2 Objective
Processing large set of images on a single machine can be very time consuming and costly. HIPI
is an image processing library designed to be used with the Apache Hadoop MapReduce, a
software framework for sorting and processing big data in a distributed fashion on large cluster
of commodity hardware. HIPI facilitates efficient and high-throughput image processing with
MapReduce style parallel programs typically executed on a cluster. It provides a solution for
how to store a large collection of images on the Hadoop Distributed File System (HDFS) and
make them available for efficient distributed processing..

1.3 Existing System


These days, in different fields of both industry and academia, large amounts of data is generated.
The use of several frameworks with different techniques is essential , for processing and
extraction of data. In the remote sensing field, large volumes of data are generated (satellite
images) over short periods of time. Information systems for processing these kind of images
were not designed with scalable features. In this paper, we present an extension of the HIPI
framework (Hadoop Image Processing Interface) for processing various image formats.

1.4 Proposed System


The Hadoop Image Processing Framework is intended to provide users with an accessible, easy-
to-use tool for developing large-scale image processing applications. The main goals of the
Hadoop Image Processing Framework are:
• Provide an open source framework over Hadoop MapReduce for developing large-scale image
applications
• Provide the ability to flexibly store images in various Hadoop file formats
• Present users with an intuitive application programming interface for image-based operations
which is highly parallelized and balanced, but which hides the technical details of Hadoop
MapReduce
• Allow interoperability between various image processing libraries
A. Downloading and storing image data

4
Hadoop uses the Hadoop Distributed File System (HDFS)[15] to store files in various nodes
throughout the cluster. One of Hadoop’s significant problems is that of small file storage. [16] A
small file is one which is significantly smaller than HDFS block size. Large image datasets are
made up of small image files in great numbers, which is a situation HDFS has a great deal of
trouble handling. This problem can be solved by providing a container to group the files in some
way. Hadoop offers a few options: • HAR File • Sequence File • Map File The Downloader
Module of our Hadoop Image Processing Framework performs the following operations:
Step 1: Input a URL List. Initially users input a file containing URLs of images to download.
The input list should be a text file with one image URL per line. The list can be generated by
hand, extracted from a database or a provided by a search. The framework provides an
extendable ripper module for extracting URLs from Flickr and Google image searches and from
SQL databases. In addition to the list the user selects the type of image bundle to be generated
(e.g. HAR, sequence or map). Our system divides the URLs for download across the available
processing nodes for maximum efficiency and parallelism. The URL list is split into several map
tasks of equal size across the nodes. Each node map task generates several image bundles
appropriate to the selected input list, containing all of the image URLs to download. In the
reduce phase, the Reducer will merge these image bundles into a large image bundle.
Step 2: Split URLs across nodes. From the input file containing the list of image URLs and the
type of file to be generated, we equally distribute the task of downloading images across the all
the nodes in the cluster. The nodes are efficiently managed so that no memory overflow can
occur even for terabytes of images downloaded in a single map task. This allows maximum
downloading parallelization. Image URLs are distributed among all available processing nodes,
and each map task begins downloading its respective image set.

Fig. 1. Individual map task of Downloader Module

5
Step 3: Download image data from URLs. For every URL retrieved in the map task, a
connection is established according to the appropriate transfer protocol (e.g. FTP, HTTP,
HTTPS, etc.). Once connected, we check the file type. Valid images are assigned InputStreams
associated with the connection. From these InputStreams, we generate new HImage objects and
add the images to the image bundle. The HImage class holds the image data and provides an
interface for the user’s manipulation of image and image header data. The HImage class also
provides interoperability between various image data types (e.g. BufferedImage, Mat, etc.).
Step 4: Store images in an image bundle. Once anHImage object is received, it can be added to
the image bundle simply by passing the HImage object to the appendImage method. Each map
task generates a number of image bundles depending on the image list. In the reduce phase, all of
these image bundles are merged into one large bundle.
B. Processing image bundle using MapReduce
Hadoop MapReduce program handles input and output data very efficiently, but their native data
exchange formats are not convenient for representing or manipulating image data. For instance,
distributing images across map nodes require the translation of images into strings, then later
decoding these image strings into specified formats in order to access pixel information. This is
both inefficient and inconvenient. To overcome this problem, images should be represented in as
many different formats as possible, increasing flexibility. The framework focuses on bringing
familiar data types directly to user. As distribution is important in MapReduce, images should be
processed in the same machine where the bundle blockresides. In
agenericMapReducesystem,theuserisresponsible for creating InputFormat and RecordReader
classes to specify the MapReduce job and distribute the input among nodes.

Fig. 2. Single node running


the Downloader Module
(handled by the framework
and transparent to the user)
This is a cumbersome and
error-prone task; the Hadoop
Image Processing Framework
provides such InputFormat
and RecordReaders for
system’s ease of use. Images
are distributed as various
image data types and users
have immediate access to
pixel values. If pixel values
are naively extracted from the
image byte formats, valuable
image header data (e.g. JPEG
EXIF data or IHDR [17]

6
image headers) is lost. The framework holds the image data in a special HImageHeader data type
before converting the image bytes into pixel values. After processing pixel data, image headers
are reattached to the processed results. The small amount of storage overhead required for this
functionality is a trade-off worth making for preserving image header data after processing. The
functionality of the framework’s Processor module is described below:
Step 1: Devise the algorithm. We assume that the user writes an algorithm which extends the
provided GenericAlgorithm class. This class is passed as an argument to the processor module.
The framework starts a MapReduce job with the algorithm as an input. The GenericAlgorithm
holds anHImage variable; this allows user to write an algorithm on a single image data structure,
which the framework then iterates over the entire image bundle. In addition to the algorithm, the
user should provide the image bundle file that needs to be processed. Depending on the specifics
of the image bundle organizationandcontents,thebundleisdividedacrossnodesas individual map
tasks. Each map task will apply the processing algorithm to each local image and append them to
the output image bundle. In the reduce phase, the Reducer merges these image bundles into a
large image bundle.
Step2:Splitimagebundleacrossnodes.The input image bundle is stored as blocks in the HDFS. In
order to obtain maximum throughput, the framework establishes each map
task to run in the same block where it resides, using custom input format and record reader
classes. This allows maximum parallelization without the problem of transferring data across
nodes. Each image bundle now applies different map tasks to the image data for which it is
responsible.

Fig. 3. Individual map task of Processor Module


Step 3: Process individual image. The processing algorithm devised by the user and provided as
input to the Processing Module is applied to every HImage retrieved in the map task. The
HImage provides its image data in the data format (e.g. Java BufferedImage, OpenCV Mat, etc.)
requested by the user and used by the processing algorithm. Once the image data type is
retrieved, processing takes place. After processing, the preserved image header data from the
original image is appended to the processed image. The processed image is appended to the

7
temporary bundle generated by the map task. Step 4: Store processed images in an image bundle.
Every map task generates an image bundle upon completion of its processing. Once the map
phase is completed there are many bundles scattered across the computing cluster. In the reduce
phase, all of these temporary image bundles are merged into a single large file which contains all
the processed images.

Fig. 5. Individual map task of Extractor Module


C. Extracting image bundles using MapReduce
In addition to creating and processing image bundles, the framework provides a method for
extracting and viewing these images. Generally, Hadoop extracts images from an image bundle
iteratively, inefficiently using a single node for the task. To address this inefficiency, we
designed an Extractor module which extracts images in parallel across all available nodes.
Distribution plays a key role in MapReduce; we want to make effective and efficient use of the
nodes in the computing cluster. As previously mentioned in the description of the Processor
module, a user working in a generic Hadoop system must again devise custom InputFormat and
RecordReader classes in order to facilitate distributed image extraction. The Hadoop Image
Processing Framework provides this functionality for the extraction task as well, providing much
greater ease of use for the development of image processing applications. Organizing and
specifying the final location of extracted images in a large distributed task can be confusing and
difficult. Our framework provides this functionality, and allows the user simply to specify
whether images should be extracted to a local file system or reside on the Hadoop DFS. The
process of the Extractor module is explained below:
Step : Input the image bundle to be extracted. The image bundle specified for extraction is
passed as a parameter to the Extractor module. In addition, the user can include an optional
parameter specifying the images’ final location (defaults to local file system). The image bundle
is then distributed across the nodes as individual map tasks. Each map task will extract the
requisite images onto the specified filesystem.

8
Step2:SplitImagebundleacrossnodes.Theinputimage bundle is split across available nodes using
the framework’s custom input format and record reader classes for maximum throughput. Once a
map task starts, HImage objects are retrieved.
Step 3: Extract individual image. The image bytes obtained from each HImage object are stored
onto the filesystem in the appropriate file format based on its image type. The Extractor module
lacks a reduce phase.

CHAPTER 2
LITERATURE SURVEY

1. Title: Hadoop Image Processing Framework

Author: Sridhar Vemula,Christopher Crick


Description: —With the rapid growth of social media, the number of images being
uploaded to the internet is exploding. Massive quantities of images are
shared through multi-platform services such as Snapchat, Instagram,
Facebook and WhatsApp; recent studies estimate that over 1.8 billion
photos are uploaded every day. However, for the most part, applications
that make use of this vast data have yet to emerge. Most current image
processing applications, designed for small-scale, local computation, do
not scale well to web-sized problems with their large requirements for
computational resources and storage. The emergence of processing
frameworks such as the Hadoop MapReduce[1] platform addresses the
problem of providing a system for computationally intensive data
processing and distributed storage. However, to learn the technical
complexities of developing useful applications using Hadoop requires a
large investment of time and experience on the part of the developer. As
such, the pool of researchers and programmers with the varied skills to
develop applications that can use large sets of images has been limited. To
address this we have developed the Hadoop Image Processing Framework,
which provides a Hadoop-based library to support large-scale image
processing. The main aim of the framework is to allow developers of
image processing applications to leverage the Hadoop MapReduce
framework without having to master its technical details and introduce an
additional source of complexity and error into their programs.

2. Title: Object Recognition In HADOOP Using HIPI

9
Author: Ankit Kumar Agrawal, Prof. Sivagami M.
Description: The amount of images and videos being shared by the user is
exponentially increasing but applications that perform video analytics is severely lacking or work
on limited set of data. It is also challenging to perform analytics with less time complexity.
Object recognition is the primary step in video analytics. We implement a robust method to
extract objects from the data which is in unstructured format and cannot be processed directly by
relational databases. In this study, we present our report with results after performance
evaluation and compare them with results of MATLAB.

CHAPTER 3
SYSTEM ANALYSIS

3.1 Requirement Elicitation


Since the system is completely new, the requirements will be based on the perception of the
researcher, his experience and useful literature that will be obtained from relevant sources.
An understanding of the operation of the system to be controlled will greatly contribute to
the elicitation of the system requirements.
3.2 Requirement Analysis
The requirements for our application:
• Linux Kernel distro
• Apache Hadoop
• Gradle Automation script
• HIPI libraries
• Image Processing Libraries
• J2se v7 or newer

3.2.1. Hardware Requirement

Hardware requirements are not imposing on the performance of the system. A basic
computing platform will serve as a cluster node. Preferably a cluster of atleast 3
nodes achieves better performance.

3.2.2. Software Requirement

10
The system was established on CentOSlinuxdistro is used, specifically called as
‘CloudEra’. It includes repositories critical to the functioning of Hadoop. Moreover, the
necessary environment has to be setup for HIPI to function properly.

11
3.3 Software Requirement Specification

1. Introduction

1.1 Purpose

This project tries to solve the problem of processing big data of images on Apache Hadoop using
Hadoop Image Processing Interface (HIPI) for storing and efficient distributed processing,
combined with OpenCV, an open source library of rich image processing algorithms. A program
to count number of faces in collection of images is demonstrated.

1.2 Intended Audience and Reading Suggestions

All chapters, sections, of this document are written primarily for the project guides and
coordinators and describes in technical terms the details of the functionality of the product. Both
sections of the document describe the same software product in its entirety, but are intended for
different audiences and thus use different language.

1.3 Product Scope

Processing large set of images on a single machine can be very time consuming
and costly. HIPI is an image processing library designed to be used with the Apache
Hadoop MapReduce, a software framework for sorting and processing big data in a
distributed fashion on large cluster of commodity hardware. HIPI facilitates efficient
andhighthroughput image processing with MapReduce style parallel programs
typically executed on a cluster. It provides a solution for how to store a large
collection of images on the Hadoop Distributed File System (HDFS) and make them
available for efficient distributed processing.

1.4 References
http://hipi.cs.virginia.edu/gettingstarted.html
http://dinesh-malav.blogspot.in/2015/05/image-processing-using-opencv-on-hadoop.html
https://cs.ucsb.edu/~cmsweeney/papers/undergrad_thesis.pdf

12
2. Overall Description

2.1 Product Perspective

HIPI was created to empower researchers and present them with a capable tool that would enable
research involving image processing and vision to be performed extremely easily. With the
knowledge that HIPI would be used for researchers and as an educational tool, we designed HIPI
with the following goals in mind.
1. Provide an open, extendible library for image processing and computer vision applications in a
MapReduce framework
2. Store images efficiently for use in Map-reduce applications
3. Allow for simple filtering of a set of images
4. HIPI will set up applications so that they are highly paralleled and balanced so that users do
not have to worry about

2.2 Product Functions

Fig 5. Image Based Map Reduce 1

Standard Hadoop MapReduce programs handle input and output of data very
effectively, but struggle in representing images in a format that is useful for researchers.
Current methods involve significant overhead to obtain standard float image
representation. For instance, to distribute a set of images to a set of Map nodes would
require a user to pass the images as a String, then decode each image in each map
task before being able to do access pixel information. This is not only inefficient, but
inconvenient. These tasks can create headaches for users and make the code look
cluttered and difficult to interpret the intent of the code. As such, sharing code is less
efficient because the code is harder to read and harder to debug. Our library focuses on
bringing familiar image-based data types directly to the user for easy use in MapReduce
applications.

13
2.3 Operating Environment

‹ Full fledged use of the HIPI interface requires :


‹ java sdk 7 environment
‹ standard installation of Apache Hadoop Distributed File System (HDFS) and
MapReduce.
‹ Tested with Hadoop v2.7.1
‹ The HIPI distribution uses the Gradle build automation system to manage compilation
and package assembly. HIPI has been tested with Gradle version 2.5.

2.4 Design and Implementation Constraints

With the proliferation of online photo storage and social medias from websites such as
Facebook, Flickr, and Picasa, the amount of image data available is larger than ever before and
growing more rapidly every day [Facebook 2010]. This alone provides an incredible database of
images that can scale up to billions of images. Incredible statistical and probabilistic models can
be built from such a large sample source. For instance, a database of all the textures found in a
large collection of images can be built and used by researchers or artists. The information can be
incredibly helpful for understanding relationships in the world1 . If a picture is worth a thousand
words, we could write an encyclopedia with the billions of images available to us on the internet.

2.5 User Documentation


The officialHIPI API Javadoc.

3. External Interface Requirements

3.1 Hardware Interfaces


The minimun hardware specs for a single node Hadoop cluster is as follows:

NameNode/JobTracker/Standby NameNode nodes. The drive count will fluctuate depending on


the amount of redundancy:

•4–6 1TB hard disks in a JBOD configuration (1 for the OS, 2 for the FS image [RAID 1], 1
for Apache ZooKeeper, and 1 for Journal node)
•2 quad-/hex-/octo-core CPUs, running at least 2-2.5GHz
•64-128GB of RAM
•Bonded Gigabit Ethernet or 10Gigabit Ethernet

14
4. System Features

4.1 Feature 1

HIPI facilitates efficient and high-throughput image processing with MapReduce style parallel
programs typically executed on a cluster.

4.2 Feature 2

hibDownload takes two required arguments: a path to a directory on the HDFS that contains files
with lists of image URLs and a path to a destination HIB, also on the HDFS. The program
attempts to download the entire set of image URLs (potentially using multiple download nodes
working in parallel) and stores these downloaded images in the output HIB. hibDownload also
accepts several optional arguments that allow specifying the number of download nodes and
using input files from the Yahoo/Flickr CC 100M dataset.

5. Other Nonfunctional Requirements

5.1 Performance Requirements


Performance contraints are not a concern as we are dealing with a distributed file system capable
of adding nodes to pre-existing clusters and issuing MapReduce tasks on them.

5.2 Safety Requirements

A backup of all the feature data should be maintained so that, in the event of data corruption, the
backup can replace the corrupt data.

5.3 Security Requirements

The feature data of images that are used during recognition must be secure.

15
CHAPTER 4
SYSTEM DESIGN

4.1 System Architecture

Fig. 6. Architecture of Proposed System

4.2. Architectural Views and Detailed Design


4.2.1. Use Case Diagram

Fig. 2. Use Case Diagram

16
CHAPTER 5
CODING
2.1. Code
HelloWorld.java

packageorg.hipi.examples;

importorg.hipi.image.FloatImage;

importorg.hipi.image.HipiImageHeader;

importorg.hipi.imagebundle.mapreduce.HibInputFormat;

importorg.apache.hadoop.conf.Configured;

importorg.apache.hadoop.util.Tool;

importorg.apache.hadoop.util.ToolRunner;

importorg.apache.hadoop.fs.Path;

importorg.apache.hadoop.io.IntWritable;

importorg.apache.hadoop.io.Text;

importorg.apache.hadoop.mapreduce.lib.input.FileInputFormat;

importorg.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

importorg.apache.hadoop.mapreduce.Job;

importorg.apache.hadoop.mapreduce.Mapper;

17
importorg.apache.hadoop.mapreduce.Reducer;

importorg.apache.hadoop.mapreduce.lib.input.FileInputFormat;

importorg.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

importjava.io.IOException;

publicclassHelloWorldextendsConfiguredimplementsTool{

publicstaticclassHelloWorldMapperextendsMapper<HipiImageHeader,FloatImag
e,IntWritable,FloatImage>{

publicvoid map(HipiImageHeader key,FloatImage value,Context context)

throwsIOException,InterruptedException{

publicstaticclassHelloWorldReducerextendsReducer<IntWritable,FloatImage,
IntWritable,Text>{

publicvoid reduce(IntWritable key,Iterable<FloatImage> values,Context


context)

throwsIOException,InterruptedException{

18
publicint run(String[]args)throwsException{

// Check input arguments

if(args.length!=2){

System.out.println("Usage: helloWorld<input HIB><output directory>");

System.exit(0);

// Initialize and configure MapReduce job

Jobjob=Job.getInstance();

// Set input format class which parses the input HIB and spawns map
tasks

job.setInputFormatClass(HibInputFormat.class);

// Set the driver, mapper, and reducer classes which express the
computation

job.setJarByClass(HelloWorld.class);

job.setMapperClass(HelloWorldMapper.class);

job.setReducerClass(HelloWorldReducer.class);

// Set the types for the key/value pairs passed to/from map and reduce
layers

job.setMapOutputKeyClass(IntWritable.class);

job.setMapOutputValueClass(FloatImage.class);

job.setOutputKeyClass(IntWritable.class);

job.setOutputValueClass(Text.class);

19
// Set the input and output paths on the HDFS

FileInputFormat.setInputPaths(job,newPath(args[0]));

FileOutputFormat.setOutputPath(job,newPath(args[1]));

// Execute the MapReduce job and block until it complets

boolean success =job.waitForCompletion(true);

// Return success or failure

return success ?0:1;

publicstaticvoid main(String[]args)throwsException{

ToolRunner.run(newHelloWorld(),args);

System.exit(0);

1.1.1 Computing The Average Pixel Color


map() method

20
publicstaticclassHelloWorldMapperextendsMapper<HipiImageHeader,FloatImag
e,IntWritable,FloatImage>{

publicvoid map(HipiImageHeader key,FloatImage value,Context context)

throwsIOException,InterruptedException{

// Verify that image was properly decoded, is of sufficient size, and


has three color channels (RGB)

if(value
!=null&&value.getWidth()>1&&value.getHeight()>1&&value.getNumBands()==3)
{

// Get dimensions of image

int w =value.getWidth();

int h =value.getHeight();

// Get pointer to image data

float[]valData=value.getData();

// Initialize 3 element array to hold RGB pixel average

float[]avgData={0,0,0};

// Traverse image pixel data in raster-scan order and update running


average

21
for(int j =0; j < h; j++){

for(int i =0; i < w; i++){

avgData[0]+=valData[(j*w+i)*3+0];// R

avgData[1]+=valData[(j*w+i)*3+1];// G

avgData[2]+=valData[(j*w+i)*3+2];// B

// Create a FloatImage to store the average value

FloatImageavg=newFloatImage(1,1,3,avgData);

// Divide by number of pixels in image

avg.scale(1.0f/(float)(w*h));

// Emit record to reducer

context.write(newIntWritable(1),avg);

}// If (value != null...

}// map()

22
}// HelloWorldMapper

reduce()

publicstaticclassHelloWorldReducerextendsReducer<IntWritable,FloatImage,
IntWritable,Text>{

publicvoid reduce(IntWritable key,Iterable<FloatImage> values,Context


context)

throwsIOException,InterruptedException{

// CreateFloatImage object to hold final result

FloatImageavg=newFloatImage(1,1,3);

// Initialize a counter and iterate over IntWritable/FloatImage records


from mapper

int total =0;

for(FloatImageval: values){

avg.add(val);

total++;

if(total >0){

23
// Normalize sum to obtain average

avg.scale(1.0f/ total);

// Assemble final output as string

float[]avgData=avg.getData();

String result =String.format("Average pixel value: %f %f


%f",avgData[0],avgData[1],avgData[2]);

// Emit output of job which will be written to HDFS

context.write(key,newText(result));

}// reduce()

}// HelloWorldReduc

5.2. Screenshots

24
Installing hipi

25
Downloading images

26
import images to HIB file

to get info of the HIB file

27
Creating jar

Executing jar

28
Output location

Result

29
CHAPTER 6
TESTING

Processing of single image in Matlab and HIPI demonstrates that Matlab


environment achieves better performance.

For multiple image based bundles on the other hand, the HIPI interface
run at a faster execution time.

30
CHAPTER 7
RESULT ANALYSIS

7.1. Experimental Results

1.Processing single image


2.Procesing Multiple images

31
CONCLUSION

This paper has described our Hadoop Image Processing Framework for implementing large scale

image processing applications. The framework is designed to abstract the technical details of

Hadoop’s powerful MapReduce system and provide an easy mechanism for users to process

large image datasets. We provide software machinery for storing images in the various Hadoop

file formats and efficiently accessing the Map Reduce pipeline. By providing interoperability

between different image data types we allow the user to leverage many different open-source

image processing libraries. Finally, we provide the means to preserve image headers throughout

image manipulation process, retaining useful and valuable information for image processing and

vision applications. With these features, the framework provides a new level of transparency and

simplicity for creating large-scale image processing applications on top of Hadoop’s MapReduce

framework. We demonstrate the power and effectiveness of the framework in terms of

performance enhancement and complexity reduction. The Hadoop Image Processing Framework

should greatly expand the population of software developers and researchers easily able to create

large-scale image processing applications.

32
FUTURE WORKS

In the near future we hope to extend the framework into a full-fledged multimedia processing
framework. We would like to improve the framework to handle audio and video processing over
Hadoop with similar ease. We also intend to add a CUDA module to allow processing tasks to
make use of machines’ graphics cards. Finally, we intend to develop our system into a highly
parallelized open-source Hadoop multimedia processing framework, providing web-based
graphical user interfaces for image processing applications.

33
REFERENCES

[1] Object Recognition In HADOOP Using HIPI Ankit Kumar Agrawal, Prof. SivagamiM.
[2] Hadoop Image Processing Framework Sridhar Vemula Computer Science Department
Oklahoma State University Stillwater, Oklahoma Email: sridhar.vemula@okstate.edu
Christopher Crick Computer Science Department Oklahoma State University Stillwater,
Oklahoma Email: chriscrick@cs.okstate.edu
[3] hipi.cs.virginia.edu
[4] dinesh-malav.blogspot.com
[5] stackoverflow.com
[6] github.com
[7] en.wikipedia.org

34
35

You might also like