Professional Documents
Culture Documents
ABSTRACT
Small size file handling problem is a big challenge in for handling large sized files & consists of two
Hadoop framework. Many approaches have been components i.e HDFS & Map Reduce. HDFS is the
proposed and evaluated to deal with the small size file primary component of the Hadoop with a default data
handling problem in Hadoop. File merging strategy is block size of 128 MB meant for managing & storing
one of the most popular approaches used in literature. large sized files. When the “size of file” is much
To deal with the small size file handling problem this smaller than the default HDFS block size, the
paper evaluates a merging strategy in the domain of efficiency is degraded. HDFS supports master-slave
“Content-based Image Retrieval Systems (CBIRS)”. architecture & follows writing once but reading many
CBIRS form a perfect application domain for times pattern. One of the most important advantages
evaluating solutions for small size file handling of HDFS is data replication. Map Reduce is regarded
problem in Hadoop by incorporating a huge number as the heart of the Hadoop. Map reduce is a software
of image files (small files).The approach used in this framework, a programming model & a processing part
paper is shown to be efficient than the solution of the Hadoop that makes use of computing resources
provided by HIPI (Hadoop Image processing & is used to process & generate large datasets in a
Interface). reliable & fault tolerant manner. In map reduce,
Hadoop program performs two separate & distinct
Keywords: Hadoop, Data node, Name node, HDFS tasks in which a sequence file is used for the purpose
client, Map reduce, small-file problem, Content-based of input/output formats. Both job trackers & task
Image Retrieval, Histogram Intersection trackers are incorporated in map reduce. It is not
necessary to write map reduce jobs in java but
1. INTRODUCTION Hadoop is implemented in Java.
Hadoop is among the most favorable high- Generally, in Hadoop deployment, we have three
performance java based open source distributed major types of machine roles. They are Name node,
computing platform that was designed to store & Data node & client machine.
process big data. Hadoop gives the best performance
@ IJTSRD | Available Online @ www.ijtsrd.com | Volume – 2 | Issue – 3 | Mar-Apr 2018 Page: 1751
International Journal of Trend in Scientific Research and Development (IJTSRD) ISSN: 2456-6470
@ IJTSRD | Available Online @ www.ijtsrd.com | Volume – 2 | Issue – 3 | Mar-Apr 2018 Page: 1752
International Journal of Trend in Scientific Research and Development (IJTSRD) ISSN: 2456-6470
@ IJTSRD | Available Online @ www.ijtsrd.com | Volume – 2 | Issue – 3 | Mar-Apr 2018 Page: 1753
International Journal of Trend in Scientific Research and Development (IJTSRD) ISSN: 2456-6470
proposed an approach which mainly consists of two referenced pixel with its neighbor’s pixel with respect
aspects: prefetching & file merging & the approach to the direction & proposed an algorithm for
which is adopted by Blue Sky mainly consists of three retrieving images for CBIRS using local term patterns
layers: Business Process layer, User Interface layer & (LTP) for indexing & also defines a standard
the Storage layer. Chandrasekar S et al. [6] combination between local term patterns & binary
proposed a similar approach as already proposed by term patterns by calculating a difference in gray level
Doug et al. by extending an HDFS known as EHDFS between referenced pixel & its neighbors. Dewen
which provides an improved prefetching, indexing Zhuang et al. [14] proposed a relevance feedback
mechanism & consists of four techniques prefetching, method to improve performance, retrieval efficiency,
file mapping, file merging & file extraction in order to reduces semantic gap & the storage of image
improve an efficiency. Mansoor Ahmad Mir & signatures. Swapnil et al. [15] proposed a solution for
Jawed Ahmad [7] proposed a solution similar to retrieving “similar images” with respect to the query
‘merging’ in which an index table was maintained for image from a database and from a large database of
each block. This paper also made use of Fetching in images in a secure, effective & efficient manner with
which the access time of data from memory is high as the help of a technique known as Local Tera Patterns
compared to from cache & for fast retrieval of (LTP) by calculating a difference in gray level
information a caching mechanism was used. between the center pixels & its neighbors. Hinge
Mr.Shubham Bhandari et al.[8] this paper explained Smita et al. [16] This paper made use of the map-
the architecture of NHAR to reduce the stress on reduce framework to retrieve & extract features from
Namenode & discussed the Configuring Hashtable of images mainly used for parallel processing & then
Name node for the NHAR in which each Name node returns the result in less time. B.R. Kavitha et al. [17]
is confined to three Hashtables. Chatuporn proposed an approach which consists of three stages
Vorapongkitipun & Natawut Nupairoj [9] of obtaining the final result of CBIRS. The First stage
proposed an approach based on HAR known as involves feature extraction & storage of images. The
NHAR. NHAR enhances HAR to add additional files. Second stage involves the implementation of
This paper provides the ability to access smaller files algorithm & extraction of features from a query
in HDFS & also increases the memory usage of image. The last stage involves the retrieval of images
metadata. Guru Prasad M SI et al.[10] proposed two in the order in which it matches with the properties of
techniques i.e Map Combine Reduce & File Manager a query image. K.Kranthi Kumar & Dr.T.Venu
for handling small files in Hadoop. File Manager Gopal [18] proposed matching & comparison
solves the memory stress on Namenode, manages the algorithms to match & compare the features of one
metadata, provides mutable property to HDFS files, image with another image. To calculate the similarity
distributes files to computing nodes & performs four between features algorithms were used by CBIR
functions in which a separate algorithm was proposed systems & a separate algorithm was used for each
for each function. J. Shashank et al. [11] In case of feature one for extraction & another for matching.
querying an image from a database this paper Michael J .Swain & Dana H. Ballard [19] proposed
describes the Content level access which is kept Histogram Intersection in their article “color
protected from every person as well as a database indexing”. Ballard & Brown [20] proposed that there
admin. CBIR deals with this type of issue. This paper are “three opponent color axes” which are used for
also describes users issue about data privacy & color axes of the histograms and are given below:
provides experimental study, scalability & efficiency “rg” = “r – g”
of the algorithm which is customizable for the
structure of hierarchical data. S. M. Patil et al. [12] “by” = “2 * b- r- g”
allows storing images in the database by its features & “wb" = “r +g +b”
deals with technical, large & distributed collection of
scientific images which contain complex information where “r”, “g” and “b” represent “red”, “green” and
& can be retrieved using sophisticated, query-based “blue” signals and “rg”, “by” and “wb” are the three
semantics & precise measures of similarity. This opponent color axes used by the human visual system.
paper also describes the information related to “low- This paper makes use of merging strategy for
level properties” of the image and produced the evaluating “Content-Based Image Retrieval Systems”
outputs. S. Murali et al. [13] proposed a method by using a technique known as “Histogram
which gives the relationship between two pixels i.e a Intersection”.
@ IJTSRD | Available Online @ www.ijtsrd.com | Volume – 2 | Issue – 3 | Mar-Apr 2018 Page: 1754
International Journal of Trend in Scientific Research and Development (IJTSRD) ISSN: 2456-6470
4. Proposed Approach 3) This path file is rendered on Hadoop platform.
Since the above-stated approaches require a lot of 4) Images are read through this path file & content
overhead with respect to pre-processing so there is a based features are extracted from the images which
need for an approach where there is less processing make up histograms for different files.
overhead & less communication cost. In this paper to 5) These histograms are stored as objects using
achieve the same, we evaluate a Content-based Image Sequencefileoutputformat.
retrieval system by implementing a merging strategy.
Our approach also overcomes the overhead 4.2 Algorithm for stage 2:
incorporated in Hadoop Image Processing Interface 1) Extract the features from the input query image in a
(HIPI). similar manner which also makes up a histogram.
Our proposed approach is divided into two stages:- 2) This input query histogram is compared with the
4.1 Algorithm for Stage 1: stored histograms through histogram Intersection
process.
1) Take input an Image dataset where each image is a
small file. 3) Those histograms whose comparison results are
specific to a given threshold are the output results.
2) Instead of processing each image file separately we
extract paths to these small files & store those files
into a single file.
5. EXPERIMENTS
No of No of No of bytes Map Input Input Split Total committed
Image mappers read & & Output bytes heap usage(bytes)
files written Records & spilled
Records
300 Total 4. 2 88250 100 112 11867648
mappers for first
level
configuration & 360600 0 0
2 for second
level
configuration
The dataset used to perform this experiment is image file separately our approach extracts paths to
BSDS300 which stands for “Berkeley Segmentation these small files, stores those files into a single file
dataset” consisting of 300 images. The images are and renders them on Hadoop platform. Images are
divided into a “test set of 100 images” and a “training read through this path file by a single mapper and
set of 200 images”. The half of the segmentation was content based features are extracted from the images
obtained from presenting grayscale image and other which make up histograms for different image files
half were obtained from presenting the color image. and these histograms are stored as objects using
This experiment was done on Hadoop 2.0 with Cent “SequenceFileoutputformat” which increases the
Operating system 7, 8GB RAM, Intel i7 core, 1TB efficiency and also solves the problem of classical
hard disk, Java JDK 1.8.0 and 2.00 GB processor. For HIPI. Since multiple image files are read by a single
performing image processing tasks in the distributed mapper, the map-reduce paradigm can be used
environment a library for Hadoop framework called efficiently.
HIPI is used which provides application programming
6. Conclusion and Future Scope
Interfaces(API’s) in which multiple files are read by a
single mapper where after each culling stage during CBIRS is a challenge. Hadoop Platform is best suited
processing one mapper reads one image file so the for handling large sized files. Image files are small
problem is still there. Instead of processing each files and in order to analyze the contents of various
@ IJTSRD | Available Online @ www.ijtsrd.com | Volume – 2 | Issue – 3 | Mar-Apr 2018 Page: 1755
International Journal of Trend in Scientific Research and Development (IJTSRD) ISSN: 2456-6470
images, histograms are made, which store the color International Journal on Recent & Innovation
information of the images and outputs them as Trends in Computing & Communication.
sequence file output format i.e. the histograms are 9. ChatupornVorapongkitipun & Natawut Nupairoj
stored as objects. The input query image is processed “Improving performance of small-file accessing in
in the same manner making up a histogram. The input Hadoop” 11th International Joint Conference on
query histogram is compared with the stored Computer Science & Software Engineering
histogram through histogram Intersection process. In (JCSSE), in the year 2014.
Small size file handling problem & HIPI one mapper 10. Prasad, G., Nagesh, H.R., & Deepthi, M.
reads one image file which is a limitation. So this Improving the Performance of Processing for
paper makes use of Sequencefileoutputformat in Small files in Hadoop: A Case Study of Weather
which multiple files are read by a single mapper due Data Analytics.International Journal of Computer
to which the time for reading from the hard disks will Science & Information Technology, in the year
be less and the processing time will get reduced to a 2014.
greater extent. 11. J. Shashank, “Content-Based Image Retrieval
Using Color, Texture & Shape features,”
Audio and video files also come under the category of
International Conference on Advanced Computing
small files, as a future work, we can explore these
& Communications(ICACC), in the year 2007.
types of files too since these files will also suffer
12. Shankar M. Patil “Content-Based Image Retrieval
performance issues which are faced by small files
using Color, Texture & Shape,” International
stored in HDFS.
Journal of Computer Science & Engineering
REFERENCES Technology (IJCSET), in the year 2012.
1. Scholar, U. G.A Selective Approach for Storing 13. Murala S., Maheswari R.P., & Balasubramanian,
Small Files in Respective Blocks of Hadoop. R., “Local Tetra Patterns: A New Feature
2. Roshini, R., & Raikar, M. Map Reduce based Descriptor for CBIR” IEEE Transactions on
Analysis of Live Website Traffic Integrated with Image Processing, in the year 2012.
Improved Performance for Small Files using 14. Kusuma,B & Megha.P.Arakeri.”Survey on
Hadoop. Content-based Image Retrieval using MapReduce
3. Jayakar, K. P., & Gaurav, Y. B. Efficient Way for over Hadoop.”International Journal of Advances
Handling Small Files using Extended in Electronics & Computer Science, in the year
HDFS.International Journal of Computer Science 2015.
& Mobile Computing in June 2014. 15. Smita, H., Monika, G., & Shraddha, C. Content-
4. Gupta, B., Nath, R., & Gopal, G.An Efficient Based Image Retrieval Using Hadoop Map
Approach for Storing and Accessing Small Files Reduce.International Journal of Computer Science
with Big Data Technology. International Journal trends & Technology, in the year 2014.
of Computer Applications, in the year 2016. 16. Kavitha, B.R., Kumaresen, P.,& Govindaraj, R.
5. Dong, Bo, et al.”A Noval approach to improving Content-Based Image Retrieval using Hadoop &
the efficiency of storing & accessing of small files HIPI, on 24th Dec 2017.
on Hadoop: a case study by ppt files.” 17. Kavitha, B.R., Kumaresan, P., & Govindaraj,
International Conference on Services Computing, R.Content-based Image Retrieval using Hadoop
in the year 2010. and HIPI.
6. Chandrasekar, S., et al. ”A novel indexing scheme 18. K.Kranthi Kumar & Dr. T. Venu Gopal. ”CBIR:
for efficient handling of small files in Hadoop Content-Based Image Retrieval.”A Conference
distributed file system.” International Conference paper in the year 2014.
on Computer Communication & Informatics 19. Swain, M. J., & Ballard, D.H. Color indexing.
(ICCI), in the year 2013. International Journal of computer vision,in 1991.
7. Mir, M.A., & Ahmad, J. ”An Optimal Solution 20. Ballard D H., & Brown, C .M. Computer vision.
small file problem in Hadoop. International Prentice Hall Professional Technical reference in
Journal of Advanced Research in Computer 1982.
Science,” in the year 2017.
8. Mr. Shubham Bhandari et al.An approach to
solving a small file problem in Hadoop by using
Dynamic Merging & Indexing Scheme.
@ IJTSRD | Available Online @ www.ijtsrd.com | Volume – 2 | Issue – 3 | Mar-Apr 2018 Page: 1756