Big Data Analysis Using Apache Hadoop

Big Data Analysis using Apache Hadoop
Shankar Ganesh Manikandan Siddarth Ravi

Department of Information Technology Department of Information Technology
Dhanalakshmi College of Engineering Dhanalakshmi College of Engineering
Tambaram,Chennai, India Tambaram,Chennai, India
7401557984 9943879800
shankarganesh25.m@gmail.com siddarth.siddu5@gmail.com
Abstract—We live in on-demand, on-command Digital universe level programming language (e.g. Java, Python) and it is
with data prolifering by Institutions, Individuals and Machines at
flexible. Query processing is done through NoSQL integrated
a very high rate. This data is categories as "Big Data" due to its
sheer Volume, Variety and Velocity. Most of this data is in HDFS as Hive tool .
unstructured, quasi structured or semi structured and it is Traditional experience in data warehousing, reporting, and
heterogeneous in nature. online analytic processing (OLAP) is different for advanced
The volume and the heterogeneity of data with the speed it is forms of analytics. Organizations are implementing specific
generated, makes it difficult for the present computing forms of analytics, particularly called advanced analytics.
infrastructure to manage Big Data. Traditional data management,
warehousing and analysis systems fall short of tools to analyze this These are an collection of related techniques and tool types,
data. usually including predictive analytics, data mining, statistical
Due to its specific nature of Big Data, it is stored in distributed analysis, complex SQL, data visualization, artificial
file system architectures. Hadoop and HDFS by Apache is widely intelligence, natural language processing. Database analytics
used for storing and managing Big Data. Analyzing Big Data is a platforms such as MapReduce, in-database analytics, in-
challenging task as it involves large distributed file systems which memory databases, and columnar data stores are used for
should be fault tolerant, flexible and scalable. Map Reduce is standardizing them.
widely been used for the efficient analysis of Big Data. Traditional
DBMS techniques like Joins and Indexing and other techniques A unique challenge for researchers system and academicians is
like graph search is used for classification and clustering of Big that the large datasets needs special processing systems . Map
Data. These techniques are being adopted to be used in Map Reduce over HDFS gives Data Scientists the techniques
Reduce. through which analysis of Big Data can be done. HDFS is a
In this paper we suggest various methods for catering to the distributed file system architecture which encompasses the
problems in hand through Map Reduce framework over Hadoop original Google File System .Map Reduce jobs use efficient
Distributed File System (HDFS). Map Reduce is a Minimization data processingtechniques which can be applied in each of the
technique which makes use of file indexing with mapping, sorting, phases of MapReduce; namely Mapping, Combining,
shuffling and finally reducing. Map Reduce techniques have been Shuffling,Indexing, Grouping and Reducing. All these
studied in this paper which is implemented for Big Data analysis techniques have been studied in this paper for implementation
using HDFS.
in Map Reduce tasks.
Keyword-Big Data Analysis, Big Data Management, Map
Reduce, HDFS II. HADOOP AND HDFS
I. INTRODUCTION Hadoop is a scalable, open source, fault-tolerant Virtual
“Big data” refers to datasets whose size is beyond the ability Grid operating system architecture for data storage and
of typical database software tools to capture, store, manage, processing. It runs on commodity hardware, it uses HDFS
and analyse. which is fault-tolerant high-bandwidth clustered storage
architecture. It runs MapReduce for distributed data processing
Big data analyticsis the area where advanced analytic and is works with structured and unstructured data.HDFS is
techniques operate on big data sets. It is really about two things,
Big data and Analytics and how the two have teamed up to create designed to reliably store very large files across machines in a
one of the most profound trends in business intelligence (BI). large cluster. It stores each file as a sequence of blocks; all
blocks in a file except the last block are the same size. The
Map Reduce by itself is capable for analysing large blocks of a file are replicated for fault tolerance. The block size
distributed data sets; but due to the heterogeneity, velocity and and replication factor are configurable per file. An application
volume of Big Data, it is a challenge for traditional data can specify the number of replicas of a file. The replication
analysis and management tools. A problem with Big Data is factor can be specified at file creation time and can be changed
that they use NoSQL and has no Data Description Language later. Files in HDFS are write-once and have strictly one writer
(DDL) and it supports transaction processing. Also, web-scale at any time.The NameNode makes all decisions regarding
data is not universal and it is heterogeneous. For analysis of replication of blocks. It periodically receives a Heartbeat and a
Big Data, database integration and cleaning is much harder Blockreport from each of the DataNodes in the cluster. Receipt
than the traditional mining approaches.Parallel processing and of a Heartbeat implies that the DataNode is functioning
distributed computing is becoming a standard procedure which properly. A Blockreport contains a list of all blocks on a
are nearly non-existent in RDBMS. DataNode.
Map Reduce has following characteristics; it supports Figure2 shows the architecture of HDFS clusters
Parallel and distributed processing, it is simple and its implementation with Hadoop. It can be seen that HDFS has
architecture is shared-nothing which has commodity diverse distributed the task over two parallel clusters with one server
hardware (big cluster).Its functions are programmed in a high- and two slave nodes each. Data analysis tasks are distributed in
these clusters.
978-1-4799-6541-0/14/$31.00 ©2014 IEEE

Figure 1. HDFS Architecture
Figure 2. HDFS Clusters
III. BIG DATA ANALYSIS

handling the velocity and heterogeneity of data, tools like Hive,
Much data today is not natively in structured format; for Pig and Mahout are used which are parts of Hadoop and HDFS
example, tweets and blogs are weakly structured pieces of text, framework. It is interesting to note that for all the tools used,
while images and video are structured for storage and display, Hadoop over HDFS is the underlying architecture. Oozie and
but not for semantic content and search: transforming such EMR with Flume and Zookeeper are used for handling the
content into a structured format for later analysis is a major volume and veracity of data, which are standard Big Data
challenge.Figure 3, below gives a glimpse of the Big Data management tools. The layer with their specified tools forms
analysis tools which are used forefficient and precisedata the bedrock for Big Data management and analysis framework.
Figure 3. Big Data Analysis Tools

IV. MAP REDUCE
MapReduce is a programming model for processing large-
scale datasets in computer clusters. The MapReduce
programming model consists of two functions, map() and
reduce(). Users can implement their own processing logic by
specifying a customized map() and reduce() function. The
map() function takes an input key/value pair and produces a
list of intermediate key/value pairs. The MapReduce runtime
system groups together all intermediate pairs based on the
intermediate keys and passes them to reduce() function for
producing the final results.
Map
(in_key,in_value)>list(out_key,intermediate_value)
Figure 5. Map ReduceWorking through Master /Slave
Reduce
(out_key,list(intermediate_value))>list(out_value)
B. Map Reduce Techniques
The signatures of map() and reduce() are as follows :
map (k1,v1) ! list(k2,v2)and reduce (k2,list(v2)) 1. Prepare the Map() input – the "MapReduce
!list(v2) system" designates Map processors, assigns the K1
input key value each processor would work on, and
provides that processor with all the input data
associated with that key value.
2. Run the user-provided Map() code – Map() is
run exactly once for each K1 key value, generating
output organized by key values K2.
3. "Shuffle" Map output to the Reduce processors
– the MapReduce system designates Reduce
processors, assigns the K2 key value each
processor should work on, and provides that
processor with all the Map-generated data
associated with that key value.
4. Run the user-provided Reduce() code – Reduce()
is run exactly once for each K2 key value produced
by the Map step.
Figure 4. Map Reduce Architecture and Working 5. Produce the final output – the MapReduce
system collects all the Reduce output, and sorts it
Large scale data processing is a difficult task, managing by K2 to produce the final outcome.
hundreds or thousands of processors and managing
parallelization and distributed environments makes is more V. CONCLUSION
difficult. Map Reduce provides solution to the mentioned
issues, as is supports distributed and parallel I/O scheduling, it The need to process enormous quantities of data has never
is fault tolerant and supports scalability and has inbuilt been greater. Not only are terabyte- and petabyte-scale
processes forstatus and monitoring of heterogeneous and large datasets rapidly becoming commonplace, but there is
datasets as in Big Data. consensus that great value lies buried in them, waiting to be
unlocked by the right computational tools. Big Data analysis
A. Map Reduce Components tools like Map Reduce over Hadoop and HDFS, promises to
help organizations better understand their customers and the
marketplace, hopefully leading to better business decisions
1. Name Node – manages HDFS metadata, doesn’t and competitive advantages. For engineers building
dealwith files directly information processing tools and applications, large and
2. Data Node –stores blocks of HDFS–defaultreplication heterogeneous datasets which are generating continuous
level for each block: 3 flow of data, lead to more effective algorithms for a wide
3. Job Tracker –schedules, allocates and monitors range of tasks, from machine translationto spam
jobexecution on slaves – Task Trackers detection.The ability to analyse massiveamounts of data
4. Task Tracker –runs Map Reduce mayprovide the key to unlocking thesecrets of the cosmos
operationsHadoopMapReduce comes bundled with a or the mysteries oflife.MapReduce can be exploited to
library ofgenerally useful mappers, reducers, and solve a variety of problems related to text processing at
partitioners scales that would have been unthinkable a few years ago.
REFERENCES
[1] Jefry Dean and Sanjay Ghemwat, MapReduce:A
Flexible Data Processing Tool, Communications of the
ACM, Volume 53, Issuse.1,January 2010, pp 72-77.
[2] Jefry Dean and Sanjay Ghemwat,.MapReduce:
Simplified data processing on large clusters,
Communications of the ACM, Volume 51 pp. 107–113,
2008
[3] Brad Brown, Michael Chui, and James Manyika, Are

you ready for the era of „big
data ?,McKinseyQuaterly,Mckinsey Global
Institute, October 2011.
[4] DunrenChe, MejdlSafran, and ZhiyongPeng, From Big
Data to Big Data Mining: Challenges, Issues, and
Opportunities, DASFAA Workshops 2013, LNCS
7827, pp. 1–15, 2013.
[5] MarcinJedyk, MAKING BIG DATA, SMALL, Using
distributed systems for processing, analysing and
managing large huge data sets, Software
Professional s Network, Cheshire Data systems Ltd.
[6] OnurSavas, YalinSagduyu, Julia Deng, and Jason
Li,Tactical Big Data Analytics: Challenges, Use Cases
and Solutions, Big Data Analytics Workshop in
conjunction with ACM Sigmetrics 2013,June 21, 2013.
[7] Kyuseok Shim, MapReduce Algorithms for Big Data
Analysis, DNIS 2013, LNCS 7813, pp. 44–48, 2013.
[8] Raja.Appuswamy,ChristosGkantsidis,DushyanthNaraya
nan,Ori onHodson,AntonyRowstron, Nobody ever got
fired for buying a cluster, Microsoft Research,
Cambridge, UK, Technical Report,MSR-TR-2013-2
[9] Carlos Ordonez, Algorithms and Optimizations for Big
Data Analytics: Cubes, Tech Talks,University of
Houston, USA.
[10] Spyros Blanas, Jignesh M. Patel,VukErcegovac, Jun
Rao,Eugene J. Shekita, YuanyuanTian, A Comparison
of Join Algorithms for Log Processing in MapReduce,
SIGMOD 10,
June 6–11, 2010, Indianapolis, Indiana, USA.

Big Data Analysis Using Apache Hadoop

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Big Data Analysis Using Apache Hadoop

Uploaded by

Copyright:

Available Formats

Big Data Analysis using Apache Hadoop

Shankar Ganesh Manikandan Siddarth Ravi

978-1-4799-6541-0/14/$31.00 ©2014 IEEE

Figure 2. HDFS Clusters

III. BIG DATA ANALYSIS

Figure 3. Big Data Analysis Tools

[3] Brad Brown, Michael Chui, and James Manyika, Are

You might also like