Professional Documents
Culture Documents
Abstract—We live in on-demand, on-command Digital universe level programming language (e.g. Java, Python) and it is
with data prolifering by Institutions, Individuals and Machines at
flexible. Query processing is done through NoSQL integrated
a very high rate. This data is categories as "Big Data" due to its
sheer Volume, Variety and Velocity. Most of this data is in HDFS as Hive tool .
unstructured, quasi structured or semi structured and it is Traditional experience in data warehousing, reporting, and
heterogeneous in nature. online analytic processing (OLAP) is different for advanced
The volume and the heterogeneity of data with the speed it is forms of analytics. Organizations are implementing specific
generated, makes it difficult for the present computing forms of analytics, particularly called advanced analytics.
infrastructure to manage Big Data. Traditional data management,
warehousing and analysis systems fall short of tools to analyze this These are an collection of related techniques and tool types,
data. usually including predictive analytics, data mining, statistical
Due to its specific nature of Big Data, it is stored in distributed analysis, complex SQL, data visualization, artificial
file system architectures. Hadoop and HDFS by Apache is widely intelligence, natural language processing. Database analytics
used for storing and managing Big Data. Analyzing Big Data is a platforms such as MapReduce, in-database analytics, in-
challenging task as it involves large distributed file systems which memory databases, and columnar data stores are used for
should be fault tolerant, flexible and scalable. Map Reduce is standardizing them.
widely been used for the efficient analysis of Big Data. Traditional
DBMS techniques like Joins and Indexing and other techniques A unique challenge for researchers system and academicians is
like graph search is used for classification and clustering of Big that the large datasets needs special processing systems . Map
Data. These techniques are being adopted to be used in Map Reduce over HDFS gives Data Scientists the techniques
Reduce. through which analysis of Big Data can be done. HDFS is a
In this paper we suggest various methods for catering to the distributed file system architecture which encompasses the
problems in hand through Map Reduce framework over Hadoop original Google File System .Map Reduce jobs use efficient
Distributed File System (HDFS). Map Reduce is a Minimization data processingtechniques which can be applied in each of the
technique which makes use of file indexing with mapping, sorting, phases of MapReduce; namely Mapping, Combining,
shuffling and finally reducing. Map Reduce techniques have been Shuffling,Indexing, Grouping and Reducing. All these
studied in this paper which is implemented for Big Data analysis techniques have been studied in this paper for implementation
using HDFS.
in Map Reduce tasks.
Keyword-Big Data Analysis, Big Data Management, Map
Reduce, HDFS II. HADOOP AND HDFS
I. INTRODUCTION Hadoop is a scalable, open source, fault-tolerant Virtual
“Big data” refers to datasets whose size is beyond the ability Grid operating system architecture for data storage and
of typical database software tools to capture, store, manage, processing. It runs on commodity hardware, it uses HDFS
and analyse. which is fault-tolerant high-bandwidth clustered storage
architecture. It runs MapReduce for distributed data processing
Big data analyticsis the area where advanced analytic and is works with structured and unstructured data.HDFS is
techniques operate on big data sets. It is really about two things,
Big data and Analytics and how the two have teamed up to create designed to reliably store very large files across machines in a
one of the most profound trends in business intelligence (BI). large cluster. It stores each file as a sequence of blocks; all
blocks in a file except the last block are the same size. The
Map Reduce by itself is capable for analysing large blocks of a file are replicated for fault tolerance. The block size
distributed data sets; but due to the heterogeneity, velocity and and replication factor are configurable per file. An application
volume of Big Data, it is a challenge for traditional data can specify the number of replicas of a file. The replication
analysis and management tools. A problem with Big Data is factor can be specified at file creation time and can be changed
that they use NoSQL and has no Data Description Language later. Files in HDFS are write-once and have strictly one writer
(DDL) and it supports transaction processing. Also, web-scale at any time.The NameNode makes all decisions regarding
data is not universal and it is heterogeneous. For analysis of replication of blocks. It periodically receives a Heartbeat and a
Big Data, database integration and cleaning is much harder Blockreport from each of the DataNodes in the cluster. Receipt
than the traditional mining approaches.Parallel processing and of a Heartbeat implies that the DataNode is functioning
distributed computing is becoming a standard procedure which properly. A Blockreport contains a list of all blocks on a
are nearly non-existent in RDBMS. DataNode.
Map Reduce has following characteristics; it supports Figure2 shows the architecture of HDFS clusters
Parallel and distributed processing, it is simple and its implementation with Hadoop. It can be seen that HDFS has
architecture is shared-nothing which has commodity diverse distributed the task over two parallel clusters with one server
hardware (big cluster).Its functions are programmed in a high- and two slave nodes each. Data analysis tasks are distributed in
these clusters.