Professional Documents
Culture Documents
Volume 4 Issue 2
Abstract
We are living in Internet world. Information or data is required on
demand wherever, whenever required. Large amount of data in different
formats which cannot be processed with traditional tools like Data Base
Management system is termed as Big Data. MapReduce of Hadoop is
one of the computing tools for Big Data Analytics. Cloud provides
MapReduce as- a-service. In this paper we investigate and discuss
challenges and requirements of MapReduce integrity in cloud. Review of
some of the important integrity assurance frameworks are also focused
with their capabilities and their future research directions. We discussed
on algorithms for detecting collusive and non-collusive workers.
processing of data which could be with distributed data mining due to the rate
fluctuating in terms of volume and at which data are growing these days.
velocity. All these feature of cloud
computing made this as a tool of choice This chapter gives detailed explanation
for big-data analytics. Cloud based big about Hadoop core storage component
data analytics satisfy the requirement of namely Hadoop Distributed File System
complicated statistical analysis, linear and computing paradigm MapReduce.
regression, predictive analysis on big-data
in the multidisciplinary fields like health- Hadoop File system-HDFS
care, banking, business sectors, Hadoop Distributed File system is the
government projects, academia and many default File system. Local File system,
more. HFTP, F2, S3 and other mountable
distributed file systems are also compatible
Cost with respect to Hardware, software with Hadoop Framework. Google File
up gradation, maintenance or network System is foundation for HDFS. It is
configuration can be saved when cloud designed work on thousands of
base big data analytics used. Enterprise commodity/cheaper reliable and fault
can concentrate on analyzing data only, tolerant machines.
not the hardware or any other issues
related to maintain it. In HDFS master/Slave architecture single
Name node becomes a master node and N,
HADOOP- HDFS AND MAPREDUCE N>=1 number of data nodes becomes
Technologies behind cloud computing and Slave nodes. Metadata and actual data are
big data a distinct and can operate stored in master and slave node
mutually exclusively. Cloud has to play a respectively.
big role in big data analytics. Cloud
computing provides cost-effective solution
to store large data set. Big data analytics
can be made platform as a service with in a
cloud environment. [2, 3, 4, 15,16]
Hadoop can be thought as a platform that
runs on cloud computing to provide us
Map Task after taking input data converts pairs. Zero or more intermediate (1key,
individual elements into key/value pairs. 1value) pairs are output of map function.
This output becomes input to the next Next stage is grouping of pairs based on
reduce task. It combines these data tuples 1k value of each pair. Reducer is called
into a smaller set of tuples [10]. By default for each such group. Output is final result.
both input and output are saved in HDFS MasterNode/NameNode is the Node where
file system. JobTracker runs and which accepts job
requests from clients. JobTracker works
MapReduce is highly scalable over with the name node which is single master
thousands of nodes. Data processing node. It manages all Hadoop resources. It
application has to be decomposed into keeps track of resource consumption and
Mapper and Reducer once in the also schedules the master and slave jobs to
beginning, and same will be applied to appropriate machines.
data residing in multiple machines. This is
the advantage of MapReduce strategy. Key SlaveNode/DataNode is where Map and
idea is sending machine to where the data Reduce programs runs. TaskTracker runs
resides. Map, Shuffle and Reduce are three in SlaveNodes. Job of TaskTracker is
major stages of algorithm. Map or the monitoring tasks assigned to SlaveNodes
Mapper takes its input line by line from and re-executing the failed tasks are also
HDFS by default and processes it by additional responsibilities of it.
creating small blocks of data. Next stage is
combination of Shuffle and Reduce phase. Word Count Example
New set of output goes back to the HDFS. One of the well-known illustrations to
During processing MapReduce is sent to understand MapReduce is word count
the specified servers in the cluster. Cluster program. It is depicted in Fig.5. In the
collects individual results and reduces newly setup infrastructure of Hadoop-
these into final result form and sends to the based big-data ecosystem, many questions
HDFS server. need to be answered. These questions are
about security of Hadoop ecosystem,
Executing the task security of data residing in Hadoop, secure
Input to the Mapper has to be converted as manner of accessing Hadoop ecosystem,
pair of <Key,Value>. Output also is in this the way to enforce security models and
form only. Map function is applied to all many more.
Figure:-3 Work Flow of MapReduce Algorithm in word count task for given file
fragmented into block1 and block2.
Paper [6] presents “integrity framework node which verifies portion of result
for both collusive and non-collusive randomly. Whichever node fails to answer
mappers”. This framework assumes both quiz is considered as malicious node.
storage and masters are trusted. Mappers Each mapper worker prepares intermediate
and special limited verifier workers are result as well as MD5 hash code of this
executing on trusted node but mappers are result. These are cached for obtaining
not trusted. It is based on replicating each result of replicated task. Once both
mapper which identifies non-collusive workers clear k quizzes. Future research
mappers. It computes result without direction is making this framework worth
contacting any other non-collusive nodes. in case of untrusted reducer workers.
Next is identifying collusive nodes, which Second possible enhancement to this work
communicate with other malicious node in is reusing verification node computation of
order to send same type of output. reused tasks which reduces its workload.
Table2 gives detailed comparison of some
It is very tedious job to identify collusive of the algorithms on assuring integrity in
nodes. Dedicated verification nodes send MapReduce.
quiz-type verification to each computing
origin of result.
8 “Achieving Replicatio Cloud data It forces each machine to If more than one
accountable n/ double is correct. be held responsible for auditor is involve in
MapReduce check Workers its behavior. testing A’s task,
in cloud based are Accountability test is computational task
computing”[ unaware of done by auditors which increases. Master need
12] existence check all machines and to verify the
of auditors detect malicious nodes. correctness of idle
who replay Auditor replays the task worker to select as an
the task. and compares result with auditor.
Auditor original result.
Group is Probability-
not accountability reduces
malicious the number of records to
actions. be checked.
Workers
cannot
reclaim till
completion
.
The Authors of [13] HuseyinUlusoy and are in trusted domain whereas mappers are
others developed a novel framework for untrusty. To verify workers’ trust in terms
computation Integrity problem of of consistency of mappers’ results, edges
MapReduce based on replication scheme. of graphs have marked either zero or one.
In [14] Yan Ding and others proposed a Zero indicates both mappers have given
framework for protecting MapReduce different intermediate results but otherwise
against collusive attackers and assuring no. Earlier case is termed as inconsistency
integrity without extra re-computations. pair of workers and later one as
Analysis of “undirected Integrity consistency pair.
Attestation Graph” helps to identify both
collusive and non-collusive attackers. In [14] Y Ding and others proposed a
Assumption is both master and reducers scheme to detect attacks in both map phase
30 Page 20-33 © MANTECH PUBLICATIONS 2019. All Rights Reserved
International Journal of Software and Computer Science Engineering
Volume 4 Issue 2
and reduce phase and is based on not categories namely hardware/check based,
replication based. First assumption of sapling bases and finally replication based.
attack model-nodes are communicating in During this survey we observed that lot of
cryptographically secured way and only research need to be done in the area of
master node is assumed to be trusted. It is security issues in file system, trust on
probe injection based verification method hardware etc. No much work is shown
to achieve integrity of MapReduce where master is malicious or in case of
computations and also to identify untrusted cloud provider.
malicious workers. “A probe consists of
data injected into the original input dataset. Two or more of these algorithms can be
The result of the probe set can be pre- integrated to provide more sophisticated
computed before the entire input dataset is secured BigData-MapReduce Analytics
processed in the MapReduce system. A model in Cloud.
probe has two attributes: data value and
location. The data value is the specific REFERENCES
value in the computation; the location is its I. AjithBailakare, and Meenakshi,
position in the entire dataset after “An Introduction to Cloud
injection.” Computing and its Security Issues
and Challenges - A Literature
CONCLUSION Review”, IJEECS, Vol. 6, Issue 5,
MapReduce became a fault-tolerant, 2017.
efficient, and scalable data processing tool
for large datasets. But when MapReduce II. Han hu, Yonggangwen,Ttat-
introduced over public and hybrid cloud sengchua, and Xuelong li, “Toward
computing, addressing security and scalable systems for big data
privacy became important concerns. Since analytics: A Technology Tutorial”,
then many algorithms and frameworks are IEEE transactions, vol. 2, Pp. 652-
coming into picture to address these 687, 2014.
sensitive issues. Some of the important
algorithms and frameworks had been III. RaghavendraKune, Pramod Kumar
surveyed here in details. Comparison table Konugurthi,
is also presented. Based on survey we ArunAgarwalRaghavendraRaoChil
listed these frameworks under three larige and RajkumarBuyya, “The