You are on page 1of 14

International Journal of Software and Computer Science Engineering

Volume 4 Issue 2

Achieving Integrity Assurance of MapReduce in Cloud Computing

Meenakshi*, Ramachandra AC**, Subhrajit Bhattacharya***


Assistant Professor*, Professor**, ***
Dept. of Computer Science and Engineering*, **, Dept. of Data Science
NitteMeenakshi Institute of Technology Bangalore*, **, Career Launcher, Bangalore***
Corresponding author’s email id:kmeenarao@gmail.com*
DOI:http://doi.org/10.5281/zenodo.3375498

Abstract
We are living in Internet world. Information or data is required on
demand wherever, whenever required. Large amount of data in different
formats which cannot be processed with traditional tools like Data Base
Management system is termed as Big Data. MapReduce of Hadoop is
one of the computing tools for Big Data Analytics. Cloud provides
MapReduce as- a-service. In this paper we investigate and discuss
challenges and requirements of MapReduce integrity in cloud. Review of
some of the important integrity assurance frameworks are also focused
with their capabilities and their future research directions. We discussed
on algorithms for detecting collusive and non-collusive workers.

Keywords:Cloud Computing, Hadoop, MapReduce, Security, Integrity

INTRODUCTION provider interaction”. [1] Cloud computing


National Institute for standards and is a mean of providing software, storage
Technology defines Cloud Computing as “ and processing as a service to the users for
a pay-per-use model for enabling their applications with reasonable cost.
convenient, on-demand network access to Using cloud computing user can connect
a shared pool of configurable computing to applications over the internet. Pool of
resources for example networks, servers, resources, elasticity and broad network
storage, applications and services, that can access, accessibility, efficiency and
be rapidly provisioned and released with measured services are some of the key
minimal management effort or service features of cloud. It supports storage and

20 Page 20-33 © MANTECH PUBLICATIONS 2019. All Rights Reserved


International Journal of Software and Computer Science Engineering
Volume 4 Issue 2

processing of data which could be with distributed data mining due to the rate
fluctuating in terms of volume and at which data are growing these days.
velocity. All these feature of cloud
computing made this as a tool of choice This chapter gives detailed explanation
for big-data analytics. Cloud based big about Hadoop core storage component
data analytics satisfy the requirement of namely Hadoop Distributed File System
complicated statistical analysis, linear and computing paradigm MapReduce.
regression, predictive analysis on big-data
in the multidisciplinary fields like health- Hadoop File system-HDFS
care, banking, business sectors, Hadoop Distributed File system is the
government projects, academia and many default File system. Local File system,
more. HFTP, F2, S3 and other mountable
distributed file systems are also compatible
Cost with respect to Hardware, software with Hadoop Framework. Google File
up gradation, maintenance or network System is foundation for HDFS. It is
configuration can be saved when cloud designed work on thousands of
base big data analytics used. Enterprise commodity/cheaper reliable and fault
can concentrate on analyzing data only, tolerant machines.
not the hardware or any other issues
related to maintain it. In HDFS master/Slave architecture single
Name node becomes a master node and N,
HADOOP- HDFS AND MAPREDUCE N>=1 number of data nodes becomes
Technologies behind cloud computing and Slave nodes. Metadata and actual data are
big data a distinct and can operate stored in master and slave node
mutually exclusively. Cloud has to play a respectively.
big role in big data analytics. Cloud
computing provides cost-effective solution
to store large data set. Big data analytics
can be made platform as a service with in a
cloud environment. [2, 3, 4, 15,16]
Hadoop can be thought as a platform that
runs on cloud computing to provide us

21 Page 20-33 © MANTECH PUBLICATIONS 2019. All Rights Reserved


International Journal of Software and Computer Science Engineering
Volume 4 Issue 2

Figure: - 1 Main Modules of Hadoop-Hadoop common, Hadoop Yarn, HDFS and


MapReduce constitute Hadoop Framework.

 Determining mapping of fragmented Different parameters are also given to the


input files to the slave nodes in terms job which does Job configuration. From
of fixed block size; 64KB by default. the JobClient, these jar executable files
and input configuration goes to the
 Instructing data nodes for block JobTracker.
creation/deletion/replication.
Replication factor is three by default. Tasks to be handled by JobTracker are as
follows:
 Making data nodes to perform  Assigning jar files and configuration to
read/write operations. HDFS provides slaves/TaskTracker.
commands to interact with file system  Scheduling tasks
which are similar to those of UNIX  Monitoring tasks.
shell commands.  Report about diagnostic/status
information to the JobClient.
How Does Hadoop Work?
The first stage is Submitting job, which Last third stage is execution phase where
allow user/an application to submit a tasks are executed by TaskTracker on
job/task to be processed to the Hadoop’s different nodes as per the MapReduce
Job-client. To do so, location of the input code written by the user.
and output files is specified by the user.
And also provide the java classes in the Hadoop MapReduce
java files which comprise of computing MapReduce comprises two components
algorithms namely Map and Reduce. namely Map task and Reduce task (Fig.4).
22 Page 20-33 © MANTECH PUBLICATIONS 2019. All Rights Reserved
International Journal of Software and Computer Science Engineering
Volume 4 Issue 2

Map Task after taking input data converts pairs. Zero or more intermediate (1key,
individual elements into key/value pairs. 1value) pairs are output of map function.
This output becomes input to the next Next stage is grouping of pairs based on
reduce task. It combines these data tuples 1k value of each pair. Reducer is called
into a smaller set of tuples [10]. By default for each such group. Output is final result.
both input and output are saved in HDFS MasterNode/NameNode is the Node where
file system. JobTracker runs and which accepts job
requests from clients. JobTracker works
MapReduce is highly scalable over with the name node which is single master
thousands of nodes. Data processing node. It manages all Hadoop resources. It
application has to be decomposed into keeps track of resource consumption and
Mapper and Reducer once in the also schedules the master and slave jobs to
beginning, and same will be applied to appropriate machines.
data residing in multiple machines. This is
the advantage of MapReduce strategy. Key SlaveNode/DataNode is where Map and
idea is sending machine to where the data Reduce programs runs. TaskTracker runs
resides. Map, Shuffle and Reduce are three in SlaveNodes. Job of TaskTracker is
major stages of algorithm. Map or the monitoring tasks assigned to SlaveNodes
Mapper takes its input line by line from and re-executing the failed tasks are also
HDFS by default and processes it by additional responsibilities of it.
creating small blocks of data. Next stage is
combination of Shuffle and Reduce phase. Word Count Example
New set of output goes back to the HDFS. One of the well-known illustrations to
During processing MapReduce is sent to understand MapReduce is word count
the specified servers in the cluster. Cluster program. It is depicted in Fig.5. In the
collects individual results and reduces newly setup infrastructure of Hadoop-
these into final result form and sends to the based big-data ecosystem, many questions
HDFS server. need to be answered. These questions are
about security of Hadoop ecosystem,
Executing the task security of data residing in Hadoop, secure
Input to the Mapper has to be converted as manner of accessing Hadoop ecosystem,
pair of <Key,Value>. Output also is in this the way to enforce security models and
form only. Map function is applied to all many more.

23 Page 20-33 © MANTECH PUBLICATIONS 2019. All Rights Reserved


International Journal of Software and Computer Science Engineering
Volume 4 Issue 2

MAPREDUCE SECURITY THREATS not be a burden on MapReduce working


AND PRIVACY CHALLENGES efficiency.
When cloud is providing MapReduce as a
service, it has to deal with new security When MapReduce is deployed in public
and privacy issues. Some of such cloud, it is more vulnerable to security
challenges are explained here. MapReduce threats as compared to private deployment.
is dealing with Big Data which is large and Authentication, authorization and access
arriving at high speed from various inputs. control” are very essential requirements
Single system hold single copy of data in for MapReduce computational nodes.
one location only where as in MapReduce, Authorization is the process of
replicated splits of data need to be identification of adversarial mapper or
transferred and stored securely. Cloud reducer or user. After successful
alone may not have distributed computing authentication, access privileges of
for every task. But MapReduce means mappers and reducers are checked so that
computing distributed replicated chunk of they can proceed to access and framework.
data. It needs to secure both distributed “Availability” of data, mappers and
nodes and replicated data. Attack may reducers are always for authenticated and
yield wrong output from effected mapper authorized users without much delay.
or reducer, modify the data, transfer data Strategy provides effective solution for
to third party etc. data flow occur between detecting malicious workers, but difficult
clouds, or between storage nodes or to identify when all replicated tasks are
between computing nodes. Adding handled by collusive group. Table I gives
security and privacy mechanisms should brief about MapReduce security threats.

Figure:-2 Hadoop- Map and Reduce algorithm Overview

24 Page 20-33 © MANTECH PUBLICATIONS 2019. All Rights Reserved


International Journal of Software and Computer Science Engineering
Volume 4 Issue 2

Figure:-3 Work Flow of MapReduce Algorithm in word count task for given file
fragmented into block1 and block2.

RELATED WORK ON INTEGRITY Along with master node, intermediate


ASSURANCE IN MAPREDUCE results obtained from two different map
”Data Integrity is the assurance that data reducers of same replicated task is checked
received are exactly as sent by an for consistency by other worker also. It
authorized entity that means without provides scalability and efficiency.
containing modification, insertion, Commitment protocol and verification
deletion, or replay”. protocol are implemented to provide
security for MapReduce tasks. Future
In SecureMR framework [5], master is research direction is applying sampling
assumed to be safe and workers are not technique to find inconsistency and
trusted. Distributed File System DFS provide integrity when all duplicated tasks
incorporated with integrity assurance so are processes by collusive attackers. [5]
that workers are provided with integrity Strategy provides effective solution for
protected data. “Each worker is having detecting malicious workers, but difficult
public/private key pair and any worker can to identify when all replicated tasks are
generate and verify signatures and no handled by collusive group.
worker can forge other’s signatures”.
25 Page 20-33 © MANTECH PUBLICATIONS 2019. All Rights Reserved
International Journal of Software and Computer Science Engineering
Volume 4 Issue 2

Table:-1Various Attacks on MapReduce


Attack Pas Act Definition Effect on Attack on
sive ive MapReduce

1 Imperson √ Attacker pretends Effected legal user Authenticati


ate attack to be a legal user sometimes may be on
by gaining charged for using
passwords or weak cloud.
encryption schemes
with brute force Attacker may do
attack data leakage or
wrong computations

2 Denial- √ Attacker causes Attacked node may Availability


of- system non- cause other working of data,
Service functional. In node non-functional mapper,
(DoS) MapReduce by sending repeated reducer
context system execution request.
means nodes, or
mappers or DoS cause heavy
reducers. network traffic.

3 Replay √ Adversary resends Make node busy. It Authorizatio


attack the valid message may replay n
to mappers or authentication
reducers details and cause
impersonate and
DoS.

4 Eavesdro √ Observing inputs, Adversary gain Confidential


pping outputs and knowledge of ity
intermediate results intensive computation
of nodes without computations. s and data
knowledge of
computing owner

5 Man-in- √ Attacker modifies It may lead to DoS, Confidential


the- the data of impersonation or ity
middle communication replay. computation
attack between two nodes s and data

6 Repudiat √ Node falsely denies Mapper or reducer Authorizatio


ion execution request. falsely denies the n,
execution request of Authenticati
already on
accomplished task.

26 Page 20-33 © MANTECH PUBLICATIONS 2019. All Rights Reserved


International Journal of Software and Computer Science Engineering
Volume 4 Issue 2

Paper [6] presents “integrity framework node which verifies portion of result
for both collusive and non-collusive randomly. Whichever node fails to answer
mappers”. This framework assumes both quiz is considered as malicious node.
storage and masters are trusted. Mappers Each mapper worker prepares intermediate
and special limited verifier workers are result as well as MD5 hash code of this
executing on trusted node but mappers are result. These are cached for obtaining
not trusted. It is based on replicating each result of replicated task. Once both
mapper which identifies non-collusive workers clear k quizzes. Future research
mappers. It computes result without direction is making this framework worth
contacting any other non-collusive nodes. in case of untrusted reducer workers.
Next is identifying collusive nodes, which Second possible enhancement to this work
communicate with other malicious node in is reusing verification node computation of
order to send same type of output. reused tasks which reduces its workload.
Table2 gives detailed comparison of some
It is very tedious job to identify collusive of the algorithms on assuring integrity in
nodes. Dedicated verification nodes send MapReduce.
quiz-type verification to each computing

Table 2. MapReduce Integrity Assurance Frameworks

Framework Verificati Attack Concept Future research


on Model directions
Schemes

1 “VC3: Hardware Physical User uploads encrypted Tampering Processor


Trustworthy / processors MapReduce code to work packages, DoS attack,
Data checkpoin ensure on encrypted file. Key traffic analysis, fault
Analytics in t based integrity of exchange protocol is injections need to be
the Cloud memory executed to decrypt addressed.
using SGX” region of MapReduce inside
[7] systems. workers. Result is again
encrypted.

27 Page 20-33 © MANTECH PUBLICATIONS 2019. All Rights Reserved


International Journal of Software and Computer Science Engineering
Volume 4 Issue 2

2 “TMR: Hardware Computing Trusted Platform Framework can be


Towards a / infrastruct hardware Module does verified for large scale
Trusted checkpoin ure is remote attestation of infrastructure and
MapReduce t trusted. workers and programs real-life practical
Infrastructur based Master can loaded to workers are workloads.
e” [8] be verified checked for reliability.
periodicall
y by third
party.

3 “Towards Watermar Workers of First data is preprocessed Instead only “text-


Trusted king / MapReduc by verifier with injected intensive task”,
Services: sampling e cannot watermark. It verifies the “numerical data-
Result based decide correctness of intensive” tasks also
verification whether MapReduce result. need to be considered.
schemes for input is Random sampling also
mapreduce” having applied.
[9] injected
data or not.

4 “Viaf: Watermar Both Dedicated verification Making this


Verification king / storage nodes send quiz-type framework worth in
-based sampling and verification to each case of untrusted
integrity based masters are computing node which reducer workers and
assurance trusted verifies portion of result reusing verification
framework randomly. Whichever node computation of
for node fails to answer quiz reused tasks which
mapreduce” is considered as reduces its workload.
[5] malicious node.

5 “A Result Watermar Master is In prepressing, secondary Some workers may


Veri1ficatio king / trusted and cluster is formed with not being verified at
n Scheme sampling computatio equal range of data to all. Randomized
for based n providers every mapper. Small worker selection can
MapReduce are fraction of total number be improved further

28 Page 20-33 © MANTECH PUBLICATIONS 2019. All Rights Reserved


International Journal of Software and Computer Science Engineering
Volume 4 Issue 2

” [10] malicious of workers is selected for for complex


without verification. computations.
disturbing
accuracy
level of
output.

6 “Securemr: Replicatio Only Commitment protocol Sampling technique to


A service n/ double master is and verification protocol be applied to find
integrity check trusted. All are implemented to inconsistency when all
assurance based workers provide security for duplicated tasks are
framework are in MapReduce tasks. Along processes by collusive
for untrusted with master node, attackers.
mapreduce” domain. intermediate results
[6] obtained from two
different map reducers of
same replicated task is
checked for consistency
by other worker also.

7 Distributed Replicatio Master is Design of the “result Collusive workers


Results n/ double trusted and certification” in also may be present in
Checking check workers MapReduce computation the system and
for based are not. All in Desktop Grid has been produce erroneous
MapReduce workers addressed using results. Framework
in Volunteer are non- “Majority Voting can be redesigned t
Computing collusive method”. Verification is address this issue.
[11] independe decentralized involving
nt. not only master but
workers also. In Naming
scheme workers attach a
key to the result
computed which helps
reducers to check the

29 Page 20-33 © MANTECH PUBLICATIONS 2019. All Rights Reserved


International Journal of Software and Computer Science Engineering
Volume 4 Issue 2

origin of result.

8 “Achieving Replicatio Cloud data It forces each machine to If more than one
accountable n/ double is correct. be held responsible for auditor is involve in
MapReduce check Workers its behavior. testing A’s task,
in cloud based are Accountability test is computational task
computing”[ unaware of done by auditors which increases. Master need
12] existence check all machines and to verify the
of auditors detect malicious nodes. correctness of idle
who replay Auditor replays the task worker to select as an
the task. and compares result with auditor.
Auditor original result.
Group is Probability-
not accountability reduces
malicious the number of records to
actions. be checked.
Workers
cannot
reclaim till
completion
.

The Authors of [13] HuseyinUlusoy and are in trusted domain whereas mappers are
others developed a novel framework for untrusty. To verify workers’ trust in terms
computation Integrity problem of of consistency of mappers’ results, edges
MapReduce based on replication scheme. of graphs have marked either zero or one.
In [14] Yan Ding and others proposed a Zero indicates both mappers have given
framework for protecting MapReduce different intermediate results but otherwise
against collusive attackers and assuring no. Earlier case is termed as inconsistency
integrity without extra re-computations. pair of workers and later one as
Analysis of “undirected Integrity consistency pair.
Attestation Graph” helps to identify both
collusive and non-collusive attackers. In [14] Y Ding and others proposed a
Assumption is both master and reducers scheme to detect attacks in both map phase
30 Page 20-33 © MANTECH PUBLICATIONS 2019. All Rights Reserved
International Journal of Software and Computer Science Engineering
Volume 4 Issue 2

and reduce phase and is based on not categories namely hardware/check based,
replication based. First assumption of sapling bases and finally replication based.
attack model-nodes are communicating in During this survey we observed that lot of
cryptographically secured way and only research need to be done in the area of
master node is assumed to be trusted. It is security issues in file system, trust on
probe injection based verification method hardware etc. No much work is shown
to achieve integrity of MapReduce where master is malicious or in case of
computations and also to identify untrusted cloud provider.
malicious workers. “A probe consists of
data injected into the original input dataset. Two or more of these algorithms can be
The result of the probe set can be pre- integrated to provide more sophisticated
computed before the entire input dataset is secured BigData-MapReduce Analytics
processed in the MapReduce system. A model in Cloud.
probe has two attributes: data value and
location. The data value is the specific REFERENCES
value in the computation; the location is its I. AjithBailakare, and Meenakshi,
position in the entire dataset after “An Introduction to Cloud
injection.” Computing and its Security Issues
and Challenges - A Literature
CONCLUSION Review”, IJEECS, Vol. 6, Issue 5,
MapReduce became a fault-tolerant, 2017.
efficient, and scalable data processing tool
for large datasets. But when MapReduce II. Han hu, Yonggangwen,Ttat-
introduced over public and hybrid cloud sengchua, and Xuelong li, “Toward
computing, addressing security and scalable systems for big data
privacy became important concerns. Since analytics: A Technology Tutorial”,
then many algorithms and frameworks are IEEE transactions, vol. 2, Pp. 652-
coming into picture to address these 687, 2014.
sensitive issues. Some of the important
algorithms and frameworks had been III. RaghavendraKune, Pramod Kumar
surveyed here in details. Comparison table Konugurthi,
is also presented. Based on survey we ArunAgarwalRaghavendraRaoChil
listed these frameworks under three larige and RajkumarBuyya, “The

31 Page 20-33 © MANTECH PUBLICATIONS 2019. All Rights Reserved


International Journal of Software and Computer Science Engineering
Volume 4 Issue 2

anatomy of big data computing”, VIII. A. Ruan and A. Martin, “TMR:


Wiley Online Library, 2015. Towards a trusted mapreduce
infrastructure,” World Congress,

IV. Shankar Ganesh Manikandan and Pp. 141–148, 2012.

Siddarth Ravi, “Big Data Analysis


Using Apache Hadoop”, IEEEE IX. Huang Chu, Zhu Sencun and

Explore, International Conference Wu.Dinghao, “Towards Trusted

on IT Convergence and Security, Services: Result Verification

Pp. 1-4, Oct. 2014. Schemes for MapReduce”,


International Symposium on

V. W. Wei, J. Du, T. Yu, and X. Gu, Cluster, Cloud and Grid

“SecureMR: A service integrity Computing 2012.

assurance framework for


mapreduce,” in ACSAC, 2009, pp. X. Pareek G., Goyal C. and Nayal M.,

73–82. “A Result Verification Scheme for


MapReduce Having Untrusted

VI. Y. Wang and J. Wei, “VIAF: Participants”, Intelligent

verification-based integrity Distributed Computing, Advances

assurance framework for in Intelligent Systems and

MapReduce”, International Computing, Vol 321, 2015.

Conference on Cloud Computing,


Washington, DC, USA, Pp. 300– XI. M. Moca, G. C. Silaghi and G.

307, 2011. Fedak, “Distributed Results


Checking for MapReduce in

VII. F. Schuster, M. Costa, C. Fournet, Volunteer Computing”,

C. Gkantsidis, M. Peinado, G. International Symposium on

Mainar-Ruiz, and M. Russinovich, Parallel and Distributed Processing

“Vc 3: Trustworthy data analytics Workshops and Phd Forum,

in the cloud”, IEEE Symposium on Shanghai, Pp. 1847-1854, 2011.

Security and Privacy, Vol. 15,


2014. XII. Zhifeng Xiao, Yang Xiao,
“Achieving Accountable
MapReduce in cloud computing”,

32 Page 20-33 © MANTECH PUBLICATIONS 2019. All Rights Reserved


International Journal of Software and Computer Science Engineering
Volume 4 Issue 2

Future Generation Computer Ecosystem", ICICI 2018, LNDECT


Systems, Pp. 1-13, 2014. 26, Pp. 1–7, 2019.

XIII. HuseyinUlusoy, Murat


Kantarcioglu, ErmanPattuk and Cite this Article As
LalanaKagal, “AccountableMR:
Meenakshi, Ramachandra AC, Subhrajit
Toward Ac Bhattacharya (2019) Achieving Integrity
Assurance of MapReduce in Cloud
XIV. countable MapReduce systems”,
Computing International Journal of
International Conference on Big Software and Computer Science
Data, Big Data 2015, Santa Clara, Engineering, 4 (2), 20- 33

CA, USA, Pp. 451–460, 2015. http://doi.org/10.5281/zenodo.3375498

XV. Yan Ding, Huaimin Wang,


Songzheng Chen, Xiaodong Tang,
Hongyi Fu and Peichang Shi,
“PIIM: Method of Identifying
Malicious Workers in the
MapReduce System with an Open
Environment”, International
Symposium on Service Oriented
System Engineering, 2014.

XVI. Meenakshi, A. C. Ramachandra,


M. N. Thippeswamy, and
AjithBailakare, "Role of Hadoop in
Big Data Handling", ICICI 2018,
LNDECT 26, Pp. 482–491, 2019.

XVII. R. Chandana, D. Harshitha,


Meenakshi, and A. C.
Ramachandra, "Big Data Migration
and Sentiment Analysis of Real
Time Events Using Hadoop

33 Page 20-33 © MANTECH PUBLICATIONS 2019. All Rights Reserved

You might also like