Jifs223295 2

Journal of Intelligent & Fuzzy Systems 44 (2023) 5231–5255 5231
DOI:10.3233/JIFS-223295
IOS Press
A comprehensive study and review of

tuning the performance on database
scalability in big data analytics
PY
M.R. Sundarakumara,∗ , G. Mahadevanb , R. Natchadalingamc , G. Karthikeyand , J. Ashoke ,
J. Samuel Manoharanf , V. Sathyag and P. Velmurugadassh
a Research Scholar, AMC Engineering College, Bangalore, India
b AMC Engineering College, Bangalore, India
c School of Computing and Information Technology, Reva University, Bengaluru, India
CO
d Department of EEE, Sona College of Technology, Salem, Tamil Nadu, India
e Department of ECE, V.S.B Engineering College, Karur, Tamil Nadu, India
f Department of ECE, Sir Isaac Newton College of Engineering and Technology, Nagapattinam, Tamil Nadu,
India
g Department of Artificial Intelligence and Data Science, Panimalar Engineering College, Chennai, India
h Department of Computer Science & Engineering, Kalasalingam Academy of Research and Education,
OR
Tamil Nadu, India
TH
Abstract. In the modern era, digital data processing with a huge volume of data from the repository is challenging due
to various data formats and the extraction techniques available. The accuracy levels and speed of the data processing on
larger networks using modern tools have limitations for getting quick results. The major problem of data extraction on the
repository is finding the data location and the dynamic changes in the existing data. Even though many researchers created
different tools with algorithms for processing those data from the warehouse, it has not given accurate results and gives low
AU
latency. This output is due to a larger network of batch processing. The performance of the database scalability has to be
tuned with the powerful distributed framework and programming languages for the latest real-time applications to process
the huge datasets over the network. Data processing has been done in big data analytics using the modern tools HADOOP
and SPARK effectively. Moreover, a recent programming language such as Python will provide solutions with the concepts
of map reduction and erasure coding. But it has some challenges and limitations on a huge dataset at network clusters. This
review paper deals with Hadoop and Spark features also their challenges and limitations over different criteria such as file
size, file formats, and scheduling techniques. In this paper, a detailed survey of the challenges and limitations that occurred
during the processing phase in big data analytics was discussed and provided solutions to that by selecting the languages and
techniques using modern tools. This paper gives solutions to the research people who are working in big data analytics, for
improving the speed of data processing with a proper algorithm over digital data in huge repositories.
Keywords: HADOOP, SPARK, scalability, batch processing, big-data
∗ Corresponding author. M.R. Sundarakumar, Research

Scholar, AMC Engineering College, Bangalore, India. E-mail:
sundar.infotechh@gmail.com.
ISSN 1064-1246/$35.00 © 2023 – IOS Press. All rights reserved.

5232 M.R. Sundarakumar et al. / Review of tuning performance on database scalability
paradigm introduced by Google, its purpose is to han-

dle the high volume of data by using a map and
reducing functions [8]. The entire input file is split
into small pieces then map the related data as keys
and send it to reduce. Reducers collect those keys
and combined them into the appropriate output. This
entire process is done by data mining algorithms and
various techniques but the time taken to do this task is
not the user’s concern. Because the input of Hadoop
is a batch processing method, it is not suitable for
real-time applications [9].
Fig. 1. Main V’s used in Big Data.
PY
1.1. Literature survey
1. Introduction
For parallel processing of data over networks,
In this digital world, all people are generating a Hadoop Framework is used. The data processing
huge volume of information as data for their real- between the nodes is based on their location and
CO
world applications and needs. Every day plenty of migration of their place. So aligning and arrang-
data were created in various domains like health- ing all individual nodes for performing distributed
care, retail, banking, industries, and companies [1]. data processing at a time is complicated [10] using
The data warehouse has been generated to store and normal client-server or peer-to-peer networks. The
the time taken for getting it on time is miserable. distributed framework was used to disseminate the
Multiple methods and algorithms are used in a data data from the repositories but the accuracy level and
warehouse as a mining process but are not apt for latency are the factors affected in large networks
OR
plenty of situations. Later, data analytics disseminate while accessing huge data sets. To overcome this
to the market for managing large amounts of data problem Hadoop framework came and manage all
from various repositories [2]. Accessing those data is those critical situations easily as commodity hard-
using modern tools like Hadoop and Spark [3], and ware. It is a vertical storage data processing system
after that applying data mining algorithms for analyt- that is fast to recover the data elements even from a
TH
ics. Modified data stored in various places according large dataset or huge repository.
to user requirements. The major problem will occur
during the extraction phase due to data location and 1.2. History of HADOOP
volume of the repository over the network [4]. Fig-
ure 1 explains the main V’s used in big data and how In earlier days, distributed network node files are
AU
it is processed. sending through a client-server architecture system

To provide the solution for the high volume data with limited size. If huge files are sent across the net-
storage, the classic scale-in database storage was work, latency, throughput, and speed of the transfer
developed but not sufficient. Later, scale-out con- are very low [11], and maybe a loss or corrup-
cepts were introduced as commodity hardware by tion of files also happened. When data storage is
Hadoop [5]. So Hadoop will provide better solu- more it is not suitable for data processing within
tions to handle data storage in mega repositories a stipulated period is not possible. So instead of
in the form of clusters. There are single and mul- the scale-in concept working in RAID methodol-
timode clusters in Hadoop with Hadoop Distributed ogy [12], storage has been extended to a different
File System (HDFS) [6] for huge dataset storage over level. But also a lot of difficulties occurred in the
a network. Though the data was stored in HDFS distributed database process. The same concept is
clusters, it will be accessed via many algorithms used online also. Searching for an element from a
and classifications for extracting them into real-world huge database and given to the user is not succeeded
applications. The problem faced by most companies on time. In 2002, nutches were created by yahoo as
while extracting the data is waiting time and access- a web crawler to identify the highest count of ele-
ing time from the repository [7]. Hadoop handles ments searched in browsers using the internet [13].
this problem with Map Reduce concepts for giving The time taken is quite fast when searching the recent
solutions to inadequate time. Map Reduce is a new data whereas old data has to be extracted from the
M.R. Sundarakumar et al. / Review of tuning performance on database scalability 5233
Fig. 2. History of Hadoop.
PY
database that was not archived quickly. So Google monitored for data flow access. This will help to
has introduced the concept of Google File System find a minimal or optimal solution for time consump-
(GFS) [14] with a file access index table for the ref- tion issues in the Hadoop framework. Nevertheless,
erence of the files in a network. Based on that index data generation and extraction have to be monitored
CO
searching method instead of web crawlers, the entire using any of the tools in a Hadoop ecosystem that
distributed network finishes the searching element will give an immense result of required data to the
task within optimal time. After that Google intro- user on time among the clusters. Researchers find
duced Map Reduce programming concept to optimize difficulty over the network optimization time of the
the searched elements from a huge database by map ETL (Extraction, Transaction, and Loading) process
() and reduce () functions. Distributed File System normally, because of the CAP (Consistency, Avail-
(DFS) was introduced to store a large volume of data ability, Partition Tolerance) theorem concepts [16]. If
OR
from the commodity hardware nodes using Hadoop. any nodes got failure, then data alterations are quickly
So it is called a Hadoop Distributed File System reflected in the cluster by Hadoop Eco-System Tools.
(HDFS). Yahoo was supported by 1000 individual So this system deals with the entire big data analytics
nodes as a cluster to distribute database parallel. But concept via various tools. Table 1 the entire Hadoop
when Hadoop came into this scenario, all nodes are Eco-System structure.
TH
connected with different commodity clusters by the

scale-out process where data distribution happened
without any clumsiness. Finally, Hadoop was intro-
duced as an open-source framework by Apache and 1.4. HDFS architecture
developed by java programming as a core language.
AU
Hadoop has introduced its commercial product in Hadoop Distributed File System (HDFS) consists
the name of Apache Hadoop with basic versions. of Name Node (NN) and Data Node (DD) in a sin-
Though Hadoop has supported parallel distributed gle node or multi-node cluster setup. Classic Hadoop
databases with Shared Nothing Architecture (SNA) contains Job Tracker by name node and task tracker
[15] principle, it will support some modern tools for by data node to find the flow of the data access.
doing data processing. It is called Hadoop Eco Sys- But the limitations of Hadoop made this architec-
tem which supports all data processing and analytics ture with a new concept called replication. Each end
work. Figure 2 gives a detailed history of Hadoop and every input the job has to complete and the output
its limitations. data will be stored in 3 data nodes as a replication
[17]. The Metadata of the output data has been stored
to avoid software or hardware fault during transmis-
1.3. HADOOP ecosystem sion time. If any node gets failure in a cluster the other
nodes get activated and the data has to transfer with-
Hadoop has supported many data mining algo- out delay. In later versions of Hadoop, a Secondary
rithms and methods for accessing data from a huge Name Node (SNN) was introduced to avoid the fail-
data set with the help of modern tools as a support- ure of the name node and its data has to be copied
ing system. Data collection from different resources as a FSI image. Figure 3 denotes the architecture of
and stored in a warehouse has to be controlled and HDFS and its replication principles.
Table 1
Hadoop Eco System
Name Sqoop FLUME HIVE HBASE PIG R
Functions Structured Collect the SQL query Store data in data Latin Refine data from
data logs ware-house. programming the warehouse.
Language DML JAVA JAVA JAVA Latin R
Database Model RDBMS NoSQL JSON JSON NoSQL RDBMS, No SQL
Consistency Concepts Yes Yes Yes Yes Yes Yes
Concurrency Yes Yes Yes Yes Yes Yes
Durability Yes Yes Yes Yes Yes Yes
Replication default default No No No No
Storage Method LOCAL HDFS HDFS HDFS HDFS HDFS
• Hadoop 2. X
PY
Hadoop is Master-Slave architecture by nature and
it is controlled by Name Node (NN) as a Master.
The remaining nodes which are connected to this
Name Node are called Data Nodes (DN) as a slave.
CO
If suppose NN got failure or is disconnected from the
cluster the entire system will get collapsed. In this
critical situation, the Name node has taken a photo-
copy of its data and stored it in a different node called
Secondary Name Node (SNN) over the network to
avail the CAP theorem concepts. This additional fea-
Fig. 3. HDFS Architecture. ture is available in Hadoop 2. X with the name of
OR
YARN (Yet Another Resource Negotiator). Here also
1.5. HADOOP versions replication factor is 3 but the block size is 128 MB for
input data storage [17, 19]. Below Table 2 will give
The Hadoop framework is used to provide parallel the technical differences between these versions.
distributed database access with a basic java program- • Hadoop 3-version
TH
ming paradigm [18]. It emphasizes the work done Hadoop 3. X is the latest version of the Apache
simplified by Map-reduce concepts working among Hadoop developed by Apache to overcome the prob-
the clusters. Hadoop was developed by Apache and lems of previous versions. The problem in previous
the basic version was released with several features versions is mainly lying in the number of blocks allo-
to do data processing within a short time. Initially, the cated for input data. For example, if 6 blocks are
AU
Hadoop framework was designed only for performing needed for storing input data into blocks we need
data processing tasks on a distributed database paral- 6X3 = 18 blocks for replication. So the overhead stor-
lel. The entire framework is running as a cluster-based age value is calculated using extra blocks divided by
network. original blocks and it will be multiplied by 100 which
• Hadoop 1. X gives a 200 percent result. The extra memory space
Hadoop 1. X version is a basic version that is allocation causes more cost usage problems for busi-
explained two major components Map reduce and ness people. So in Hadoop 3. x erasure coding [20,
HDFS storage. Map-reduce is a programming model 21] is used to reduce that extra memory space to 50
that reveals the input file is divided into the number percent overhead. Figure 4 and Fig. 5 will explain it.
of maps and converted into key-value pairs. Com- The above diagram describes erasure coding in the
biner parts get these maps as input and reduce them Hadoop 3. x feature. The replication of 3 nodes can be
according to the keys produced by mappers. Finally, divided and combined with two nodes using the XOR
the reduced data will be stored in HDFS storage. Per- function as parity block storage. The same 6 blocks
haps, this is a reliable storage system and redundant were taken for input file storage, instead of 18 blocks
for a distributed database. It consists of a replication only 9 blocks were allocated for storage which means
factor as 3 by default in master-slave architecture. 3 blocks for extra storage. So the overhead storage
Data nodes created 64 MB of blocks to store input is 3 divided by 6 and multiplied by 100 gives 50%
data in HDFS. only. Here the storage has to be denoted as Data Lake
Table 2
Hadoop versions differences
HADOOP 1. X HADOOP 2. X
4000 nodes per cluster 10,000 nodes per cluster
Job Tracker work is the bottleneck YARN cluster is used
One namespace Multiple in HDFS
Static maps and reducer Not restricted
Only one job to map-reduce Any applications that integrated with HADOOP
Working based on the number of tasks in a cluster Working based on cluster size
• Differences in Hadoop 2. X vs. 3. X

There are lots of technical features changed in each
PY
version of Hadoop which improves the performance
of the data processing speed in big data analytics.
Table 3 will denote all the technical features of their
versions.
CO
1.6. Schedulers used in HADOOP
In Hadoop, so many clients are sending their jobs

for performing tasks. This can be handled by Job
Tracker or resource Manager by Hadoop. There are
Fig. 4. Replication. two different versions are available in Hadoop named
OR
Hadoop 1. X and Hadoop 2.X.Here X denotes the ver-
sion releases/updates. If Hadoop 1. X is used in the
cluster, then the tasks can be controlled by the Job
Tracker /Resource Manager. If it will be Hadoop 2.
X, it may use the secondary node for the purpose of
replica in the Name Node and will be used for copy-
TH
ing Metadata [24] from the cluster. There are three

main schedulers are available in Hadoop.
Fig. 5. Erasure Coding.
i. FIFO
[22]. Due to this erasure coding number of blocks ii. Capacity
AU
assigned for incoming data is reduced. So memory iii. FAIR

has to be utilized in HDFS is very low. Moreover,
Erasure coding help to get accurate data with low The following Table 4 [25] explains all the sched-
latency, because of using limited memory utilized in ulers and their drawbacks.
HDFS as a block.
One more feature added to this Hadoop 3. x is
yarn architecture has slightly changed to adapt to the 2. Reasons for using HADOOP
reduction of data blocks in HDFS. In this, the resource
manager allocated the jobs to the node manager and The data processing speed is improved using the
it will be monitored by the application master. A con- Hadoop framework because of its features [26]. It has
tainer is a new feature that will give the request of each a lot of advantages over the network. Table 5 explains
node to the Application Master [23] then the request the Hadoop features.
is sent to the name node. If any failure between the
nodes all status will be monitored by the application 2.1. Problem identification
master and the container holds the status of the nodes,
so latency and throughput will be high when using Though Hadoop has many features for huge data
Hadoop 3. x. Figure 6 explains the YARN architec- processing in clusters, it has some drawbacks while
ture. executing the tasks. Because the features may have
PY
Table 3
CO
Fig. 6. Hadoop 3. X YARN Architecture.
Hadoop latest and previous version differences

Features Hadoop 2. X Hadoop 3. X
OR
Java Version JDK 7 JDK 8
Fault Tolerance Replication Erasure Coding
Data Balancing HDFS Balancer CLI disk Balancer
Storage Overhead 200% 50%
Data Storage Data skew Data fake
YARN services Scalability issues V2 improves
Container Delay due to feedback Queue
TH
Nodes per cluster 10000 More than 10000

Speed Low High
Single point of Failure Overcome automatically No manual intervention
Heap size memory Configured Auto tuning
Job Monitoring Resource Manager Node Manager
Task Monitoring Resource Manager Application Manager
AU
Secondary Name Node support Only one More than 2
Table 4
Schedulers’ drawbacks
Type of Scheduler Pros Cons Remarks
FIFO Effective Implementation Poor data location Static Allocation
FAIR Short response time Unbalanced workload Homogeneous System
CAPACITY Unused Capacity jobs Complex implementation Homogeneous System, Non-primitive
Delay Simple Scheduling Not work in all situations Homogeneous System, Static
Matchmaking Good Data locality More response time Homogeneous System, Static
LATE Heterogeneity Lack of reliability Homogeneous System & Heterogeneity
Deadline Constraint Optimizing Timing Cost is high Homogeneous System, Heterogeneity, Dynamic
Resource Aware Cluster nodes Monitoring Extra time for monitoring Homogeneous System, Heterogeneity, Dynamic
HPCA High hit rate and redundancy Cluster change state Homogeneous System, Heterogeneity, Dynamic
Round Robin Proper work completion No priority is given Homogeneous System, Heterogeneity, Dynamic
some limitations while distributed data processing mance of Hadoop over distributed data processing
running inside the clusters [27]. Multiple factors will scenarios. Some of the points are discussed below
affect the Hadoop features and reduce the perfor- with their major parameters.
Table 5
Hadoop Features
Features Usage
Various Data Sources Multiple networks
Availability It has a replication feature which means the data in which stored in a node can replicate in three different
nodes. So there is no problem with availability issues.
Scalable A lot of nodes can be connected in a cluster as a single node and multi-node at anytime, anywhere
concept.
Cost- Effective Hadoop is an open-source framework for the usage of all companies that created a huge volume of data
dynamically.
Low Network Traffic The traffic would not be affecting the data processing task because of connectivity among cluster nodes.
High Throughput The Map-Reduce programming paradigm provides high throughput between the nodes connected in
Hadoop by its divide and conquer method job process.
Compatibility Hadoop is a framework that accepts all platforms of operating systems, programming languages, and
modern tools of the Hadoop ecosystem.
PY
Multiple Language Support Hadoop is suitable for all object-oriented programming languages like java, python, and Scala. Moreover,
it is integrated with Hadoop ecosystem tools effectively
• While accessing the small files [28] due to the of problems with parallel processing among nodes.
default block size their speed has less and the There are some bottlenecks which are affected the
allocation of memory is huge. To avoid this
Merging of small files, HAR extension files
(Hadoop Archives), and H Base tools can be
used.
CO
performance of Hadoop processing over the network.
They are [31]
• All the key resources in the CPU can be utilized

• When big files have handled the speed of properly for Map and Reduce process.
retrieval is slow and can be processed by SPARK • Master-Slave architecture is running in the data
OR
Framework. node as Main memory using RAM.
• Unstructured data processing initiates low • Network and bandwidth traffic due to huge file
latency due to different file formats and this size accessing.
could be handled by SPARK, FLINK, and RDD • The throughput problem of input-output devices
(Resilience Distributed Data set) is used for stor- data storage over the network.
age purposes.
TH
• High-level data storage and network-level prob- Hadoop tuning problems in data processing are
lems are raised when we talk about security discussed below with solutions.
concerns [29] in a larger network that can be
◦ A large volume of source data can be tuned
solved using HDFS ACL for authentication
by Huge I/O input at the map stage [32] with
purposes and YARN (Yet Another Resource
AU
LZ0.LZ4 codex
Negotiation) as Application Manager.
◦ Spilled records in the Partition and Sort phases
• Batch-wise data input processing is working
are using a circular memory buffer using the
but not real-time data accessing. The tools like
formula
SPARK and FLINK is used to handle that.
Sort Size = (16 + R) * N / 1,048,576
• More lines of code (1, 20,000) [30] cannot be
R–number of Map
accessed but using SPARK and FLINK it is pos-
N –dividing the Map output records
sible.
by the number of map tasks are
• It does not support repetitive computations and
mapred.local.dir = 100MB
no delta iterations but the SPARK tool supported
◦ Network Traffic at Map and reduce side can be
all with in-memory analytics technique.
tuned by Writing small snippets to enable or
• No Caching and Abstraction features are run-
disable in the map-reduce program and default
ning in the Hadoop framework whereas SPARK.
replication factor of 1,3,5,7 nodes in the single
and multi-node cluster configuration.
2.2. Tuning Hadoop performance ◦ Insufficient Parallel Tasks [33] in idle resources
are handled by adjusting Map, Reduce Tasks
Hadoop is used to perform parallel distributed numbers and memory. There are 2 map, re-
data processing in different clusters. But it has a lot duce tasks, 1 CPU vcore and 1024MB memory
allocated as a default configuration. For exam- 3.2. Phases of map-reduce

ple, 8 CPU cores with 16 GB RAM on Node
Managers, then 4 Map, 2 Reduce Tasks with Multiple phases are working in the map-reduce
memory 1024 MB allocated to each task and programming model because huge files are divided
it leaves 2 CPU cores in a buffer for another into independent tasks and each will work parallelly.
works. Separate work has to be done in every stage of the
map reduction.
Hadoop Framework running with java program- The Map-Reduce model is working only on the
ming language by map-reduce model for data data which are stored in HDFS. Because all the oper-
processing from a huge dataset warehouse on real- ations working in Hadoop Cluster are only based on
time applications. For complicated analysis of the HDFS storage data. So the input from various sources
real world, problems can be easily solved by Hadoop has to be given to the map-reduce from HDFS is the
PY
with low-cost open-source. Though data warehouse first step of Map Reduce. According to the data size,
engines work effectively, the speed of data retrieval the entire file is disseminated into individual tasks
is the major problem [34] in analytics. To improve by a splitter. The input text format is changed into
the speed of the data processing in big data analyt- key-value pairs by the record reader function. Com-
ics the above-said tuning parameters of Hadoop can biner is taking care of that key matches and it will
CO
be implemented with any latest algorithms like Deep make partitions over the HDFS disk based on the
Learning, Machine learning, Artificial Intelligence, file size. The partitions are stored in the intermedi-
Genetic Algorithms, Data Mining, Data Warehouse ate data of the mapper function to give the output to
algorithms, and block-chain [35, 36] concepts. Hence the next phase. But alignment is the major problem
the huge dataset of big data is the cause for handling that leads to cause latency or throughput problems.
real-world scenarios in many companies. All their So shuffling of keys and value pairs for each partition
worry is to maintain that with low-cost server con- is running on the HDFS disk. The next important pro-
OR
figuration and consistency should be controlled on cess that happened in Map Reduce is sorting [40, 41]
time. The retrieval of data from the data warehouse based on keys from the HDFS. Using index search-
has to be improved with the Hadoop framework by ing techniques the sorted values are generated for the
high throughput is succeeded. next phase. Reducer is important in map-reduce to
optimize all the values into an appropriate format.
TH
3.3. Problems of map-reduce

3. Map reduces programming model
Map Reduce is designed with java as a program-
Map Reduce is an important programming model ming language platform working on a Hadoop cluster.
AU
used in the Hadoop framework that accesses a high The cluster may vary in their nodes named as a
volume of data in parallel by disseminating the whole single node or multi-node cluster have master-slave
work into individual tasks. So that the input file can architecture. The main problem of Map Reduce is
be accessed by map-reduce functions to minimize the extracting data from a huge dataset within a stipu-
size of the file coming in the output part with com- lated time but that is not achieved because of the input
pression [37]. After this process, the user or client file size of data from HDFS. The challenge in map-
will get the exact files that they expected from the reduce is to minimize or optimize the whole volume
large volume of datasets. of data into compressed format low volume data. But
the time to complete that process is very high. In other
words, latency and throughput are very low. Normal
3.1. Importance of map-reduce data extraction from the data warehouse is a little bit
slower because of the patterns and algorithms used
Map-reduce is used to access a huge dataset that for processing [42].
is stored in HDFS parallelly. Increasing the velocity
and reliability of the cluster map-reduce plays a major 3.4. Read/write operations in map reduce
role in processing. The latency and throughput of the
entire system will be increased because of the time Map Reduce is running with batch processing on
taken to complete the job. Hadoop cluster data input format which means once
key-value pairs. Finally, it collects the time of occur-

rences of each word from that three sentences and
will be given to the output to the client or user. Here
the final output will be in the compressed form of
input data which leads to data processing with poor
latency best throughput and. The size of the input
file is low like KB means within a few seconds map-
reduce has to be finished. If it will be in MB/GB,
then the number of maps and reduces will be more
for doing the Map-reduce function [49, 50]. Figure 8
gives the example of word count with three sentences.
Finally, the output got by the user is a compressed
PY
Fig. 7. Data Sharing in Map Reduce. number of occurrences as an output. Based on this
word count all the files are handled by batch process-
the input has to be taken another input is waiting ing and perform Map Reduce operations. Figure 8
for the completion of the previous task. This is the summarizes the word count example.
most important problem in Map Reduce and it will
CO
be accessed through iterations [43] in Map Reduce.
Because once the reading operation has taken place 4. Map reduce versions (MRV)
from HDFS it will be processed by the Map-Reduce
phases and write the output on HDFS [44]. The Map Reduce function done in Hadoop cluster by
next iteration has taken the input from these pre- job tracker and task tracker. Classic versions of Map
vious writes on the HDFS disk. Likewise, if more Reducev1 function is working with trackers. But
number of iteration processes is compiled in Map- latest version MRV2 is running with YARN archi-
OR
Reduce [45] then it will store HDFS permanently. If tecture. Because it gives the tracking feature of Map
the user requires particular data from that they have reduce job in every stage [51]. The schedulers and
to write queries using any Data Manipulation Lan- queues are used to give the job status of a given task.
guages (DML) for their results. In this scenario, more MRV1 only deals with output whereas MRV2 gives
iterative operations are not possible by Map Reduce the status of the entire job. Figures 9, 10 illustrate the
TH
because in batch processing only once the input has advantages and disadvantages of MR versions.
to take. If more iterative operations (looping) [46–48]
are running it will not apt for low latency data pro- 4.1. HADOOP map reduces performance tuning
cessing. Because every time the map-reduce model
runs repetitive functions, it may not complete the The Map Reduce performance can be accessed by
AU
task within time. Moreover, latency is also high while several factors of the Hadoop framework and its fea-
doing data processing. Figure 7 explains the read and tures. Map Reduce performance can be affected in
writes operations of the data sharing function in Map terms of speed, latency, throughput, and time taken
reduce. to complete the task. There are several other factors
that may exist during the transmission of data in the
3.5. Map reduce word count example Hadoop cluster that will affect map-reduce [52]. They
are
The best example for Map Reduce is a java based
Word Count Program in the Hadoop cluster. Initially, a. Performance
three sentences have to be taken for input and it b. Programming model & Domain
will be split into different individual tasks as input c. Configuration and automation
split. The next mapping phase takes care of indi- d. Trends
vidual tasks and converts that input split into keys e. Memory
and values which means the number of presence of
the word is calculated. Based on the alphabet cri- 4.1. Performance
teria the keys are shuffled and sorted as an output
of the mapper. Reducer collects those outputs and Initialization of Hadoop and Map Reduce will
gives them as input to the combiner for alignment of affect the performance due to the techniques used in
PY
CO
OR
Fig. 8. Word Count Example for Map Reduce.
TH
AU
Fig. 9. Map Reduce Version1.

PY
CO
OR
Fig. 10. Map Reduce Version2.
the entire data processing system. Because Hadoop 1. tracked and sent to YARN for monitoring. Finally, any
TH
x gives only the output but cannot give time to com- jobs that want to kill or delete during the processing
plete the task. But Hadoop 2. x overcomes this issue time should be controlled by YARN because of this
and tracks the status of the job throughout the task. coordination. Any data processing model contains a
At last, the latest Hadoop 3. x version describes the single input system for processing whereas here both
advanced MRV2 process for quick response over the inputs are merged together as a tagging method for
AU
network on the Hadoop cluster through its erasure easy access to the huge data sets. Figure 12 gives the
coding techniques [52, 53]. So Hadoop framework issues of performance in Map Reduce.
and Map Reduce installation is the major issue in the
performance of Map Reduce consideration. Figure 11 4.2. Programming model and domain
gives the issues of performance in Map Reduce.
Scheduling of jobs in Map-reduce is an impor- Map Reduce writing map and reduce functions
tant concept in the Hadoop cluster. Continuously jobs using good programming is essential for the users.
are assigned in Hadoop Framework by the clients; There are various programming languages supported
the order of jobs taken for Map Reduce is a typical by Hadoop for performing Map-Reduce operations.
process. So the schedulers are used to perform this Every language is based on platform dependent or
work with the help of queues. Three main schedulers independent employing their characteristics. Some
are available in Hadoop namely FIFO, Capacity, and of the languages that support the Hadoop ecosystem
FAIR [54]. Coordination of jobs between the nodes is, are SQL, NoSQL, Java, Python, Scala, and JSON
coordination between the nodes on the Hadoop clus- [55]. They have their own set of properties to per-
ter disseminates the details of all nodes to consider form operations like join and cross properties of the
as the main factor in tuning the Map-reduce func- dataset. It supports the techniques of running itera-
tion. While accessing a variety of jobs sequentially tions and incremental computations among the nodes
the resource manager. The status of the jobs will keep in Hadoop for accessing distributed databases paral-
PY
CO
OR
Fig. 11. Performance issue 1.
adequately is a big challenge. If this work fails auto-

matically Map Reduce will give poor output on the
task. Input-Output disk minimization is the major
TH
drawback in Hadoop MR for accessing data regu-

larly. Their performance is changed due to the size of
input data and methods used for splitting are noted.
If the number of reduces is less may increase MR
performance. The code written in a specific language
AU
supports static code generation [56] and the index

creation method on Map-reduce will increase the per-
formance. Sometimes the specific language doesn’t
adapt to the changes that are made by the client in
the system. The entire system is aware of data opti-
mization principles to provide better performance on
Fig. 12. Performance issue.
Map-reduce.
4.4. Trends
lelly. Perhaps, many iteration operations will affect
Map Reduce performance. Figures 13 and Fig. 14 Data warehouse data are accessed by the database
denotes issues of programming models. engine on Map-reduce. But the data size is very large,
and extraction of small data from that engine made
4.3. Configuration and automation it difficult. The time taken to complete the process is
very high. But instead of disk processing, it should be
Self-tuning of the workload between the nodes done by memory processing directly will improve the
can be balanced by a load balancer on Hadoop and MR performance by I/O disks. Indexing [57] is the
the data flow sharing among the nodes is controlled traditional database technique that is used to search
PY
Fig. 13. Programming Model issue 1.
CO
OR
TH
AU
Fig. 14. Programming Model issue 2.
the elements in the database or files run in nodes. It to any issues the next job or node will get active and
gives the extracted data to the user very fast. It might start the process over the network without waiting for
not depend on the size of a file, in each file the same manual intervention. The materials required for the
techniques have been used. Memory caching [58] MR process can be verified initially before the start
between the nodes is very important to improve the of the job allocation by the resource manager.
performance in MR. It describes the status of every
job condition and the previous computation level also. 4.5. Memory
Caching helps to identify the location of the data on
the node specifically by its memory allocated by the Map Reduce function fully depends on the num-
jobs. Even though the nodes or jobs are canceled due ber of maps and reducers used for every task in the
Table 6
Map Reduce Implementations
Map Reduce Implement Methods Advantages Disadvantages
Google Map Reduce multiple data blocks on different nodes Batch processing-based architecture is not
to avoid fault tolerance problem suitable for real-time applications
Hadoop High scalability Cluster maintenance is difficult.
Grid Grain Subtask distribution and load balancing Does not support non-java applications
Mars Massive Thread Parallelism in GPU Not for atomic operations due to expensive
Tiled-Map Reduce Convergence and Generalization Cost is high
Phoenix Multicore CPU Scalability is less
Twister Tools are used effectively Not possible to break huge data set
Hadoop cluster. If it will get increase immediately the be removed. For example, in the word count
PY
performance of the system goes very slow in terms Map Reduce program written by java only case
of time taken to complete the task. the sensor output is required means making
–DwordCount.case.sensitive = true/ false com-
• Calculation of number of maps mand during the run time will give better
The number of maps assigned for every job by performance than the previous one [59]. Because
CO
a client is too calculated by the size of the input the bad records can be eliminated using these
file [59] and allocated blocks for accessing those commands.
data. The following formula denotes the number • Task execution & environment
of maps required for performing Map Reduce The task tracker in data nodes keeps track
operations. of all information about the jobs and is sent
Number of Maps = Total size of the input to YARN Resource Manager consequently.
OR
But there is a limitation over these oper-
file/Total number of blocks ations in terms of memory allocation in a
(1) map and reduction for task execution. The
By default, minimum of 10 – 100 maps per command –Djava.library.path=<-Xm512M/-
node is assigned for the job. A maximum of 300 Xm1024M executes Map Reduce environment
TH
maps can be allocated to do Map Reduce job. [60] within that memory limit successfully.
For example, 10TB of input file size and 128MB The following Table 6 & provides details of
block size are allocated by Hadoop 2. x means Map Reduce Implementation methods and their
10TB/12b MB = 82,000 maps are approximately applications.
assigned for completing that job.
AU
• Calculation of number of reduces 5. Map reduce job optimization techniques

Normally reducer is allocated for all maps
reduce job is 1. If the number of reducers wants The Map-Reduce job allocated by the resource
to be increased for huge processes, then the con- manager of Hadoop will improve the performance of
figuration file can be changed during installation the data processing speed and accurate results based
or after using speculative tasks. The follow- on the configuration of the cluster and proper allo-
ing formula denotes the number of reduces by cation of map-reduce tasks with their type of input
default required for performing Map Reduce data. Though LZO compression helps to compress
operations. input file size there will be a combiner between map-
per and reducer is a must for improving map-reduce
NumberofReducer job performance optimization [61]. Most of the code
= 0.95or1.35 ∗ numberofnodes (2) data can be reused to avoid searching for data location
time over the cluster.
• Skipping bad records There are some other important aspects used in the
To eliminate the bad records created dur- map-reduce programming model to provide solutions
ing the Map-Reduce process can be changed for map-reduce job performance improvement in the
using configuration files. By enabling the true Hadoop framework. All factors have represented the
or false function in the configuration file it can flow of jobs from resource managers to data nodes and
Table 7
Map Reduce Implementations
Map Reduce Applications Pros Cons
Distributed Grep Data analysis is generic Less response time
Word Count Massive document collection of occurrences Limited only
Tera Sort Load balancing transparency
Inverted Index Collection of unique posting list Lots of pairs in shuffling & sorting
Term Vector Host analysis search Sequential tasks
Random Forest Scalability is high Low
Extreme Learning Machine union and simplification Uncertainty
Spark Data fit in memory Huge memory needed
Algorithms Data exhaustive applications Time uncontrollable
DNA Fragment Parallel Algorithm Large memory
Mobile sensor data Extracting data is easy Difficult to implement
Social Networks Quick response Need more techniques for analysis
PY
how data can deviate from the flow during run time. • Work sharing
Because these factors are rectified means even a big Map Reduce is specially designed for handling
job running on the Hadoop cluster will give output multiple jobs parallelly. If multiple jobs are running
CO
with low latency. Below Fig. 15 listed the factors for simultaneously, it is recommended to share those jobs
job optimizations. by individual maps [65] in the function. That work
• Operator pipelining was done by a splitter in the map-reduce function. The
It is mainly used in Map reduce concept for aggre- time taken to complete the job is decreased because
gation of databases to utilize the filter data and of this sharing job process.
perform operations like grouping, sorting, and con- • Data reuse
verting [62, 63] output from one form to another form Data that is used for the Map Reduce function
OR
of operators. Pipelining is used to connect two jobs from the HDFS storage can be reused for next-level
simultaneously to complete the job within time. But changes in the same input file. Reusability [66] in the
the issue is extended database lock or tie when read- form of inheritance and will reduce the number of
ing/writing in response to the user request. So the lines of codes in a program.
iterate operations are used at that particular time to • Skew mitigation
TH
improve their performance during pipeline events. Skew Mitigation is the main issue in Map reduce,
• Approximate results solved by different techniques to avoid data trans-
The result of the map-reduce is approximate in mission. Using skew-resilient operators, classical
terms of size, time, and accuracy. Even though the skew-mitigation problems were solved. By reparti-
performance has to be increased during the running tioning the concept, skew mitigation can be handled
AU
time it cannot be predictable by its output. Any files in a big data environment using three major methods.
can be taken as an input format it will provide an out- Minimizing the number of times of repartition to any
put of map reduced function. The output cannot be task can reduce repartitioning overhead. Then min-
accurate or reliable in such cases. imizing repartitioning side effects can be removed
• Indexing and sorting during the struggling time to remove mitigation ambi-
Since Map-Reduce works with key-value pairs, it guity. At last, unnecessary recompilations are used to
is very complicated to align the order of the jobs by minimize the total transparency of skew mitigation
a resource manager. It allocates the task to the data [27, 61, 67].
node which may cause conflicts rapidly [57, 64]. So • Data colocation
indexing techniques are used in this job execution by Same location files will be collocated on the similar
searching the elements based on the index key values locate of nodes is a new concept based on the locator
stored in the index table. The table contains all key of file attribute in the file characteristics. When the
values of the independent task in the mapper task and new file is creating its location, the list of data nodes
will give exact data to the combiner to perform the and the number of files in the same case can be iden-
merging option. But the issue is that merging also tified and stored all those input files in the same set
it is complicated by arranging values in any order. of nodes automatically [17, 62, 68]. It will improve
So sorting is a function used in between these and the map-reduce performance by avoiding duplication
performs reducer value output effectively. and repetitions [69, 70] of files in a Hadoop cluster.
PY
CO
Fig. 15. Data colocation.
Figure 15 describes the example of data colocation time. So obviously it is working faster than
OR
in the Hadoop cluster. Hadoop. Approximately 100 times better than
Map Reduce on Hadoop due to memory.
• Stream Processing: It supports stream pro-
6. Map reduction using java and python cessing which means input and output data
are continuously accessed. It is mainly used to
TH
Map Reduce function can be written in java or any access real-time application data processing.
other higher languages, the performance should be • Latency: Resilient Distributed Dataset (RDD) is
changed according to the features of selected lan- used to catch the data using memory in between
guages. Table 8 narrates the differences between java the nodes on the cluster. RDD manages logi-
and python coding languages when map reduce can cal partitions for distributed data processing and
AU
be written. conversion of data format. This is where Spark

does most of the operations such as transfor-
mation and managing the data. RDD is used in
7. Spark framework logical portions [71], which can be manipulated
on the Hadoop cluster.
Apache Spark framework is an open-source used • Lazy Evaluation: Only for needed situations it
for distributed cloud computing clusters. It is working is accessed the real world applications otherwise
with the data processing engine concept meanwhile it will be the idle condition.
to be faster than the Hadoop Map Reduce for data • Less Lines of Code: SAPRK is used SCALA
analytics. Though Hadoop is used to provide big data language for processing data with less number
analytics effectively, it has some drawbacks [70] with of lines when compared to Hadoop.
limited factors which were already discussed in sec- Figure 16 and Fig. 17 are explained the working
tion 4. principles of the Hadoop map-reduce and spark
I. Spark features engine.
• In-memory Processing: This technique is used II. Real world scenarios of SPARK
to capture moving data or processes inside Many companies created terabytes of data through
and outside of the disk without spending more human and machine generation applications. Apache
Table 8
Map Reduce written in java and python differences
Features Java Python
File size Handling <1 GB is easy >1 GB is easy
Library Files All in JAR format Separate library files
File Extension .java .py
Method of calling Main No main method
Data collection Arrays, Index List, set, dictionary, tuples
Object oriented Required Optional
Case Sensitive Required Optional
Compilation Easy in all platform Easy in Linux
Productivity Less More
Applications Desktop, Mobile, Web Analytics, Mathematical,Calculations
Type of files Batch processing, embedded application Real time processing files also
Functions Return 0 & 1 is used Dict is used for return
PY
Programming concepts Dynamic less Cannot push threads of single processor to another
Syntax Specific types Simple only
Basic programming C,c++ basics(oops) Higher end concepts like ML
Number of codes High Less code size
Input data format Streaming with STDIN,STDOUT by binary not text Both binary and text
Areas Working Architecture, tester, developer, administrator Analytics, manipulation, retrieval, visual reports,
CO
AI, Neural Networks
Speed 25 times greater than python Low due to interpreter
Execution Time High because of code length Easy
Typing Dynamic Static
Verbose Syntax Low Normal
Frameworks Spring, Blade Django, Flask
Gaming Jmonkey Engine Pandas3D,cocos
Ml Libraries Weka, Mallet Tensorflow, pytorch
OR
TH
AU
Fig. 16. Working of Hadoop Map Reduce.
Spark is used to improve the company’s business consistency of data at each second so that the
insights [72]. Few examples of companies using customer relationship is very strong on their
SPARK in real-world applications. feedback.
B. Alibaba: Analyze big data, and extrac-
• E-commerce: To improve consumer satisfac- tion of image data can be handled by Alibaba
tion over competitive problems, a few industries Company using SPARK as an implementa-
are implementing SPARK to handle this situation tool. They are used on a large graph, for
tion. They are: deriving results.
A. eBay: Discounts and or offers for online • Healthcare: MyFitnessPal, which is used to
purchases and any other purchase transaction improve a healthier lifestyle through diet and
SPARK can be developed using real-time to scan through the food calorie data of about
data. It will provide the updating status and 100 million users to find the quality of the
PY
Fig. 17. Working of Spark.
food system using SPARK in-memory process- takes replication of every job output data in the HDFS
ing techniques.
• Media and Entertainment: Netflix, for video
streaming uses Apache Spark to control and
monitored its users compared with the earlier
CO
cluster disks.
Spark is a distributed cluster framework for pro-
cessing data on the memory of the nodes by its process
engine. In-memory analytics data processing is used
shows that they have watched. in SPARK, so the output of each step is stored in
between the node memories for clients. For this, it
OR
III. HADOOP AND SPARK SIMILARITIES consumes a lot of memory for storage. One big advan-
tage of SPARK is to access real-time applications
• Stand-alone Mesos and Cloud are the places
frequently. Although it is used for online generated
where Spark can run on Hadoop.
data processing, streaming is mainly used. There is
• Machine Learning algorithms can be executed
plenty of data generated online with every second.
faster inside the memory using Spark’s MLlib
TH
To maintain all those heavy storages and accessing

in order to provide the solutions which are not
engine or machine should be needed. So SPARK is
given easily by Hadoop Map Reduce [73].
used a lot of memory units in between the nodes on
• Cluster Administration and Data Management
the network path. The time to complete the job is also
can be done by combining SPARK and Hadoop
very less by using SPARK [74]. Figure 18 differenti-
because SPARK does not have its own Dis-
AU
ates the working of HDFS and SPARK clearly.

tributed File System (DFS).
• Enhanced security can be provided by Hadoop,
for making workloads. But Spark can be 7.1. Spark architecture
deployed on available resources at all places of
a cluster. So there is no manual allocation and In general, the SPARK framework is used to access
tracking of individual tasks. For the above-said real-time data with its memory analytics processing
features, SPARK is still used by big companies over the big network without any delay or traffic. A
and industries those who are working on real- normal SPARK architecture consists of a software
world applications. driver program that has to be written in SCALA lan-
guage [75] and it will control all the worker nodes.
IV WORKING ON SPARK VS. HADOOP The cluster manager has monitored all these works
Hadoop framework is working under the principle and it is located between the worker node and the
of master-slave architecture where used as name node driver program node. Spark context is a small pro-
and data node with replication principle. The output gram written only for doing the job of data processing
of each step in Hadoop has stored their data in the on the nodes but the difference is mainly in the mem-
HDFS cluster continuously. So if the client needs to ory storage part. The worker node contains the task
retrieve the data from the database it will be very easy assigned by the cluster manager with the executor
to extract in Hadoop. Because the Hadoop framework module. Once the program can be executed by a
PY
CO
Fig. 18. Working difference between SPARK and Hadoop.
buffer and are arranged them like an array based on

the index value. So whenever multiple jobs are com-
ing to HDFS stored the output continuously without
any drawback. SPARK initial design is accessing the
OR
data from the input and performing the mapper func-
tion then suddenly storing the output to a separate
partition like a queue. So when the client is required
each step output from the HDFS storage, they can
collect it directly from that. In that Fig. 20, R1, R2,
TH
Fig. 19. Architecture of SPARK. and R3 are the partitions that collect the output of the
mapper and stored it accordingly. During the shuffle
section, c1 is a core that is used to denote mapper 1,
cluster manager, the executor module in the worker
and c2, c3, and c4 denote other mappers. So if shuf-
node access the input data from HDFS and immedi-
fling will happen in SPARK, the mapper output of
AU
ately stores the output in memory. The client wants

the particular mapper is stored in core 1. Likewise, an
to know the intermediate data at every step of the
individual CPU node contains 4 cores [79] by default
execution they will retrieve from that. Figure 19 is
all the other mappers are stored on free cores which
clarifying the architecture of SPARK.
are represented by the mapper. Figure 20 will explain
the shuffles used in Hadoop and SPARK respectively.
7.2. Designing of spark file system Features of big data eco system tools are listed
below in Table 9 for all the tools. There are plenty of
In Hadoop Distributed File System, Map Reduce differences between Hadoop and SPARK. The exper-
can use for data processing by mapper functions imental results of multi-node clusters are displayed
among nodes under the cluster. The input file is dis- in Table 10.
seminated into the number of tasks by the splitter and
each task is working individually for the Map Reduce
operation. Every mapper output is collected as a key- 7.3. HADOOP VS. SPARK
value pair and it will be stored in a circular buffer
[76–78] for alignment then whole files are stored in There are plenty of technical differences between
HDFS by partitions. Figure 20 will explain the work- Hadoop and SPARK. Based on these results anyone
ing nature of Hadoop. The partitions R1, R2, and R3 can conclude that for computing their big data which
have different outputs of the mapper from the circular framework is better to select for data processing?
PY
CO
OR
Fig. 20. SPARK files system.
Table 9
TH
Specifications of all tools

Features Hadoop SPARK Flink Strom Kafka Samza
Performance Slower 100 times Faster Closed loop Fast Fast Fast
than Hadoop Iteration.
Language Java, Python Scala Python and R Java and All Languages Best with java JVM
AU
Support provides API & work with languages

in Scala, all languages
Python
Processing Batch Stream & Batch Single Stream Native Stream Native Stream Native Stream
Latency High(min) Low(sec) Low(sub sec) Very low(ms) Low (1-2 sec) Low (less than
sec)
Security Kerberos and Low secured using Kerberos Kerberos TLS, ACL, No security
ACL only passwords Kerberos,
SASL
Fault High Less snap shot High High High
Tolerance method
Scalability Large 14000 High 8000 nodes High 1000 High Average Average
nodes nodes
Moreover, these technical differences convey the 8. Author’s contribution

message to the people who plan to initiate a start-
up company using computers. They have planned To summarize the contribution for this paper, the
to select the framework for their requirements in all authors are explained the challenges and limitations
aspects. Table 11 summarizes the features of both faced in modern tools like Hadoop and SPARK for
Hadoop and SPARK. data processing as following points:
Table 10
Experimental results of multi-node cluster
Parameters Hadoop Records SPARK Records FLINK Records
Data Size 102.5TB 100TB >100TB
Elapsed Time 72mins 23 min >23mins
Nodes 2100 206 190
Cores 50400 physical 6592 virtualized 6080 virtualized
Throughput in cluster 3150GB/sec 618 GB/sec 570 GB/sec
Network 10Gbps EC2 >10Gbps
Sort Rate 1.42TB/min 4.27 TB/min 4.27 TB/min
Sort rate/node 0.67 GB/min 20.7 GB/min 22.5 GB/min
I. Authors were taken various techniques from question raises in the software industry that the real
PY
many research papers on the topic of tuning the world scenario problems have been solved only by
performance of the databases while scalability is big industries or those who are ready to invest more
increased, and all papers are discussed about the data money is the only possibility. But there are other
extraction techniques from the huge repositories with factors also considered in the same scene taken by
low latency and high accuracy over large networks. different industries. The main problem is data-driven
CO
II. Authors were written this review paper about from the large datasets with fewer resources is a chal-
Hadoop versions and their features to extract the lenging one. This paper deals with all the points to
data from the repositories and also SPARK tool fea- improve the data processing velocity of big data ana-
tures with the latest techniques. A detailed review has lytics by the famous framework Hadoop vs. SPARK.
been written in this paper while selecting the tool for Henceforth, the data generated day by day in the
extractions with their advantages and disadvantages. real-world can handle different latest algorithms for
III. Authors have suggested ways to improve the analytics, and processing from the huge volume is
OR
performance of the databases extraction from the being possible with tuning the already existing meth-
repositories. Moreover, the difficulties faced in previ- ods or trends. There must be proper analysis and
ous methods. Though modern tools are used for data research problems finding capacity that should be
extraction writing a map-reduce program in Hadoop needed to implement all the innovative solutions for
with a recent algorithm is a challenging task. SPARK real-world problems. Finally, the user wants to find
TH
is an advanced tool but the cost spend for used that a solution for their problem with big data analytics
tool is unimaginable for small-scale companies. Here Hadoop and SPARK are the main frameworks to pro-
authors were given suggestions to improve the per- vide solutions but according to the user requirement,
formance in both tools. they have to choose the best one. For example, the
client wants to start a company which has low invest-
AU
ment but dealing big data problem for a complex

9. Conclusion and future scope solution means Hadoop is their best choice because
of the cost and type of data. If the same company has
Big data analytics is an important technology in the urge to handle real-world application data and
this era used to access huge datasets Parallelly in a ready to provide huge investment, obviously SPARK
distributed cluster environment. Based on the require- is the best tool for them. When we consider tech-
ments of the client or user every software company is nical aspects like algorithm and methodology, both
deciding to deploy its software and hardware frame- tools are using some common techniques but final
works. Many start-up companies are also confused decisions might be taken based on cost and type of
about their infrastructure to build up. This paper data handling. The decision taken by all persons who
provides a solution for all companies and research- are handling big data analytics, Hadoop Map Reduce
oriented people to select their framework for data is suitable for low-cost and batch processing whereas
processing rapidly. Perhaps, the basic factors of the SPARK is apt for real-time processing and a high-cost
data processing projects like speed and cost are con- tool for data processing.
sidered in all situations. The above said technologies There are plenty of tools available for handling
and examples are given a transparent view of the big data in the IT world, but only limited ones are
big company’s infrastructure for dealing with real- popular among companies and industries because of
world problems effectively. There is a million-dollar their user-friendly or cost-wise approach. Hadoop
Table 11
Hadoop Vs SPARK
Features Hadoop SPARK
File Processing Method Batch processing Batch/Real Time/iterative/graph Processing
Programming Language Java, Python Scala
Data Storage type Scale-out Data Lake or Pool
Programming Model Map reduce In Memory processing
Job Scheduler Externally Not required
Cost Low High
RAM Usage Less Lot of RAMs
Memory Type Single memory Execution & Storage memory Separately
Data Size Up to GB is fine PB is fine
Latency High Low latency
Data taken as input Text, images, videos RDD(Resilient Distributed Dataset)
Disk Type HDD (Hard Disk) SDD (Solid Disk)
PY
N/w Performance Low High
Speed rate <3x <3x. 1/10 nodes
Algorithm by default Divide and conquer ALS (Alternate Least Square)
Data Location details Index Table Abstraction using Mlib
Data Hiding Low High using function calls
Dataset size Small set Huge set > TB
CO
Shuffle speed Low High
Storage memory of mapper Directly in Disk RAM to Disk
output
Containers Usage Releases after every map Release only after the entire job completion
Dynamic Allocation Not possible Possible but hectic
Replications 1,3,5 nodes Pipelines
Delay High due to assign JVM for each task Low due to quick launch
Mechanism for message passing Parsing and JAR files Remote Procedure Call (RPC)
OR
Time Taken to complete job Minutes because of small data set Hours for big data set.
Allocating Memory Erasure Coding DAG (Directed Acyclic Graph)
Data Input method Hadoop Streaming SPARK Streaming
Data conversion formats Text to binary All forms
Job Memory Large Low
Input Memory Less High
Processing type Parallel and distributed Parallel and distributed
TH
Data Extraction Disk Based Memory Based

I/O Processing Disk RAM
Resources Usage More Less
Data status Stateless State
Iterative Process Not Taken Taken
Caching Doesn’t support Support in RAM
AU
R/W to HDFS YARN Cluster SPARK Engine

Tools supported Pig,Hive,HBase ALL in one
Accessibility Command User Interface (CUI) Graphical User Interface(GUI)
Traceability Easy by YARN Not possible
Fault Tolerance High Low
Security High(tracking) Low (no tracking)
Storage Architecture Distributed Not distributed
Data taken slot from resources Only one slot Any slots(real time)
Time Lag Yes No
Program Written Map Reduce Driver Program
Controller YARN Cluster Manager
Partition Type Single partition for all map outputs Separate partition for every map output
Companies Used Industries and Companies not needed real Real time data processing needed. YAHOO,
time data analytics. Cloud era, Horton eBay, Alibaba, Netflix, oracle. Cisco,
works, IBM, British Airways, Face book, Verizon, Microsoft, Data Bricks and Amazon
Twitter, LinkedIn
and SPARK are the tools used in very high-speed FLINK, and Kafka [80] are also available for access-
data processing by various factors. How long have ing both batch and real-time processing in big data
these tools ruled the world with their updated versions analytics. Only the techniques are varied in all tools.
and techniques? New tools of Apache like FLUME, The new FLUME tools are used to collect various
logs and events from different resources and stored in Proceedings of the 2018 International Conference on Big
HDFS with high throughput and low latency. Apache Data and Education (2018, March), (pp. 52–56).
[12] S. Kaisler, F. Armour, J.A. Espinosa and W. Money, Big
FLINK is used to access the huge datasets by the data: Issues and challenges moving forward. In 2013 46th
micro-batch method which runs the data in a single Hawaii International Conference on System Sciences (2013,
run time with closed-loop operations. So the time to January), (pp. 995–1004). IEEE.
complete tasks is very low and identifying the cor- [13] S. Kaisler, F. Armour, W. Money and J.A. Espinosa, Big data
issues and challenges. In Encyclopedia of Information Sci-
rupted data part is also easy. Another tool Apache ence and Technology, Third Edition (2015), (pp. 363–370).
Kafka is a modern tool used to handle feed with IGI Global.
high throughput and low latency in social media. [14] D. Che, M. Safran and Z. Peng, From big data to big
Finally, plenty of tools are used in big data analyt- data mining: challenges, issues, and opportunities. In Inter-
national Conference on Database Systems for Advanced
ics for handling a huge volume of data sets with Applications (2013, April), (pp. 1-15). Springer, Berlin,
different mechanisms and approaches. User has to Heidelberg.
PY
take a decision very carefully in accessing and pro- [15] A. O’Driscoll, J. Daugelaite and R.D. Sleator, ‘Big data’,
Hadoop and cloud computing in genomics, Journal of
tecting their data with big data analytics world. This
Biomedical Informatics 46(5) (2013), 774–781.
paper has covered the challenges and limitations of [16] Y. Demchenko, C. Ngo and P. Membrey, Architecture
big data analytic tools in all aspects and provides solu- framework and components for the big data ecosystem,
tions to handle those problems in a systematic way Journal of System and Network Engineering 4(7) (2013),
1–31.
CO
of approach. [17] A. Fahad, N. Alshatri, Z. Tari, A. Alamri, I.Z. Khalil and
A. Bouras, A survey of clustering algorithms for big data:
Taxonomy and empirical analysis, IEEE Transactions on
Emerging Topics in Computing 2(3) (2014), 267–279.
[18] Y. Arfat, S. Usman, R. Mehmood and I. Katib, Big Data
References for Smart Infrastructure Design: Opportunities and Chal-
lenges. In Smart Infrastructure and Applications (2020),
(pp. 491–518). Springer, Cham.
OR
[1] A. Katal, M. Wazid, and R.H. Goudar, Big data: issues,
challenges, tools and good practices. In 2013 Sixth interna- [19] S.U. Ahsaan, H. Kaur and S. Naaz, An Empirical Study
tional conference on contemporary computing (IC3) (2013, of Big Data: Opportunities, Challenges and Technologies.
August). (pp. 404–409). IEEE.. In New Paradigm in Decision Science and Management
[2] N. Khan, I. Yaqoob, I.A.T. Hashem, Z. Inayat, M. Ali, W. (2020), (pp. 49–65). Springer, Singapore.
Kamaleldin,... and A. Gani, Big data: survey, technologies, [20] A. Mohamed, M.K. Najafabadi, Y.B. Wah, E.A.K. Zaman
opportunities, and challenges, The Scientific World Journal and R. Maskat, The state of the art and taxonomy of big
data analytics: view from new big data framework, Artificial
TH
(2014), 2014.
[3] N. Elgendy and A. Elragal, Big data analytics: a litera- Intelligence Review 53(2) (2020), 989–1037.
ture review paper. In Industrial Conference on Data Mining [21] Y. Arfat, S. Usman, R. Mehmood and I. Katib, Big Data
(2014, July), (pp. 214–227). Springer, Cham. Tools, Technologies, and Applications: A Survey. In Smart
[4] C.W. Tsai, C.F. Lai, H.C. Chao and A.V. Vasilakos, Big data Infrastructure and Applications (2020), (pp. 453–490).
analytics: a survey, Journal of Big data 2(1) (2015), 21. Springer, Cham.
[5] J.F. Weets, M.K. Kakhani and A. Kumar, Limitations and [22] P.L.S. Kumari, Big Data: Challenges and Solutions. In Secu-
AU
challenges of HDFS and MapReduce. In 2015 International rity, Privacy, and Forensics Issues in Big Data (2020), (pp.
Conference on Green Computing and Internet of Things 24–65). IGI Global.
(ICGCIoT) (2015, October), (pp. 545–549). IEEE. [23] A. Jaiswal, V.K. Dwivedi and O.P. Yadav, Big Data and its
[6] W. Yu, Y. Wang, X. Que and C. Xu, Virtual shuffling for Analyzing Tools: A Perspective. In 2020 6th International
efficient data movement in mapreduce, IEEE Transactions Conference on Advanced Computing and Communication
on Computers 64(2) (2013), 556–568. Systems (ICACCS) (2020, March), (pp. 560–565). IEEE.
[7] D.P. Acharjya and K. Ahmed, A survey on big data analytics: [24] A. Sharma, G. Singh and S. Rehman, A Review of Big
challenges, open research issues and tools, International Data Challenges and Preserving Privacy in Big Data. In
Journal of Advanced Computer Science and Applications Advances in Data and Information Sciences (2020), (pp.
7(2) (2016), 511–518. 57–65). Springer, Singapore.
[8] S. Yu, Big privacy: Challenges and opportunities of pri- [25] S. Riaz, M.U. Ashraf and A. Siddiq, A Comparative Study
vacy study in the age of big data, IEEE Access bf 4 (2016), of Big Data Tools and Deployment PIatforms. In 2020
2751–2763. International Conference on Engineering and Emerging
[9] M.A. Wani and S. Jabin, Big data: issues, challenges, and Technologies (ICEET) (2020, February), (pp. 1–6). IEEE.
techniques in business intelligence. In Big data analytics [26] N.K. Gupta and M.K. Rohil, Big Data Security Chal-
(2018), (pp. 613–628). Springer, Singapore. lenges and Preventive Solutions. In Data Management,
[10] A. Oussous, F.Z. Benjelloun, A.A. Lahcen and S. Belfkih, Analytics and Innovation (2020), (pp. 285–299). Springer,
Big Data technologies: A survey, Journal of King Saud Singapore.
University-Computer and Information Sciences 30(4) [27] D.K. Tayal and K. Meena, A new MapReduce solution for
(2018), 431–448. associative classification to handle scalability and skew-
[11] N. Khan, M. Alsaqer, H. Shah, G. Badsha, A.A. Abbasi and ness in vertical data structure, Future Generation Computer
S. Salehian, The 10 Vs, issues and challenges of big data. In Systems 103 (2020), 44–57.
[28] P. Abimbola, A. Sanga and S. Mongia, Hadoop Framework ing and virtual machine migration, IEEE Access 7 (2019),
Ecosystem: Ant Solution to an Elephantic Data. (2019), 92259–92284.
Available at SSRN 3463635. [47] Q. Chen, J. Yao, B. Li and Z. Xiao, PISCES: Optimiz-
[29] R. Kashyap, Big Data Analytics Challenges and Solutions. ing Multi-Job Application Execution in MapReduce, IEEE
In Big Data Analytics for Intelligent Healthcare Manage- Transactions on Cloud Computing 7(1) (2016), 273–286.
ment (2019), (pp. 19–41). Academic Press. [48] R.H. Hariri, E.M. Fredericks and K.M. Bowers, Uncertainty
[30] S.U. Ahsaan, H. Kaur and S. Naaz, An Empirical Study in big data analytics: survey, opportunities, and challenges,
of Big Data: Opportunities, Challenges and Technologies. Journal of Big Data 6(1) (2019), 44.
In New Paradigm in Decision Science and Management [49] G. Yu, X. Wang, K. Yu, W. Ni, J.A. Zhang and R.P. Liu,
(2020), (pp. 49–65). Springer, Singapore. Survey: Sharding in blockchains, IEEE Access 8 (2020),
[31] P.L.S.K. Kaur and V. Bharti, A Survey on Big Data—Its 14155–14181.
Challenges and Solution from Vendors. In Big Data Pro- [50] J. Luengo, D. Garcı́a-Gil, S. Ramı́rez-Gallego, S. Garcı́a and
cessing Using Spark in Cloud (2019), (pp. 1–22). Springer, F. Herrera, Dimensionality Reduction for Big Data. In Big
Singapore. Data Preprocessing (2020), (pp. 53–79). Springer, Cham.
[32] P.L.S. Kumari, Big Data: Challenges and Solutions. In Secu- [51] J. Luengo, D. Garcı́a-Gil, S. Ramı́rez-Gallego, S. Garcı́a and
PY
rity, Privacy, and Forensics Issues in Big Data (2020), (pp. F. Herrera, Imbalanced Data Preprocessing for Big Data. In
24–65). IGI Global. Big Data Preprocessing (2020), (pp. 147–160). Springer,
[33] M.A. Wani and S. Jabin, Big data: issues, challenges, and Cham.
techniques in business intelligence. In Big data analytics [52] A. Chugh, V.K. Sharma and C. Jain, Big Data and Query
(2018), (pp. 613–628). Springer, Singapore. Optimization Techniques. In Advances in Computing and
[34] I. Anagnostopoulos, S. Zeadally and E. Exposito, Handling Intelligent Systems (2020), (pp. 337–345). Springer, Singa-
big data: research challenges and future directions, The pore.
CO
Journal of Supercomputing 72(4) (2016), 1494–1516. [53] S. Vengadeswaran and S.R. Balasundaram, CLUST: Group-
[35] G. Kapil, A. Agrawal and R.A. Khan, Big Data ing Aware Data Placement for Improving the Performance
Security challenges: Hadoop Perspective, International of Large-Scale Data Management System. In Proceedings
Journal of Pure and Applied Mathematics 120(6) (2018), of the 7th ACM IKDD CoDS and 25th COMAD (2020), (pp.
11767–11784. 1–9).
[36] M. Li, Z. Liu, X. Shi and H. Jin, ATCS: Auto-Tuning Con- [54] M. Naisuty, A.N. Hidayanto, N.C. Harahap, A. Rosyiq, A.
figurations of Big Data Frameworks Based on Generative Suhanto and G.M.S. Hartono, Data protection on Hadoop
Adversarial Nets, IEEE Access 8 (2020), 50485–50496. distributed file system by using encryption algorithms: a
OR
[37] E. Mohamed and Z. Hong, Hadoop-MapReduce job systematic literature review. In Journal of Physics: Confer-
scheduling algorithms survey. In 2016 7th International ence Series (2020, January). (Vol. 1444, No. 1, p. 012012).
Conference on Cloud Computing and Big Data (CCBD) IOP Publishing
(2016, November), (pp. 237–242). IEEE. [55] M.H. Mohamed, M.H. Khafagy and M.H. Ibrahim, Recom-
[38] J. Wang, X. Zhang, J. Yin, R. Wang, H. Wu and D. Han, mender Systems Challenges and Solutions Survey. In 2019
Speed up big data analytics by unveiling the storage distri- International Conference on Innovative Trends in Computer
bution of sub-datasets, IEEE Transactions on Big Data 4(2) Engineering (ITCE) (2019, February), (pp. 149–155). IEEE.
TH
(2016), 231–244. [56] I.A.T. Hashem, N.B. Anuar, A. Gani, I. Yaqoob, F. Xia
[39] S.M. Nabavinejad, M. Goudarzi and S. Mozaffari, The and S.U. Khan, MapReduce: Review and open challenges,
memory challenge in reduce phase of MapReduce appli- Scientometrics 109(1) (2016), 389–422.
cations, IEEE Transactions on Big Data 2(4) (2016), [57] N.M. Elzein, M.A. Majid, I.A.T. Hashem, I. Yaqoob, F.A.
380–386. Alaba and M. Imran, Managing big RDF data in clouds:
[40] U. Sivarajah, M.M. Kamal, Z. Irani and V. Weerakkody, Challenges, opportunities, and solutions, Sustainable Cities
AU
Critical analysis of Big Data challenges and analytical meth- and Society 39 (2018), 375–386.
ods, Journal of Business Research 70 (2017), 263–286. [58] S. Pouyanfar, Y. Yang, S.C. Chen, M.L. Shyu and S.S.
[41] S. Dolev, P. Florissi, E. Gudes, S. Sharma and I. Singer, Iyengar, Multimedia big data analytics: A survey, ACM
A survey on geographically distributed big-data process- Computing Surveys (CSUR) 51(1) (2018), 1–34.
ing using MapReduce, IEEE Transactions on Big Data 5(1) [59] M.S. Al-kahtani and L. Karim, Designing an Efficient
(2017), 60–80. Distributed Algorithm for Big Data Analytics: Issues and
[42] Y. Guo, J. Rao, D. Cheng and X. Zhou, ishuffle: Improving Challenges, International Journal of Computer Science and
Hadoop performance with shuffle-on-write, IEEE Trans- Information Security (IJCSIS) 15(11) (2017).
actions on Parallel and Distributed Systems 28(6) (2016), [60] Z. Lv, H. Song, P. Basanta-Val, A. Steed and M. Jo, Next-
1649–1662. generation big data analytics: State of the art, challenges,
[43] K. Wang, Q. Zhou, S. Guo and J. Luo, Cluster frameworks and future research topics, IEEE Transactions on Industrial
for efficient scheduling and resource allocation in data cen- Informatics 13(4) (2017), 1891–1899.
ter networks: A survey, IEEE Communications Surveys & [61] P. Basanta-Val and M. Garcı́a-Valls, A distributed real-time
Tutorials 20(4) (2018), 3560–3580. java-centric architecture for industrial systems, IEEE Trans-
[44] S.S.R.P. Time, Cluster Frameworks for Efficient Schedul- actions on Industrial Informatics 10(1) (2013), 27–34.
ing and Resource Allocation in Data Center Networks: A [62] P. Basanta-Val, N.C. Audsley, A.J. Wellings, I. Gray and N.
Survey. Fernández-Garcı́a, Architecting time-critical big-data sys-
[45] M. Hajeer and D. Dasgupta, Handling big data using a data- tems, IEEE Transactions on Big Data 2(4) (2016), 310–324.
aware HDFS and evolutionary clustering technique, IEEE [63] Q. Liu, W. Cai, J. Shen, X. Liu and N. Linge, An adap-
Transactions on Big Data 5(2) (2017), 134–147. tive approach to better load balancing in a consumer-centric
[46] N.S. Dey and T. Gunasekhar, A comprehensive survey cloud environment, IEEE Transactions on Consumer Elec-
of load balancing strategies using Hadoop queue schedul- tronics 62(3) (2016), 243–250.
[64] A. Montazerolghaem, M.H. Yaghmaee, A. Leon-Garcia, [74] M.P. Kumar and S. Pattern, Security Issues in Hadoop Asso-
M. Naghibzadeh and F. Tashtarian, A load-balanced call ciated With Big Data.
admission controller for IMS cloud computing, IEEE Trans- [75] W. Inoubli, S. Aridhi, H. Mezni, M. Maddouri and E.
actions on Network and Service Management 13(4) (2016), Nguifo, (2018, August). A comparative study on streaming
806–822. frameworks for big data.
[65] J. Zhao, K. Yang, X. Wei, Y. Ding, L. Hu and G. Xu, [76] Z. Yu, Z. Bei and X. Qian, Data size-aware high dimensional
A heuristic clustering-based task deployment approach for configurations are auto-tuning of in-memory cluster com-
load balancing using Bayes theorem in a cloud environment, puting. In Proceedings of the Twenty-Third International
IEEE Transactions on Parallel and Distributed Systems Conference on Architectural Support for Programming
27(2) (2015), 305–316. Languages and Operating Systems (2018, March), (pp.
[66] A.K. Singh and J. Kumar, Secure and energy-aware load 564–577).
balancing framework for cloud data center networks, Elec- [77] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M.
tronics Letters 55(9) (2019), 540–541. McCauly,... and I. Stoica, Resilient distributed datasets: A
[67] D. Shen, J. Luo, F. Dong and J. Zhang, Virto: joint coflow fault-tolerant abstraction for in-memory cluster computing.
scheduling and virtual machine placement in cloud data In Presented as part of the 9th USENIX Symposium on
PY
centers, Tsinghua Science and Technology 24(5) (2019), Networked Systems Design and Implementation (NSDI 12)
630–644. (2012), (pp. 15–28).
[68] M. Bhattacharya, R. Islam and J. Abawajy, Evolutionary [78] M. Li, J. Tan, Y. Wang, L. Zhang and V. Salapura, Spark-
optimization: a big data perspective, Journal of Network bench: a comprehensive benchmarking suite for in-memory
and Computer Applications 59 (2016), 416–426. data analytic platform spark. In Proceedings of the 12th
[69] Q. Chen, C. Liu and Z. Xiao, Improving MapReduce perfor- ACM International Conference on Computing Frontiers
mance using a smart speculative execution strategy, IEEE (2015, May), (pp. 1–8).
CO
Transactions on Computers 63(4) (2013), 954–967. [79] D. Agrawal, A. Butt, K. Doshi, J.L. Larriba-Pey, M. Li, F.R.
[70] H. Wang, Z. Xu, H. Fujita and S. Liu, Towards felicitous Reiss and Y. Xia, Spark Bench–a spark performance testing
decision making: An overview on challenges and trends of suite. In Technology Conference on Performance Evaluation
Big Data, Information Sciences 367 (2016), 747–765. and Benchmarking (2015, August), (pp. 26–44). Springer,
[71] A.G. Shoro and T.R. Soomro, Big data analysis: Apache Cham.
spark perspective, Global Journal of Computer Science and [80] A. Jaiswal, V.K. Dwivedi and O.P. Yadav, Big Data and its
Technology (2015). Analyzing Tools: A Perspective. In 2020 6th International
[72] M. Zaharia, R.S. Xin, P. Wendell, T. Das, M. Armbrust, A. Conference on Advanced Computing and Communication
OR
Dave,... and A. Ghodsi, Apache spark: a unified engine for Systems (ICACCS) (2020, March), (pp. 560–565). IEEE.
big data processing, Communications of the ACM 59(11)
(2016), 56–65.
[73] S. Salloum, R. Dautov, X. Chen, P.X. Peng and J.Z. Huang,
Big data analytics on Apache Spark, International Journal
of Data Science and Analytics 1(3–4) (2016), 145–164.
TH
AU

Jifs223295 2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Jifs223295 2

Uploaded by

Copyright:

Available Formats

Journal of Intelligent & Fuzzy Systems 44 (2023) 5231–5255 5231

A comprehensive study and review of

Keywords: HADOOP, SPARK, scalability, batch processing, big-data

∗ Corresponding author. M.R. Sundarakumar, Research

ISSN 1064-1246/$35.00 © 2023 – IOS Press. All rights reserved.

paradigm introduced by Google, its purpose is to han-

it is processed. sending through a client-server architecture system

Fig. 2. History of Hadoop.

connected with different commodity clusters by the

• Differences in Hadoop 2. X vs. 3. X

In Hadoop, so many clients are sending their jobs

ing Metadata [24] from the cluster. There are three

assigned for incoming data is reduced. So memory iii. FAIR

Hadoop latest and previous version differences

Nodes per cluster 10000 More than 10000

Secondary Name Node support Only one More than 2

• All the key resources in the CPU can be utilized

allocated as a default configuration. For exam- 3.2. Phases of map-reduce

3.3. Problems of map-reduce

key-value pairs. Finally, it collects the time of occur-

Fig. 9. Map Reduce Version1.

Fig. 10. Map Reduce Version2.

adequately is a big challenge. If this work fails auto-

drawback in Hadoop MR for accessing data regu-

supports static code generation [56] and the index

Fig. 14. Programming Model issue 2.

• Calculation of number of reduces 5. Map reduce job optimization techniques

be written. conversion of data format. This is where Spark

Fig. 16. Working of Hadoop Map Reduce.

To maintain all those heavy storages and accessing

ates the working of HDFS and SPARK clearly.

buffer and are arranged them like an array based on

ately stores the output in memory. The client wants

Fig. 20. SPARK files system.

Specifications of all tools

Support provides API & work with languages

Moreover, these technical differences convey the 8. Author’s contribution

ment but dealing big data problem for a complex

Data Extraction Disk Based Memory Based

R/W to HDFS YARN Cluster SPARK Engine

You might also like