You are on page 1of 21

Overview

Scalable machine-learning
algorithms for big data analytics:
a comprehensive review
Preeti Gupta,1* Arun Sharma1 and Rajni Jindal2

Big data analytics is one of the emerging technologies as it promises to provide


better insights from huge and heterogeneous data. Big data analytics involves
selecting the suitable big data storage and computational framework augmented
by scalable machine-learning algorithms. Despite the tremendous buzz around
big data analytics and its advantages, an extensive literature survey focused on
parallel data-intensive machine-learning algorithms for big data has not been
conducted so far. The present paper provides a comprehensive overview of vari-
ous machine-learning algorithms used in big data analytics. The present work is
an attempt to identify the gaps in the work already performed by researchers,
thus paving the way for further quality research in parallel scalable algorithms
for big data. © 2016 John Wiley & Sons, Ltd
How to cite this article:
WIREs Data Mining Knowl Discov 2016. doi: 10.1002/widm.1194

INTRODUCTION with optimal utilization of computational resources. It


is not the case, however, that earlier parallel distributed

T he advent of various technologies such as Social


Networks and Internet of Things has led to the
generation of a global network for sharing and colla-
mechanisms of processing did not exist. Mechanisms
such as MPI (Message Passing Interface) were available
in the past literature. However, the major advantage of
borating information. The usage and influence of dig- big data frameworks such as MapReduce and Spark
ital media, especially social media, has grown in every over other parallel distributed system is that it relieves
sphere of our life. This proliferation has generated a the programmer from the burden of code and data dis-
huge amount of data, creating a data deluge. It is diffi- tribution, replication, machine failures, and load balan-
cult to effectively extract useful information from all cing, which is taken care of by the underlying runtime
the available online information due to the volume and system automatically. This is the main reason why big
variety (structured/semistructured/unstructured) of data analytics has drawn so much attention from aca-
data. However, the foundation of a good analytical demicians and researchers. It allows the collecting or
framework relies totally on the quality of data; the crowding sense of data from multiple heterogeneous
richer the dataset, the better the inferences. The over- sources and aids in extracting useful insights more eas-
whelming amount of data necessitates mechanisms for ily from the huge and heterogeneous datasets.
efficient information filtering and processing. Distribu- A large amount of research is being conducted in
ted parallelism is the default choice for processing volu- the evolving and multifaceted area of big data analyt-
minous datasets as it enables faster execution of the job ics. However, despite the tremendous interest in big
data analytics and its advantages, an extensive litera-
*Correspondence to: preeti.cool80@gmail.com ture survey focused on parallel data-intensive machine-
1
Department of IT, Indira Gandhi Delhi Technical University for
learning algorithms for big data has not been con-
Women (IGDTUW), New Delhi, India ducted so far. This has encouraged us to provide a
2
Department of CSE, Delhi Technological University (DTU), Delhi, comprehensive survey of distributed parallel machine-
India learning algorithms for big data along with various
Conflict of interest: The authors have declared no conflicts of inter- optimization and performance metrics for evaluation.
est for this article. This is also the motivation of our paper. In the present

© 2016 John Wiley & Sons, Ltd 1


Overview wires.wiley.com/dmkd

paper, an overview of big data analytics is provided. criteria employed by authors for deciding on research
The task of big data analytics is divided into two paper inclusions are represented diagrammatically in
phases. The first phase involves setting the stage for big Figure 1. The relevant material was scattered across
data processing and includes the selection of big data various journals, conferences, and book chapters.
storage and big data computational framework. The The online databases that were referred for extract-
second phase involves the designing and implementa- ing relevant literature were IEEE, Science Direct,
tion of machine-learning tools for extracting insights Hindawi ACM Digital Library, Springer, IGI global,
from the data. The paper is divided into seven sections. and Taylor & Francis Online. The online databases
The first section starts with the general introduction of were searched based on descriptors such as ‘big
the topic. The second section outlines the employed data,’ ‘big data analytics,’ ‘big data challenges,’ ‘big
research methodology, highlighting the basis on which data storage,’ ‘big data and machine learning,’
we selected the research papers for study and the cri- ‘machine learning algorithms employing
teria for their inclusion in our work. The third MapReduce,’ ‘distributed parallel algorithm &
section provides an introduction to big data and big MapReduce,’ and ‘parallel machine learning algo-
data analytics along with various challenges of big data rithms employing Spark,’ which produced various
analytics. The fourth section covers details regarding journal and conference papers from the year 2008
big data infrastructure. The fifth section provides to 2016.
details and a comparative analysis of the various The literature selection was divided into two
distributed data-intensive parallel machine-learning categories. One category included the papers related
algorithms for big data. The sixth section provides to big data infrastructure, and the other included
some limitations of our research. The seventh scalable machine-learning algorithms for big data.
section presents the research implication of our The full text of each article was reviewed to examine
research. The final section contains concluding remarks the applicability of the article for the review, and
and the future scope of our work. only the relevant articles were included. The papers
that have achieved parallelism using vertical scaling
were rejected. Papers that included distributed paral-
RESEARCH METHODOLOGY lel programming without MapReduce or Spark were
Objective also rejected.
The strong need of extracting relevant information
from a data heap containing humungous and heteroge- Key Factors for Evaluation
neous data demands scalable and decentralized analy- On the basis of the selected papers, several optimiza-
sis techniques. This is where big data analytics steps tion and evaluation metrics are used to perform a
in. It helps researchers and data scientists in extracting comparative analysis of scalable machine-learning
useful insights from the data deluge. Traditional algorithms for big data. The optimization metrics
machine-learning algorithms need to be adapted to the used are sampling, indexing, intelligent partitioning,
big data realm for handling big data scalability. The special data structure, in-memory computation, and
objective of this paper is to provide a solid base for asynchronous execution. The performance metrics
scalable machine-learning algorithms for big data ana- used are number of MapReduce jobs, speedup, accu-
lytics. The study covers an overview regarding big data racy, and scalability. The performance metrics also
storage and big data computational frameworks along include the data source validation check and the
with a survey of distributed parallel data-intensive comparative analysis of the algorithm with the base
algorithms for big data. Although this search may not technique, which are the effectiveness indicators of
be completely exhaustive, it provides a good compre- the assumption and results made in the algorithm.
hensive base for understanding distributed scalable Scalable machine-learning algorithms were evaluated
machine-learning algorithms in the emergent and pro- on these metrics to perform a comparative analysis.
found field of big data analytics. The paper also pro-
vides a comparative analysis of scalable machine-
learning algorithms on various metrics. BIG DATA ANALYTICS
AND ITS CHALLENGES
Search Process Big data analytics has led to data explosion where
This paper is a qualitative review study based on the data from sensors, devices, and networks are col-
authors’ perception from the study of various papers lected and stored without discarding anything from
related to big data analytics. Literature selection it. The data are flowing at very high rate, and the

2 © 2016 John Wiley & Sons, Ltd


WIREs Data Mining and Knowledge Discovery Scalable machine-learning algorithms for big data analytics

Online repository
(IEEE electronic library, science direct, ACM digital library,
springer, IGI global, hindwai and taylor and francis online)

Descriptor (‘big data’ , ‘big data analysis’ , ‘‘big data


challenges’ , ‘big data storage’ , ‘big data and machine learning’ ,
‘machine learning algorithms employing mapreduce’ ,
‘Distributed parallel algorithm and mapreduce’ and ‘parallel
machine learning algorithms employing spark’)

Related to
big data
analytics No

Yes

No Related to big Related to big No


data data-parallel
infrastructure machine learning

Yes
Yes

Employing No
horizontal
scaling

Yes

Review full text

Yes
Included in survey Employing big data
(2008–2016) computational framework

No

F I G U R E 1 | Literature selection process.

© 2016 John Wiley & Sons, Ltd 3


Overview wires.wiley.com/dmkd

amount of data generated is humongous and hetero- Big data scalable infrastructure needs to be aug-
geneous in nature. These features account for the mented with machine-learning tools in order to
3Vs of big data, which are Volume, Variety, and extract insights from the data sea. Big data is thus
Velocity. Researchers have proposed additional fea- broad spectrum comprising evolving techniques and
tures or Vs of big data. Owais and Hussein1 have tools in order to handle data-intensive jobs. Tradi-
proposed the 9Vs of big data, comprising Veracity, tional algorithms need to exploit parallelism in order
Value, Visualization, Volatility, Variability, and to efficiently handle big data problems. Traditional
Validity, in addition to the 3Vs, which refers to trust- data analysis algorithms need to be adapted to the
worthiness, significance, graphical representation, big data environment to make them efficient for
retention policy, consistency, and accuracy of data, handling high-volume, heterogeneous, and dynamic
respectively. These Vs must be considered while data. The major challenges imposed by big data on
leveraging big data usage for any domain. Big data data analytics are5,6:
analytics of sensor log or sentiment data may help in
extracting useful or significant insights from huge (a) Scalability: Big data collected from various
data, which was previously unavailable. However, sensors would be voluminous and evolving in
sentiment data or log data may be collected from nature. The data analysis algorithm for big
various varied sources, making it important to con- data thus needs to take into account large or
sider the trustworthiness, consistency, and accuracy complex datasets. They should be scalable
of sources to produce correct and relevant output. enough to handle a growing amount or veloc-
Just collecting all the data without extracting ity of data gracefully.
something interesting is of no use, and this is where (b) Decentralization: The traditional data analysis
big data analytics comes into the picture. Wu et al.2 was designed with a view that they will be run
have defined big data analytics as the process of on a single machine on a limited dataset.
examining big data to uncover hidden patterns, However, sensor data may be spread across
unknown correlations, and other useful information multiple machines or clusters, for which cen-
that can be used to make better decisions. Big data tralized solutions will not work. Thus, big
analytics has been employed by many communities, data algorithms need to be distributed in order
such as government agencies, security, e-commerce, to handle the voluminous nature of big data.
and health organizations.3
(c) Dynamic datasets: The traditional algorithms
The biggest challenge in the big data era is to
were designed to operate on static data and
devise ways to analyze and visualize large datasets
were not able to handle data on the fly or
as just collecting big data does not make it good.
streaming data. However, sensor data are evol-
Parallelism is the default choice for processing volu-
ving in nature and are generated at a very fast
minous dataset as it enables faster execution of the
rate, for which real-time analysis may be
job with optimal utilization of computational
required. Big data algorithms need to operate
resources. Parallelism could be achieved by either
on a dynamic dataset and may also require the
vertical scaling or horizontal scaling. Vertical scaling
learning process to be fast in order to carry
involves increasing the processing power and stor-
out real-time analysis. A dynamic dataset may
age of a single machine, whereas horizontal scaling
also need to exploit a dynamic graph for visu-
involves the distribution of load across multiple sys-
alization, projecting real-time scenarios.
tems. Vertical scaling could be achieved by techni-
ques such as High-Performance Computing Clusters (d) Nonuniform Dataset: The dataset for big data
(HPCC), Graphics Processing Unit (GPU), and mul- may come from multiple disparate sources com-
ticore processors.4 Big data may consist of tera or prising unstructured and heterogeneous data. It
peta bytes or even more data, for which vertical may involve a collection of data from various
scaling may not be sufficient, and horizontal scaling sources, such as web logs, social media, or sen-
would be the best fit. Horizontal scaling distributes sor data, which will have different data formats.
the task among distributed nodes to achieve distrib- Thus, big data algorithms need to handle nonu-
uted processing. Traditional horizontal scaling niform datasets with a lot of noise also.
includes a peer-to-peer network, with MPI as the (e) Privacy: Big data analysis has been widely
communication mechanism.4 However, due to self- adopted in various industries such as health-
managing and a fault-tolerant file system, the care due to its potential in harnessing relevant
default data store for big data is the Hadoop Dis- insights. However, big data has a high volume
tributed File System (HDFS). of data collected from various sources that

4 © 2016 John Wiley & Sons, Ltd


WIREs Data Mining and Knowledge Discovery Scalable machine-learning algorithms for big data analytics

raises more privacy concerns for the people Consequently, a big data infrastructure needs
involved in the data. The traditional methods distributed file system (DFS), Not Only SQL
for protecting data privacy will not work on (NoSQL) databases, computational frameworks pro-
dynamic and huge big data. viding storage, and analytical scalability.

In order to handle these challenges, the realm of big


data analytics in this paper has been divided into two
phases, big data infrastructure and big data inference Distributed File System
engine, as shown in Figure 2. The first phase involves DFS provides scalable data storage and access for
setting the stage for big data processing and involves handling big data. DFS enables parallel processing
the selection of big data storage and big data compu- on a large amount of data. HDFS has become the
tational frameworks, which have been discussed in default choice for big data tasks due to its fault toler-
detail in the fourth section. The second phase ance and its ability to be deployed on low-
involves the designing and implementation of commodity hardware.9 HDFS allows users to store
machine-learning tools in order to extract insights data in files supporting a wide variety of data. The
from the data sea, which have been discussed in files are further split into blocks. The data blocks are
detail in the fifth section. distributed and replicated across nodes, providing
fault tolerance.10 HDFS supports write-once-read-
many semantics on files.9 The data block size is usu-
BIG DATA INFRASTRUCTURE ally 64 MB and is replicated into three copies by
Big data collection, storage, and analysis demand default.11 A HDFS cluster operates in a master–slave
data decentralization and provide a new strategy for architecture.9 It consists of one master NameNode
the traditional data warehousing approach. It fits and a number of DataNodes. NameNode manages
into the MAD model.7 MAD refers to the Magnetic, the filesystem namespace and controls access to the
Agile, and Deep analysis of data. Big data allows the files. NameNode also maps the data blocks to Data-
capture and storage of anything as is, without any Nodes. DataNode creates stores and retrieves data
transformation, acting as a magnet attracting any blocks as per the NameNode’s direction.11
kind of data. It needs agile infrastructure that is fully HDFS is not a database; it is just a filesystem.
scalable to handle data evolution of a varied nature. HDFS is provided as a part of the Hadoop frame-
It needs to provide a deep data repository with deep work. Hadoop is a library of tools for big data. It
analytical capabilities. It has not replaced Data Ware- consists of Hadoop Common Utilities, HDFS, and
house; instead, it has provided an edge to the overall MapReduce. After the introduction of Hadoop 2.0,
strategy by offering schema on query rather than YARN (Yet Another Resource Negotiator) has been
schema on design.8 The Extract, Transform, and added to the Hadoop framework.11 YARN has sepa-
Load (ETL) approach has been transformed into rated the resource management functions from the
ELT, i.e., Extract, Load, and Transform.7 programming mode, which has brought significant

Big Data Big Data


Models Big Data Scalable
Distributed
Computational Machine
File System
Framework Learning
Algorithms

Phase 1: Big Data Infrastructure Phase 2: Big Data Inference Engine

F I G U R E 2 | Big data analytics.

© 2016 John Wiley & Sons, Ltd 5


Overview wires.wiley.com/dmkd

performance improvements such as support of other distributed data storage, large-scale batch-
processing models besides MapReduce.11 oriented data processing, and exploratory and
A number of other alternatives of HDFS are predictive analytics. Some of the well-known
also available, such as Ceph, Tachyon, GlusterFS, implementations of wide-column store are Big-
QFS, GridGain, and XtreemFS.12 table, Cassandra SimpleDB, and DynamoDB
(c) Document-oriented store: It stores the data as
Data Models for Big Data the collections of the documents. These docu-
Big data has led to an explosion of databases due to ments are encoded in a standard data
the inability of traditional databases to scale-up with exchange format such as XML, JSON, or
big data. The big data models could be categorized into BSON. The value column in a document store
NoSQL and NewSQL data models, which could be contains semistructured data and may contain
used in place of, or along with, traditional RDBMS. a large number of attributes. The number and
CAP theorem states that any networked shared-data type of these attributes may vary among rows.
system can have, at most, two of the three desirable Document store allows searching through both
properties of consistency, high availability of data, and keys and values, unlike a key-value store,
tolerance to network partitions.13 A distributed system which allows searching only by key. Document
cannot lose on tolerance to network partitions, leaving stores are good for the storage of literal docu-
a choice between consistency and availability.13 ments such as text documents, email messages,
RDBMS systems are more focused on consistency and XML documents. They are also good for
rather than availability and strictly comply with ACID storing sparse data. Some of the well-known
(Atomicity, Consistence, Isolation, Durability) proper- implementations of document-oriented store
ties. NoSQL, on the other hand, are focused on availa- are CouchDB(JSON) and MongoDB (BSON).
bility rather than consistency and comply with BASE (d) Graph data model. These models use struc-
(Basically Available, Soft-state, Eventually consistent) tured relational graphs of interconnected key-
properties. RDBMS handles both data storage and value pairs. Data is stored in the form of a
data management, whereas NoSQL databases separate graph having nodes and links, with attributes
both. NoSQL databases focus only on high- associated with them. Graph data models are
performance scalable data storage, and data manage- useful when relationships between data are
ment tasks are handled by the application layer.10 more important than the data itself, such as
NoSQL supports schema-free storage or dynamic social networks and recommendations. Some
changing schema, enabling applications to upgrade of the well-known implementations of graph
table structure without table rewriting.10 data store are Neo4j, InfoGrid, and GraphDB.
NoSQL are horizontal scalable databases that
can handle a variety of big data and are categorized NewSQL is another type of big data model and is
into four models. They are key-value store, wide- known as the next generation scalable relational data-
column store, document-oriented store, and graph base management systems.16 NewSQL implements the
database.14,15 best of both traditional databases and NoSQL data-
bases. NewSQL is based on relational model and pro-
(a) Key-value store: These systems store data in vides scalability of NoSQL while still maintaining
key-value pairs where value contains data that ACID properties.16 NewSQL supports SQL query.
could be retrieved with the help of key. The However, the underlying architecture of NewSQL is
value may be a string or complex lists or sets. different from RDBMS and supports scaling techni-
The search can only be performed against keys ques, such as vertical partitioning, horizontal partition-
and is limited to exact matches. The key-value ing, or sharding and clustering.16 Some well-known
store is suitable for the fast retrieval of values, implementations of NewSQL are NuoDB and VoltDB.
such as retrieving product names or user pro- These data models differ significantly from each
files. Some of the well-known implementations other and may be selected based on the
of a key-value store are Dynamo, Voldemort, underlying task.
Redis, and Riak.
(b) Wide-column store: These systems employ a
distributed, column-oriented data structure Big Data Computational Frameworks
that accommodates multiple attributes per There are a number of parallel distributed computa-
key. The wide-column store is suitable for tion models for big data. The choice of these big data

6 © 2016 John Wiley & Sons, Ltd


WIREs Data Mining and Knowledge Discovery Scalable machine-learning algorithms for big data analytics

models depends mainly on the underlying domain or network of spouts and bolts. A spout constitutes a
application. These parallel processing models could source stream, such as input from a Twitter stream,
be further categorized into batch mode, real time, whereas a bolt consists of processing and computa-
and hybrid mode.17 tional logic, such as joins and aggregation.20 The
major difference between storm topologies and
Batch Mode MapReduce jobs as identified by Sundaresan and
In this mode, the data collected are divided into Kandavel17 is that storm topologies keep on execut-
batches and processed in parallel by multiple nodes. ing due to real-time date ingestion of stream data,
MapReduce is a distributed framework which pro- whereas MapReduce jobs eventually finish.
vides data parallelism for processing large datasets Various other storm alternatives, such as Tri-
and is the best fit in this category. MapReduce is dent and S4, have also been proposed in literature.17
based on shared nothing architecture where each
processing node is self sufficient to carry out its task.
Map and reduce are its two main operators. The Hybrid Model
input dataset is divided into data chunks that are A hybrid model can handle both batch data and real-
operated by the mapper to compute intermediate time data. Spark is the most deployed big data hybrid
values. Similar intermediate values are grouped based processing model. It does not have its own file system
on the common key. The reduce function then takes and usually runs on top of HDFS. It employs in-
these grouped values to produce an aggregated result. memory computation, reducing read and write disk
MapReduce fits into SIMD architecture suitable for operations. The main building block in Spark is the
embarrassingly parallel problems. MapReduce jobs Resilient Distributed Dataset (RDD), which provides
take the input from a disk, and all the results, includ- in-memory data caching of intermediate results and
ing intermediate results, are written to disk every fault tolerance without replication.21 Spark also
time, incurring a significant overhead for iterative offers microbatch streaming in which an incoming
jobs in terms of I/O cost and network bandwidth.18 stream is divided into different chunks to be pro-
MapReduce’s major advantage over other parallel cessed by the batch system.22
distributed systems is that it relieves the programmer A comparison of these three big data computa-
from the burden of code and data distribution, repli- tion models is presented below in Table 1.
cation, machine failures, and load balancing, which All the computation models mentioned above
is taken care by the underlying runtime system auto- are parallel and have their own pros and cons. The
matically. Hadoop is the open-source implementation decision to select the best model depends on the
of MapReduce, which also provides its own DFS, domain/application and processing time, i.e., latency.
HDFS. Doulkeridis and Nørvåg19 have pointed out At one end is MapReduce, which is a batch-
various weaknesses and limitations of MapReduce, processing framework, and on the other end is
such as high communication cost, lack of iteration, Storm, which is a real-time processing framework.
and lack of real-time processing. The popularity of Marz and Waren23 have proposed a lambda architec-
MapReduce has now faded mainly due to a high ture that combines different technologies. It combines
overhead for iterative tasks in machine learning.20 the process of batch analytics and real time stream
Several big data projects have been proposed in analytics. Lambda architecture decomposes the prob-
literature that addresses the shortcomings of MapRe- lem into three layers, as shown in Figure 3. The three
duce while still maintaining scalability and fault tol- layers are batch layer, speed layer, and serving layer.
erance capability, such as Incoop, IncMR, Haloop, The data are sent both to the batch layer and the
Twister, iMapReduce, Continous MapReduce, speed layer as soon as it arrives. The batch layer
Shark, Dremel, and Scalding.17,19 appends the new data to the already existing dataset
and performs the computation on it, which is then
Real-Time Processing passed to the serving layer for indexing. The speed
Real-time processing involves processing data or layer compensates the latency in the output of batch
stream as it arrives and produces output almost in layer by providing real-time updates of the new data.
real time. The programming model thus employed The service layer merges the output of both layers to
for batch processing is not suitable for real-time pro- provide the federated view. Hybrid models such as
cessing. Storm is the most deployed computation sys- Spark fit well into lambda architecture, seamlessly
tem for the real-time processing of unbounded integrating both the modes. Stream processing of
streams with a very low processing latency.17 Storm click streams followed by batch processing for the
architecture consists of topologies that are the correlation of clicks is one such example. Vizard24

© 2016 John Wiley & Sons, Ltd 7


Overview wires.wiley.com/dmkd

TABLE 1 | Comparison of Big Data Computation Models


Batch Processing Real Time Hybrid (Batch + Real time)
Popular big data parallel MapReduce Storm Spark
model/framework
Data processing model Data parallel batch processing Real-time task parallel Data parallel generalized batch
model computation model processing model with
microbatch streaming
Big data data structure Key-value pairs having functions Topology consisting of spouts RDD, data frames
and bolts
Big data dimensions Volume Velocity Volume, velocity
Characteristics Fault tolerant, scalable, latency Fault tolerant, scalable, latency Fault tolerant, scalable, in-
of few minutes for few subseconds, suitable memory computation,
for real-time processing latency of seconds, suitable
for near real-time task, RDD
makes them suitable for
iterative tasks
Limitations Selective access to data, high Lack of iteration and graph Selective access to data, Lack
communication cost, lack of processing support of interactive or real-time
iteration, lack of interactive, or processing, Insufficient main
real-time processing memory may result in
reduced performance
Applications Forensics, log analysis, indexing Online machine learning, e- Recommendation
commerce, Security event (Ad Placement), Fraud
monitoring Detection
RDD, resilient distributed dataset.

Batch layer
for batch analytics Ba
tch
(Hadoop, Spark) vie
w
Service layer

Federated view
New
data
Speed layer for real iew
ev
l tim
time analytics Rea
(Storm, Spark)

FI GU RE 3 | Lambda architecture.

has pointed out that Spark has almost replaced been used for a long time for scalability and speedup.
MapReduce and has become the default standard. However, they all suffer from memory and processor
distribution challenges. MapReduce provides the
automatic handling of data distribution, load balan-
BIG DATA INFERENCE ENGINE cing, and fault tolerance, allowing easier and faster
scalability of parallel systems.
Big data scalable infrastructure needs to be aug-
mented with machine-learning tools in order to
extract inference and insights from the data sea. The
machine-learning algorithms for big data are Scalable Machine-learning Algorithms
required to fit into the distributed parallel program- for Big Data
ming paradigms for handling the inherent scalability Several parallel machine-learning algorithms are pro-
of big data. Parallel and distributed techniques have posed for big data, which employs MapReduce or

8 © 2016 John Wiley & Sons, Ltd


WIREs Data Mining and Knowledge Discovery Scalable machine-learning algorithms for big data analytics

other variants of MapReduce. This section presents a points per iteration with the help of mappers in par-
literature review of data-intensive parallel machine- allel, which are sent to reducers. The process is
learning algorithms to see how they are using repeated for O(log n) iterations, resulting in O(klogn)
MapReduce or Spark to achieve the desired objec- points as candidates. These candidates’ points are
tive. Several articles have also achieved data parallel- used to form k clusters using k-means++. Authors
ism with the help of MapReduce and GPUs. have showed that k-means|| leads to faster conver-
However, the use of GPUs may not be fully scalable gence time for iteration.
to big data, due to which they are not considered in Density-based clustering algorithms find the
the literature review. clusters based on the region of density. It identifies
The algorithms are categorized into various dense clusters of points, allowing it to learn clusters of
machine-learning classes, which include unsupervised arbitrary shapes, and helps in identifying the outliers.
learning, supervised learning, semi-supervised learning, Unlike k-means, they do not need to know the num-
and reinforcement learning, including deep learning. ber of clusters in advance. DBSCAN is a well-known
density-based clustering algorithm and has been
Unsupervised Learning employed in a number of applications, such as image
Unsupervised learning finds hidden structure in unla- processing pattern recognition and location-based
beled data. Clustering is one of the important forms service. He et al.29 have proposed MR-DBSCAN, a
of unsupervised learning. Clustering partitions the MapReduce version of DBSCAN. MR-DBSCAN con-
datasets into clusters or groups such that intracluster sists of three stages. In the first stage, the dataset is
similarity between the data points is maximum, and partitioned based on computational cost estimation to
intercluster similarity is minimum. Clustering has a achieve load balancing for heavily skewed data. In the
number of applications, such as customer segmenta- second stage, local clustering, i.e., sequential DBSCAN
tion, document retrieval, image segmentation, and is performed in parallel by mappers on all partitions.
pattern classification. Several clustering algorithms In the third stage, the partial results generated are
have been proposed for the MapReduce framework. aggregated by reducers to generate global clusters.
A brief description of some of them is presented Cludoop is also a density-based clustering algorithm
below. proposed by Yu et al.30 that incorporates CluC as the
K-means clustering has been one of the most serial clustering algorithm utilized by parallel map-
popular clustering algorithms. Zhao et al.25 proposed pers. CluC utilizes the relationships of connected cells
a parallel version of the k-means (PK-means) algorithm around points, instead of an expensive completed
based on MapReduce. Distance calculation is the most neighbor query, significantly reducing the number of
expensive task in the k-means algorithm and could be distance calculations.
executed independently for various data points in par- Hierarchical clustering aims to make a hierar-
allel. Authors have achieved speedup by employing chy of clusters either using divisive clustering, which
MapReduce for parallelism. PK-means uses three func- is a top–down approach, or agglomerative clustering,
tions, map, combine, and reduce. The map function which is a bottom–up approach. Single-linkage hier-
assigns each data point to its closest center. The com- archical clustering (SHC) is an example of agglomer-
bine function then aggregates the intermediate results ative clustering. Achieving parallelism for such an
of the map function and passes it to the reducer. The algorithm is tricky due to inherent data dependency.
reduce function updates the centroids. This MR cycle Jin et al.31 have proposed a Spark-based parallel
is performed iteratively until termination condition has SHC algorithm, SHAS, which obtains a Minimum
been met. A number of MapReduce (MR) jobs would Spanning Tree (MST) for achieving clustering. The
involve immense I/O activity, which could be an issue Divide and Conquer strategy is adopted by SHAS,
for large datasets. Thomas and Annappa26 have also which divides the original dataset into subsets. MSTs
employed parallel k-means MapReduce clustering are calculated locally for each of these subsets. These
algorithms for the prediction of optimal paths in self- intermediate MSTs are combined by employing the
aware Mobile Ad-Hoc Networks. K-way merge until a single MST is left. SHAS uses a
The biggest drawback in k-means is the selec- special data structure(union-find), which records the
tion of initial centroids. K-means++27 is a modifica- information about each vertex and is used to effi-
tion of k-means that selects subsequent centroids ciently combine these partial MSTs. Authors have
(after the first centroid, which is selected at random) compared the performance of the SHAS with its
based on probability proportional to overall error. MapReduce version. They have figured out that
Bahmani et al.28 have proposed a scalable k-means++ Spark performs better than MapReduce for all kinds
(K-means||) on MapReduce. K-means|| samples k of datasets due to in-memory computation and no

© 2016 John Wiley & Sons, Ltd 9


Overview wires.wiley.com/dmkd

overhead in setting and tearing of jobs for each their own merits, the selection of which is determined
iteration. by a cost-based optimization technique. BoW calcu-
Spectral clustering relies on Eigen decomposi- lated the cost of both algorithms on the basis of
tions of affinity, dissimilarity, or kernel matrices to parameters such as file size, network speed, disk
partition data into clusters.32 It has been employed in speed, start-up cost, and plug-in cost. The calculated
various applications, such as speech recognition and cost is then used to select the best algorithm among
image processing. Power iteration clustering (PIC) the two. However, BoW is a hard clustering method,
reduces the computational complexity of spectral which may not work well for overlapping clusters.
clustering.32 PIC replaces the Eigen decomposition of Coclustering, or bi-clustering, or simultaneous
the similarity matrix required by spectral clustering clustering is an unsupervised learning technique that
into a small number of iterative matrix–vector multi- simultaneously clusters objects and features, or in
plications. Yan et al.32 have proposed a parallel other words, it allows parallel clustering of the rows
power iteration clustering (p-PIC) using MPI for and columns of a matrix. In this case, clustering is per-
handling large data, which was further implemented formed row-wise, keeping column assignments fixed
by Jayalatchumy and Thambidurai33 on MapReduce to find the best group. The same process is executed,
due to its better fault tolerance ability. MapReduce is in parallel, column-wise, keeping row assignments
used for affinity matrix calculations in parallel with fixed. It has been employed in a number of applica-
the help of mappers, the results of which are globally tions, such as text mining, collaborative filtering, bio-
aggregated by the reducer. Authors have showed that informatics, and graph mining. Papadimitriou and
the accuracy in producing the clusters is almost the Sun36 have proposed a distributed coclustering model,
same as using a MapReduce framework. DisCo, with MapReduce. In MapReduce distributed
SUB-CLU is a subspace clustering algorithm processing, initially, global parameters consisting of
that aims to finds clusters hidden in subsets of the adjacency matrix, row vector, and column vector are
original feature space by using the DBSCAN algo- formed and broadcasted. Then, mappers perform the
rithm and is employed in high-dimensional datasets. local clustering by reading rows, which are fed to the
Subspace clustering has been employed in a number reducer to perform global clustering. The process is
of applications, such as sensor network, bioinformat- repeated till the cost function stops decreasing (the
ics, network traffic, image processing, and compres- same process is also followed for column-wise).
sion. Zhu et al.34 have proposed CLUS, which has
implemented a subspace clustering algorithm on top
Supervised Learning
of Spark to achieve scalability and parallelism. CLUS
Supervised learning infers a function from labeled
speeds up the SUB-CLU algorithm by executing mul-
trained data, which are used further for verification
tiple density-based clustering (DBSCAN) tasks in par-
and classification. The two broad subcategories in
allel. DBSCAN is performed initially on one-
supervised learning are classification and regression.
dimensional subspace followed by higher-
dimensional subspaces, excluding noise from lower
subspaces. The process ends when no higher- (a) Classification: Classification is a supervised
dimensional subspace could be generated. It also learning technique that is used to determine
employs a data partitioning to induce data locality. the class of variables or to predict future
Cordeiro et al.35 have proposed another subcluster- trends. The classification algorithm usually
ing algorithm on MapReduce, BoW (Best of both consists of a training phase to construct a
Worlds), that automatically spots bottlenecks and training model and a testing phase for the veri-
picks a good strategy to balance the I/O and network fication. Classification has a number of appli-
cost. It can employ most of the serial clustering cations, such as spam filtering, image
method as a plug-in. Authors have proposed two recognition, speech recognition, text categori-
methods, ParC (Parallel Clustering) and SnI (Sample zation, and fraud detection.
and Ignore). ParC performs data partitioning and (b) Regression: Regression is also a supervised
then finds clusters using MapReduce by employing learning technique used for prediction. Regres-
any serial clustering algorithm. SnI performs sam- sion predicts a value from a continuous data-
pling first to get rough clusters and then clusters the set, whereas classification is used with a
remaining points using ParC. ParC is executed on the categorical dataset. The variable to be pre-
whole dataset only once, minimizing disk access. SnI dicted is known as the dependent variable.
performs two MR jobs to reduce communication cost There is only one dependent variable in regres-
at the expense of higher I/O cost. ParC and SnI have sion. The variables used for modeling or

10 © 2016 John Wiley & Sons, Ltd


WIREs Data Mining and Knowledge Discovery Scalable machine-learning algorithms for big data analytics

training are known as independent variables. Salperwyck et al.39 have proposed CurboSpark, a
If the number of independent variables is one, Spark version of their time-series decision tree,
then, it is known as linear regression, and if it i.e., CurboTree. Authors have used hierarchical clus-
involves multiple independent variables, it is tering for the generation of the best-split attribute fol-
known as multiple regression. lowed by a decision tree for classification. Authors
have also tested the framework to understand and
A brief description of some the classification and draw analytical insight about the electrical consump-
regression algorithms adapted to the MapReduce tion of smart meters. Authors have indicated that the
framework are presented below. use of Spark library has resulted in significant
A decision tree generates the classification or improvement due to in-memory computation.
regression rules by recursively partitioning the data- Random forest is based on ensemble learning and
set. Decision trees with categorical target values are is used for classification or regression. It builds a num-
known as classification trees, whereas where target ber of decision tree on subsets of training data, which
values are continuous, they are known as regression together creates a classification model. The classifica-
trees. A decision tree consists of a root, internal node tion model so formed is used for predicting the class of
or decision node, and leaf nodes representing class samples. Han et al.40 have proposed SMRF, a scalable
labels. The decision node represents a test on the random forest algorithm based on MapReduce. It has
attribute, with edge representing the test condition, two phases. The first phase is used for building the clas-
creating partitions of the dataset. The decision rule sification model, and the second phase is used for verifi-
constitutes the path from root to leaf node. Decision cation or prediction. In the training MR phase, data is
tree classification consists of a training phase that is partitioned among nodes that are operated by the map-
used to build the tree in a recursive manner, with an pers to create decision trees. The outputs of mappers
objective to optimize the classification rule. In the test are simply combined by the reducers or combiner. All
phase, a decision tree is used to determine the class decision trees from mappers create a forest or classifica-
to which a new instance belongs. A decision tree has tion model, which is used in next phase for prediction.
been employed in a number of applications, such as Similarly, PLANET41 and MReC4.542 are also based
spam filtering, text categorization, and fraud detec- on ensemble-based approaches for scalable tree learn-
tion. C4.5, an extension of ID3 algorithm, is a well- ing on large data using MapReduce. Singh et al.43 have
known decision tree that uses information gain ratio employed random forest for peer-to-peer botnet detec-
for splitting attributes. Dai and Ji37 have proposed tion on MapReduce.
C4.5 using a MapReduce programming model, Fuzzy Rule-Based Classification Systems
MRC4.5, which transforms the traditional algorithm (FRBCSs) are effective classification systems and can
into a series of Map and Reduce procedures. The handle uncertainty, vagueness, or ambiguity. They
data is vertically partitioned to maximally preserve are used in applications such as pattern recognition
localization characteristics. Before starting the algo- and classification. Rio et al.44 have extended the Chi-
rithm, data preparation is performed to design some FRBCS, a linguistic fuzzy rule-based classification
data structures, such as attribute table, count table, system, to big data, Chi-FRBCS-BigData, which uses
and hash table, to minimize the communication cost. the MapReduce framework to learn and fuse rule
After data preparation, the best splitting attribute is bases. It uses two MR jobs for the algorithm. One
decided using a single MR job. After selecting an MR job is used to build the model from a big train-
attribute, the training dataset is partitioned into sev- ing set. The other MR job is used to estimate the
eral parts such that similar value records values are class of the big data sample set using an earlier
in the same partition preserving benefits of vertical model. The Chi-FRBCS-BigData algorithm has been
partitioning. MR jobs are executed on each partition developed in two different versions, Chi-FRBCS-Big-
recursively to determine the best splitting attribute at Data-Max and Chi-FRBCS-BigData-Ave, which may
the current node until the complete partition is be selected on the basis of speed and accuracy trade-
assigned to the same class. Linkages between nodes off. The Ave version provides better accuracy,
are created to build a decision tree, which is used for whereas the Max version provides faster results.
the classification of new instances. Only map func- Authors observed a reduction in classification accu-
tion is needed for the testing phase. Authors have racy when using a larger number of maps as the rule
indicated that the MapReduce implementation of the weights are calculated from smaller data partitions.
C4.5 algorithm exhibits both time efficiency and scal- Naive Bayes is a supervised classification tech-
ability. Similarly, Purdila and Pentiuc38 have pro- nique based on statistical Bayes’ theorem with inde-
posed MR-tree, a MapReduce version of ID3. pendence assumptions among variables. The Naive

© 2016 John Wiley & Sons, Ltd 11


Overview wires.wiley.com/dmkd

Bayes approach uses joint probability to estimate the data information in data subsets, and major voting
probabilities over a set of classes. It has been helps in generating a strong classifier. The authors have
employed in applications such as text classification, performed the experiment on MapReduce, Haloop,
spam filtering, and sentiment analysis. Dai and Sun45 and Spark computing frameworks. It was found that
have implemented the Naive Bayes text classification among all, MapReduce performs the worst, with a
algorithm based on a rough set using MapReduce. slight improvement in the case of Haloop, and Spark
The authors have combined a rough set theory in soft performs the best due to in-memory computation. Sim-
computing with Naive Bayes to improve its limit of ilarly, Zhang and Xiao51 have implemented a parallel
independence, reducing bias. It selects a closet approx- multilayered neural network employing a BP training
imate independent attributes subset. In the training algorithm on MapReduce (MRBP) on cloud comput-
phase, an MR job is used to normalize the weighted ing clusters for speedup.
frequency of each key. The testing phase is used to Multiple linear regression is a widely adopted
predict the output of test records. Liu et al.46 have also regression technique that has a very high training
implemented a MapReduce version of the Naive Bayes time and may fail in case of a large dataset. Rehab
classifier for sentiment analysis. and Boufarès52 have proposed multiple linear regres-
Support vector machines (SVM) are a classifica- sion on MapReduce (MLR-MR) for speedup and
tion and regression model that attempts to find the scalability over a large dataset. MLR-MR distributes
best possible surface to separate classes in training the QR decomposition and ordinary least squares
samples. The learning ability of SVM is independent computation in parallel using MapReduce to estimate
of the dimensionality of feature space, which makes the relationship between the predictor and the target
them a suitable candidate for achieving parallelism variables. MLR-MR has three steps. In the first step,
with MapReduce. LibSVM is one of the most widely the matrix is divided into small blocks. In the second
adopted SVMs. Sun and Fox47 have proposed a Par- step, mappers compute a local QR factorization for
allel LibSVM on MapReduce to improve the training each such block, and in the third step, the results of
time. In this model, training samples are distributed all mappers (Qi results) are aggregated to have final
into subsections, which are trained in parallel by coefficient results. Authors have showed that MLR-
mappers using LibSVM. Support vectors from the MR is able to train large data and effectively solves
map jobs are collected by the reducer and fed back the out-of-memory problem.
as input until all sub-SVMs are combined into one
SVM. Based on the parallel SVM model on MapRe- Semisupervised Learning
duce (PSMR), Xu et al.48 have proposed an e-mail Semisupervised learning is used where the data con-
classification model using the Enron dataset. Simi- tains small labeled data and large amount of unla-
larly, Khairnar and Kinikar49 have proposed senti- beled data. It uses labeled data and unlabeled data
ment analysis-based mining and summarizing using for training. Semisupervised learning is well suited to
parallel SVM using MapReduce (MRSVM). big data in which majority of data may be unlabeled,
Artifical neural networks (ANNs) are widely and the cost of labeling all data may be too high.
adopted for classification and regression tasks. The Cotraining and active learning are the two important
back-propagation neural network (BPNN) is the most techniques employed for semisupervised learning.
commonly encountered ANN. BPNN is a multilayer Cotraining is a semisupervised learning tech-
feed-forward network. It adjusts the classification para- nique in which each example is partitioned into two
meters by employing an error back-propagation distinct views and where two views are independently
(BP) technique until its parameters are attuned to all sufficient. It learns to separate classifiers for each
input instances.50 The PBPNN proposed by Liu et al.50 view, and then, the most confident prediction of each
presents a parallel version of BPNN using MapReduce classifier on unlabeled data is used for labeled trained
and Spark, which distributed the iterations and compu- data. Hariharan and Shivshankar53 have presented a
tation of BPNN in parallel to achieve speedup. PBPNN cotraining-based multiview learning algorithm under
distributes the data to mappers for parallel processing. semisupervised conditions on MapReduce. The
Each mapper locally tunes subsets of data using BPNN authors have also proposed the use of data structures
iteratively. The outputs of all mappers are then ensem- such as mapping table, label file, and bidirectional
ble by the reducer or joiner to generate classification reducers in order to avoid broadcasting of new labels
rules. Classification accuracy is maintained by merging to all nodes in the MapReduce computation. A map-
weak classifiers into a strong classifier with the help of ping table contains partition details, and a label file
ensemble techniques such as bootstrapping and major- contains the labels of data points in each partition.
ity voting. Bootstrapping helps in maintaining original Bidirectional reducers update the labels in the label

12 © 2016 John Wiley & Sons, Ltd


WIREs Data Mining and Knowledge Discovery Scalable machine-learning algorithms for big data analytics

file directly. Classifiers on multiple views are learnt Metrics Evaluation


locally using the mappers on each node individually, The design of scalable learning systems has enabled
and the results are combined to generate a global us to address larger problems easily. Algorithms
model using a bidirectional reducer. employing MapReduce or RDD may take less train-
Active learning is another semisupervised learn- ing time and may result in an increase of speed on a
ing technique in which learning algorithms interac- huge dataset compared to its serial counterpart.
tively query the source for a certain type of instance Designing a big data algorithm entails two main
to obtain output, reducing learning effort. Zhao challenges, which are also pointed out by Cordeiro
et al.54 have proposed a Punishing-Characterized et al.35 in his research. The two main challenges are
Active Learning Back-Propagation (PCAL-BP) neural to minimize the I/O cost and network communica-
network algorithm for big data to improve the learn- tion cost. Researchers have employed various ways
ing efficiency of the BP model. to overcome these challenges and to achieve high per-
formance. The parallel data-intensive machine algo-
Reinforcement Learning rithm covered in previous section could be thus
Reinforcement learning continuously learns in an compared on various evaluation metrics. The evalua-
environment by trial and error. It discovers the actions tion metrics could be divided in two categories,
that yield the greatest rewards to make better deci- namely, optimization metrics and performance met-
sions. The Markov Decision Process and Q-Learning rics. The brief description of each is as follows:
are two well-known examples of reinforcement learn-
ing. Reinforcement learning has been employed in Optimization Metrics
applications such as gaming and robotics. Optimization metrics refers to the measures that can be
Li and Schuurmans55 applied the MapReduce used to improve the overall performance by reducing
framework for methods such as policy evaluation, pol- I/O cost or network communication cost or both. They
icy iteration, and value iteration by distributing the help in achieving objective such as load balancing,
computation involved in large matrix multiplications. resource utilization, and dimensionality reduction. The
Parallelism was used to speedup large matrix opera- various optimization criteria that could be employed
tions but not to parallelize the learning. To the best of by distributed big data algorithms are:-
our knowledge, the work of Li and Schuurmans55 is
the only paper that has used MapReduce for reinforce- (a) Sampling: Sampling refers to selecting a subset
ment learning of linear representations. Parallel rein- of data from the entire dataset. This helps to
forcement learning in which agents learn in parallel to achieve similar accuracy with reduced network
achieve speedup convergence such as Bayesian Rein- communication costs and processing time.
forcement Learning56 are being proposed in literature. Sampling also helps to handle data skewness.
(b) Indexing: Indexing data may help in selecting
Deep Learning specific data or removing unnecessary data
Deep learning refers to machine-learning techniques from processing, resulting in reduced I/O cost.
that use supervised and/or unsupervised strategies to (c) Intelligent partitioning: Data among different
automatically learn hierarchical representations in nodes should not be partitioned randomly as it
deep architectures for classification.57 Deep learning may lead to increased data shuffling among
has been employed in a number of applications such nodes. In order to reduce the communication
as image search, threat detection, fraud detection, cost, an intelligent data partitioning strategy
sentiment analysis, and log analysis. A Deep Belief may be adopted. Mappers distribute the out-
Network (DBN) with Restricted Boltzman Machine put to reducers using hashing. Data skewness
(RBM) is one of the most widely used frameworks can happen at a reducer when one reducer is
for deep learning. RBMs are used to generate a train- heavily loaded or is taking more time to com-
ing model in an unsupervised manner, which is used pute compared to others. A new job cannot
by DBN subsequently for supervised classification or start until all reducers have finished, resulting
fine tuning. Zhang and Chen58 have proposed a dis- in the improper utilization of resources. This
tributed learning paradigm using MapReduce. Pre- can be reduced by exploiting data locality and
training is achieved by distributed learning, which using a load-aware partitioning strategy,
involves stacking a series of distributions of RBMs instead of blind hashing for distribution
followed by distributed BP for fine tuning using the among reducers.59,60 A dynamic data parti-
MapReduce framework. tioning method can also continuously optimize

© 2016 John Wiley & Sons, Ltd 13


Overview wires.wiley.com/dmkd

the varying size of data for each iteration, pro- compared with other techniques. The values
viding load balancing.34 Thus, an optimal par- assigned are as follows:
titioning may help in handling imbalanced
data skew and load balancing.
(d) Special Data Structure: Big data implies a huge Comparative Analysis Value Assigned
No comparison 0
dataset. Learning or overall computation of a
machine-learning algorithm for big data may Comparison with the base technique or 1
require multiple MR jobs, resulting in increased similar serial technique
I/O costs. It may also be required to share infor- Comparison among parallel techniques 2
mation among nodes, entailing a high commu- (MapReduce vs Spark)
nication cost. Designing a special data structure
to hold specific information may reduce the
computational cost and would help in some (c) Number of MR jobs: It is a measure of the
form of data caching or data aggregation. number of MR jobs required to complete
(e) In-memory computation: In-memory computa- the algorithm. The number of MR jobs
tion helps to cache intermediate data in main determines the amount of I/O cost as every
memory and perform computation. This time data needs to be written to disk and
reduces the number of disk I/O required after read from disk. More number of MR jobs
every step of a typical MR job. may also result in increased network com-
(f) Asynchronous execution: A typical MR job is munication costs. Spark would help to
synchronous in nature, which implies that all achieve better performance in case of multi-
the reducers must finish before a new map ple MR jobs due to in-memory computa-
could be started. This results in the ineffective tion. The values assigned are as follows:
utilization of resources.
No. of MR Jobs Value Assigned
Papers were evaluated for the abovementioned optimi- Low 0
zation metrics, as presented in Table 2. Papers are High 1
assigned a value 1 in case the metric is used and a value
0 in case the metric is not used in the algorithm.
(d) Speedup: Speedup determines the reduction
in time to compute or train by parallel algo-
Performance Metrics rithm as compared to its serial counterpart.
Performance metrics are used to compare the perfor-
Parallelism may result in increased speed,
mance of the algorithm against its serial counterpart.
which may not be always linear due to high
Papers were evaluated on the following mentioned
network communication costs among
performance metrics and is presented in Table 2.
nodes. The values assigned are as follows:

(a) Data source validation: The data are used to


validate the algorithm. A synthetic dataset may
result in poor validation but may help in case Speedup Value Assigned
Good speedup due to MapReduce 0
no real dataset is available or to test for specific
test cases. The values assigned are as follows: Excellent speedup in case of MapReduce 1
due to optimizations metric employed
Excellent speedup due to in-memory 2
Data Source Value Assigned computation such as Spark
Synthetic datasets 0
Real-world datasets 1
Both (Synthetic datasets + Real-world 2
(e) Accuracy: Accuracy determines the quality
datasets)
of data output as compared to its serial
counterpart. The parallel and distributed
big data algorithms may result in the
(b) Comparative analysis: It is used to check decrease of output accuracy due to data dis-
whether the results obtained have been tribution. A balance may be needed

14 © 2016 John Wiley & Sons, Ltd


WIREs Data Mining and Knowledge Discovery Scalable machine-learning algorithms for big data analytics

between output accuracy and speed. The prevalent from the literature review, executes a
values assigned are as follows: serial algorithm for local computation at various
nodes/partitions in parallel, with the help of map-
pers, the results of which are aggregated by the
Data Source Value Assigned reducer/s. This task is often known as an MR job.
Not measured 0
Depending on the learning technique or fine tuning
Approximate match 1 required. a single or multiple MR jobs can be exe-
Perfect match 2 cuted before achieving a global view. The goal of all
the abovementioned algorithms is to make a serial
algorithm scalable by adapting it to run parallel on
(f) Scalability: MapReduce or RDD have the
distributed platforms.
basic aim of providing scalability, which
The various algorithms presented take less time
means that the system would be able to
than their serial counterparts. However, choosing the
gracefully handle the growth in data or
computational framework to employ may be decided
number of nodes. Accuracy may be reduced
by the context of problems. MapReduce is the best
for small datasets or for a large number of
fit where the number of MR jobs is low, i.e., 1 or
data nodes due to network overhead and
2. However, in case of an iterative task, a large num-
data distribution. Thus, an optimal number
ber of MR jobs may be required, for which MapRe-
of data nodes may be required. The values
duce may not be the best computational framework.
assigned are as follows:
It has been found that for such cases, the Spark
framework helps in achieving excellent speedup and
better scalability due to in-memory computation and
Scalability Value Assigned less overhead in setting up jobs for every iteration. It
Almost linear scale-up with increase in the 0 is further found that Spark performs better than
number of nodes and datasets MapReduce in almost all the cases due to its in-
Almost linear scale-up with a 1 memory computation, which reduces the network
consideration of imbalanced/skewed I/O cost. Nonetheless, Spark performs better until the
data time it has enough memory to cache data.
System is scalable and detects lower 2 This paper has also evaluated the various algo-
and/or upper bound for parallelism rithms on optimization metrics such as sampling,
degree intelligent partitioning, indexing, and special user-
defined data structure that reduces the number of
A comparative analysis of the parallel data- MR jobs or data shuffling, resulting in lower I/O
intensive machine-learning algorithm based on the costs and better speedup. Another factor of impor-
above metrics is presented below in Table 2. There tance is imbalanced or skewed datasets, which may
are various machine-learning toolkits such as MLib, cause uneven loads on the data nodes. The considera-
Mahout, scikiy-learn, MLpack, Flink, and Amazon tion of imbalanced data results in load balancing and
Machine Learning available for various big data improved scalability.
computational frameworks.61 These toolkits imple- The various algorithms are further evaluated on
ment some of the machine-learning algorithms that various performance metrics. It was found that the
could also be initially leveraged by the researchers accuracy of parallel data-intensive algorithms may be
for conducting their research. reduced due to small datasets or a large number of
data nodes. The advantage of parallelism may be
diminished due to network and data distribution
overhead. A balance is needed between speed and
DISCUSSION
accuracy, which could be achieved by selecting an
MapReduce or RDD have been adopted by many optimal number of data nodes.
researchers due to its flexibility, scalability, and easi- The power and value of big data analytics pro-
ness in handling data-intensive tasks. They help in vides a wealth of information. However, while
achieving data parallelism on a large volume of data designing or selecting a scalable machine-learning
and may result in good speedup. algorithms for big data, measures should be taken to
A number of scalable machine-learning algo- reduce the I/O and network communication cost.
rithms for big data have been discussed in this arti- Another issue that should also be considered is the
cle. In general, a big data-learning algorithm, as handling of false positives or output validation.

© 2016 John Wiley & Sons, Ltd 15


16
TABLE 2 | Comparative Analysis of Scalable Machine-Learning Algorithms for Big Data
Optimization Metrics Performance Metrics
Overview

No.
Machine Special Data of
Learning Framework Intelligent Data In-Memory Asynchronous Source Comparative MR
Class Paper Adapted From Employed Sampling Indexing Partitioning Structure Computation Execution Validation Analysis Jobs Speedup Accuracy Scalability
Clustering Parallel K- K-means MapReduce 0 0 0 0 0 0 0 0 1 0 0 0
means25
K-means||28 K-means++ MapReduce 1 0 0 0 0 0 1 1 1 1 1 0
29
MR-DBSCAN DBSCAN MapReduce 0 1 1 0 0 0 1 1 0 1 0 1
Cludoop30 CluC MapReduce 0 1 0 0 0 0 1 1 0 1 1 0
31
SHAS SHC (single- Spark 0 0 0 1 1 1 0 2 1 2 0 0
linkage
hierarchical
clustering)
MR-PIC33 pPIC MapReduce 0 0 0 0 0 0 0 1 0 0 1 0
CLUS34 SUB-CLU Spark 0 1 1 0 1 1 2 2 1 2 2 0
(subspace
clustering)
BoW:ParC and Subspace MapReduce 1 0 1 0 0 0 2 1 0 1 1 0
SnI35 clustering
DisCo36 Coclustering MapReduce 0 0 0 0 0 0 1 0 1 1 0 2
37

© 2016 John Wiley & Sons, Ltd


Classification MRC4.5 C4.5 MapReduce 0 0 1 1 0 0 0 1 1 1 0 0
MR-tree38 ID3 MapReduce 0 0 0 0 0 0 1 0 1 0 0 0
39
CurboSpark CourboTree— Spark 0 0 0 0 1 1 0 0 1 2 0 0
a time series
decision tree
SMRF40 Random forest MapReduce 0 0 0 0 0 0 1 1 0 0 2 0
PLANET41 MapReduce 0 0 0 1 0 0 1 0 0 1 1 2
42
MreC4.5 Ensemble C4.5 MapReduce 0 0 0 0 0 0 1 0 0 0 0 0
Chi-FRBCS- Chi-FRBCS MapReduce 0 0 0 0 0 0 1 1 0 0 1 2
BigData44 (Fuzzy rule-
based
classification
systems)
wires.wiley.com/dmkd
TABLE 2 | Continued
Optimization Metrics Performance Metrics
No.
Machine Special Data of
Learning Framework Intelligent Data In-Memory Asynchronous Source Comparative MR
Class Paper Adapted From Employed Sampling Indexing Partitioning Structure Computation Execution Validation Analysis Jobs Speedup Accuracy Scalability
MR version of Naive Bayes MapReduce 0 0 0 0 0 0 1 1 0 0 1 0
naive Bayes with rough
with rough set
set45
MR version of Naive Bayes MapReduce 0 0 0 0 0 0 1 0 0 1 2
naive Bayes classifier
WIREs Data Mining and Knowledge Discovery

Classifier46
MapReduce LibSVM Twister 0 0 0 0 1 0 1 0 1 0 1 2
parallel
SVM47
PSMR48 LibSVM MapReduce 0 0 0 0 0 0 1 1 1 0 1 0
49
MRSVM LibSVM MapReduce 0 0 0 0 0 0 1 1 1 0 1 0
PBPNN50 BPNN MapReduce, 1 0 0 0 1 1 1 2 0 2 1 0
Haloop,
and Spark
MRBP51 BPNN MapReduce 0 0 0 0 0 0 1 1 0 0 1 0
52
MLR–MR MLR MapReduce 0 0 0 0 0 0 2 1 0 0 1 0

© 2016 John Wiley & Sons, Ltd


Semi- Multiview Cotraining MapReduce 0 0 0 1 0 0 2 2 0 1 0 0
supervised learning on
learning MapReduce
using
cotraining53
Deep Large-scale DBN with RBM MapReduce 0 0 0 0 0 0 1 1 1 0 1 0
learning deep belief
nets with
MapReduce58
BPNN, back-propagation neural network; PBPNN, parallel BPNN; DBN, deep belief network; MLR, multiple linear regression; MLR–MR, multiple linear regression on MapReduce; SVM, support vector machine;
PSMR, parallel SVM on MapReduce; MRSVM, MapReduce SVM; MRBP, MapReduce back propagation; RBM, restricted Boltzman machine.

17
Scalable machine-learning algorithms for big data analytics
Overview wires.wiley.com/dmkd

LIMITATIONS digits on a Spark cluster running multiple deep learn-


ing jobs.63
The purpose of this study is to provide a brief overview This research would help the researchers work-
of big data analytics. This study has provided a com- ing in the area of big data analytics to choose the
prehensive literature review of data-intensive parallel right computational framework based on their under-
machine-learning algorithms for big data analytics. A lying context. This research would help other
number of important limitations of the study are: researchers in understanding which machine-learning
techniques have been adapted to MapReduce, mak-
(a) The study has not considered parallel algo- ing them scalable in prior studies. This would also
rithms employing vertical scaling, such as help researchers in developing a new framework or
GPUs, due to their limited scalability. The hybrid model for big data analytics. Furthermore, the
study has also not addressed parallel algo- various evaluation metrics could be used as a bench-
rithms employing MPI. mark in evaluating any new technique for big data.
(b) The study has not covered graph-based analyt-
ical algorithms as they best suit the BSP (Bulk
Synchronous Parallel) model rather than CONCLUSION AND FUTURE WORK
MapReduce or RDD.
The paper presents an overview of big data analytics,
(c) The study has provided just a brief overview with a special focus on data-intensive distributed
of various techniques and algorithms. How- machine-learning algorithms for big data. The pres-
ever, in order to see the full details regarding ent paper also performs a comparative analysis of the
particular techniques or algorithms, the corre- parallel data-intensive machine-learning algorithm on
sponding paper may be referred. optimization and performance metrics. The design of
(d) The study is limited to only data-intensive scalable and fault tolerant learning systems have
batch algorithms and does not consider online enabled us to address larger problems easily due to
streaming algorithms. which big data analytics has drawn attention of
many researchers. This study would provide a base
for the researchers in this area as it provides a wide
RESEARCH IMPLICATIONS
collection of previous research.
Big data analytics is evolving in nature, and its poten- This review is theoretical in nature. However,
tial has been exploited in almost every area and in the future, we would also like to implement scala-
industry, such as telecom, finance, medicine, and gov- ble machine-learning algorithms on real-world pro-
ernment agencies, to produce better outcomes, lead- blems, such as log analysis and social media analysis.
ing to improved growth and success. Big data Based on our survey, most of the previous studies
analytics has been found to be significant in various have focused on making traditional serial techniques
areas such as national development, industrial scalable. A new technique built from scratch for big
upgrade, scientific research, interdisciplinary data is missing in the literature, which could be con-
research, and better prediction.62 It has been used to sidered in future work. Designing a hybrid technique
solve complex problems such as recommendation, based on MapReduce and MPI, utilizing the merits
content filtering, fraud detection, and terrorist track- of both, is also worth researching in future. Visuali-
ing. A typical example of big data analytics demon- zation plays an important role in big data analytics
strating its power is an image recognition system, and has not been explored and is, therefore, also an
which took only 5 min to study 60,000 handwritten important topic of exploration.

REFERENCES
1. Owais SS, Hussein NS. Extract five categories CPIVW 3. Chen H, Chiang RHL, Storey VC. Business intelligence
from the 9V characteristics of the big data. Int J Adv and analytics: from big data to big impact. MIS Q
Comput Sci Appl 2016, 7:254–258. 2012, 36:1165–1188.
2. Wu X, Zhu X, Wu GQ, Ding W. Data mining with 4. Singh D, Reddy CK. A survey on platforms for big
big data. IEEE Trans Knowl Data Eng 2014, data analytics. J Big Data 2014, 2:1–20. doi:10.1186/
26:97–107. s40537.014.0008.6.

18 © 2016 John Wiley & Sons, Ltd


WIREs Data Mining and Knowledge Discovery Scalable machine-learning algorithms for big data analytics

5. Tsai C, Lai C, Chao H, Vasilakos AV. Big data analyt- 19. Doulkeridis C, Nørvåg K. A survey of large-scale ana-
ics: a survey. J Big Data 2015, 2:1–32. doi:10.1186/ lytical query processing in MapReduce. VLDB J 2013,
s40537.015.0030.3. 23:355–380. doi:10.1007/s00778.013.0319.9.
6. Nathalie J, Stefanowski J. Big data analysis: new algo- 20. Landset S, Khoshgoftaar T, Richter A, Hasanin T. A
rithms for a new society. In: A Machine Learning Per- survey of open source tools for machine learning with
spective on Big Data Analysis. Springer International big data in the Hadoop ecosystem. J Big Data 2015,
Publishing: Cham; 2016, 1–31. 2:1–36. doi:10.1186/s40537.015.0032.1.
21. Zaharia M, Chowdhury M, Das T, Dave A, Ma JM,
7. Cohen J, Dolan B, Dunlap M, Hellerstein J, Welton C.
McCauley M, Franklin MJ, Shenker S, Stoica I. Fast
Mad skills: new analysis practices for big data. Proc
and interactive analytics over hadoop data with Spark.
VLDB Endow 2009, 2:1481–1492. doi:10.14778/
USENIX 2012, 37:45–51.
1687553.1687576.
22. Shahrivari S, Jalili S. Beyond batch processing: towards
8. Schmarzo B. Understanding the role of Hadoop in real-time and streaming big data. Computers 2014,
your bi environment. Available at: https://infocus.emc. 3:117–129.
com/williamschmarzo/understandingtheroleofhadoop-
inyourbienvironment/. (Accessed February 6, 2016). 23. Marz N, Warren J. Big Data: Principles and Best Prac-
tices of Scalable Realtime Data Systems. Greenwich,
9. Borthakur D. HDFS architecture guide. Available at: CT: Manning Publications Co; 2015.
https://hadoop.apache.org/docs/r1.2.1/hdfsdesign.pdf. 24. Vizard M. Cloudera aims to replace MapReduce with
(Accessed February 6, 2016). Spark as default hadoop framework. Available at:
10. Bakshi K. Considerations for big data: architecture http://www.datacenterknowledge.com/archives/2015/
and approach. In: Proceedings of the 2012 I.E. Aero- 09/09/cloudera-aims-to-replace-mapr. (Accessed Febru-
space Conference, Big Sky, Montana, US, Mar 3–10, ary 6, 2016).
2012, 1–7. doi:10.1109/AERO.2012.6187357. 25. Zhao W, Ma H, He Q. Parallel K-means clustering
11. Kulkarni AP, Khandewal M. Survey on hadoop and based on MapReduce. In: Proceedings of First Interna-
introduction to YARN. Int J Emerging Technol Adv tional Conference of Cloud Computing (CloudCom),
Eng 2014, 4:82–87. Springer, Beijing, China, 2009, 674–679.
26. Thomas L, Annappa B. Application of parallel K-
12. Roman J. Hadoopecosystemtable. Available at: https://
means clustering algorithm for prediction of optimal
hadoopecosystemtable.github.io. (Accessed February
path in self aware mobile ad-hoc networks with link
6, 2016).
stability. In: Proceedings of First International Confer-
13. Brewer E. CAP twelve years later: how the “rules” ence of Advances in Computing and Communications,
have changed. Computer 2012, 45:23–29. Springer, Kochi, India, July 22–24, 2011, 396–405.
doi:10.1109/MC.2012.37. 27. Arthur D, Vassilvitskii S. K-means++: the advantages
14. Sharma S, Tim US, Gadia SK, Wong JS, Shandilya R, of careful seeding. In: Proceedings of the Eighteenth
Peddoju SK. Classification and comparison of NoSQL Annual ACM-SIAM Symposium on Discrete Algo-
big data models. Int J Big Data Intell 2015, rithms (SODA ’07), Society for Industrial and Applied
2:201–221. Mathematics, Philadelphia, PA, USA, 2007,
1027–1035.
15. Moniruzzaman ABM, Hossain SA. NoSQL database:
new era of databases for big data analytics—classifica- 28. Bahmani B, Moseley B, Vattani A, Kumar R,
tion, characteristics and comparison. Int J Database Vassilvitskii S. Scalable k-means++. PVLDB 2012,
Theor Appl 2013, 6:1–14. 5:622–633.
29. He Y, Tan H, Luo W, Feng S, Fan J. MR-DBSCAN: a
16. Moniruzzaman ABM. NewSQL: towards next-
scalable MapReduce-based DBSCAN algorithm for
generation scalable RDBMS for online transaction pro-
heavily skewed data. Front Comput Sci 2014,
cessing (OLTP) for big data management. International
8:83–99. doi:10.1007/s11704.013.3158.3.
Journal of Database Theory & Application 2014,
7:121–130. 30. Yu Y, Zhao J, Wang X, Wang Q, Zhang Y. Cludoop:
an efficient distributed density-based clustering for big
17. Sundaresan B, Kandavel D. Big languages for big data: data using Hadoop. Int J Distr Sensor Network 2015,
a study and comparison of current trend in data pro- 2015:1–13. doi:10.1155/2015/579391.
cessing techniques for big data. In: Design and Imple-
31. Jin C, Liu R, Chen Z, Hendrix W, Agrawal A,
mentation of Programming Languages Seminar, TU,
Choudhary A. A scalable hierarchical clustering algo-
Darmstadt, Germany, 2014. doi: 10.13140/
rithm using Spark. In: Proceedings of IEEE First Inter-
2.1.5107.8402.
national Conference on Big Data Computing Service
18. Lin J. Mapreduce is good enough? If all you have is a and Applications (BigDataService), San Francisco Bay,
hammer, throw away everything that’s not a nail! Big CA, USA, 2015, 418–426. doi: 10.1109/
Data 2012, 1:28–37. BigDataService.2015.67.

© 2016 John Wiley & Sons, Ltd 19


Overview wires.wiley.com/dmkd

32. Yan W, Brahmakshatriya U, Xue Y, Gilder M, Wise B. J Comput Intell Syst 2015, 8:422–437. doi:10.1080/
p-pic: parallel power iteration clustering for big data. 18756891.2015.1017377.
J Parallel Distr Comput 2013, 73:352–359.
45. Dai Y, Sun H. The naive Bayes text classification algo-
33. Jayalatchumy D, Thambidurai P. Implementation of p- rithm based on rough set in the cloud platform.
pic algorithm in MapReduce to handle big data. Int J J Chem Pharmaceut Res 2014, 6:1636–1643.
Res Eng Technol 2014, 3:113–118.
46. Liu B, Blasch E, Chen Y, Shen D, Chen G. Scalable
34. Zhu B, Mara A, Mozo A. CLUS: parallel subspace sentiment classification for big data analysis using
clustering algorithm on spark. In: Proceedings of New naive Bayes classifier. In: Proceedings of IEEE Interna-
Trends in Databases and Information Systems, Poi- tional Conference on Big Data, Silicon Valley, CA,
tiers, France, 2015, 175–185. USA, 2013, 99–104.
35. Cordeiro RLF, Traina JC, Traina AJM, Lo´pez J, 47. Sun ZQ, Fox GC. Study on parallel SVM based on
Kang U, Faloutsos C. Clustering very large multi- MapReduce. In: International Conference on Parallel
dimensional datasets with MapReduce. In: Proceedings and Distributed Processing Techniques and
of the 17th ACM SIGKDD International Conference Applications, Las Vegas, NV, USA, 2012, 495–561.
on Knowledge Discovery and Data Mining, KDD ’11,
ACM: New York, NY, USA, 2011, 690–698. doi: 48. Xu K, Wen C, Yuan Q, He X, Tie J. A MapReduce
10.1145/2020408.2020516. based parallel SVM for email classification. J Network
2014, 9:1640–1647.
36. Papadimitriou S, Sun J. Disco: distributed co-clustering
with map-reduce: a case study towards petabyte-scale 49. Khairnar J, Kinikar M. Sentiment analysis based min-
end-to-end mining. In: Proceedings of the Eighth IEEE ing and summarizing using SVM- Mapreduce. Int J
International Conference on Data Mining, ICDM ’08, Comput Sci Inform Tech 2014, 5:4081–4085.
IEEE Computer Society: Washington, DC, USA, 2008,
50. Liu Y, Xu L, Li M. The parallelization of back propa-
512–521, doi: 10.1109/ICDM.2008.142.
gation neural network in MapReduce and Spark. Int J
37. Dai W, Ji W. A MapReduce implementation of C4. Parallel Program 2016, 1–20. doi:10.1007/
5 decision tree algorithm. Int J Database Theor Appl s10766.016.0401.1.
2014, 7:49–60.
51. Zhang H, Xiao N. Parallel implementation of multi-
38. Purdila V, Pentiuc S. MR-tree—a scalable MapReduce layered neural networks based on MapReduce on
algorithm for building decision trees. J Appl Computer cloud computing clusters. Soft Comput 2015,
Sci Math 2014, 16:16–19. 20:1471–1483. doi:10.1007/s00500.015.1599.3.
39. Salperwyck C, Maby S, Cubillé J, Lagacherie M. Cour- 52. Rehab MA, Boufarès F. Scalable massively parallel
boSpark: decision tree for time-series on Spark. In: learning of multiple linear regression algorithm with
Proceedings of the 1st International Workshop on MapReduce. In: IEEE International Conference Trust-
Advanced Analytics and Learning on Temporal Data, com/BigDataSE/ISPA, Helsinki, Finland, Aug 20–22,
Porto, Portugal, 2015. 2015, 41–47. doi:10.1109/Trustcom.2015.560.
40. Han J, Liu Y, Sun X. A scalable random forest
53. Hariharan C, Shivashankar S. Large scale multi-view
algorithm based on MapReduce. In: Proceedings of the
learning on MapReduce. In: Proceedings of 19th Inter-
4th IEEE International Conference on Software Engi-
national Conference on Advanced Computing and
neering and Service Science (ICSESS), Beijing, China,
Communications, Chennai, India, 2013.
May 23–25, 2013, 849–852. doi: 10.1109/
ICSESS.2013.6615438. 54. Zhao Q, Ye F, Wang S. A new back-propagation neu-
41. Panda B, Herbach J, Basu S, Bayardo RJ. Planet: mas- ral network algorithm for a big data environment
sively parallel learning of tree ensembles with MapRe- based on punishing characterized active learning strat-
duce. PVLDB 2009, 2:1426–1437. egy. Int J Knowl Syst Sci 2013, 4:32–45. doi:10.4018/
ijkss.2013100103.
42. Wu G, Li H, Hu X, Bi Y, Zhang J, Wu X. MReC4.5:
C4.5 ensemble classification with MapReduce. In: Pro- 55. Li Y, Schuurmans D. Mapreduce for parallel reinforce-
ceedings of the Fourth ChinaGrid Annual Conference, ment learning. In: Proceedings of the 9th European
Yantai, China, Aug 21–22, 2009, 249–255. doi: Conference on Recent Advances in Reinforcement
10.1109/ChinaGrid.2009.39. Learning, EWRL’11, Springer-Verlag: Berlin, Heidel-
berg, Germany, 2012, 309–320.
43. Singh K, Guntuku SC, Thakur A, Hota C. Big data
analytics framework for peer-to-peer botnet detection 56. Barrett E, Duggan J, Howley E. A parallel framework
using random forests. Inform Sci 2014, 278:488–497. for Bayesian reinforcement learning. Connection Sci
doi:10.1016/j.ins.2014.03.066. 2014, 26:7–23. doi:10.1080/09540091.2014.885268.
44. Río SD, López V, Benítez JM, Herrera F. A MapRe- 57. Chen XW, Lin X. Big data deep learning: challenges
duce approach to address big data classification pro- and perspectives. IEEE Access 2014, 2:514–525.
blems based on the fusion of linguistic fuzzy rules. Int doi:10.1109/ACCESS.2014.2325029.

20 © 2016 John Wiley & Sons, Ltd


WIREs Data Mining and Knowledge Discovery Scalable machine-learning algorithms for big data analytics

58. Zhang K, Chen XW. Large-scale deep belief nets with 61. Desale D. Top 15 frameworks for machine learning
MapReduce. IEEE Access 2014, 2:395–403. experts. Available at: http://www.kdnuggets.com/2016/
doi:10.1109/ACCESS.2014.2319813. 04/top-15-frameworks-machine-learning-experts.html.
59. Ibrahim S, Jin H, Lu L, He B, Antoniu G, Wu S. Hand- (Accessed June 30, 2016).
ling partitioning skew in MapRe duce using LEEN. 62. Xiaolong J, Benjamin WW, Xueqi C, Yuanzhuo W. Sig-
Peer-to-Peer Network Appl 2013, 6:409–424. nificance and challenges of big data research. Big Data
doi:10.1007/s12083.013.0213.7. Res 2015, 2:59–64. doi:10.1016/j.bdr.2015.01.006.
60. Chen Q, Yao J, Xiao Z. Libra: lightweight data 63. Riggins J. Apache spark for deep learning: two case
skew mitigation in MapReduce. IEEE Trans Parallel studies. Available at: http://thenewstack.io/two-tales-
Distr Syst 2015, 26:2520–2533. doi:10.1109/ big-data-framework-adoption-apache-spark/.
TPDS.2014.2350972. (Accessed June 30, 2016).

© 2016 John Wiley & Sons, Ltd 21

You might also like