You are on page 1of 22

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/350486051

Machine Learning Techniques for Big Data Analytics in Healthcare: Current


Scenario and Future Prospects

Chapter · August 2022


DOI: 10.1007/978-3-030-99457-0

CITATIONS READS

3 603

3 authors:

Shahid Mohammad Ganie Majid Bashir Malik


Woxsen University Baba Ghulam Shah Badshah University, Rajouri, J & K
23 PUBLICATIONS 191 CITATIONS 41 PUBLICATIONS 384 CITATIONS

SEE PROFILE SEE PROFILE

Tasleem Arif
Baba Ghulam Shah Badshah University
52 PUBLICATIONS 339 CITATIONS

SEE PROFILE

All content following this page was uploaded by Shahid Mohammad Ganie on 30 August 2022.

The user has requested enhancement of the downloaded file.


Machine Learning Techniques for Big
Data Analytics in Healthcare: Current 6
Scenario and Future Prospects

Shahid Mohammad Ganie, Majid Bashir Malik,


and Tasleem Arif

6.1 Introduction

6.1.1 Big Data

Many de nitions are oating around which have raised some confusion, but big
data can be de ned only based on characteristics, dimensions, classi cation along
with the analytical framework for storing and processing data [1]. According to the
Gartner IT Glossary [2] “Big data is a collection of high-volume, high-velocity and
high-variety information assets that require advanced tools and techniques in order
to store and process the data for building superior insight and decision making.”
Big data is also de ned as a collection of complex and voluminous data that are
being generated from different sources including the healthcare system, social net-
working sites, smart meters, sensors, online transactions, and administrative ser-
vices [3]. Among all sectors, healthcare industry data is increasing faster due to the
advancement of smart devices and well-equipped labs [4].

6.1.2 Inclination Towards the Study

The growth of the data can be understood from the fact, Internet Users produce 2.5
quintillion bytes of data each day, 90% of all data today was shaped in the last two
years, in 2016 there was 7.5% of internet population in the world and now estimated
to 3.7 billion people according to recent research cited by Domo [5]. 300+ hours of
videos are being uploaded on YouTube every minute [6]. 40,000 search queries are
being posted every second, 350,000 tweets per minute, and 171 million emails per

S. M. Ganie · M. B. Malik (*) · T. Arif


BGSB University, Rajouri, Jammu and Kashmir, India
e-mail: majidbashirmalik@bgsbu.ac.in

© The Author(s), under exclusive license to Springer Nature 103


Switzerland AG 2022
T. Choudhury et al. (eds.), Telemedicine: The Computer Transformation of
Healthcare, TELe-Health, https://doi.org/10.1007/978-3-030-99457-0_6
104 S. M. Ganie et al.

DATA BEING GENERATED FROM DIFFERENT APPLICATIONS (2018)

Financial services
Other 12%
25%

Resources
2%
Manifacturing
21%

Transportation
4%

Healthcare
7%
Infrastrucure
9%
Retail/WH
13% Media and Entertainment
7%

Fig. 6.1 Generation of “Global Datasphere”

minute [7]. Healthcare data is increasing at a rate of 48% annually, and by 2020, the
Stanford University study projection is that 2314 Exabytes of data will be formed
per year [8]. As represented in Fig. 6.1, most of the data is being generated by dif-
ferent organizations [9].

6.1.3 Characteristics of Big Data

According to the existing literature, characteristics of big data in healthcare give a


brief description of 42 V’s, 17 V’s, 10 V’s, 7 V’s [10] so on and so forth, but it sum-
marizes and divides the concepts into ve dimensions as presented in Table 6.1 [11].

6.2 Literature Review

Different approaches have been carried out by researchers/practitioners using


machine learning techniques in order to integrate, manage, analyze, interpret, etc.
big data in healthcare industry. Technological advancement can be used for better
healthcare outcomes, reducing the cost and complexity of hospitals, and to central-
ize the large volume of data to expand the superiority of the existing healthcare
system [12]. If healthcare data is being created from varied sources with a heteroge-
neity of data formats, to analyze and drive actionable insights is very highly essen-
tial to predict future health conditions. The research work presents the state-of-the-art
examination of data mining techniques using big data in healthcare informatics
6 Machine Learning Techniques for Big Data Analytics in Healthcare: Current… 105

Table 6.1 V’s of big data in healthcare


V’s of big data De nition
• Volume It can be de ned as the parameter that describes the quantity of the data being
generated and stored from different health sources.
• Velocity The rate at which data is being generated and processed along with the
speediness at which data changes especially in case of streaming.
• Variety It is the type and nature of the data. Type refers to different data formats and
generating sources, while nature de nes different forms like structured,
semi-structured, and unstructured data.
• Veracity It is all about the quality of the data. The data that is submitted can either be
incorrect, may have noise or even missing values. How can this data be
con ded for its truthfulness?
• Value It refers to the identi cation and extraction of valuable information for
analysis. Data incorporates information of great bene t and insight for users.

[13]. Prediction of the clinical outcomes using gene expression data and tissue-level
data has been presented for smart healthcare systems. Also, ICU readmissions using
patient-level data, HER’s, etc. have been discussed; social media data regarding
healthcare has been also presented using population-level data for better prediction
of different diseases. In [14], a probabilistic data acquisition mechanism and sto-
chastic prediction framework are designed and implemented for analyzing health-
care data using big data analytics. Intra-cluster correlation evaluation (IaCE) and
inter-cluster correlation evaluation (IeCE) have been developed for healthcare big
data. Besides, the future health condition prediction (FHCP) algorithm is used to
predict the health conditions of a patient based on current status. MapReduce tool
has been used for processing of big data on cloud architecture. The prediction model
gives an accuracy rate of 98% for predicting the current health status correlated with
a patient. The framework has been developed through extensive simulations in the
cloud architecture and 90% of it utilizes the resources ef ciently to reduce analysis
time. The research work provides the detailed, complete, and organized study of
recent methodology of big data healthcare applications in ve sections including
ML, heuristic-based, agent-based, and hybrid approaches [15]. The work surveys
205 research articles from different reputed publishing houses that have been pre-
sented and discussed for big data analytics in healthcare domain. Big data security
and privacy are considered big challenges to secure the information of the patients
in healthcare [16]. The security and privacy issues are discussed and addressed so
that the oating information of different sources can be secured from the third par-
ties in healthcare. Anonymization and encryption techniques have been designed
and implemented in this study. Different stages of big data lifecycle in the health-
care domain like data collection, aggregation, analyze, interpretation, information
gain, etc. are to be protected and encrypted from the end-users by applying different
privacy preservation techniques. The paper highlights the vast impacts of big data in
healthcare system [17]. The different stakeholders like medical practitioners,
patients, physicians, clinicians, pharmaceuticals, and health insurers are taken into
account while dealing with big data lifecycle in healthcare informatics. Big data
technology like Hadoop architecture has been used for different tasks, viz. data
106 S. M. Ganie et al.

collection, data processing, big storage, and data analysis. Also, some of the promi-
nent big data applications in healthcare have been discussed. Big data was catego-
rized into four types: descriptive, prescriptive, predictive, and discovery analytics.
In [4], classi cation of big data in healthcare along with varied sources is discussed.
Big data is being generated from sources like HER’s, EMR’s, clinical reports,
Omics, etc. and to integrate this large volume of data in healthcare analytics to pro-
mote the personalized treatment of different diseases. Various platforms for health-
care data analytics like Hadoop, spark, etc. are being used for the management of
big data. To develop a sophisticated framework for a healthcare system that will
integrate diverse elds like bioinformatics, chemoinformatics, etc. to provide better
decision making in scienti c and social science. This will facilitate healthcare by
new treatments of epidemics, provide early alarming of diseases, and will help the
healthcare providers in drug and medicine discoveries.

6.3 Big Data Healthcare Analytics (BDHA)

In a big data environment, data analytics is a step to perform actionable insights


over data with the help of highly scalable distributed platforms and technologies
that are capable of storing and processing the big data from varied sources [18]. It
is the overall organization of the whole data lifecycle from collection, integration,
storing, analyzing, and leading data so that meaningful information is generated
and presented in the form of reports or visualization for decision support or even
decision-making [19]. The different stages of big data for healthcare analytics
include identifying the problem, data aggregation, preprocessing of data, perform-
ing analytics over data, and then visualization of the results as shown in Fig. 6.2 [20].
BDA helps to make healthier decisions, optimize business operations by analyz-
ing customer behavior, cost reduction, and making smarter and more ef cient sup-
port systems in the organizations [21]. For instance, a healthcare provider uses big
data from various repositories for the better prediction of disease by making action-
able insights from raw data [4]. There are four general categories of big data analyt-
ics [22]. As depicted in Fig. 6.3, as we go upwards from descriptive to prescriptive
analytics, we add to the value of the results generated but at the same time complex-
ity also increases proportionally [22].

6.4 Various Big Data Platforms for Healthcare Analytics

This fast-growing and tremendous amount of health data has far exceeded the
bounds of human interpretation and analytical capabilities [23]. Powerful and ver-
satile DM and ML techniques are the dire need of time so that they can automati-
cally excavate and organize information from heterogeneous and complex healthcare
data [24]. The goal is to link the computational power of the technology to automate
the capability of ingestion, processing, and visualization of results from the huge
volume of data for the betterment of public health [25]. The big data platforms for
6 Machine Learning Techniques for Big Data Analytics in Healthcare: Current… 107

Fig. 6.2 Big data lifecycle


for healthcare Problem
Identification

Big Data
Sources

Big Data
Storage

Big Data
Process

Big Data
Analytics

Big Data
Visualization

healthcare analytics have been categorized based on scaling [26]. Scaling means
adapting to cumulative demands of the system in terms of processing requests.
Scaling can be horizontal scaling and vertical scaling.

6.4.1 Horizontal Scaling Platforms

Horizontal scaling or “scale-out” is distributed data processing that shares workload


among multiple nodes. This increases the processing capability up to a great extent.
The most popular horizontal scale-out platforms are as follows [26]:
108 S. M. Ganie et al.

Big Data Analytical Categories

How can
we make it
happen?

PRESCRIPTIVE
What will ANALYTICS
happen?

V
PREDECTIVE
A
ANALYTICS
L Why did it
U happen?
E
DIAGNOSIS
What ANALYTICS
happened?

DESCRIPTIVE
ANALYTICS

COMPLEXITY

Fig. 6.3 Big data analytics at different phases

6.4.1.1 Peer-to-Peer Networks


One of the decentralized and scattered network architectures having millions of
nodes interconnected with each other [27]. In this architecture, the machines (called
peers) provide as well as ingest resources. Peer to peer is the oldest distributed net-
work platform used for computing big data. Usually the Message Passing Interface
(MPI) standard is the communication system deployed to communicate, transfer,
and exchange the data between the machines (peers) [26]. In addition to this, it is
suitable for iterative processing and capable of storing data instances over a limit-
less boundary (can be millions of nodes). MPI acts as a master as well as a slave to
utilize the resources when deployed in the master-slave model.

6.4.1.2 Apache Hadoop


Hadoop ecosystem [28] is a batch processing open-source software framework that
permits to storage and processes large and complex data sets that cannot be handled
by traditional approaches. Hadoop works in a decentralized environment across a
cluster of computers. It is particularly developed to scale up from single servers to
hundreds and even thousands of machines [17]. Apache Hadoop framework is
highly faulted tolerant with the feature of replicating data [26]. The Hadoop ecosys-
tem comprises an array of related software, which is collectively used for collection
and aggregation of big data up to analysis of required results [17].
6 Machine Learning Techniques for Big Data Analytics in Healthcare: Current… 109

Data Collection
• Sqoop: An open-source framework comprised of SQL and Hadoop systems [29].
Sqoop is designed to provide a command-line interface that is used to transfer
data to-and-fro between HDFS and relational database servers.
• Flume: Flume is a robust tool used for data ingestion in HDFS. It is used to col-
lect, aggregate, and transpose a large volume of real-time data such as log-based
les, social media, and email [30]. It also operates by capturing the real-time/
streaming data from different web services to HDFS.
• Kafka: It is an open-source disseminated platform that communicates the
publish-subscribe messaging system. Originally it was developed at LinkedIn
and later it became part of the Apache project and is now used by many organiza-
tions because of its scalability, durability, and fault tolerance [31].
• Chukwa: It is distributed data collection and rapid processing system. Chukwa is
a powerful and exible platform with different core components as Agents,
Collectors, MapReduce Jobs, and Hadoop Infrastructure Care Center (HICC) [17].

Data Processing
• MapReduce: Dean and Ghemawat proposed MapReduce at Google technology
[32]. MapReduce takes care of processing and computing data present in
HDFS. The working principle of MapReduce is mainly divided into two parts,
called Mappers and Reducers [33]. The Map module takes input data from HDFS
and translates it into some intermediate results to the reducers. The reduced task
is based on map task; it takes the output of Mappers as input and then generates
the nal results by aggregating the intermediate sets which are again re ected in
HDFS [34].
• YARN: Yet Another Resource Negotiator is a framework that provides job
arrangement and cluster resource organization, i.e., the jobs across the cluster
[35]. It is a global scheduler that handles both batch and stream processing [36].
It is also known as MapReduce version 2 and is compatible with MapReduce,
having master-slave architecture with the full support of the virtual distrib-
uted system.
• Storm: An open-source computing system used for real-time processing of big
data analytics [28]. The storm is distributed, reliable, and fault-tolerant in nature,
used for extract, transform, and load (ETL) operations, online machine learning,
and continuous computation [37].
• Flink: It is a data ingestion tool in HDFS [38], which came into the picture for
gathering, combining, and transport a large amount of streaming data. The main
idea behind ink is to capture a large amount of data from varied sources into
HDFS [39].
• Spark: It is a cluster computing framework used for both batch and real-time
processing [40]. Spark was developed at the University of California and is next-
generation big data processing engine. It is a cluster computing framework par-
ticularly designed for the speed up (100 times more than the speed of Hadoop
MapReduce for data that resides in main memory and around 10 times faster
when data is in disk) in terms of processing [41].
110 S. M. Ganie et al.

Big Data Storage


• HDFS: Hadoop Distributed File System (HDFS) [7] is the main module of
Hadoop ecosystem that is used to store big data les across various clusters of
the cost-effective system (commodity hardware) while providing high availabil-
ity and fault tolerance [42].
• HBase: Apache HBase is a NoSQL non-relational distributed columnar database
that is being used to store unstructured and semi-structured data easily and also
provides a real-time read/write access mechanism [43].
• Hcatalog: It is a table storage management system for Hadoop and provides a
shared schema and data type mechanism to read and write data for various tools
of Hadoop ecosystem [17].

Big Data Analysis


• Pig: Pig is a high-level data ow system developed by yahoo, used to write
simple queries that are converted into the MapReduce program and then exe-
cuted over Hadoop cluster [44]. It has been used to overcome the complexity of
Map and Reduce Stages; Pig helps to process bulk large data sets by spending
less time in writing MapReduce programs.
• Hive: It is a tool for big data analytics made on the top of Hadoop, used to ana-
lyze structured and semi-structured data. It runs an interface mechanism for per-
forming many queries transcribed in HQL (Hive Query Language) that are alike
to SQL declarations [45].
• Mahout: Apache Mahout is meant for machine learning that runs on Hadoop
ecosystem, extracts insight from a large volume and complex data [46]. It is used
for various machine learning tasks like clustering, classi cation, collaborative
ltering, and text mining in a scalable and distributed fashion [46].

Hadoop ecosystem comprises a bunch of tools that are used together by various
organizations to handle big data. So, it provides all the services that are required to
process big data for accomplishing the different tasks. The pictorial representation
of Hadoop ecosystem is shown below in Fig. 6.4 [17].

6.4.2 Vertical Scaling Platforms

Vertical scaling or “scale up” is achieved by enhancing the processing and storage
capability of a single server by increasing the number of processors or cores and
enhancing the memory and other required resources. The most popular vertical
scale-up platforms are as follows [26]:

• HPC clusters: The blades of high-performance computing with a big number of


processing cores have a dynamic memory organization, different levels of cache,
and communication mechanisms that are optimized for diverse user require-
ments [47]. To achieve the scalability of such systems is much costlier than
6 Machine Learning Techniques for Big Data Analytics in Healthcare: Current… 111

Data Management
Zookeeper

and workflow
Ambari

Oozie

Sqoop Spark HDFS Pig

Map
Flume HBase Hive
Reduce

Kafka YARN HCatalog Mahout

Chukwa Flink MongoDB Avro

Scribe Storm Cassandra Impala

Data Ingestion Data Processing Data Storage Data Analysis

Fig. 6.4 Hadoop ecosystem for healthcare analytics

Hadoop or Spark. The Message Passing Interface (MPI) is usually used for com-
munication and sharing information in such systems [48].
• Graphics Processing Unit (GPUs): They are used for accelerating the graphic
operations for pictorial representation on display using frame buffer [26].
General-purpose computing on graphics processing units is the result of advanced
enhancements in hardware and algorithms in GPU-like parallel architecture.
• Field Programmable Gate Arrays (FPGA): Field Programmable Gate Arrays are
specialized and custom-built hardware units that can be optimized for high speed
and throughput for a particular application [49]. The customization and develop-
ment cost is typically very high, as compared to other platforms also its feature
of programming using Hardware Descriptive Language (HDL) increases the
overall development cost [26].

6.5 Materials and Methods

Machine learning techniques are being used broadly for the health data to provide
the diagnosis and prognosis of several diseases [50]. Implementation of sophisti-
cated computer algorithms in the area of healthcare is an old story but there are still
some areas in healthcare where advanced computing, scienti c methods, and per-
fect visualization by using machine learning and deep learning paradigms can make
a difference [51]. ML techniques play a dynamic role in the early prediction of
112 S. M. Ganie et al.

Fig. 6.5 Machine learning


techniques for healthcare Machine Learning
analytics Techniques

Supervised Learning

Un-Supervised Learning

Semi-Supervised Learning

Active
Learning

Reinforcement
Learning

diseases and their complications to reduce costs and readmissions in hospitals.


Machine learning techniques can be categorized based on learning mechanisms as
shown in Fig. 6.5 [52].

6.5.1 Machine Learning Techniques

Machine learning techniques are built on the basic principles and concepts of statis-
tics that are being used to solve data mysteries. Machine learning algorithms are
used successfully for data analytics by different organizations, especially in health-
care [53]. The ML algorithms used for data analysis following are some most prom-
inent ML classi ers that are being widely used in big data healthcare analytics to
provide better solutions to health problems [54]. Machine learning techniques can
be explored for various real-world problems and also these problems can be mod-
eled to produce optimal solution [28].

• Logistic Regression: It is a supervised ML model for classi cation that produces


the dichotomous output in terms of binary format which is used to probabilisti-
6 Machine Learning Techniques for Big Data Analytics in Healthcare: Current… 113

cally forecast the outcome of categorical variable by using a sigmoid function


[55]. It forecasts the probability of the event based on the logarithmic function
decided by some threshold. “Suppose there are ‘N’ input independent variables
with their values represented by X1, X2, X3, …………….Xn. Let us assume that Y is
the probability of Yes (1) and 1 − Y is the probability of No (0). Then the nal
logistic regression equation is given as” [56]:

 Y 
log   ==> Y = C + B1 X 1 + B2 X 2 + B3 X 3 + Bn X n (6.1)
1 − Y 
• Naive Bayes: It is a ML-based classi er, mostly used for predictive analytical of
different real-life problems [57]. It is classi cation technique which is based on
Bayesian theorem with an inference among predicate and target variables used
for study. Its working mechanism is simple and is highly used for large datasets.
“Given a hypothesis H and evidence E, Bayes theorem states that the relationship
between the probability of the hypothesis before getting the evidence P(H) and
the probability of the hypothesis after getting the evidence P(H|E) is” [58]:

P E |H .P H
P H |E = (6.2)
P E
In simple English, we write the equation as:
P(H|E) = Posterior likelihood
P(E|H) = Probability
P(H) = Session prior likelihood
P(E) = Forecaster prior likelihood
• Support Vector Machine: It is a ML classi er that is designed by a separative
hyper-plane [59]. Support vector machine algorithm is being used for both clas-
si cation and regression analysis for different real-world problems. The basic
function of this supervised classi er is to segregate or classify the given data
points in the best possible and appropriate way in a multidimensional space. To
work in high and complex dimensions, the SVM classi er uses different versions
of the kernels like linear, polynomial, and radial basis function kernels. The
equations for various kernels are [60]:
Linear Kernel Equation

F X = B 0 + Sum ai ∗ x,xi (6.3)


Polynomial Kernel Equation
b
K X 1 ,X 2 = a + X 1T X 2 (6.4)
where b = degree of kernel and a = constant term.
Radial Basis Function Kernel Equation
2
K X 1 ,X 2 = exponent − X 1, X 2 (6.5)
114 S. M. Ganie et al.

where ||X1, X2|| = Euclidean distance between (X1 and X2)


• Decision Tree: It is a classi cation technique with the graphical representation of
all possible solutions to a decision based on certain conditions [55]. The working
principle behind the DT algorithm is decision-making for a speci c classi cation
problem. The computation procedure of the algorithm is very expensive as far as
training and testing data is being concerned. In this acyclic connected graph at
each node, one feature is selected to make separating decisions that lead to the
prediction. The process of splitting the nodes continues unless and until the leaf
node has optimally fewer data points. The decision tree classi er is shown in
Fig. 6.6 [61].
• Random Forest: It is a well-known supervised ML classi er that is used for
regression and classi cation problems [62]. RF algorithm is made by using many
DT models where it compiles all the results of the decision trees that lead to the
nal outcome. The RF classi er handles the problem of over tting that may
make the results worse by creating the n number of decision trees depending
upon the size and complexity of the dataset. The decision tree classi er is shown
in Fig. 6.7 [63].

Fig. 6.6 The decision tree

Fig. 6.7 The random


forest Random Forest

Tree 1 Tree 2 Tree 3


6 Machine Learning Techniques for Big Data Analytics in Healthcare: Current… 115

Input Layer

Hidden Layer 1

Hidden Layer 2

Output Layer

Fig. 6.8 The neural network

• Arti cial Neural Networks: Arti cial Neural Networks are algorithms inspired
by human nervous system that have self-capability to produce better results on
voluminous data [64]. The structure of the neural network can be divided into
three layers: input (disparate sources), hidden (accountable for internal process-
ing), and output (desired results). ANN is currently being used in healthcare
analytics for better diagnosis and prognosis of various diseases because of the
diversity of data sources in healthcare like EHR, clinical data, omics data, CT
scans, ultrasound images, MRI images, etc. The graphical representation of the
ANN is shown in Fig. 6.8 [65]:

6.5.2 Description of Dataset

The publicly available PIMA diabetes dataset has been used for the experimental
study. The dataset is sourced from UC Irvine Machine Learning Repository [66]. It
comprised of 768 instances and 9 attributes, 8 predicted variables and 1 target vari-
able in order to classify whether a patient is diabetic or not. The dataset contains the
rich amount of contributing parameters that can be explored to identify the severity
of the diabetic disease. Different ML classi ers have developed realistic health
management system for diabetes. Table 6.2 demonstrates the attribute name,
description, range, and their data type used in the PIMA dataset.

6.5.3 Proposed Framework

The proposed framework shown in Fig. 6.9 presents the working principle of develop-
ing ML techniques for forecast of various chronic diseases showcasing diabetes. In
this framework, ML models have been developed using the PIMA dataset for forecast
of diabetes. The preprocessing techniques have been used to perform various
116 S. M. Ganie et al.

Table 6.2 Parameter description used for the analysis


Data
S. No Attribute name Description Range Type
1 Pregnancies Number of times an individual is [0–17] int64
pregnant
2 Glucose Plasma glucose concentration [0–199] int64
3 BloodPressure Diastolic blood pressure of an [0–122] int64
individual
4 SkinThickness Triceps skin fold thickness (mm) [0–99] int64
5 Insulin 2-hour serum insulin (mu U/ml) [0–846] int64
6 BMI Body Mass Index of individual (kg/ [0– oat64
m2) 67.10]
7 DiabetesPedigreeFunction Diabetes Pedigree Function [0–2.42] oat64
8 Age Age of the individual [21–28] int64
9 Class Diabetic or Non-diabetic [0/1] int64

Machine learning Models

Hyper-parameter
Tuning

Training
Training Models Evaluation
PIMA diabetes Dataset Building Results
Dataset

Testing
Testing Evaluation
Data Pre-processing Dataset Models Results
Techniques

Data splitting and Validation Performance Prediction


Validation Evaluation of
Validation Dataset
Results of Results Diabetes

Fig. 6.9 The proposed framework for prediction of diabetes

statistical measurements over the dataset to improve the quality assessment of the
data. Feature engineering has been done to identify the contribution of the parameters
towards the prediction of disease. Data transformation was performed before building
the various ML models. K-fold cross-validation, where K = 10, has been incorporated
to validate the performance evaluation of various statistical measures of ML models.

6.6 Case Study and Application

The machine learning paradigm is a highly sophisticated technological application


that is being applied almost in every sphere of engineering and science domains,
especially in healthcare [67]. In healthcare domain, the analysis of healthcare
6 Machine Learning Techniques for Big Data Analytics in Healthcare: Current… 117

parameters and based on these parameters predicting the future health conditions of
different chronic diseases are still in the informative stage [14, 68]. Machine learn-
ing libraries help in implementing the whole end-to-end ML process, which includes
building a statistical model that predicts a precise outcome. Different supervised
machine learning algorithms, viz. logistic regression (LR), Naive Bayes (NB), sup-
port vector machine (SVM), decision tree (DT), random forest (RF), and arti cial
neural network (ANN), have been developed by using the PIMA diabetes dataset.
These classi ers were executed using Jupyter and Spyder (Integrated Development
Environment) along with python 3.8 in windows operating system [69].
Exploratory data analysis (EDA) is used to describe the dataset for different tasks
based on inferential statistics that need to be executed before building the ML mod-
els. The report of attributes used in the dataset is depicted in Fig. 6.10. The statisti-
cal measurements of parameters in dataset like count, mean, std., min, max, etc.
have been calculated. The describe() function on pandas DataFrame has been used
to list the statistical measures of each parameter in the dataset.
The data exploratory analysis has been performed to identify the relationship
among the attributes used in the dataset as shown in Fig. 6.11. Different packages
for data visualization like Matplot, Seaborn, etc. have been used to draw the correla-
tion matrix plot between the set of attributes. The association between the reliant on
and autonomous variables lies in between −1 and +1.
The predictive machine learning models, viz. LR, NB, SVM, DT, RF, and ANN,
were employed. The algorithms were developed by splitting the PIMA diabetes
dataset into an 80:20 ratio, i.e., 80% of data has been used for training the classi ers
and 20% of data for testing the developed models. Also, ten-fold cross-validation
has been applied to validate the building models for better prediction of diabetes

Dataset Description
900
800
700
600
500
400
300
200
100
0
Diabetes
Blood Skin
Pregnancies Glucose Insulin BMI Pedigree Age class
Pressure Thinkness
Function
count 768 768 768 768 768 768 768 768 768
mean 3.845052 120.894531 69.105469 20.536458 79.799479 31.992578 0.471876 33.240885 0.348958
std 3.369578 31.972618 19.355807 15.952218 115.244002 7.88416 0.331329 11.760232 0.476951
min 0 0 0 0 0 0 0.078 21 0
25% 1 99 62 0 0 27.3 0.24375 24 0
50% 3 117 72 23 30.5 32 0.3725 29 0
75% 6 140.25 80 32 127.25 36.6 0.62625 41 1
max 17 199 122 99 846 67.1 2.42 81 1

Fig. 6.10 Dataset description used in the experimental study


118 S. M. Ganie et al.

1.0
Pregnancies 1 0.13 1.14 –0.082 –0.073 0.018 –0.034 0.54 0.22

Glucose 0.13 1 0.15 0.056 0.33 1.22 0.14 0.26 0.47 0.8

BloodPressure 0.14 0.15 1 0.21 0089 0.28 0.041 0.24 0.065

0.6
SkinThinkness –0.082 0.056 0.21 1 0.44 0.39 0.18 –0.12 0.073

Insulin –0.073 0.33 0.089 0.44 1 0.2 0.19 –0.41 1.13


0.4

BMI 0.018 0.22 0.28 0.39 0.2 1 0.14 0.036 0.29

DiabetesPedigreeFunction –0.034 0.14 0.041 0.18 0.19 0.14 1 0.033 0.17 0.2

Age 0.54 0.26 0.24 –0.12 –0.041 0.036 0.033 1 0.24


0.0
Class 0.22 0.41 0.065 0.073 0.13 0.29 0.17 0.24 1

Diabetes Pedigree Function

Class
BloodPressure
Pregnancies

Glucose

Age
SkinThinkness

BMI
Insulin

Fig. 6.11 Correlation matrix plot of attributes

Table 6.3 Accuracy of machine learning classi ers

Model Training accuracy Testing accuracy


Logistic regression 76.50% 80.51%
Naïve Bayes 76.01% 74.67%
Support vector machine 76.34% 79.87%
Decision tree 83.36% 75.32%
Random forest 84.99% 79.22%
Arti cial neural network 71.94% 70.77%

diseases. The aggregation accuracy results using ten-fold cross-validation of differ-


ent algorithms are shown in Table 6.3.
Figure 6.12 shows the accuracy results achieved by various classi ers used for
prediction of diabetes. Among all the classi ers, Logistic Regression happened to
be the best model with a testing accuracy rate of 80.51%, followed by support vec-
tor machine model with an accuracy rate of 79.87%, random forest with an accuracy
rate of 79.22%, decision tree with an accuracy rate of 75.32%, Naive Bayes with an
accuracy rate of 74.67%, and lastly arti cial neural network with an accuracy rate
of 70.77%.
6 Machine Learning Techniques for Big Data Analytics in Healthcare: Current… 119

100.00%

90.00%
83.36% 84.99%
80.51% 79.87% 79.22%
80.00% 76.50% 76.01% 76.36%
74.67% 75.32%
71.94% 70.77%
70.00%

60.00%

50.00%

40.00%

30.00%

20.00%

10.00%

0.00%
Training Accuracy Testing Accuracy

Logistic Regression Naïve Bayes Support Vector Machine


Decision Tree Random Forest Artificial Neural Network

Fig. 6.12 Accuracy of various algorithms

Table 6.4 Various statistical measures of algorithms


Algorithm Precision (%) Recall (%) Speci city (%) FPR (%) FNR (%) F1-score (%)
LR 0.880 0.819 0.755 0.224 0.180 0.851
NB 0.845 0.773 0.687 0.312 0.226 0.807
SVM 0.876 0.817 0.760 0.240 0.182 0.845
DT 0.855 0.775 0.702 0.297 0.224 0.813
RF 0.876 0.801 0.750 0.250 0.198 0.837
ANN 0.835 0.736 0.636 0.363 0.236 0.782

The other statistical measures are shown in Table 6.4. The various measurements
have been used to evaluate the performance evaluation of different classi ers
towards the prediction of diabetic disease. However, in terms of precision, LR
model predicts best with the highest precision rate of 0.88%, whereas in terms of
false positive rate and false negative rate, LR predicts with the lowest rate of 0.224%
and 0.180%, respectively.

6.7 Conclusion and Future Scope

This chapter began with a brief outline of big data, its classi cation, and its charac-
teristics. The procedure of BDA in healthcare from data acquisition up to the gen-
eration of required results is presented in this study. Then various platforms for big
data and BDA in healthcare have been discussed, through which we can store, man-
age, analyze, lter, and distribute healthcare data for further analytical procedures
to provide better solutions to real-world healthcare problems. Different challenges
regarding BDA in healthcare have been also discussed. These challenges must be
taken into account while dealing with big data analytics life cycle. In this chapter,
120 S. M. Ganie et al.

various ML classi ers have been discussed that play a signi cant role in healthcare
industry. The study has presented and implemented the framework by using the
PIMA diabetes dataset to forecast whether a patient is diabetic or not. The results
achieved by the framework are accuracy, recall, precision, f1-score, false positive
rate, false negative rate, etc. The classi ers used in this model have been validated
by using k-fold cross-validation technique. To extend the current study, these ML
models shall be used for other related diseases which share communality of data
with diabetes. Also, for ef ciency and reliability of the framework used in this study
ensemble and hybridization ML techniques shall be used for the forecast of various
diseases in early stages to save human lives. Similarly, new mining techniques
should be developed so that sensitive information can be extracted from the huge
volume of health data.

References
1. Wasson M, Buck A, Robe J., Wilson M. Big data architecture style. Azur. Appl. Archit. Guid.
| Microsoft Docs, 2018, pp. 1–7.
2. Gandomi A, Haider M. Beyond the hype: Big data concepts, methods, and analytics. Int J Inf
Manag. 2015;35(2):137–44. https://doi.org/10.1016/j.ijinfomgt.2014.10.007.
3. Oracle. Oracle: Big Data for the enterprise Oracle White Paper—Big Data for the enterprise,
An Oracle White Pap., no. June, 2013.
4. Dash S, Shakyawar SK, Sharma M, Kaushik S. Big data in healthcare: management, analysis
and future prospects. J Big Data. 2019;6(1) https://doi.org/10.1186/s40537-019-0217-0.
5. How much data do we create every day? The mind-blowing stats everyone should read.
6. 300 Hours of video are uploaded to Youtube every minute..
7. Google Search Statistics—Internet live stats.
8. Infographic: How Big Data will unlock the potential of healthcare.
9. Saracco R. Another shift in content production, 2020. pp. 2019–2020
10. Shafer T. The 42 V’s of Big Data and Data Science, kdnuggets.com Elder Res., pp. 1–3, 2017,
[Online]. Available: https://www.kdnuggets.com/2017/04/42-vs-big-data-data-science.html.
11. Hameed Shnain A, Jasim Hadi H, Hadishaheed S, Haji Ahmad A. Big data and ve V’S
characteristics. Int J Adv Electron Comput Sci. 2015;2:393–2835. Available: https://www.
researchgate.net/publication/332230305
12. Ganie SM, Malik MB. Comparative analysis of various supervised machine learning algo-
rithms for the early prediction of type-II diabetes mellitus. Int J Med Eng Inform. 2021;1(1):1.
https://doi.org/10.1504/ijmei.2021.10036078.
13. Herland M, Khoshgoftaar TM, Wald R. A review of data mining using big data in health infor-
matics. J Big Data. 2014;1(1) https://doi.org/10.1186/2196-1115-1-2.
14. Sahoo PK, Mohapatra SK, Wu SL. Analyzing Healthcare Big data with prediction
for future health condition. IEEE Access. 2016;4:9786–99. https://doi.org/10.1109/
ACCESS.2016.2647619.
15. Pashazadeh A, Navimipour NJ. Big data handling mechanisms in the healthcare applications:
a comprehensive and systematic literature review. J Biomed Inform. 2018;2017(82):47–62.
https://doi.org/10.1016/j.jbi.2018.03.014.
16. Abouelmehdi K, Beni-Hessane A, Khalou H. Big healthcare data: preserving security and
privacy. J. Big Data. 2018;5(1):1–18. https://doi.org/10.1186/s40537-017-0110-7.
17. Bahri S, Zoghlami N, Abed M, Tavares JMRS. BIG DATA for Healthcare: a survey. IEEE
Access. 2019;7:7397–408. https://doi.org/10.1109/ACCESS.2018.2889180.
6 Machine Learning Techniques for Big Data Analytics in Healthcare: Current… 121

18. Chong D, Shi H. Big data analytics: a literature review. J Manag Anal. 2015;2(3):175–201.
https://doi.org/10.1080/23270012.2015.1082449.
19. Tsai CW, Lai CF, Chao HC, Vasilakos AV. Big data analytics: a survey. J Big Data.
2015;2(1):1–32. https://doi.org/10.1186/s40537-015-0030-3.
20. B. T. Erl, P. Buhler, and W. Kha, Big Data adoption on and planning considerations LiveLessons
(Video Training) Big Data analytics lifecycle. This chapter is from the book Business Case
Evaluation This chapter is from the book, 2019, pp. 1–19.
21. Yaqoob I, et al. Big data: From beginning to future. Int J Inf Manag. 2016;36(6):1231–47.
https://doi.org/10.1016/j.ijinfomgt.2016.07.009.
22. Delen D, Ram S. Research challenges and opportunities in business analytics. J Bus Anal.
2018;1(1):2–12. https://doi.org/10.1080/2573234x.2018.1507324.
23. Mazumdar S, Seybold D, Kritikos K, Verginadis Y. A survey on data storage and place-
ment methodologies for cloud-big data ecosystem. J Big Data. 2019;6(1):1–37. Springer
International Publishing
24. Winter G. Machine learning in healthcare. Br J Heal Care Manag. 2019;25(2):100–1. https://
doi.org/10.12968/bjhc.2019.25.2.100.
25. Ganie SM, Malik MB, Arif T. Various platforms and machine learning techniques for Big
Data analytics: a technological survey. Int J Scienti c Res Comput Sci Eng Inform Technol.
2018;3(6):679–87.
26. Singh D, Reddy CK. A survey on platforms for big data analytics. J. Big Data. 2015;2(1):1–20.
https://doi.org/10.1186/s40537-014-0008-6.
27. Irestig M, Hallberg N, Eriksson H, Timpka T. Peer-to-peer computing in health-promoting
voluntary organizations: system design analysis. J Med Syst. 2005;29(5):425–40. https://doi.
org/10.1007/s10916-005-6100-x.
28. Landset S, Khoshgoftaar TM, Richter AN, Hasanin T. A survey of open source tools for
machine learning with big data in the Hadoop ecosystem. J Big Data. 2015;2(1):1–36. https://
doi.org/10.1186/s40537-015-0032-1.
29. Mehta S, Mehta V. Hadoop ecosystem: an introduction. Int J Sci Res. 2016;5(6):557–62.
https://doi.org/10.21275/v5i6.nov164121.
30. Bhagavatula VSN, Raju SS. A survey of hadoop ecosystem as a handler of bigdata, no. August
2016, 2017.
31. Leang B, Ean S, Ryu GA, Yoo KH. Improvement of kafka streaming using partition and multi-
threading in big data environment. Sensors (Switzerland). 2019;19(1) https://doi.org/10.3390/
s19010134.
32. Dean J, Ghemawat S. MapReduce: simpli ed data processing on large clusters. In: OSDI
2004—6th Symp. Oper. Syst. Des. Implement.; 2004. p. 137–49. https://doi.org/10.21276/
ijre.2018.5.5.4.
33. Sun P, Wen Y. Scalable architectures for Big Data analysis. Encycl Big Data Technol.
2019:1446–54. https://doi.org/10.1007/978-3-319-77525-8_281.
34. Kaur I, Kaur N, Ummat A, Kaur J, Kaur N. Research paper on big data and Hadoop. Int J
Comput Sci Technol. 2016;8491(1):50–3.
35. Mathiya BJ, Desai VL. Apache Hadoop Yarn Parameter con guration challenges and optimi-
zation. In: Proceedigs of the IEEE International Conference on Soft-Computing and Networks
Security (ICSNS). IEEE; 2015. https://doi.org/10.1109/ICSNS.2015.7292373.
36. Perwej Y, Kerim B, Adrees MS, Sheta OE. An empirical exploration of the Yarn in Big Data.
Int J Appl Inf Syst. 2017;12(9):19–29. https://doi.org/10.5120/ijais2017451730.
37. Alkatheri S, Abbas SA, Siddiqui MA. Big Data frameworks: a comparative study. Int J Comput
Sci Inf Secur. 2019;17(1)
38. Perwej DY, Omer M, Kerim B. A comprehend the Apache Flink in big data environments. IOSR
J Comput Eng (IOSR-JCE). 2018;20(1):48–58. https://doi.org/10.9790/0661-2001044858.
39. Rabl T, Traub J, Katsifodimos A, Markl V. Apache Flink in current research. IT Inf Technol.
2016;58(4):2–9. https://doi.org/10.1515/itit-2016-0005.
122 S. M. Ganie et al.

40. Benbrahim H, Hachimi H, Amine A. Comparison between Hadoop and Spark. In: Proceedings
of the International Conference on Industrial Engineering and Operations Management, vol.
2019; 2019. p. 690–701.
41. Qureshi NM, et al. Dynamic container-based resource management framework of spark eco-
system. In: 2019 21st International Conference on Advanced Communication Technology
(ICACT). IEEE; 2019. p. 522–6. https://doi.org/10.23919/ICACT.2019.8701970.
42. Basu P. HDFS for big data. J Chem Inf Model. 2013;53(9):1689–99. https://doi.org/10.1017/
CBO9781107415324.004.
43. Jin C, Ran S. The research for storage scheme based on Hadoop. In: Proceedings of the 2015
IEEE International Conference Computer and Communications (ICCC) 2015. IEEE; 2015.
p. 62–6. https://doi.org/10.1109/CompComm.2015.7387541.
44. Swarna C, Ansari Z. Apache Pig—a data ow framework based on Hadoop map reduce. Int
J Eng Trends Technol. 2017;50(5):271–5. https://doi.org/10.14445/22315381/ijett-v50p244.
45. Fuad A, Erwin A, Ipung HP. Processing performance on Apache Pig, Apache Hive and MySQL
cluster. In: Proceedings of the 2014 International Conference on Information, Communication
Technology and System (ICTS), 2014. IEEE; 2014. p. 297–301. https://doi.org/10.1109/
ICTS.2014.7010600.
46. Eluri VR, Ramesh M, Al-Jabri ASM, Jane M. A comparative study of various clustering tech-
niques on big data sets using Apache Mahout. In: 2016 3rd MEC Int. Conf. Big Data Smart
City, ICBDSC 2016. IEEE; 2016. p. 374–7. https://doi.org/10.1109/ICBDSC.2016.7460397.
47. Kumar D, Ali L, Memon S. Design and implementation of high performance computing
(HPC) cluster design and implementation of high performance computing (HPC) Cluster, no.
January, 2018.
48. Yeo CS, Buyya R, Eskicioglu R, Graham P. Handbook of nature-inspired and innovative com-
puting. In: Handbook nature inspired innovative computing, June 2014; 2006. p. 0–24. https://
doi.org/10.1007/0-387-27705-6.
49. Ruiz-Rosero J, Ramirez-Gonzalez G, Khanna R. Field programmable gate array appli-
cations—a scientometric review. Computation. 2019;7(4):63. https://doi.org/10.3390/
computation7040063.
50. Lai H, Huang H, Keshavjee K, Guergachi A, Gao X. Predictive models for diabetes mel-
litus using machine learning techniques. BMC Endocr Disord. 2019;19(1):1–9. https://doi.
org/10.1186/s12902-019-0436-6.
51. Guleria P, Sood M. Intelligent Learning analytics in healthcare sector using machine learn-
ing. 2020.
52. Sarwar MA, Kamal N, Hamid W, Shah MA. Prediction of diabetes using machine learn-
ing algorithms in healthcare. In: ICAC 2018–2018 24th IEEE Int. Conf. Autom. Comput.
Improv. Product. through Autom. Comput., September; 2018. p. 1–6. https://doi.org/10.23919/
IConAC.2018.8748992.
53. Doupe P, Faghmous J, Basu S. Machine learning for health services researchers. Value Heal.
2019;22(7):808–15. https://doi.org/10.1016/j.jval.2019.02.012.
54. Ferdous M, Debnath J, Chakraborty NR. Machine learning algorithms in healthcare: a literature
survey. In: 2020 11th International Conference on Computing, Communication and Networking
Technologies (ICCCNT). IEEE; 2020. https://doi.org/10.1109/ICCCNT49239.2020.9225642.
55. Patil R, Tamane S. A comparative analysis on the evaluation of classi cation algorithms in the
prediction of diabetes. Int J Electr Comput Eng. 2018;8(5):3966–75. https://doi.org/10.11591/
ijece.v8i5.pp3966-3975.
56. Celine S, Dominic MM, Devi MS. Logistic regression for employability prediction. Int J Innov
Technol Explor Eng. 2020;9(3):2471–8. https://doi.org/10.35940/ijitee.c8170.019320.
57. Kaviani P, Dhotre S. International journal of advance engineering and research short survey on
Naive Bayes algorithm. Int J Adv Eng Res Dev. 2017;4(11):607–11.
58. Elkan C. Naive Bayesian learning. 2007, pp. 1–4.
59. Jegan C, Kumari VA, Chitra R. Classi cation of diabetes disease using support vector
machine. Int J Eng Res Appl. 2018;3(2):1797–801. Available: https://www.researchgate.net/
publication/320395340
View publication stats

6 Machine Learning Techniques for Big Data Analytics in Healthcare: Current… 123

60. Abdillah AA, Suwarno S. Diagnosis of diabetes using support vector machines with radial
basis function kernels. Int J Technol. 2016;7(5):849–58. https://doi.org/10.14716/ijtech.
v7i5.1370.
61. Tree D. Decision trees tutorial (https://opendatascience.com/decision-trees-tutorial/), 2020,
pp. 1–11.
62. Chari KK, Chinna Babu M, Kodati S. Classi cation of diabetes using random forest with
feature selection algorithm. Int J Innov Technol Explor Eng. 2019;9(1):1295–300. https://doi.
org/10.35940/ijitee.L3595.119119.
63. Lateef Z. A comprehensive guide to Random Forest in R, pp. 1–14, 2019 [Online]. Available:
https://www.edureka.co/blog/naive-bayes-in-r/.
64. Santhosh KV, Nayak S. Engineering vibration communication and information processing,
vol. 478. Springer; 2019. p. 523–35. https://doi.org/10.1007/978-981-13-1642-5.
65. Is W, Learning D. what is a neural network? Introduction to arti cial neural networks. 2020,
pp. 1–7.
66. View ALL Data Sets Citation Policy. 2021, p. 2021.
67. Malik MM, Abdallah S, Ala’raj M. Data mining and predictive analytics applications
for the delivery of healthcare services: a systematic literature review. Ann Oper Res.
2018;270(1–2):287–312. https://doi.org/10.1007/s10479-016-2393-z.
68. Nissa N, Jamwal S, Mohammad S. Early detection of cardiovascular disease using machine
learning techniques an experimental study. Int J Recent Technol Eng. 2020;9(3):635–41.
https://doi.org/10.35940/ijrte.c46570.99320.
69. Anaconda Inc., Anaconda Distribution, Anaconda, 2019, [Online]. Available: https://www.
anaconda.com/distribution/.

You might also like