Professional Documents
Culture Documents
net/publication/350486051
CITATIONS READS
3 603
3 authors:
Tasleem Arif
Baba Ghulam Shah Badshah University
52 PUBLICATIONS 339 CITATIONS
SEE PROFILE
All content following this page was uploaded by Shahid Mohammad Ganie on 30 August 2022.
6.1 Introduction
Many de nitions are oating around which have raised some confusion, but big
data can be de ned only based on characteristics, dimensions, classi cation along
with the analytical framework for storing and processing data [1]. According to the
Gartner IT Glossary [2] “Big data is a collection of high-volume, high-velocity and
high-variety information assets that require advanced tools and techniques in order
to store and process the data for building superior insight and decision making.”
Big data is also de ned as a collection of complex and voluminous data that are
being generated from different sources including the healthcare system, social net-
working sites, smart meters, sensors, online transactions, and administrative ser-
vices [3]. Among all sectors, healthcare industry data is increasing faster due to the
advancement of smart devices and well-equipped labs [4].
The growth of the data can be understood from the fact, Internet Users produce 2.5
quintillion bytes of data each day, 90% of all data today was shaped in the last two
years, in 2016 there was 7.5% of internet population in the world and now estimated
to 3.7 billion people according to recent research cited by Domo [5]. 300+ hours of
videos are being uploaded on YouTube every minute [6]. 40,000 search queries are
being posted every second, 350,000 tweets per minute, and 171 million emails per
Financial services
Other 12%
25%
Resources
2%
Manifacturing
21%
Transportation
4%
Healthcare
7%
Infrastrucure
9%
Retail/WH
13% Media and Entertainment
7%
minute [7]. Healthcare data is increasing at a rate of 48% annually, and by 2020, the
Stanford University study projection is that 2314 Exabytes of data will be formed
per year [8]. As represented in Fig. 6.1, most of the data is being generated by dif-
ferent organizations [9].
[13]. Prediction of the clinical outcomes using gene expression data and tissue-level
data has been presented for smart healthcare systems. Also, ICU readmissions using
patient-level data, HER’s, etc. have been discussed; social media data regarding
healthcare has been also presented using population-level data for better prediction
of different diseases. In [14], a probabilistic data acquisition mechanism and sto-
chastic prediction framework are designed and implemented for analyzing health-
care data using big data analytics. Intra-cluster correlation evaluation (IaCE) and
inter-cluster correlation evaluation (IeCE) have been developed for healthcare big
data. Besides, the future health condition prediction (FHCP) algorithm is used to
predict the health conditions of a patient based on current status. MapReduce tool
has been used for processing of big data on cloud architecture. The prediction model
gives an accuracy rate of 98% for predicting the current health status correlated with
a patient. The framework has been developed through extensive simulations in the
cloud architecture and 90% of it utilizes the resources ef ciently to reduce analysis
time. The research work provides the detailed, complete, and organized study of
recent methodology of big data healthcare applications in ve sections including
ML, heuristic-based, agent-based, and hybrid approaches [15]. The work surveys
205 research articles from different reputed publishing houses that have been pre-
sented and discussed for big data analytics in healthcare domain. Big data security
and privacy are considered big challenges to secure the information of the patients
in healthcare [16]. The security and privacy issues are discussed and addressed so
that the oating information of different sources can be secured from the third par-
ties in healthcare. Anonymization and encryption techniques have been designed
and implemented in this study. Different stages of big data lifecycle in the health-
care domain like data collection, aggregation, analyze, interpretation, information
gain, etc. are to be protected and encrypted from the end-users by applying different
privacy preservation techniques. The paper highlights the vast impacts of big data in
healthcare system [17]. The different stakeholders like medical practitioners,
patients, physicians, clinicians, pharmaceuticals, and health insurers are taken into
account while dealing with big data lifecycle in healthcare informatics. Big data
technology like Hadoop architecture has been used for different tasks, viz. data
106 S. M. Ganie et al.
collection, data processing, big storage, and data analysis. Also, some of the promi-
nent big data applications in healthcare have been discussed. Big data was catego-
rized into four types: descriptive, prescriptive, predictive, and discovery analytics.
In [4], classi cation of big data in healthcare along with varied sources is discussed.
Big data is being generated from sources like HER’s, EMR’s, clinical reports,
Omics, etc. and to integrate this large volume of data in healthcare analytics to pro-
mote the personalized treatment of different diseases. Various platforms for health-
care data analytics like Hadoop, spark, etc. are being used for the management of
big data. To develop a sophisticated framework for a healthcare system that will
integrate diverse elds like bioinformatics, chemoinformatics, etc. to provide better
decision making in scienti c and social science. This will facilitate healthcare by
new treatments of epidemics, provide early alarming of diseases, and will help the
healthcare providers in drug and medicine discoveries.
This fast-growing and tremendous amount of health data has far exceeded the
bounds of human interpretation and analytical capabilities [23]. Powerful and ver-
satile DM and ML techniques are the dire need of time so that they can automati-
cally excavate and organize information from heterogeneous and complex healthcare
data [24]. The goal is to link the computational power of the technology to automate
the capability of ingestion, processing, and visualization of results from the huge
volume of data for the betterment of public health [25]. The big data platforms for
6 Machine Learning Techniques for Big Data Analytics in Healthcare: Current… 107
Big Data
Sources
Big Data
Storage
Big Data
Process
Big Data
Analytics
Big Data
Visualization
healthcare analytics have been categorized based on scaling [26]. Scaling means
adapting to cumulative demands of the system in terms of processing requests.
Scaling can be horizontal scaling and vertical scaling.
How can
we make it
happen?
PRESCRIPTIVE
What will ANALYTICS
happen?
V
PREDECTIVE
A
ANALYTICS
L Why did it
U happen?
E
DIAGNOSIS
What ANALYTICS
happened?
DESCRIPTIVE
ANALYTICS
COMPLEXITY
Data Collection
• Sqoop: An open-source framework comprised of SQL and Hadoop systems [29].
Sqoop is designed to provide a command-line interface that is used to transfer
data to-and-fro between HDFS and relational database servers.
• Flume: Flume is a robust tool used for data ingestion in HDFS. It is used to col-
lect, aggregate, and transpose a large volume of real-time data such as log-based
les, social media, and email [30]. It also operates by capturing the real-time/
streaming data from different web services to HDFS.
• Kafka: It is an open-source disseminated platform that communicates the
publish-subscribe messaging system. Originally it was developed at LinkedIn
and later it became part of the Apache project and is now used by many organiza-
tions because of its scalability, durability, and fault tolerance [31].
• Chukwa: It is distributed data collection and rapid processing system. Chukwa is
a powerful and exible platform with different core components as Agents,
Collectors, MapReduce Jobs, and Hadoop Infrastructure Care Center (HICC) [17].
Data Processing
• MapReduce: Dean and Ghemawat proposed MapReduce at Google technology
[32]. MapReduce takes care of processing and computing data present in
HDFS. The working principle of MapReduce is mainly divided into two parts,
called Mappers and Reducers [33]. The Map module takes input data from HDFS
and translates it into some intermediate results to the reducers. The reduced task
is based on map task; it takes the output of Mappers as input and then generates
the nal results by aggregating the intermediate sets which are again re ected in
HDFS [34].
• YARN: Yet Another Resource Negotiator is a framework that provides job
arrangement and cluster resource organization, i.e., the jobs across the cluster
[35]. It is a global scheduler that handles both batch and stream processing [36].
It is also known as MapReduce version 2 and is compatible with MapReduce,
having master-slave architecture with the full support of the virtual distrib-
uted system.
• Storm: An open-source computing system used for real-time processing of big
data analytics [28]. The storm is distributed, reliable, and fault-tolerant in nature,
used for extract, transform, and load (ETL) operations, online machine learning,
and continuous computation [37].
• Flink: It is a data ingestion tool in HDFS [38], which came into the picture for
gathering, combining, and transport a large amount of streaming data. The main
idea behind ink is to capture a large amount of data from varied sources into
HDFS [39].
• Spark: It is a cluster computing framework used for both batch and real-time
processing [40]. Spark was developed at the University of California and is next-
generation big data processing engine. It is a cluster computing framework par-
ticularly designed for the speed up (100 times more than the speed of Hadoop
MapReduce for data that resides in main memory and around 10 times faster
when data is in disk) in terms of processing [41].
110 S. M. Ganie et al.
Hadoop ecosystem comprises a bunch of tools that are used together by various
organizations to handle big data. So, it provides all the services that are required to
process big data for accomplishing the different tasks. The pictorial representation
of Hadoop ecosystem is shown below in Fig. 6.4 [17].
Vertical scaling or “scale up” is achieved by enhancing the processing and storage
capability of a single server by increasing the number of processors or cores and
enhancing the memory and other required resources. The most popular vertical
scale-up platforms are as follows [26]:
Data Management
Zookeeper
and workflow
Ambari
Oozie
Map
Flume HBase Hive
Reduce
Hadoop or Spark. The Message Passing Interface (MPI) is usually used for com-
munication and sharing information in such systems [48].
• Graphics Processing Unit (GPUs): They are used for accelerating the graphic
operations for pictorial representation on display using frame buffer [26].
General-purpose computing on graphics processing units is the result of advanced
enhancements in hardware and algorithms in GPU-like parallel architecture.
• Field Programmable Gate Arrays (FPGA): Field Programmable Gate Arrays are
specialized and custom-built hardware units that can be optimized for high speed
and throughput for a particular application [49]. The customization and develop-
ment cost is typically very high, as compared to other platforms also its feature
of programming using Hardware Descriptive Language (HDL) increases the
overall development cost [26].
Machine learning techniques are being used broadly for the health data to provide
the diagnosis and prognosis of several diseases [50]. Implementation of sophisti-
cated computer algorithms in the area of healthcare is an old story but there are still
some areas in healthcare where advanced computing, scienti c methods, and per-
fect visualization by using machine learning and deep learning paradigms can make
a difference [51]. ML techniques play a dynamic role in the early prediction of
112 S. M. Ganie et al.
Supervised Learning
Un-Supervised Learning
Semi-Supervised Learning
Active
Learning
Reinforcement
Learning
Machine learning techniques are built on the basic principles and concepts of statis-
tics that are being used to solve data mysteries. Machine learning algorithms are
used successfully for data analytics by different organizations, especially in health-
care [53]. The ML algorithms used for data analysis following are some most prom-
inent ML classi ers that are being widely used in big data healthcare analytics to
provide better solutions to health problems [54]. Machine learning techniques can
be explored for various real-world problems and also these problems can be mod-
eled to produce optimal solution [28].
Y
log ==> Y = C + B1 X 1 + B2 X 2 + B3 X 3 + Bn X n (6.1)
1 − Y
• Naive Bayes: It is a ML-based classi er, mostly used for predictive analytical of
different real-life problems [57]. It is classi cation technique which is based on
Bayesian theorem with an inference among predicate and target variables used
for study. Its working mechanism is simple and is highly used for large datasets.
“Given a hypothesis H and evidence E, Bayes theorem states that the relationship
between the probability of the hypothesis before getting the evidence P(H) and
the probability of the hypothesis after getting the evidence P(H|E) is” [58]:
P E |H .P H
P H |E = (6.2)
P E
In simple English, we write the equation as:
P(H|E) = Posterior likelihood
P(E|H) = Probability
P(H) = Session prior likelihood
P(E) = Forecaster prior likelihood
• Support Vector Machine: It is a ML classi er that is designed by a separative
hyper-plane [59]. Support vector machine algorithm is being used for both clas-
si cation and regression analysis for different real-world problems. The basic
function of this supervised classi er is to segregate or classify the given data
points in the best possible and appropriate way in a multidimensional space. To
work in high and complex dimensions, the SVM classi er uses different versions
of the kernels like linear, polynomial, and radial basis function kernels. The
equations for various kernels are [60]:
Linear Kernel Equation
Input Layer
Hidden Layer 1
Hidden Layer 2
Output Layer
• Arti cial Neural Networks: Arti cial Neural Networks are algorithms inspired
by human nervous system that have self-capability to produce better results on
voluminous data [64]. The structure of the neural network can be divided into
three layers: input (disparate sources), hidden (accountable for internal process-
ing), and output (desired results). ANN is currently being used in healthcare
analytics for better diagnosis and prognosis of various diseases because of the
diversity of data sources in healthcare like EHR, clinical data, omics data, CT
scans, ultrasound images, MRI images, etc. The graphical representation of the
ANN is shown in Fig. 6.8 [65]:
The publicly available PIMA diabetes dataset has been used for the experimental
study. The dataset is sourced from UC Irvine Machine Learning Repository [66]. It
comprised of 768 instances and 9 attributes, 8 predicted variables and 1 target vari-
able in order to classify whether a patient is diabetic or not. The dataset contains the
rich amount of contributing parameters that can be explored to identify the severity
of the diabetic disease. Different ML classi ers have developed realistic health
management system for diabetes. Table 6.2 demonstrates the attribute name,
description, range, and their data type used in the PIMA dataset.
The proposed framework shown in Fig. 6.9 presents the working principle of develop-
ing ML techniques for forecast of various chronic diseases showcasing diabetes. In
this framework, ML models have been developed using the PIMA dataset for forecast
of diabetes. The preprocessing techniques have been used to perform various
116 S. M. Ganie et al.
Hyper-parameter
Tuning
Training
Training Models Evaluation
PIMA diabetes Dataset Building Results
Dataset
Testing
Testing Evaluation
Data Pre-processing Dataset Models Results
Techniques
statistical measurements over the dataset to improve the quality assessment of the
data. Feature engineering has been done to identify the contribution of the parameters
towards the prediction of disease. Data transformation was performed before building
the various ML models. K-fold cross-validation, where K = 10, has been incorporated
to validate the performance evaluation of various statistical measures of ML models.
parameters and based on these parameters predicting the future health conditions of
different chronic diseases are still in the informative stage [14, 68]. Machine learn-
ing libraries help in implementing the whole end-to-end ML process, which includes
building a statistical model that predicts a precise outcome. Different supervised
machine learning algorithms, viz. logistic regression (LR), Naive Bayes (NB), sup-
port vector machine (SVM), decision tree (DT), random forest (RF), and arti cial
neural network (ANN), have been developed by using the PIMA diabetes dataset.
These classi ers were executed using Jupyter and Spyder (Integrated Development
Environment) along with python 3.8 in windows operating system [69].
Exploratory data analysis (EDA) is used to describe the dataset for different tasks
based on inferential statistics that need to be executed before building the ML mod-
els. The report of attributes used in the dataset is depicted in Fig. 6.10. The statisti-
cal measurements of parameters in dataset like count, mean, std., min, max, etc.
have been calculated. The describe() function on pandas DataFrame has been used
to list the statistical measures of each parameter in the dataset.
The data exploratory analysis has been performed to identify the relationship
among the attributes used in the dataset as shown in Fig. 6.11. Different packages
for data visualization like Matplot, Seaborn, etc. have been used to draw the correla-
tion matrix plot between the set of attributes. The association between the reliant on
and autonomous variables lies in between −1 and +1.
The predictive machine learning models, viz. LR, NB, SVM, DT, RF, and ANN,
were employed. The algorithms were developed by splitting the PIMA diabetes
dataset into an 80:20 ratio, i.e., 80% of data has been used for training the classi ers
and 20% of data for testing the developed models. Also, ten-fold cross-validation
has been applied to validate the building models for better prediction of diabetes
Dataset Description
900
800
700
600
500
400
300
200
100
0
Diabetes
Blood Skin
Pregnancies Glucose Insulin BMI Pedigree Age class
Pressure Thinkness
Function
count 768 768 768 768 768 768 768 768 768
mean 3.845052 120.894531 69.105469 20.536458 79.799479 31.992578 0.471876 33.240885 0.348958
std 3.369578 31.972618 19.355807 15.952218 115.244002 7.88416 0.331329 11.760232 0.476951
min 0 0 0 0 0 0 0.078 21 0
25% 1 99 62 0 0 27.3 0.24375 24 0
50% 3 117 72 23 30.5 32 0.3725 29 0
75% 6 140.25 80 32 127.25 36.6 0.62625 41 1
max 17 199 122 99 846 67.1 2.42 81 1
1.0
Pregnancies 1 0.13 1.14 –0.082 –0.073 0.018 –0.034 0.54 0.22
Glucose 0.13 1 0.15 0.056 0.33 1.22 0.14 0.26 0.47 0.8
0.6
SkinThinkness –0.082 0.056 0.21 1 0.44 0.39 0.18 –0.12 0.073
DiabetesPedigreeFunction –0.034 0.14 0.041 0.18 0.19 0.14 1 0.033 0.17 0.2
Class
BloodPressure
Pregnancies
Glucose
Age
SkinThinkness
BMI
Insulin
100.00%
90.00%
83.36% 84.99%
80.51% 79.87% 79.22%
80.00% 76.50% 76.01% 76.36%
74.67% 75.32%
71.94% 70.77%
70.00%
60.00%
50.00%
40.00%
30.00%
20.00%
10.00%
0.00%
Training Accuracy Testing Accuracy
The other statistical measures are shown in Table 6.4. The various measurements
have been used to evaluate the performance evaluation of different classi ers
towards the prediction of diabetic disease. However, in terms of precision, LR
model predicts best with the highest precision rate of 0.88%, whereas in terms of
false positive rate and false negative rate, LR predicts with the lowest rate of 0.224%
and 0.180%, respectively.
This chapter began with a brief outline of big data, its classi cation, and its charac-
teristics. The procedure of BDA in healthcare from data acquisition up to the gen-
eration of required results is presented in this study. Then various platforms for big
data and BDA in healthcare have been discussed, through which we can store, man-
age, analyze, lter, and distribute healthcare data for further analytical procedures
to provide better solutions to real-world healthcare problems. Different challenges
regarding BDA in healthcare have been also discussed. These challenges must be
taken into account while dealing with big data analytics life cycle. In this chapter,
120 S. M. Ganie et al.
various ML classi ers have been discussed that play a signi cant role in healthcare
industry. The study has presented and implemented the framework by using the
PIMA diabetes dataset to forecast whether a patient is diabetic or not. The results
achieved by the framework are accuracy, recall, precision, f1-score, false positive
rate, false negative rate, etc. The classi ers used in this model have been validated
by using k-fold cross-validation technique. To extend the current study, these ML
models shall be used for other related diseases which share communality of data
with diabetes. Also, for ef ciency and reliability of the framework used in this study
ensemble and hybridization ML techniques shall be used for the forecast of various
diseases in early stages to save human lives. Similarly, new mining techniques
should be developed so that sensitive information can be extracted from the huge
volume of health data.
References
1. Wasson M, Buck A, Robe J., Wilson M. Big data architecture style. Azur. Appl. Archit. Guid.
| Microsoft Docs, 2018, pp. 1–7.
2. Gandomi A, Haider M. Beyond the hype: Big data concepts, methods, and analytics. Int J Inf
Manag. 2015;35(2):137–44. https://doi.org/10.1016/j.ijinfomgt.2014.10.007.
3. Oracle. Oracle: Big Data for the enterprise Oracle White Paper—Big Data for the enterprise,
An Oracle White Pap., no. June, 2013.
4. Dash S, Shakyawar SK, Sharma M, Kaushik S. Big data in healthcare: management, analysis
and future prospects. J Big Data. 2019;6(1) https://doi.org/10.1186/s40537-019-0217-0.
5. How much data do we create every day? The mind-blowing stats everyone should read.
6. 300 Hours of video are uploaded to Youtube every minute..
7. Google Search Statistics—Internet live stats.
8. Infographic: How Big Data will unlock the potential of healthcare.
9. Saracco R. Another shift in content production, 2020. pp. 2019–2020
10. Shafer T. The 42 V’s of Big Data and Data Science, kdnuggets.com Elder Res., pp. 1–3, 2017,
[Online]. Available: https://www.kdnuggets.com/2017/04/42-vs-big-data-data-science.html.
11. Hameed Shnain A, Jasim Hadi H, Hadishaheed S, Haji Ahmad A. Big data and ve V’S
characteristics. Int J Adv Electron Comput Sci. 2015;2:393–2835. Available: https://www.
researchgate.net/publication/332230305
12. Ganie SM, Malik MB. Comparative analysis of various supervised machine learning algo-
rithms for the early prediction of type-II diabetes mellitus. Int J Med Eng Inform. 2021;1(1):1.
https://doi.org/10.1504/ijmei.2021.10036078.
13. Herland M, Khoshgoftaar TM, Wald R. A review of data mining using big data in health infor-
matics. J Big Data. 2014;1(1) https://doi.org/10.1186/2196-1115-1-2.
14. Sahoo PK, Mohapatra SK, Wu SL. Analyzing Healthcare Big data with prediction
for future health condition. IEEE Access. 2016;4:9786–99. https://doi.org/10.1109/
ACCESS.2016.2647619.
15. Pashazadeh A, Navimipour NJ. Big data handling mechanisms in the healthcare applications:
a comprehensive and systematic literature review. J Biomed Inform. 2018;2017(82):47–62.
https://doi.org/10.1016/j.jbi.2018.03.014.
16. Abouelmehdi K, Beni-Hessane A, Khalou H. Big healthcare data: preserving security and
privacy. J. Big Data. 2018;5(1):1–18. https://doi.org/10.1186/s40537-017-0110-7.
17. Bahri S, Zoghlami N, Abed M, Tavares JMRS. BIG DATA for Healthcare: a survey. IEEE
Access. 2019;7:7397–408. https://doi.org/10.1109/ACCESS.2018.2889180.
6 Machine Learning Techniques for Big Data Analytics in Healthcare: Current… 121
18. Chong D, Shi H. Big data analytics: a literature review. J Manag Anal. 2015;2(3):175–201.
https://doi.org/10.1080/23270012.2015.1082449.
19. Tsai CW, Lai CF, Chao HC, Vasilakos AV. Big data analytics: a survey. J Big Data.
2015;2(1):1–32. https://doi.org/10.1186/s40537-015-0030-3.
20. B. T. Erl, P. Buhler, and W. Kha, Big Data adoption on and planning considerations LiveLessons
(Video Training) Big Data analytics lifecycle. This chapter is from the book Business Case
Evaluation This chapter is from the book, 2019, pp. 1–19.
21. Yaqoob I, et al. Big data: From beginning to future. Int J Inf Manag. 2016;36(6):1231–47.
https://doi.org/10.1016/j.ijinfomgt.2016.07.009.
22. Delen D, Ram S. Research challenges and opportunities in business analytics. J Bus Anal.
2018;1(1):2–12. https://doi.org/10.1080/2573234x.2018.1507324.
23. Mazumdar S, Seybold D, Kritikos K, Verginadis Y. A survey on data storage and place-
ment methodologies for cloud-big data ecosystem. J Big Data. 2019;6(1):1–37. Springer
International Publishing
24. Winter G. Machine learning in healthcare. Br J Heal Care Manag. 2019;25(2):100–1. https://
doi.org/10.12968/bjhc.2019.25.2.100.
25. Ganie SM, Malik MB, Arif T. Various platforms and machine learning techniques for Big
Data analytics: a technological survey. Int J Scienti c Res Comput Sci Eng Inform Technol.
2018;3(6):679–87.
26. Singh D, Reddy CK. A survey on platforms for big data analytics. J. Big Data. 2015;2(1):1–20.
https://doi.org/10.1186/s40537-014-0008-6.
27. Irestig M, Hallberg N, Eriksson H, Timpka T. Peer-to-peer computing in health-promoting
voluntary organizations: system design analysis. J Med Syst. 2005;29(5):425–40. https://doi.
org/10.1007/s10916-005-6100-x.
28. Landset S, Khoshgoftaar TM, Richter AN, Hasanin T. A survey of open source tools for
machine learning with big data in the Hadoop ecosystem. J Big Data. 2015;2(1):1–36. https://
doi.org/10.1186/s40537-015-0032-1.
29. Mehta S, Mehta V. Hadoop ecosystem: an introduction. Int J Sci Res. 2016;5(6):557–62.
https://doi.org/10.21275/v5i6.nov164121.
30. Bhagavatula VSN, Raju SS. A survey of hadoop ecosystem as a handler of bigdata, no. August
2016, 2017.
31. Leang B, Ean S, Ryu GA, Yoo KH. Improvement of kafka streaming using partition and multi-
threading in big data environment. Sensors (Switzerland). 2019;19(1) https://doi.org/10.3390/
s19010134.
32. Dean J, Ghemawat S. MapReduce: simpli ed data processing on large clusters. In: OSDI
2004—6th Symp. Oper. Syst. Des. Implement.; 2004. p. 137–49. https://doi.org/10.21276/
ijre.2018.5.5.4.
33. Sun P, Wen Y. Scalable architectures for Big Data analysis. Encycl Big Data Technol.
2019:1446–54. https://doi.org/10.1007/978-3-319-77525-8_281.
34. Kaur I, Kaur N, Ummat A, Kaur J, Kaur N. Research paper on big data and Hadoop. Int J
Comput Sci Technol. 2016;8491(1):50–3.
35. Mathiya BJ, Desai VL. Apache Hadoop Yarn Parameter con guration challenges and optimi-
zation. In: Proceedigs of the IEEE International Conference on Soft-Computing and Networks
Security (ICSNS). IEEE; 2015. https://doi.org/10.1109/ICSNS.2015.7292373.
36. Perwej Y, Kerim B, Adrees MS, Sheta OE. An empirical exploration of the Yarn in Big Data.
Int J Appl Inf Syst. 2017;12(9):19–29. https://doi.org/10.5120/ijais2017451730.
37. Alkatheri S, Abbas SA, Siddiqui MA. Big Data frameworks: a comparative study. Int J Comput
Sci Inf Secur. 2019;17(1)
38. Perwej DY, Omer M, Kerim B. A comprehend the Apache Flink in big data environments. IOSR
J Comput Eng (IOSR-JCE). 2018;20(1):48–58. https://doi.org/10.9790/0661-2001044858.
39. Rabl T, Traub J, Katsifodimos A, Markl V. Apache Flink in current research. IT Inf Technol.
2016;58(4):2–9. https://doi.org/10.1515/itit-2016-0005.
122 S. M. Ganie et al.
40. Benbrahim H, Hachimi H, Amine A. Comparison between Hadoop and Spark. In: Proceedings
of the International Conference on Industrial Engineering and Operations Management, vol.
2019; 2019. p. 690–701.
41. Qureshi NM, et al. Dynamic container-based resource management framework of spark eco-
system. In: 2019 21st International Conference on Advanced Communication Technology
(ICACT). IEEE; 2019. p. 522–6. https://doi.org/10.23919/ICACT.2019.8701970.
42. Basu P. HDFS for big data. J Chem Inf Model. 2013;53(9):1689–99. https://doi.org/10.1017/
CBO9781107415324.004.
43. Jin C, Ran S. The research for storage scheme based on Hadoop. In: Proceedings of the 2015
IEEE International Conference Computer and Communications (ICCC) 2015. IEEE; 2015.
p. 62–6. https://doi.org/10.1109/CompComm.2015.7387541.
44. Swarna C, Ansari Z. Apache Pig—a data ow framework based on Hadoop map reduce. Int
J Eng Trends Technol. 2017;50(5):271–5. https://doi.org/10.14445/22315381/ijett-v50p244.
45. Fuad A, Erwin A, Ipung HP. Processing performance on Apache Pig, Apache Hive and MySQL
cluster. In: Proceedings of the 2014 International Conference on Information, Communication
Technology and System (ICTS), 2014. IEEE; 2014. p. 297–301. https://doi.org/10.1109/
ICTS.2014.7010600.
46. Eluri VR, Ramesh M, Al-Jabri ASM, Jane M. A comparative study of various clustering tech-
niques on big data sets using Apache Mahout. In: 2016 3rd MEC Int. Conf. Big Data Smart
City, ICBDSC 2016. IEEE; 2016. p. 374–7. https://doi.org/10.1109/ICBDSC.2016.7460397.
47. Kumar D, Ali L, Memon S. Design and implementation of high performance computing
(HPC) cluster design and implementation of high performance computing (HPC) Cluster, no.
January, 2018.
48. Yeo CS, Buyya R, Eskicioglu R, Graham P. Handbook of nature-inspired and innovative com-
puting. In: Handbook nature inspired innovative computing, June 2014; 2006. p. 0–24. https://
doi.org/10.1007/0-387-27705-6.
49. Ruiz-Rosero J, Ramirez-Gonzalez G, Khanna R. Field programmable gate array appli-
cations—a scientometric review. Computation. 2019;7(4):63. https://doi.org/10.3390/
computation7040063.
50. Lai H, Huang H, Keshavjee K, Guergachi A, Gao X. Predictive models for diabetes mel-
litus using machine learning techniques. BMC Endocr Disord. 2019;19(1):1–9. https://doi.
org/10.1186/s12902-019-0436-6.
51. Guleria P, Sood M. Intelligent Learning analytics in healthcare sector using machine learn-
ing. 2020.
52. Sarwar MA, Kamal N, Hamid W, Shah MA. Prediction of diabetes using machine learn-
ing algorithms in healthcare. In: ICAC 2018–2018 24th IEEE Int. Conf. Autom. Comput.
Improv. Product. through Autom. Comput., September; 2018. p. 1–6. https://doi.org/10.23919/
IConAC.2018.8748992.
53. Doupe P, Faghmous J, Basu S. Machine learning for health services researchers. Value Heal.
2019;22(7):808–15. https://doi.org/10.1016/j.jval.2019.02.012.
54. Ferdous M, Debnath J, Chakraborty NR. Machine learning algorithms in healthcare: a literature
survey. In: 2020 11th International Conference on Computing, Communication and Networking
Technologies (ICCCNT). IEEE; 2020. https://doi.org/10.1109/ICCCNT49239.2020.9225642.
55. Patil R, Tamane S. A comparative analysis on the evaluation of classi cation algorithms in the
prediction of diabetes. Int J Electr Comput Eng. 2018;8(5):3966–75. https://doi.org/10.11591/
ijece.v8i5.pp3966-3975.
56. Celine S, Dominic MM, Devi MS. Logistic regression for employability prediction. Int J Innov
Technol Explor Eng. 2020;9(3):2471–8. https://doi.org/10.35940/ijitee.c8170.019320.
57. Kaviani P, Dhotre S. International journal of advance engineering and research short survey on
Naive Bayes algorithm. Int J Adv Eng Res Dev. 2017;4(11):607–11.
58. Elkan C. Naive Bayesian learning. 2007, pp. 1–4.
59. Jegan C, Kumari VA, Chitra R. Classi cation of diabetes disease using support vector
machine. Int J Eng Res Appl. 2018;3(2):1797–801. Available: https://www.researchgate.net/
publication/320395340
View publication stats
6 Machine Learning Techniques for Big Data Analytics in Healthcare: Current… 123
60. Abdillah AA, Suwarno S. Diagnosis of diabetes using support vector machines with radial
basis function kernels. Int J Technol. 2016;7(5):849–58. https://doi.org/10.14716/ijtech.
v7i5.1370.
61. Tree D. Decision trees tutorial (https://opendatascience.com/decision-trees-tutorial/), 2020,
pp. 1–11.
62. Chari KK, Chinna Babu M, Kodati S. Classi cation of diabetes using random forest with
feature selection algorithm. Int J Innov Technol Explor Eng. 2019;9(1):1295–300. https://doi.
org/10.35940/ijitee.L3595.119119.
63. Lateef Z. A comprehensive guide to Random Forest in R, pp. 1–14, 2019 [Online]. Available:
https://www.edureka.co/blog/naive-bayes-in-r/.
64. Santhosh KV, Nayak S. Engineering vibration communication and information processing,
vol. 478. Springer; 2019. p. 523–35. https://doi.org/10.1007/978-981-13-1642-5.
65. Is W, Learning D. what is a neural network? Introduction to arti cial neural networks. 2020,
pp. 1–7.
66. View ALL Data Sets Citation Policy. 2021, p. 2021.
67. Malik MM, Abdallah S, Ala’raj M. Data mining and predictive analytics applications
for the delivery of healthcare services: a systematic literature review. Ann Oper Res.
2018;270(1–2):287–312. https://doi.org/10.1007/s10479-016-2393-z.
68. Nissa N, Jamwal S, Mohammad S. Early detection of cardiovascular disease using machine
learning techniques an experimental study. Int J Recent Technol Eng. 2020;9(3):635–41.
https://doi.org/10.35940/ijrte.c46570.99320.
69. Anaconda Inc., Anaconda Distribution, Anaconda, 2019, [Online]. Available: https://www.
anaconda.com/distribution/.