Tigistu Bekele For Examiner Final

AdmasUniversity
Postgraduate office
MSc Program
Big Data Analytics for Predict most Frequent Disease in Addis Ababa
A Thesis Submitted to Department of Computer Science of Admas University

in Partial Fulfillment of the Requirements for the Degree of MSc in Computer
Science
byTigistu Bekele
Advisor: MulugetaAdibaru (Assistant professor)
Date: July 2020

Declaration
I, Tigistu Bekele, the under signed, declare that this thesis entitled “Big Data Analytics for
Predict most Frequent Disease in Addis Ababa” is my original work. I have undertaken the
research work independently with the guidance and support of the research advisor. This study
has not been summited for any degree or diploma program in this or any institution and that all
sources of materials used for the thesis has been duly acknowledged.
Declared by
Name ________________________
Signature ______________________
Department ___________________
Date__________________________
Signature______________________
I
Certification of Approved of Thesis
School of Postgraduate Studies
Admas University
This is to certify that the thesis prepared by Tigistu Bekele, entitled “Big Data Analytics for
Predict most Frequent Disease in Addis Ababa”and summited in partial fulfillment of the
requirements for the Degree of Masters of Science in Computer Science/MSc complies with the
regulation of the University and meets the accepted standards with respect to originality and
quality.
Name of Candidate:_____________________: Signature:_______________Date:____________
NameofAdvisor: ______________________: Signature:______________Date:____________
Signature of Board of Examiner’s:
External examiner:_____________________: Signature:_______________Date:____________
Internal examiner:_____________________: Signature:________________Date:____________
Dean, SGS:__________________________: Signature:_______________Date:____________
Table of Content
II
Declaration.......................................................................................................................................I
Table of Content............................................................................................................................III
List of Figures..............................................................................................................................VII
Table content.................................................................................................................................IX
List of Abbreviation........................................................................................................................X
Acknowledgment...........................................................................................................................XI
ABSTRACT.................................................................................................................................XII
CHAPTER ONE..............................................................................................................................1
INTRODUCTION...........................................................................................................................1
1.1.Background of the study........................................................................................................1
1.1.Statement of the problem.......................................................................................................2
1.2.Research Question.................................................................................................................2
1.3.Objectives..............................................................................................................................3
1.3.1.General objectives..........................................................................................................3
1.3.2.Specific objectives..........................................................................................................3
1.4.Significance of the Research...............................................................................................3
1.5.Scope and limitation..............................................................................................................4
1.5.1.Scope of the research......................................................................................................4
1.5.2.Limitations of the research.............................................................................................4
1.6.Conceptual Framework..........................................................................................................5
1.7.Definitions of terms...............................................................................................................5
1.8.Contribution...........................................................................................................................6
1.9.Organization of the Paper......................................................................................................6
CHAPTER TWO.............................................................................................................................7
LITERATURE REVIEW................................................................................................................7
III
2.1.Introduction............................................................................................................................7
2.2.Big Data (BD)........................................................................................................................7
2.2.1.Data Evolution................................................................................................................8
2.3.Big Data Analysis..................................................................................................................9
2.3.1.Tasks in Big Data Analysis.............................................................................................9
2.3.1.1.Integration of various data sources........................................................................10
2.3.1.2.Integration with the simulation process.................................................................10
2.3.1.3.High-level user interaction....................................................................................10
2.3.1.4.Complex visualization...........................................................................................10
2.4.Big Data Application...........................................................................................................11
2.5. Big Data Application in Health Sector...........................................................................12
2.5.1.Sources of Data in Health Sector..................................................................................14
2.6.The algorithm in Data Analytics..........................................................................................14
2.6.1.Mining algorithms for the specific problem.................................................................14
2.6.2.Linear Regression.........................................................................................................16
I. Logistic Regression.....................................................................................................17
II. Classification and Regression Trees........................................................................17
2.6.3. K-Nearest Neighbors...............................................................................................18
2.6.4. K-Means Clustering.................................................................................................19
2.7. Tools used in big data analysis.......................................................................................19
2.7.1. Hadoop for big data applications.............................................................................21
2.7.2. MapReduce..............................................................................................................22
2.7.3. Distributed File System – HDFS.............................................................................22
2.8. Related Work..................................................................................................................23
CHAPTER THREE.......................................................................................................................25
IV
RESEARCH METHODOLOGY..................................................................................................25
3.1. Overview.........................................................................................................................25
3.2. Research Design..............................................................................................................25
3.3. Population.......................................................................................................................26
3.4. Sample Size and Sampling Techniques..........................................................................26
3.5. Data Source and Collection Methods..............................................................................27
3.6. Data Analysis Tool..........................................................................................................27
3.7. Model..............................................................................................................................27
3.8. Methods Used for Data pre-processing...........................................................................28
3.8.1. Noise identification..................................................................................................28
3.8.2. Data Cleaning..........................................................................................................29
3.8.3. Data normalization...................................................................................................31
3.8.4. Dimension Reduction (normalization)....................................................................31
3.8.5. Data Transformation................................................................................................32
3.8.6. Data Integration.......................................................................................................33
3.9. Method Used to Create a Model.....................................................................................35
3.9.1. Incorporating Big Data for Real-World Solution....................................................36
3.9.2. How to Predict With Regression Models................................................................37
CHAPTER FOUR.........................................................................................................................39
FINDINGS AND DISCUSSION..................................................................................................39
4.1. Data Preprocessing and preparation................................................................................39
4.1.1. Data source..............................................................................................................39
4.1.2. Data Cleaning..........................................................................................................39
4.1.3. Data Normalization..................................................................................................41
4.1.4. Dimension Reduction..............................................................................................42
V
4.1.5. Data Transformation and Integration.......................................................................42
4.2. Appropriate algorithm for identifying frequently occurred diseases..............................44
4.3. Creating a Model.............................................................................................................44
4.3.1. Machine learning and regression model using Phyton............................................44
4.3.1.1. Step 1: set up the environment.........................................................................45
4.3.1.2. Step 2: Importing libraries and modules..........................................................45
4.3.1.3. Step 3: Load Health data (our dataset).............................................................47
4.3.1.4. Step 4: Split data into training and test sets......................................................49
4.3.1.5. Step 5: Declare data preprocessing steps.........................................................50
4.3.1.6. Step 6: Declare hyperparameters to tune..........................................................52
4.3.1.7. Step 7: Tune model using a cross-validation pipeline......................................53
4.3.1.8. Step 8: Refit on the entire training set..............................................................55
4.3.1.9. Step 9: Evaluate model pipeline on test data....................................................55
4.3.1.10. Step 10: Save model for future use..................................................................57
4.3.2. Installation...............................................................................................................58
4.3.3. Configuration...........................................................................................................60
4.3.3.1. Managing Hive.................................................................................................60
CHAPTER FIVE...........................................................................................................................66
CONCLUSION AND RECOMMENDATION............................................................................66
4.4. Summary.........................................................................................................................66
4.5. Conclusion......................................................................................................................66
4.6. Recommendation............................................................................................................66
References......................................................................................................................................68
Appendices....................................................................................................................................72
VI
List of Figures
Figure 1. Conceptual Model...........................................................................................................5
Figure 2. Schematic diagram of research design..........................................................................26
Figure 3. Row data collected from Addis Ababa Health Bureau.................................................29
Figure 4. Data cleaning using python...........................................................................................40
Figure 5. Cleaned data set.............................................................................................................41
Figure 6. Data normalization using python..................................................................................41
Figure 7. Normalization data set...................................................................................................42
Figure 8. Data integration and transformation using python........................................................43
Figure 9. Transformed and integrated data set.............................................................................44
Figure 10. conformation for the proper installation of Scikit-Learn............................................45
Figure 11. importing numpy.........................................................................................................45
Figure 12. Importing Pandas.........................................................................................................45
Figure 13. Importing sampling helper..........................................................................................46
Figure 14. Importing the entire processing module......................................................................46
Figure 15. Importing the random family forest............................................................................46
Figure 16. Importing cross-validation pieline...............................................................................47
Figure 17. importing evaluation metrics.......................................................................................47
Figure 18. Imoprting Joblib..........................................................................................................47
Figure 19. Separate target from training features.........................................................................49
Figure 20. Split data into train and test sets..................................................................................49
Figure 21. Fitting the Transformer API........................................................................................50
Figure 22. Confirming scaled data set by using unit variance......................................................51
Figure 23. Appling the transformer to the data set.......................................................................51
Figure 24. Pipeline with preprocessing and model.......................................................................52
Figure 25. Tunable hyperparameters............................................................................................52
Figure 26. Declaration of hyperparameters to tune......................................................................53
Figure 27. Sklearn cross validation with pipeline.........................................................................54
Figure 28. Best set of parameters found using Cross-validation..................................................55
Figure 29. Conforming functionality............................................................................................55
VII
Figure 30. Predict a new set of data..............................................................................................56
Figure 31. Evaluating model performance...................................................................................56
Figure 32. Saving the model for further use.................................................................................57
Figure 33. Loading the model for further use...............................................................................57
Figure 34. All the code used to crate the model...........................................................................58
Figure 35. Cloudera manager login as admin...............................................................................59
Figure 36. Hive display on the cloudera Manager........................................................................59
Figure 37. Create a table using Hive over health_db....................................................................60
Figure 38. Number of patient categorized by age.........................................................................61
Figure 39. Gender with Age.........................................................................................................61
Figure 40. Top 15 diesase by Hive SQL.......................................................................................62
Figure 41. Code used to generate top disease affect the community based on age class.............62
Figure 42. Top hundred diseases from 2010-2012 based on the dataset......................................63
Figure 43. Top 30 disease affect the community from 2010-2012...............................................63
Figure 44.Ten most abundant diseases from 2010 -2012 based on the dataset............................64
Figure 45. Disease affect the community based on age class from 2010-1012............................65
VIII
Table content
Table 1: Definition of terms............................................................................................................5
Table 2. Summary statistics..........................................................................................................48
IX
List of Abbreviation
AI Artificial intelligent
AIDS Acquire immune Disease
B2B Business-to-Business
BD Big Data
CoS Classify or Send for Classification
DV Data Virtualization
DW Data Warehouses
EHRs electronic health record
ETL Extract, Transform, and Load
GPS Global Positioning System
GPU graphicsprocessing units
GUI’s Graphic User Interface
HDFS Hadoop Distributed File System
HIMSS Healthcare Information and Management Systems Society
HIV Human Immunodeficiency Virus
IoT Internet of Things
IT Information Technology
MCAR Missing Completely at Random
MSF Multiple Species Flocking
ML Machine learning
NLP Natural Language Processing
RDBMSes Relational Database Management Systems
WF World Forum
WHO World Health Organization
XML Extensible Markup language
X
Acknowledgment
First of all, I gratefully express my deepest thanks to the almighty God for his guidance, help,
support, and because He let me see new days of success in my life. Glory to God.
I am heartily thankful to my advisors, MulugetaAdibaru (Assistant professor)for his

encouragement, guidance, constructive comments, support, and his help that enabled me to
develop an understanding of the subject. I would like to thank Admas University school of
computer science overall facilitation of the research from the beginning until the end.
I also thank my classmates Netsanetand Edean for sharing ideas in the course of writing this
thesis and their comments on my work.
Last, but not least, I would like to express my heartfelt gratitude to all my family members
especially to my lovely friend, MaheletTsegaye who supported me in any respect during my
study.
XI
ABSTRACT
Machine learning on administrative data provides a powerful new tool for population health and
clinical hypothesis generation for risk factor discovery, enabling population-level risk
assessment that may help guide interventions to the most at-risk population. With a population of
more than 104 million people Ethiopian safer from different health care problems. The limited
number of health institutions, inefficient distribution of medical supplies, and the disparity
between rural and urban areas, due to severe underfunding of the health sector make access to
health-care services to the community very difficult in Ethiopia. The main objective of the study
was to design a big data analytics model that predicts the occurrence of frequent diseases to take
measures for prevention in Addis Ababa. The study wasconducted based on the data collected
from 2010 to 2012 by Addis Ababa Health Bureau. Data pre-processing such as Noise
Identification, Data Cleaning, Data Normalization, Dimension Reduction, Data Transformation,
and Data Integration were done using Python. Further data modeling was done using Hadoop.
The result extracted from the data set were the top ten diseases affected the community. Besides
that, the top three diseases affected the community based on age and gender groups. The top
three diseases affect the community without any gender and age variation are Tonsillitis (Acute
pharyngitis unspecified), Open wound of shoulder and upper arm, and Typhoid Fever. Therefore
based on this study and other country experiences it is suggested to integrate the data
management, use, and analysis in the health care system.
Keywords: Big data, healthcare, Frequent disease, Hadoop, Big data analysis, MapReduce
XII
CHAPTER ONE
INTRODUCTION
1.1. Background of the study
Health care has become an important issue in developed countries and middle-income countries
(Hong, L., Luo, M., Wang, R., Lu, P., Lu, W., & Lu, L. , 2018). The features of the public health
problem of the underdeveloped country include communicable diseases such as HIV (Human
Immunodeficiency Virus), Malaria, Tuberculosis, etc.
Ethiopia is one of the fastest-growing countries in Africa has more than 104 million people (the
second most-populace in the region). The main health concerns in Ethiopia include maternal
mortality, malaria, tuberculosis, and HIV/AIDS compounded by acute malnutrition and lack of
access to clean water and sanitation. The limited number of health institutions, inefficient
distribution of medical supplies, and the disparity between rural and urban areas, due to severe
underfunding of the health sector, make access to health-care services very difficult (WHO,
2020).
Big data the name does not only describe the size of the data but also implies the data processing
ability, technology, and approaches used for handling the data. On the other hand, big data is a
generic term used to describe structured and unstructured data set that are inadequate to collect,
process, analyze, and store using traditional software, algorithm, and data repositories due to
their extremely large and complex characters. Big data analytic has a wider application in areas
such as combating crime, business execution, finance, Global Positioning System (GPS),
commerce, travel, urban informatics, meteorology, genomics, complex physics simulations,
biology, environmental research, and health care(Hong, L., Luo, M., Wang, R., Lu, P., Lu, W., &
Lu, L. , 2018).
Big data in thehealthcaresector is essential to collecting, analyzing, and leveraging consumer,

patient, physical, and clinical data that is too vast or complex to be understood by traditional
means of data processing. The rise of healthcare big datacomes in response to the digitization of
healthcare information and the rise of value-based care, which has encouraged the industry to use
data analytics to make strategic business decisions(Kaur, P., Sharma, M., & Mittal, M. , 2018).
The challenges of healthcare data – such as volume, velocity, variety, and veracity –
healthsystems need to adopt technology capable of collecting, storing, and analyzing this
information to produce actionable insights.
1.1. Statement of the problem
The major challenge in big data analytics is to locate the required information from tables and
extract the information contained in the database. The challenge for Big Data analytics is to deal
with this heterogeneous data to generate insights for improved health-care outcomes (Belay,
2019). The most challenging task to apply data analytics in the health sector is the fact that data
in health care are disorganized and distributed since it comes from various sources and having
different structures and forms. This kind of data is commonly described as big data in the data
science research community and Big Data analytics is the treatment to deal with this kind of data
and generate insights for improved health-care outcomes. The fact that there are also governance
challenges such as lack of data protocols, lack of data standards, and data privacy issues are
adding to this (Kaur, P., Sharma, M., & Mittal, M. , 2018). Thus, the big data approach is used to
store the health informatics which is used during the disease diagnoses and the treatment. The
sample big data-based health informatics. Big Data in healthcare enables providers to deliver
more accurate and personalized care treatment. By having a detailed picture of patients, it is
easier to predict the response to a specific treatment.
In Ethiopia, there are a lot of problems in the health sector. Since the country is under low
income the problems include infrastructure, human resources, equipment, etc. Even if there are a
large amount of data such as physician’s prescriptions and written notes, laboratory, medical
imaging, pharmacy, patient records collected by the different body in the sector there is no
information that generates question raised by the sector. The question includes which disease
will affect the community in the coming 5 to 10 years, what kind of measure should be taken to
prevent the society from the disease, and which disease affects the community based on gender
and age group. Since Addis Ababa city have a high population even if there are many collected
data set it is not easy to answer the above mentionquestions like other cities in the country.
Thus in this research work, our approach is to study the existing data and to develop a model that
helps to predict the occurrence of frequent disease based on gender and age group.
1.2. Research Question
The research questions addressed by this study include:
2
 What kind of method used to integrateinformation from heterogeneous sources into a
consistent one?
 Which disease occur frequently?
 Which disease types affect the community based on age?
 Which disease types affect the community based on gender?
 What kind of model are suitable to determine the occurrence of frequent disease?
 What are the top ten diseasesthat affected the community?
1.3. Objectives
1.3.1. General objectives
The main objective of the study was to design a big data analytics model that predicts the
occurrence of frequent diseaseswhich can be used for future prevention.
1.3.2. Specific objectives
This study has the following specific objectives:
 To integrateinformation from heterogeneous sources into a consistent one.

 To evaluate which disease affects the community frequently.
 To evaluate which disease affects the community based on age and gender.
 To extract the top ten diseases that affect and will affect the community frequently.
 To develop a model suitable to determine the occurrence of frequent disease.
1.4. Significance of the Research
1. y digitizing, combining and

effectively using big data,
2. healthcare organizations ranging
from single-physician
3
3. offices and multi-provider
groups to large hospital net-
4. works and accountable care
organizations stand to
5. Realize significant benefits [2].
Potential benefits include
6. detecting diseases at earlier
stages when they can be
7. treated more easily and
effectively; managing specific in-
8. dividual and population health
and detecting health care
9. fraud more quickly and
efficiently. Numerous questions
10. Can be addressed with big data
analytics. Certain devel-
11. comments or outcomes may be
predicted and/or esti-
4
12. mated based on vast amounts of
historical data,
13. y digitizing, combining and
effectively using big data,
14. healthcare organizations ranging
from single-physician
15. offices and multi-provider
groups to large hospital net-
16. works and accountable care
organizations stand to
17. realize significant benefits [2].
Potential benefits include
18. detecting diseases at earlier
stages when they can be
19. treated more easily and
effectively; managing specific in-
20. dividual and population health
and detecting health care
5
21. fraud more quickly and
efficiently. Numerous questions
22. can be addressed with big data
analytics. Certain devel-
23. comments or outcomes may be
predicted and/or esti-
24. mated based on vast amounts of
historical data,
By digitizing, combining, and effectively using big data, healthcare organizations ranging from
single-physician offices and multi-provider groups to large hospital networks and accountable
care organizations stand to realize significant benefits. Potential benefits include detecting the
most occurred disease at earlier stages when they can be treated more easily and effectively;
managing specific individual and population health and detecting health care fraud more quickly
and efficiently. Numerous questions can be addressed with big data analytics(Raghupathi,
Wullianallur & Raghupathi, Viju. , 2014) .
New analytical frameworks and methods are required to analyze these data in a clinical setting.
These methods address some concerns, opportunities, and challenges such as features from
unstructured data, registration, and segmentation to deliver better recommendations at the
clinical level.
The data scientist is capable to analyze such complex data for the healthcare industry helping
various sub-fields perform with better efficiency. Therefore this study also positively impacts the
health care system since it predicts the frequent disease which helps take appropriatetreatment to
prevent the disease. It also gives baseline information for researchers and data scientists in the
country who interested in the health care system. The health care responsibilities such as the
Ministry of Health and other NGOs can also use the information generated by the model for
preparing the system to protect the community from the disease. Besides the study helps to
improve patient safety, it helps in reducing medical errors, improving wellness and prevention,
6
better disease management, improves and optimizing supply chains, it helps in improving risk
management, and follow with better regulatory compliance.
1.5. Scope and limitation
1.5.1. Scope of the research
Analyzing such too large or complex data sets requires strong quantitative and analytic skills.
Therefore, the scope of this research specifically conducting big data analysis to create a model
for predicting frequent disease, the disease affected the community based on age and gender, and
top ten diseases affected and will affect the community frequently in Addis Ababa based on the
collected data from Addis Ababa Health bureau. The research only covers the data collected
from Addis Ababa city.
1.5.2. Limitations of the research
Big data’ is massive amounts of information that can work wonders. It has become a topic of
special interest for the past two decades because of the great potential that is hidden in it. Various
public and private sector industries generate, store, and analyze big data to improve the services
they provide. In the healthcare industry, various sources for big data include hospital records,
medical records of patients, and results of medical examinations, and devices that are a part of
the internet of things.
But when we come to our country it’s impossible to get data in an organized format. So the
limitation of this study is it does notanalyze the security,the cost, the health care management
system, human resource management, geography, cost reduction and it does not process image
data.
1.6. Conceptual Framework
A conceptual framework is an analytical tool with several variations and contexts. It can be
applied in different categories of work where an overall picture is needed. It is used to make
conceptual distinctions and organize ideas. Strong conceptual frameworks capture something
real and do this in a way that is easy to remember and apply(Raghupathi, Wullianallur &
Raghupathi, Viju, 2014).
7
Figure 1. Conceptual Model (Raghupathi, Wullianallur & Raghupathi, Viju. , 2014).
1.7. Definitions of terms
Table 1: Definition of terms.

Hadoop Framework for distributed storage and
processing of huge data
MapReduce A programming model for large scale data
processing.
HIVE Used for performing queries
YARN Used for scheduling jobs
DBMS Database management system
HDFS Hadoop Distributed File System

JSP Java Server Pages
8
1.8. Contribution
This study contributes not only to the health care sectors but it contributesto data science study in
the country. It especially contributesto the Big Data Analysis which is somehow new to the
country.
1.9. Organization of the Paper
The research organized intofive chapters and each chapter deals with different concepts. In
chapter one background of the study, statement of the problem, research question, objective,
scope and limitation, the significance of the study, conceptual framework, the definition of
terms, and contribution of the research are discussed in detail. In chapter two literature reviews
in big data, big data analytics, big data application in health care, and related work in big data
were discussed in detail. On the other hand chapter three deals with the overall methodology
used from data preprocessing to the model creating. Chapter four deals with the result and
discussion which generate by using the model created in chapter three. Finally,chapter five
discussesthe summary of the result, the conclusion derived from the result and recommendation
given based on the result.
9
CHAPTER TWO
LITERATUREREVIEW
2.1. Introduction
The rapidly expanding field of big data analytics has started to play a pivotal role in the
evolution of healthcare practices and research. It has provided tools to accumulate, manage,
analyze, and assimilate large volumes of disparate, structured, and unstructured data produced by
current healthcare systems (Kulkarni, A., Siarry, P., Singh, P., Abraham, A., Zhang, M.,
Zomaya, A., and Baki, F, 2020).
There are diverse forms of healthcare data sources include clinical text, biomedical images,
EHRs, genomic data, biomedical signals, sensing data, and social media (Ta, V. D., Liu, C. M.,
& Nkabinde, G. W., 2016). The analysis of genomic data lets peoplehave a much broader
understanding of the relationships among different genetic markers, mutations, and disease
conditions. Furthermore, transforming genetic discoveries into personalized medicine practice is
a task with many unresolved challenges. Clinical text mining transforms data from clinical notes
that are organized in an unstructured format to useful information. Information retrieval and
natural language processing (NLP) are methods that extract useful information from large
volumes of clinical text. Social network analysis helps discover knowledge and new patterns
which can be leveraged to model and predict global health trends (e.g., outbreaks of infectious
epidemics) based on various social media resources including a collection of social media
resources such as Weblogs, Twitter, Facebook, social networking sites, search engines, etc. (Ta,
V. D., Liu, C. M., & Nkabinde, G. W., 2016).
2.2. Big Data (BD)
The concept of “big data” is not new; however, the wayit is defined is constantly changing.The
term "Big Data" refers to the evolution and use of technologies that provide the right user at the
right time with the right information from a mass of data that has been growing exponentially for
a long time in our society. The challenge is not only to deal with rapidly increasing volumes of
10
data but also the difficulty of managing increasingly heterogeneous formats as well as
increasingly complex and interconnected data. Being a complex polymorphic object, its
definition varies according to the communities that are interested in it as a user or provider of
services. Invented by the giants of the web, the Big Data presents itself as a solution designed to
provide everyone real-time access to giant databases. Big Data is a very difficult concept to
define precisely since the very notion of big in terms of volume of data varies from one area to
another. It is not defined by a set of technologies, on the contrary, it defines a category of
techniques and technologies. This is an emerging field, and as we seek to learn how to
implement this new paradigm and harness the value, the definition is changing (Riahi, Y. and
Riahi, S., 2018 ).
Big data was defined in 2001 for the first time. DougLaney, defined the 3Vs model, i.e., Volume,
Variety, and Velocity. Even though the 3Vs model was not usedto define big data, Gartner and
many other organizations, likeIBM and Microsoftstill use the “3Vs” model to define
bigdata(Hemlata and Preeti, 2016).
Volume: Big data is any set of data that is so large that the organization that owns it faces
challenges related to storing or processing it. In reality, trends like e-commerce, mobility, social
media, and the Internet of Things (IoT) are generating so much information, that nearly every
organization probably meets this criterion.
Velocity: If your organization is generating new data at a rapid pace and needs to respond in
real-time, you have the velocity associated with big data. Most organizations that are involved in
e-commerce, social media or IoT satisfy this criterion for big data.
Variety: If your data resides in many different formats, it has the variety associated with big data.
For example, big data stores typically include email messages, word processing documents,
images, video, and presentations, as well as data that resides in structured relational database
management systems (RDBMSes) (Harvey, 2017).
2.2.1. Data Evolution
To better understand what Big Data is and where it comes from, it is crucial to first understand
some history of data storage, repositories, and tools to manage them. There has been a huge
increase in data volume during the last three decades(Belay, 2019).
11
As we can see in the decade of the 1990s the data volume was measured in terabytes. Relation
databases and data warehouses representing structured data in rows and columns were the typical
technologies to store and manage enterprise information. Subsequent decade data started dealing
with different kinds of data sources driven by productivity and publishing tools such as content
managed repositories and networked attached storage systems. Consequently, the data volume
was started being measured in petabytes (Belay, 2019).
2.3. Big Data Analysis
Big Data generally refers to data that exceeds the typical storage, processing, and computing
capacity of conventional databases and data analysis techniques. As a resource, Big Data
requires tools and methods that can be applied to analyze and extract patterns from large-scale
data. (Najafabadi et al, 2015). The analysis of structured data evolves due to the variety and
velocity of the data manipulated. Therefore, it is no longer enough to analyze data and produce
reports, the wide variety of data means that the systems in place must be capable of assisting in
the analysis of data. The analysis consists of automatically determining, within a variety of
rapidly changing data, the correlations between the data to help in the exploitation of it. Big Data
Analytics refers to the process of collecting, organizing, analyzing large data sets to discover
different patterns and other useful information. Big data analytics is a set of technologies and
techniques that require new forms of integration to disclose large hidden values from large
datasets that are different from the usual ones, more complex, and of a large enormous scale. It
mainly focuses on solving new problems or old problems in better and effective ways(Riahi, Y.
and Riahi, S., 2018 ).
Big Data Analysis mainly involves analytical methods of bigdata, the system architecture of big
data, and big data miningand software for analysis. Data investigation is the mostimportant step
in big data, for exploring meaningful values,giving suggestions and decisions. Possible values
can beexplored by data analysis. However, the analysis of data is a widearea, which is dynamic
and is very complex(Hemlata and Preeti, 2016).
2.3.1. Tasks in Big Data Analysis
In the Big Data era, everybody wants to concentrate onextracting key-value and information
from the huge dataset toachieve the objectives of their organization (Hemlata and Preeti, 2016).
12
2.3.1.1. Integration of various data sources.
Today there is a great diversity of data sources containing datasets different by the structure,
format, origin (forecasts, estimations, measurements, etc.), access protocol, and veracity (the last
one is often included in the definition of Big Data). All these data should be accessed according
to its nature and semantic meaning within the e-Science solutions. At the same time formatting,
accessing and structural decomposition’s specifics should be hidden as well as technological
aspects of distributed data processing(Chen, M., Mao, S., & Liu, Y, 2014).
2.3.1.2. Integration with the simulation process
The data analytics’ tasks should be integrated with simulation tasks in two ways. First, they can
be considered as a part of composite scientific applications used for simulation. One of the ways
to perform this is the extension of the WF structure with specific nodes calling data analytics
subroutines. Second, the task may require local simulation tasks to be solved during the data
analysis (e.g. for classification of the data of estimate additional characteristics). As an additional
complication of the task, it can require local calls of software packages to perform some complex
data processing (e.g. forecasting simulation)(Belay, 2019).
2.3.1.3. High-level user interaction.
To support the user during the task definition the developed technology should use domain-
specific semantics to describe the high-level task. This semantic can be used to build expressive
languages with textual or graphical notation. Such languages allow building composition
interfaces (more powerful with graphical notation) as well as parameter definition interfaces or
interfaces for result representation (can be automatically designed using domain-specific
description)(Belay, 2019).
2.3.1.4. Complex visualization.
Large data visualization should support interactive exploration of data arrays with cognitive
support and appropriate spatiotemporal scene rendering. Moreover, the visualization should be
tightly interconnected with simulation and data analysis tasks. To support this kind of data
visualization in an automatic way the semantic description related to the data and interconnected
13
processes should be used during the building of the visual scene. To support automatic task
processing and data analytics integration a formalized domain-specific knowledge can be used
(Belay, 2019).
2.4. Big Data Application
Since 2010, big data was and remains in the spot. Big data has been increasingly used in
different industrial, social, and professional sectors(Boinepelli H. In: Mohanty H., Bhuyan P.,
Chenthati D, 2015). Some of the major categories and applications ofbig data arepublic sector
tax reduction, social security, energy exploration, environmental protection, power investigation,
and public safety.
When we see the application of big data in the healthcare industry it helps the sector by cost
reduction in medical treatments, prediction of diseases, eliminate the risk factors associate with
diseases, improves preventive care, and analyzing drug efficiency.
On the other hand, big data application in education and learning helps the system by providing
students’ preferred learning mode, track students’ performance, provide guidance, gives real-
time feedbacks and updates, improving the learning material, cross-checking of assignments, and
digital students assessment.
In the insurance industry,big data application helps by predicting customer behavior, evaluate the
risk of insuring, monitoring real-time claims, customer retention, managing premiums for the
policies, and manage the fraudulent claims.
In the transportation sector; traffic control, route planning, intelligent transport systems,
congestion management, revenue management in the private sector, technological enhancement,
and forecasting routes to reduce cast on petroleum can be achieved by applying big data.
Big data also have high application in the industrial and natural resources by integrating
geospatial, temporal, graphical and text data, and analyze consumption of utilities. Not only the
other hand by analyzing big businesses, but prognostic analytics, analyzing shopping patterns of
customers, analyzing CRM tactics of competitors, and customer statistics alteration big data
application also help the Banking sector (Kaur, P., Sharma, M., & Mittal, M. , 2018).
14
2.5. Big Data Application in Health Sector
Big data analytics in healthcare primary objective is to supports healthcare industries and to
develop more knowledgeable decisions of patient’s wellbeing and healthcare. The main
opportunities for big datascientistsin the health sector are having a huge amount of data.Despite
the inherent complexities of healthcare data,there is potential and benefit in developing and
implementingbig data solutions within this realm. Bydiscovering a new method for
understanding andassociations of patterns within a data is a bigrequirement in health care
analytics. Thus, big dataanalytics applications in healthcare take advantage ofthe explosion in
data to extract insights for makingbetter-informed decisions and as a researchcategory(J. Jeba
Praba and V. Srividhya, 2016).
Big data in the healthcare sector has access to huge amounts of complex data. But it has
overwhelmed by failures due to the truths of inadequate, unavailability of perfect electronic data.
Apart from this those databases are holding the health-related information that is difficult to link
data to other databases and devices to the field of medicine. So, the patterns and information are
not able to show on time. Using the data to curb the cost of rising healthcare and by inefficient
systems that stifle faster and better healthcare benefits across the board(Kulkarni, A., Siarry, P.,
Singh, P., Abraham, A., Zhang, M., Zomaya, A., and Baki, F., 2020). Most experts define big
data in terms of the three Vs. You have big data if your data stores have the following
characteristics (Harvey, 2017).
Predictive analytics supports health care sectors to achieve a high level of effective overall care
and preventive care, as predictive systems’ results allow treatments and actions to be taken when
all the risks are recognized in early stages, which aids in minimizing costs (Conley et al 2008).
Big data often has high values in volume, velocity, variety, variability, value, complexity, and
sparseness. Big data has the potential of applications in healthcare which include disease
surveillance, epidemic control, clinical decision support, population health management, etc.
(Sabharwal, S., Gupta, S., & Thirunavukkarasu, 2016). Big Data in healthcare can provide
significant benefits such as detecting diseases at an early stage. The inclusion of Big Data
analytics in smart healthcare systems brings innovative electronic and mobile health (e/m-health)
that increase efficiency and save medical costs (Pramanik et al., 2017).
15
In addition to the above-maintioned application of big data in healthcare other application are
includes research and development, public health, genomic analytics, pre-adjudication fraud
analysis, andpatient profile analytics.
Research & development: 1) by predictive modeling tolower attrition and produce a leaner,
faster, moretargeted R & D pipeline in drugs and devices;2) statistical tools and algorithms to
improve clinicaltrial design and patient recruitment to better matchtreatments to individual
patients, thus reducing trialfailures and speeding new treatments to market; and3) analyzing
clinical trials and patient records to identifyfollow-on indications and discover adverse effects
beforeproducts reach the market.
Public health:1) analyzing disease patterns andtracking disease outbreaks and transmission to
improvepublic health surveillance and speed response; 2) fasterdevelopment of more accurately
targeted vaccines, e.g.,choosing the annual influenza strains; and, 3) turninglarge amounts of
data into actionable information thatcan be used to identify needs, provide services, andpredict
and prevent crises, especially for the benefit of populations (Manyika J, Chui M, Brown B,
Buhin J, Dobbs R, Roxburgh C, Byers AH , 2011).Also, (IBM, 2012) suggests big data analytics
inhealthcare can contribute toEvidence-based medicine: Combine and analyze avariety of
structured and unstructured data-EMRs,financial and operational data, clinical data, and
genomicdata to match treatments with outcomes, predict patientsat risk for disease or
readmission and provide moreefficient care.
Genomic analytics: Execute gene sequencing moreefficiently and cost effectively and
genomicanalys is a part of the regular medical care decision process and the growing patient
medical record (IBM S. , 2012).
Pre-adjudication fraud analysis: Rapidly analyzelarge numbers of claim requests to reduce

fraud, waste,and abuse;Device/remote monitoring: Capture and analyze inreal-time large
volumes of fast-moving data fromin-hospital and in-home devices, for safety monitoringand
adverse event prediction;
Patient profile analytics: Apply advanced analyticsto patient profiles (e.g., segmentation and
predictivemodeling) to identify individuals who would benefitfrom proactive care or lifestyle
16
changes, for example,those patients at risk of developing a specific disease(e.g., diabetes) who
would benefit from preventive (Raghupathi, Wullianallur & Raghupathi, Viju. , 2014).
2.5.1. Sources of Data in Health Sector
In the healthcare industry, various sources for big data include hospital records, medical records
of patients, results of medical examinations, and devices that are a part of the internet of things.
Biomedical research also generates a significant portion of big data relevant to public
healthcare(Dash, S., Shakyawar, S., Sharma, M., and Kaushik, S., 2019 ) .
The main sources of health statistics are surveys, administrative and medical
records, claims data, vital records, surveillance, disease registries, and peer-reviewed literature.
We’ll take a look into these sources, and the pros and cons of using each to create health
statistics. (Nlm.nih, 2020).
2.6. The algorithm in Data Analytics
Advanced algorithms are required to implement ML and AI approaches for big data analysis on
computing clusters. A programming language suitable for working on big data (e.g. Python, R,
or other languages) could be used to write such algorithms or software. Therefore, a good
knowledge of biology and IT is required to handle the big data from biomedical research. Such a
combination of both the trades usually fits for bioinformatics. The most common among various
platforms used for working with big data include Hadoop and Apache Spark(Belay, 2019). Here
these platforms briefly introduce.
2.6.1. Mining algorithms for the specific problem
Because the big data issues have appeared for nearly ten years, Fan and Bifet pointed out that the
terms “big data” and “big data mining” were first presented in 1998, respectively. The big data
and big data mining almost appearing at the same time explained that finding something from
big data will be one of the major tasks in this research domain. Data mining algorithms for data
analysis also play a vital role in the big data analysis, in terms of computation cost, memory
requirement, and accuracy of the results. In this section, we will give a brief discussion from the
perspective of analysis and search algorithms to explain its importance for big data
analytics(Chiang M-C, Tsai C-W, Yang C-S, 2011).
17
Clustering algorithms in the big data age, traditional clustering algorithms will become even
more limited than before because they typically require that all the data be in the same format
and be loaded into the same machine to find some useful things from the whole data. Although
the problem (Chiang M-C, Tsai C-W, Yang C-S, 2011) of analyzing large-scale and high-
dimensional dataset has attracted many researchers from various disciplines in the last century,
and several solutions (Xu R, Wunsch D., 2009) have been presented in recent years, the
characteristics of big data still brought up several new challenges for the data clustering issues.
Among them, how to reduce the data complexity is one of the important issues for big data
clustering. In (Shirkhorshidi AS, Aghabozorgi SR, Teh YW, Herawan T., 2014), Shirkhorshidi et
al. divided the big data clustering into two categories: single-machine clustering (i.e., sampling
and dimension reduction solutions), and multiple-machine clustering (parallel and MapReduce
solutions). This means that traditional reduction solutions can also be used in the big data age
because the complexity and memory space needed for the process of data analysis will be
decreased by using sampling and dimension reduction methods. More precisely, sampling can be
regarded as reducing the “amount of data” entered into a data analyzing process while dimension
reduction can be regarded as “downsizing the whole dataset” because irrelevant dimensions will
be discarded before the data analyzing process is carried out.
CloudVista (X, L, G, Chen) is a representative solution for clustering big data that used cloud
computing to perform the clustering process in parallel. Birch(Zhang T, Ramakrishnan R, Livny
M. BIRCH, 1996) and sampling method were used in CloudVista to show that it can handle
large-scale data, e.g., 25 million census records. Using GPU to enhance the performance of a
clustering algorithm is another promising solution for big data mining. The multiple species
flocking (MSF) was applied to the CUDA platform from NVIDIA to reduce the computation
time of the clustering algorithm (Cui X, Charles JS, Potok T., 2013). The simulation results show
that the speedup factor can be increased from 30 up to 60 by using GPU for data clustering.
Since most traditional clustering algorithms (e.g., k-means) require a centralized computation,
how to make them capable of handling big data clustering problems is the major concern of
Feldman et al. (Feldman D, Schmidt M, Sohler C, 2013)who uses a tree construction for
generating the corsets in parallel which is called the “merge-and-reduce” approach? Moreover,
Feldman et al. pointed out that by using this solution for clustering, the update time per datum
and memory of the traditional clustering algorithms can be significantly reduced.
18
Classification algorithms similar to the clustering algorithm for big data mining, several studies
also attempted to modify the traditional classification algorithms to make them work on a
parallel computing environment or to develop new classification algorithms that work naturally
on a parallel computing environment. In(Tekin C, van der Schaar M., 2013), the design of the
classification algorithm took into account the input data that are gathered by distributed data
sources and they will be processed by a heterogeneous set of learners. In this study, Tekin et al.
presented a novel classification algorithm called “classify or send for classification” (CoS). They
assumed that each learner can be used to process the input data in two different ways in a
distributed data classification system. One is to perform a classification function by itself while
the other is to forward the input data to another learner to have them labeled. The information
will be exchanged between different learners. In brief, this kind of solution can be regarded as
cooperative learning to improve accuracy in solving the big data classification problem. An
interesting solution uses quantum computing to reduce the memory space and computing cost of
a classification algorithm.
A recent study (Ma C, Zhang HH, Wang X., 2014) shows that some traditional mining
algorithms, statistical methods, preprocessing solutions, and even the GUI’s have been applied to
several representative tools and platforms for big data analytics. The results show clearly that
machine learning algorithms will be one of the essential parts of big data analytics. One of the
problems in using current machine learning methods for big data analytics is similar to those of
most traditional data mining algorithms that are designed for sequential or centralized
computing. However, one of the most possible solutions is to make them work for parallel
computing. Fortunately, some of the machine learning algorithms (e.g., population-based
algorithms) can essentially be used for parallel computing, which has been demonstrated for
several years, such as parallel computing versions of genetic algorithms(Belay, 2019).
2.6.2. Linear Regression
Linear regression is one of the most basic algorithms of advanced analytics. This also makes it
one of the most widely used. People can easily visualize how it is working and how the input
data is related to the output data. Linear regression uses the relationship between two sets of
continuous quantitative measures. The first set is called the predictor or independent variable.
The other is the response or dependent variable. The goal of linear regression is to identify the
19
relationship in the form of a formula that describes the dependent variable in terms of the
independent variable. Once this relationship is quantified, the dependent variable can be
predicted for any instance of an independent variable. One of the most common independent
variables used is time. Whether your independent variable is revenue, costs, customers, use, or
productivity, if you can define the relationship it has with time, you can forecast value with
linear regression(Hong, L., Luo, M., Wang, R., Lu, P., Lu, W., & Lu, L. , 2018).
I. Logistic Regression
Logistic regression sounds similar to linear regression but is focused on problems involving
categorization instead of quantitative forecasting. Here the output variable values are discrete
and finite rather than continuous and with infinite values as with linear regression. The goal of
logistic regression is to categorize whether an instance of an input variable either fits within a
category or not. The output of logistic regression is a value between 0 and 1. Results closer to 1
indicate that the input variable more clearly fits within the category. Results closer to 0 indicate
that the input variable likely does not fit within the category. Logistic regression is often used to
answer clearly defined yes or no questions. Will a customer buy again? Is a buyer creditworthy?
Will the prospect become a customer? Predicting the answer to these questions can spawn a
series of actions within the business process which can help drive future revenue(Hong, L., Luo,
M., Wang, R., Lu, P., Lu, W., & Lu, L. , 2018).
II. Classification and Regression Trees
Classification and regression trees use a decision to categorize data. Each decision is based on a
question related to one of the input variables. With each question and corresponding response,
the instance of data gets moved closer to being categorized in a specific way. This set of
questions and responses and subsequent divisions of data create a tree-like structure. At the end
of each line of questions is a category. This is called the leaf node of the classification tree(Hong,
L., Luo, M., Wang, R., Lu, P., Lu, W., & Lu, L. , 2018).
These classification trees can become quite large and complex. One method of controlling the
complexity is through pruning the tree or intentionally removing levels of questioning to balance
between exact fit and abstraction. A model that works well with all instances of input values,
both those that are known in training and those that are not, is paramount. Preventing the
20
overfitting of this model requires a delicate balance between exact fit and abstraction(Hong, L.,
Luo, M., Wang, R., Lu, P., Lu, W., & Lu, L. , 2018).
A variant of classification and regression trees is called random forests. Instead of constructing a
single tree with many branches of logic, a random forest is a culmination of many small and
simple trees that each evaluate the instances of data and determine a categorization. Once all of
these simple trees complete their data evaluation, the process merges the individual results to
create a final prediction of the category based on the composite of the smaller categorizations.
This is commonly referred to as an ensemble method. These random forests often do well at
balancing exact fit and abstraction and have been implemented successfully in many business
cases(Hong, L., Luo, M., Wang, R., Lu, P., Lu, W., & Lu, L. , 2018).
In contrast to logistic regression, which focuses on a yes or no categorization, classification and

regression trees can be used to predict multi-value categorizations. They are also easier to
visualize and see the definitive path that guided the algorithm to a specific categorization(Hong,
L., Luo, M., Wang, R., Lu, P., Lu, W., & Lu, L. , 2018).
2.6.3. K-Nearest Neighbors
K-nearest neighbor is also a classification algorithm. It is known as a "lazy learner" because the
training phase of the process is very limited. The learning process is composed of the training set
of data being stored. As new instances are evaluated, the distance to each data point in the
training set is evaluated and there is a consensus decision as to which category the new instance
of data falls into based on its proximity to the training instances. This algorithm can be
computationally expensive depending on the size and scope of the training set. As each new
instance has to be compared to all instances of the training data set and a distance derived, this
process can use many computing resources each time it runs. This categorization algorithm
allows for multivalued categorizations of the data. Besides, noisy training data tends to skew
classifications. K-nearest neighbors are often chosen because it is easy to use, easy to train, and
easy to interpret the results. It is often used in search applications when you are trying to find
similar items(Hong, L., Luo, M., Wang, R., Lu, P., Lu, W., & Lu, L. , 2018).
21
2.6.4. K-Means Clustering
K-means clustering focuses on creating groups of related attributes. These groups are referred to
as clusters. Once these clusters are created, other instances can be evaluated against them to see
where they best fit. This technique is often used as part of data exploration. To start, the analyst
specifies the number of clusters. The K-means cluster process breaks the data into that number of
clusters based on finding data points with similarities around a common hub, called the centroid.
These clusters are not the same as categories because initially, they do not have business
meaning. They are just closely related instances of input variables. Once these clusters are
identified and analyzed, they can be converted to categories and provided a name that has
business meaning. K-means clustering is often used because it is simple to use and explain and
because it is fast. One area to note is that k-means clustering is extremely sensitive to outliers.
These outliers can significantly shift the nature and definition of these clusters and ultimately the
results of the analysis. These are some of the most popular algorithms in use in advanced
analytics initiatives. Each has pros and cons and different ways in which it can be effectively
utilized to generate business value. The end target with the implementation of these algorithms is
to further refine the data to a point where the information that results can be applied to business
decisions. It is this process of informing downstream processes with more refined and higher
value data that is fundamental to companies becoming truly harnessing the value of their data
and achieving the results that they desire(Belay, 2019).
2.7. Tools used in big data analysis
Different commercial and open source software are available forBig Data Mining and
Analysis.Apache Hadoop emerged as the most popular open-source platform that supported an
implementation of MapReduce, which had all the features mentioned previously like fault
tolerance, failure recovery, data partitioning, inter-job communication, etc. However,
MapReduce (and Hadoop) do not scale well when dealing with streaming data that are generated
in iterative and online processes, for use with machine learning and analytics.
As an alternative to Hadoop, Apache Spark was created, capable of performing faster-distributed

computing tasks by using in-memory techniques. Spark overcame the problem of iterative,
online processing, and frequent IO to disc made by MapReduce. It loads data into memory and
22
reuses it repeatedly(Dindokar, Shantanu, 2018). Spark is a general-purpose framework. This
allows us to implement several distributed programming models on top of it (like Pregel or
Hadoop). Spark is built on top of a new abstraction model called Resilient Distributed Datasets
(RDDs). This versatile model allows controlling the persistence and managing the partitioning of
data, among other features.
Some competitors to Apache Spark have emerged lately, especially for data stream processing.
Apache Storm is a prime candidate. It is an open-source distributed real-time processing
platform, which is capable of processing millions of tuples per second per node in a fault-tolerant
way. Apache Flink is a recent top-level Apache project designed for distributed stream and batch
data processing. Spark employs a mini-batch streaming processing over an online or real-time
processing approach. This gap is now filled by Storm and Flink(Dindokar, Shantanu, 2018).
The metastore database is an important aspect of the Hive infrastructure. It is a separate database,
relying on a traditional RDBMS such as MySQL or PostgreSQL, that holds metadata about Hive
databases, tables, columns, partitions, and Hadoop-specific information such as the underlying
data files and HDFS block locations.The metastore database is shared by other components. For
example, the same tables can be inserted into, queried, altered, and so on by both Hive and
Impala. Although you might see references to the "Hive metastore", be aware that the metastore
database is used broadly across the Hadoop ecosystem, even in cases where you are not using
Hive itself.The metastore database is relatively compact, with fast-changing data. Backup,
replication, and other kinds of management operations affect this database(Dindokar, Shantanu,
2018).
Cloudera recommends that you deploy the Hive metastore, which stores the metadata for Hive
tables and partitions, in "remote mode." In this mode, the metastore service runs in its JVM
process, and other services, such as HiveServer2, HCatalog, and Apache Impala communicate
with the metastore using the Thrift network API.Hive traditionally uses MapReduce behind the
scenes to parallelize the work and perform the low-level steps of processing a SQL statement
such as sorting and filtering. Hive can also use Spark as the underlying computation and
parallelization engine. Apache HBase is a NoSQL database that supports real-time read/write
access to large datasets in HDFS(Dindokar, Shantanu, 2018).
23
2.7.1. Hadoop for big data applications
Big Data are collections of information that would have been considered gigantic, impossible to
store and process, a decade ago. The processing of such large quantities of data imposes
particular methods. A classic database management system is unable to process as much
information. Hadoop is an open-source software product (or, more accurately, „software library
framework‟) that is collaboratively produced and freely distributed by the Apache Foundation –
effectively, it is a developer’s toolkit designed to simplify the building of Big Data solutions.
(Fujitsu, 2011)
Although Big Data principles and approaches are frequently discussed, there are not many
technologies which are convenient to deal with such data. Due to the definitions of the volume
and the velocity, the tools which are supposed to deal with Big Data have to offer a distributed
computing approach. There are the following approaches: multiple data and a single program,
and single data and multiple programs(Belay, 2019).
In the first case, there is a single program, which is run on more nodes, where all nodes process
different data. On the contrary, the second case is considered to have only one dataset, which is
processed by a program divided into small tasks that are run on different nodes in parallel. Due
to it, some tools try to abstract from the physical distribution as much as possible. Since the
Apache Company released its new implementation of the Map-Reduce paradigm, a whole
ecosystem called Hadoop has started evolving. The MapReduce paradigm offers the means to
break a large task into smaller tasks, run in parallel, and consolidate the outputs of the individual
tasks into the final output. The significant ecosystem expansion was caused by using simple
programming models to process large datasets across clusters as well as was amplified by the
fact that the whole solution has started as open-source software. Hadoop as the first publicly
known and discussed the technology of Big Data processing has been used as the base of open
source and commercial extensions. In other words, most of the sets of Big Data tools are based
on the Hadoop solution. These solutions offer methods and approach to load, pre-process, store,
query, and analyses data. In the following, it will be discussed the Hadoop ecosystem will be
described along with other technologies that have been evolved from it or others that are using its
technologies(Belay, 2019).
24
The overwhelming trend towards digital services, combined with cheap storage, has generated
massive amounts of data that enterprises need to effectively gather, process, and analyze. Data
analysis techniques from the data warehouse and high-performance computing communities are
invaluable for many enterprises, however, often their cost or complexity of scale-up discourages
the accumulation of data without an immediate need. As valuable knowledge may nevertheless
be buried in this data, related scaled-up technologies have been developed. Examples include
Google’s MapReduce, and the open-source implementation, Apache Hadoop(Belay, 2019).
Hadoop is an open-source project administered by the Apache Software Foundation. Hadoop’s

contributors work for some of the world’s biggest technology companies. That diverse,
motivated community has produced a collaborative platform for consolidating, combining, and
understanding data. Technically, Hadoop consists of two key services: data storage using the
Hadoop Distributed File System (HDFS) and large-scale parallel data processing using a
technique called MapReduce.
After completing this hands-on lab, you will be able to:
• Use Hadoop commands to explore HDFS on the Hadoop system
• Use Web Console to explore HDFS on the Hadoop system
2.7.2. MapReduce
As mentioned earlier, the MapReduce paradigm provides the means to break a large task into
smaller tasks, run the tasks in parallel and consolidate the outputs of the individual tasks into the
final output. MapReduce consists of two basic parts: a map step and a reduce step(Belay, 2019).
 Map – operates a piece of data that generates some intermediate output.

 Reduce – gathers the intermediate outputs from the map steps, processes it, and provides
the collected final output.
The main advantage of MapReduce is the workload distribution over a cluster of computers (to
run tasks in parallel). Particularly, MapReduce provides a technique, which allows the
processing of one portion of the input which can be run independently of the other input ports. In
other words, the workload can be easily distributed over the cluster(Belay, 2019).
25
2.7.3. Distributed File System – HDFS
The Hadoop Distributed File System (HDFS) is a file system that provides the capability to
distribute data across a cluster to take advantage of the parallel processing of MapReduce. HDFS
is designed to run on common low-cost hardware. Consequently, it means there is no need to
deploy it only on supercomputers. Although it is implemented in Java, HDFS can be deployed on
a wide range of machines apart from a node, which is dedicated to managing namespace services
Architecture HDFS has a master/slave architecture. It consists of a single master server that
manages the filesystem namespace and manages access to files by clients and a single
NameNode. Also, some DataNodes are usually bound up with a node in the cluster. These
DataNodes manage storage within their nodes that they run on. A file in HDFS is split into one
or more blocks that are stored by a set of DataNodes. Moreover, they are responsible for serving
read and write requests from the file system(Belay, 2019).
2.8. Related Work
Some countries are applying big data analysis for different sectors for example; Japan has
already started using Big Data technologies to improve medical treatment and healthcare for
elderly people. Big Data analytics can be used to achieve valuable information from large and
complicated datasets via data mining (Wang, Lidong & Alexander, Cheryl. , 2019).
To analyze the diversified medical data, the healthcare domain, describes analytics in four
categories: descriptive, diagnostic, predictive, and prescriptive analytics.
Descriptive analytics refers to describing the current medical situations and commenting on that
whereas diagnostic analysis explains reasons and factors behind the occurrence of certain events,
for example, choosing a treatment option for a patient based on clustering and decision trees.
Predictive analytics focuses on the predictive ability of future outcomes by determining trends
and probabilities. These methods are mainly built up of machine learning techniques and are
helpful in the context of understanding complications that a patient can develop.
Prescriptive analytics is to perform analysis to propose an action towards optimal decision

making. For example, the decision of avoiding a given treatment to the patient based on observed
side effects and predicted complications. To improve the performance of the current medical
26
systems integration of big data into healthcare analytics can be a major factor; however,
sophisticated strategies need to be developed. An architecture of best practices of different
analytics in the healthcare domain is required for integrating big data technologies to improve the
outcomes. However, there are many challenges associated with the implementation of such
strategies (Dash, S., Shakyawar, S., Sharma, M., and Kaushik, S., 2019 ).
As it has been dealt with large, complex, multi-source, incomplete, and heterogeneous data, to
obtain valid and robust diagnostic forecasting predictions, its approach needs to start with data
cleaning, data integration, and dimension reduction, and data normalization for heterogeneous
data and Big Data analytics. This includes methods for identification of missing patterns, data
wrangling, imputation, conversion, fusion, and cross-linking. Next, it was needed to introduce
mechanisms for the automated extraction of structured data elements representing biomedical
signature vectors associated with unstructured data. It has been employed in the statistical
computing environment logistic regression for its model fitting, parameter estimation, and
machine learning classification. The final component of this protocol requires a computational
platform that enables each of these steps to be implemented, integrated, and validated (Belay,
2019).
Big data is collected from a largeassortment of sources, such as social networks, videos, digital
images, and sensors. The major aim of Big Data Analytics isto discover new patterns and
relationships which might be invisible, and it can provide new insights about the users
whocreated it. There are many tools available for mining of Big Data and Analysis of Big Data,
both professional andnon-professional. In this paper, we have summarised different big data
analytic methods and tools(Hemlata and Preeti, 2016).
Big data analytics israpidly expanding in all areas such as science and engineering physical,
biological and biomedical sciences. It isthe collection of a large volume of structured and
unstructured data. The big data is evolution is from businessapplications, agriculture, healthcare,
social media and sensor data, and so on. The data in healthcare is increasingquickly and is
expected to increase very much in the coming years. The adoption of an evidence-based
approach of bigdata solutions will play a critical role in the outcomes of the healthcare industry
and providing patient-centrictreatment. Big data analytics is expanding areas with the potential to
provide useful insight into healthcare. With thebig data and analytics tools and technologies,a
27
predictive system will be designed to identify the increased risk.This work describes big data
analytics, characteristics, and potential and implementation of big data analytics inhealthcare. It
ends up with a survey of challenges and future directions in the healthcare sector(J. Jeba Praba
and V. Srividhya, 2016).
CHAPTER THREE
RESEARCH METHODOLOGY
3.1. Overview
In heterogeneous data and Big Data analytics to obtain valid and robust diagnostic forecasting
prediction from large, complex, multi-source, incomplete, and heterogeneous data, its approach
passed different stages. Most of the approaches started with data cleaning, data integration,
dimension reduction, and data normalization. This includes also methods for identification of
missing patterns, data wrangling, imputation, and conversion, fusion, and cross-linkage. The next
step in big data analysis is introducing mechanisms for automated extraction of structured data
elements biomedical signature vectors associated with unstructured data. Then there is a need to
employ the statistical computing environment logistic regression for its model fitting, parameter
estimation, and machine learning classification. The final component of this protocol requires a
computational platform that enables each of these steps to be implemented, integrated, and
validated.
The healthcare industry is quite slow when dealing with data, especially data sharing and
integration. Besides the pure technical challenge of clinical data integration, there’s a problem of
the willingness and ability to collaborate between players, healthcare providers, and patients. So
data collection, storage, integration, and analysis still are a broken process. Therefore this
chapter deals with the different steps that were carried out to overcome the above-mentioned
problems in this research work in detail.
28
3.2. Research Design
As shown in Figure 1, a stepwise process was applied in the data set to achieve the objective.
Figure 2. Schematic diagram of research design.
3.3. Population
As has been stated in data collection the data were collected from Addis Ababa Health Bureau.
The data set contains around 30.6 million patient records from 2010 to 2012.
29
3.4. Sample Size and Sampling Techniques
As has been stated in data collection the data were collected from Addis Ababa Health Bureau.
The data set contains around 30.6 million patient records from 2010 to 2012. All the data were
used in this study. But for predicting frequent disease top hundred diseases were selected from
the data set.
3.5. Data Source and Collection Methods
The dataset was obtained from the Addis Ababa Health Bureau. The data set contains data
collected between 2010 and 2012 for three consecutive years. Besides the disease type age and
gender are included in the data.
3.6. Data Analysis Tool
For data analysis python was used for data preprocessing such as noise identification, data
cleaning, data normalization, dimension reduction, data transformation, and data integration. On
the other hand for model development and disease prediction,Hadoop was used.
3.7. Model
The focus of this thesis is adaptations of logistic regression (LR) which is well-understood and
widely used in the statistics, machine learning, and data analytics communities. Its benefits
include a firm statistical foundation and a probabilistic model useful for “explaining” the data.
There is a perception that LR is slow, unstable, and unsuitable for large learning or classification
tasks. Through fast approximate numerical methods, regularization to avoid numerical
instability, and an efficient implementation it would show that LR can outperform modern
algorithms like Support Vector Machines (SVM) on a variety of learning tasks. These novel
implementations, which uses a modified iteratively re-weighted least squares estimation
procedure, can compute model parameters for sparse binary datasets with hundreds of thousands
of rows and attributes, and millions or tens of millions of nonzero elements in just a few seconds.
Why LR?
30
A wide variety of classification algorithms exist in the literature. Probably the most popular and
among the newest is support vector machines (SVM). Older learning algorithms such as k-
nearest-neighbor (KNN), decision trees (DTREE) or Bayes’ classifier (BC) are well understood
and widely applied. One might ask why we are motivated to use LR for classification instead of
the usual candidates. That LR is suitable for binary classification is made clear in Chapter 4. Our
motivation for exploring LR as a fast classifier to be used in data mining applications is its
maturity. LR is already well understood and widely known. It has a statistical foundation which,
in the right circumstances, could be used to extend classification results into a deeper analysis.
We believe that LR is not widely used for data mining because of the assumption that LR is
unsuitably slow for high-dimensional problems. In Zhang and Oles (2010), the authors observe
that many information retrieval experiments with LR lacked regularization or used too few
attributes in the model. Though they address these deficiencies, they still report that LR is
“noticeably slower” than SVM. We believe we have overcome the stability and speed problems
reported by other authors.
3.8. Methods Used for Data pre-processing
In this stage, 5 types of data processing techniques were applied. Which are:-
 Noise Identification
 Data Cleaning
 Data Normalization
 Dimension Reduction
 Data Transformation
 Data Integration
3.8.1. Noise identification
The random noise detection is one of the main tasks in signal and data processing. The terms:
random and noise have wide-ranging physical, epistemological, philosophical, and mathematical
meanings. Consequently, random noise can be defined, described, and analyzed in many ways.
31
In the colloquial meaning, the random noise can be treated as an unpredictable and undesirable
signal generated by source or factors out of our control. In practice, the random noise might be
interpreted from two fundamental perspectives: the randomness as some mathematical
characteristic, and the noise as an uncontrolled physical factor. The frequent problem concerning
noises is the fact that the mathematical definition and interpretation can be far from physical
reality, especially when “physical” means economic, social, or medical phenomena. In general, it
can be interpreted as a kind of filtered data regression. However, presentation in the prediction
context allows to explain the method’s motivation in an easy way.
As we can see from the data set in Figure 3 there is no such noise that affected the data analysis
process therefore the analysis has proceeded to the next preprocessing stage.
Figure 3. Row data collected from Addis Ababa Health Bureau.
3.8.2. Data Cleaning
Data cleaning is the process of preparing data for analysis by removing or modifying data that is
incorrect, incomplete, irrelevant, duplicated, or improperly formatted. This data is usually not
necessary or helpful when it comes to analyzing data because it may hinder the process or
32
provide inaccurate results. There are several methods for cleaning data depending on how it is
stored along with the answers being sought. Data cleaning is not simply about erasing
information to make space for new data, but rather finding a way to maximize a data set’s
accuracy without necessarily deleting information. For one, data cleaning includes more actions
than removing data, such as fixing spelling and syntax errors, standardizing data sets, and
correcting mistakes such as empty fields, missing codes, and identifying duplicate data points.
Data cleaning is considered a foundational element of the data science basics, as it plays an
important role in the analytical process and uncovering reliable answers. Most importantly, the
goal of data cleaning is to create data sets that are standardized and uniform to allow business
intelligence and data analytics tools to easily access and find the right data for each query.
(sisense, 2020), to modify or delete such data for improving data quality first data has to be
identified as incomplete, inaccurate, or unreasonable data (Chen, M., Mao, S., & Liu, Y, 2014).
The laboratory information examined in the out-patient department so, the multisource and
multimodal nature of healthcare data result in high complexity and noise problems. Besides,
there are also problems with missing values and impurity in the high-volume data. Since data
quality determines information quality, which will eventually affect the decision-making process,
it is critical to developing efficient big data cleansing approaches to improve data quality for
making accurate and effective decisions (Fang, R., Pouyanfar, S., Yang, Y., Chen, S. C., &
Iyengar, S. S, 2016).
A missing value for a variable is one that has not been entered into a dataset, but an actual value
exists (PYLE, P., SYDEMAN, W., & HESTER, M., 2001 ). Simple (non-stochastic) imputation
is often used. In simple imputation, missing values in a variable are replaced with a single value
(for example, mean, median, or mode). However, simple imputation produces biased results for
data that aren’t missing completely at random (MCAR). If there are moderate to large amounts
of missing data, simple imputation is likely to underestimate standard errors, distort correlations
among variables, and produce incorrect p-values in statistical tests. This approach should be
avoided for most missing data problems. The study of the linear correlations enabled us to ﬁll in
some new unknown values. To handle a dataset with missing values, it can be followed strategies
most common are: 1) remove the cases with unknowns; 2) fill in the unknown values by
exploring the similarity between cases; 3) fill in the unknown values by exploring the
correlations between variables, and 4) use tools that can handle these values. A database also
33
contains irrelevant attributes. Therefore, relevance analysis in the form of correlation analysis
and attribute subset selection can be used to detect attributes that do not contribute to the
classification or prediction task. Including such attributes may otherwise slow down and possibly
mislead the learning step. Typically, data cleaning and data integration are performed as a pre-
processing step. Inconsistencies in attribute or dimension naming can cause redundancies in the
resulting dataset. Data cleaning can be performed to detect and remove redundancies that may
have resulted from data integration. The removal of redundant data is often regarded as a kind of
data cleaning as well as data reduction(Belay, 2019).
3.8.3. Data normalization
There are some goals in mind when undertaking the data normalization process. The first one is
to get rid of any duplicate data that might appear within the data set. This goes through the
database and eliminates any redundancies that may occur. Redundancies can adversely affect the
analysis of data since they are values that aren’t exactly needed. Expunging them from the
database helps to clean up the data, making it easier to analyze. The other goal is to logically
group data together. You want data that relates to each other to be stored together. This will
occur in a database which has undergone data normalization. If data is dependent on each other,
they should be close to the data set. With that general overview in mind, let’s take a closer look
at the process itself. While the process can vary depending on the type of database you have and
what type of information you collect, it usually involves several steps. One such step is
eliminating duplicate data as discussed above. Another step in resolving any conflicting data.
Sometimes, datasets will have information that conflicts with each other, so data normalization is
meant to address this conflicting issue and solve it before continuing. A third step is formatting
the data (import.io., 2020). This takes data and converts it into a format that allows further
processing and analysis to be done. Finally, data normalization consolidates data, combining it
into a much more organized structure. Consider the state of big data today and how much of it
consists of unstructured data. Organizing it and turning it into a structured form is needed now
more than ever, and data normalization helps with that effort.
Clean, normalized sets of data can:
34
 Clarify drug data used in a medical system to automatically map local data to standard
codes used in the healthcare industry
 Unify disparate EHRs and patient data of unrelated providers to facilitate data sharing
and make full, meaningful use of it
 Unify financial and administrative electronic documents to facilitate KPI analysis.
3.8.4. Dimension Reduction (normalization)
The high-dimensionality reduction has emerged as one of the significant tasks in data mining
applications. For example, you may have a dataset with hundreds of features (columns in your
database). Then dimensionality reduction is that you reduce those features of attributes of data by
combining or merging them in such a way that it will not lose much of the significant
characteristics of the original dataset.
One of the major problems that occur with high dimensional data is widely known as the “Curse
of Dimensionality”. The curse of dimensionality refers to the phenomena that occur when
classifying, organizing, and analyzing high dimensional data that does not occur in low
dimensional spaces, specifically the issue of data sparsity and “closeness” of data. This pushes us
to reduce the dimensions of our data if we want to use them for analysis.
To overcome the above-mentioned challenges that come with higher dimensional data, there is
this need of reducing the dimensions of the data that is planned to be analyzed and visualized.
Dimensionality reduction is accomplished based on either feature selection or feature extraction
techniques. Feature selection is based on omitting those features from the available
measurements which do not contribute to class reparability. In other words, redundant and
irrelevant features are ignored. Feature extraction, on the other hand, considers the whole
information content and maps the useful information content into a lower-dimensional feature
space. One can differentiate the techniques used for dimensionality reduction as linear
techniques and non-linear techniques as well.
3.8.5. Data Transformation
Data transformation is the process of converting data from one format or structure into another
format or structure. Data transformation is critical to activities such as data integration and data
35
management. Data transformation can include a range of activities: you might convert data types,
cleanse data by removing nulls or duplicate data, enrich the data, or perform aggregations,
depending on the needs of your project. Data transformation can increase the efficiency of
analytic and business processes and enable better data-driven decision-making. The first phase of
data transformations should include things like data type conversion and flattening of
hierarchical data. These operations shape data to increase compatibility with analytics systems.
Data analysts and data scientists can implement further transformations additively as necessary
as individual layers of processing. Each layer of processing should be designed to perform a
specific set of tasks that meet a known business or technical requirement.
3.8.6. Data Integration
As we mentioned before, new standards adopted by the healthcare system can facilitate the
electronic exchange of information. They can also decrease the cost and complexity of building
interfaces between different systems.
The key element of the healthcare industry’s data-sharing puzzle is data interoperability. Truly
integrated systems must be easily understood by users, i.e. these systems must be able to
exchange data and subsequently present it through a comprehensive and user-friendly interface.
According to Healthcare Information and Management Systems Society (HIMSS), healthcare
data interoperability is “the ability of different information technology systems and software
applications to communicate, exchange data, and use the information that has been exchanged”.
Data interoperability is a multi-faceted concept capturing the uniform movement and
presentation of healthcare data, uniform safeguarding data security and integrity, protection of
patient confidentiality, and uniform assurance of a common degree of system service quality.
Information technology interoperability can be divided into three levels of complexity

foundational, structural, and semantic. Semantic interoperability is the one that allows disparate
data systems to share data in a useful way, as it requires both structuring the data exchange and
codification of the data, including vocabulary. Semantic interoperability deals with the common
vocabulary for accurate and reliable communication between computers.
It’s clear enough that data integration can be beneficial for the whole pool of stakeholders in a
healthcare facility – the healthcare enterprise, payers, employees, and patients (import.io., 2020).
36
In the case of data integration or aggregation, datasets are matched and merged based on shared
variables and attributes. Advanced data processing and analysis techniques allow us to mix both
structured and unstructured data for eliciting new insights; however, this requires “clean” data.
Data integration tools are evolving towards the unification of structured and unstructured data
and will begin to include semantic capabilities. It is often required to structure unstructured data
and merge heterogeneous information sources and types into a unified data layer. Most data
integration platforms use a primary integration model based on either relational or XML data
types. Advanced-Data Virtualization Platforms have been proposed which use an extended
integration data model with the ability to store and read/write all types of data in their native
format such as relational, multidimensional, semantic data, hierarchical, and index files, etc.
Integrating heterogeneous data sources is challenging. One of the reasons is that unique
identifiers between records of two different datasets often do not exist. Determining which data
should be merged may not be clear at the outset. Working with heterogeneous data is often an
iterative process in which the value of data is discovered along the way and the most valuable
data are then integrated more carefully(Rudin, C., Dunson, D., Irizarry, R., Ji, H., Laber, E.,
Leek, J., & Wasserman, L, 2014).
For heterogeneous data sources, data fusion techniques are used to match and aggregate
heterogeneous datasets for creating or enhancing a representation of reality that helps data
mining. Mid-level data fusion methodologies that merge structured and machine-produced data
work well. On the other hand, high-level data fusion tasks for merging multiple unstructured
analog sensor inputs remains challenging. On the other hand for data heterogeneity, the
following two integration was proposed by(Shirkhorshidi AS, Aghabozorgi SR, Teh YW,
Herawan T, 2014):
1) Schema integration: - the essential step of the schema integration process is to identify
correspondences between semantically identical entities of the schemas;
2) Catalogue integration: - in Business-to-Business (B2B) applications, trade partners store

information about their products in electronic catalogues. Finding correspondences among
entries of the catalogues is referred to as the catalogue matching problem.
37
Extracting structured information from unstructured data is a fundamental step. Some
frameworks are mature to certain classes of information extraction problems although their
adoption remains limited to early-adopters. But for unstructured and structured data integration,
the following approaches can be used Entity recognition and linking: here part of the problem
can be resolved by information extraction techniques such as entity recognition, relation
extraction, and ontology extraction. Entities in open datasets can be used to identify named
entities (people, organizations, places), which can be used to categorize and organize text
contents. Named entity recognition and linking tools such as DBpedia Spotlight can be used to
link structured and unstructured data. These tools help to automatically build semi-structured
knowledge.
While bringing together data from heterogeneous systems, there are three sources of data errors:
data entry errors, data type incompatibilities, and semantics incompatibilities in business entity
definitions. Enterprise data standardization mostly avoids data type mismatches and semantic
incompatibilities in data. Traditionally enterprises used ETL (Extract, Transform, and Load) and
data warehouses (DW) for data integration. However, technology is known as “Data
Virtualization (DV)” has found some acceptance as an alternative data integration solution in the
last few years. “Data Virtualization” is a federated database termed as a composite database.
Data Virtualization and Enterprise Data Standardization have the promise of reducing the cost
and implementation time of data integration. Unlike DW, DV defines data cleaning, data joins,
and transformations programmatically using logical views. DV allows for extensibility and reuse
by allowing for the chaining of logical view. But DV is not a replacement for DW; DV could
offload certain analytical workloads from DW. This is because of regression analysis, multi-
dimensional data structures, and the analysis of large amounts of data that mostly require DW
(Pullokkaran, L. J, 2013).
Data lakes are an emerging and powerful approach to the challenges of data integration as
enterprises increase their exposure to mobile and cloud-based applications and the sensor-driven
Internet of Things (IoT). Data lakes are repositories for large quantities and varieties of data,
both structured and unstructured. But data lakes are more suitable for the less-structured data that
companies need to process. However, difficulties associated with the data lakes integration
challenges include, but are not limited to 1) developing advanced metadata management over
38
raw data extracted from heterogeneous data sources; 2) dealing with the structural metadata from
the data sources and annotating data and metadata with semantic information to avoid
ambiguities. Without any metadata or metadata management, dumping all data into a data lake
would lead to a ‘data swamp’ and the data lake is hardly usable because the structure and
semantics of the data are not known (Stein, B., & Morrison, A, 2014).
3.9. Method Used to Create a Model
Imagine collecting years’ worth of data. What is valuable in this data? Often, to analyze the data
and create an intelligent and predictive model, machine learning is required. Machine learning
uses computational methods to “learn” information directly from data without relying on a
predetermined equation as a model. It turns out this ability to train models using the data itself
opens up many use cases for predictive modeling such as predictive health for complex
machinery and systems, physical and natural behaviors, energy load forecasting, and financial
credit scoring Machine learning is broadly divided into two types of methods, supervised and
unsupervised learning, each of which contains several algorithms tailored for different problems.
Supervised learning uses a training data set which maps input data to previously known response
values. Unsupervised learning draws inferences from data sets with input data that does not map
to a known output response.Incorporating models into products or services is typically done in
conjunction with enterprise application developers and system architects, but this can create a
challenge. Developing models in traditional programming languages is difficult for engineers
and scientists, while recoding models can be time-consuming and error-prone, especially if the
models require periodic updates. To alleviate this issue, enterprise application developers should
look for data analysis and modeling tools that are familiar to their engineers and scientists, while
also providing production-ready tooling’s such as application servers and code generation for
deploying models into their applications, products, and services(Belay, 2019).
The classification of patient's data has been intensively researched for years. But some
limitations have stood out, such as the small-sample dilemma, “black box,” and lack of
prediction strength (Caragea, D., 2004) It has been used Logistic Regression to build the
prediction models for a binary outcome. Obviously, the underlying probability of labels and
contribution of predictor variables can be explicitly provided in Logistic Regression models,
39
which is helpful for analysisin discovering the frequent disease that interacts and cause the
occurrence of disease.
3.9.1. Incorporating Big Data for Real-World Solution
There are several platforms available to IT organizations for storing and processing big data that
fall into two categories: 1) batch processing of large, historical sets of data, and 2) real-time or
near real-time processing of data that is continuously collected from devices.
Batch applications, such as Spark or MapReduce, are commonly used to analyze and process
historical data that has been collected over long periods or across many different devices or
systems. These applications are typically used to look for trends in data and develop predictive
models.
Streaming applications that process in real- or near-real time, such as Kafka, may be coupled
with a predictive model to add more intelligence and adaptive capabilities to a product or service
such as predictive maintenance, optimizing equipment fleets, and monitoring manufacturing
lines.
3.9.2. How to Predict With Regression Models
Regression is a supervised learning problem where, given input examples, the model learns a
mapping to suitable output quantities, such as “0.1” and “0.2”, etc.
Again, the functions demonstrated for making regression predictions apply to all of the
regression models available in scikit-learn.We can predict quantities with the finalized regression
model by calling the predict() function on the finalized model.As with classification, the
predict() function takes a list or array of one or more data instances.
One of the models that used to categorize the top 100 or top 10 diseasesis Hive data warehouse
software. This software enables reading, writing, and managing the top 100 or top 10 diseases in
distributed storage. Using the Hive query language (HiveQL), which is very similar to SQL,
queries are converted into a series of jobs that execute on a Hadoop cluster through MapReduce
or Apache Spark.
40
Users can also run batch processing workloads with Hive while also analyzing the same data for
interactive SQL or machine-learning workloads using tools like Apache Impala or Apache Spark
—all within a single platform.
As part of CDH, Hive also benefits from:
 Unified resource management provided by YARN
 Simplified deployment and administration provided by Cloudera Manager
 Shared security and governance to meet compliance requirements provided by Apache Sentry
and Cloudera Navigator
Because Hive is a petabyte-scale data warehouse system built on the Hadoop platform, it is a
good choice for environments experiencing phenomenal growth in data volume. The underlying
MapReduce interface with HDFS is hard to program directly, but Hive provides an SQL
interface, making it possible to use existing programming skills to perform data preparation.
Hive on MapReduce or Spark is best-suited for batch data preparation or ETL. In this case:
 You must run scheduled batch jobs with very large ETL sorts with joins to prepare data for
Hadoop. Most data served to BI users in Impala is prepared by ETL developers using Hive.
 You run data transfer or conversion jobs that take many hours. With Hive, if a problem occurs
partway through such a job, it recovers and continues.
 You receive or provide data in diverse formats, where the Hive SerDes and variety of UDFs
make it convenient to ingest and convert the data. Typically, the final stage of the ETL
process with Hive might be to a high-performance, widely supported format such as Parquet.
41
CHAPTER FOUR
FINDINGS AND DISCUSSION
4.1. Data Preprocessing and preparation
The random noise detection is one of the main tasks in signal and data processing. The terms:
random and noise have wide-ranging physical, epistemological, philosophical, and mathematical
meanings. Consequently, random noise can be defined, described, and analyzed in many ways.
In the colloquial meaning, the random noise can be treated as an unpredictable and undesirable
signal generated by source or factors out of our control. In practice, the random noise might be
interpreted from two fundamental perspectives: the randomness as some mathematical
characteristic, and the noise as an uncontrolled physical factor. A frequent problem concerning
noises is the fact that the mathematical definition and interpretation can be far from physical
reality, especially when “physical” means economic, social, or medical phenomena. In general, it
can be interpreted as a kind of filtered data regression.
4.1.1. Data source
The dataset was obtained from the Addis Ababa Health Bureau.
4.1.2. Data Cleaning
As described in the above paragraphs data cleaning plays a significant role in minimizing
interruption and redundancies in the data integration step. Therefore to overcome the above-
mentioned problem in the coming data analysis stage data cleaning was performed using method
shown in Figure 4.
42
Figure 4. Data cleaning using python.
For better data analysis a cleaned data were obtained after the data were subjected in the process
as mentioned in figure 4. As shown in Figure 5 the data was cleaned and redundancy data were
removed.
43
Figure 5. Cleaned data set.
4.1.3. Data Normalization
To achieve redundancy free and grouped data, the cleaned data set was analyzed using the
method shown in Figure 6.
Figure 6. Data normalization using python.
44
To achieve data free of duplicate data normalization was done using as mentioned in Figure 6
and dataset free of duplicate were obtained as shown in Figure 7.
Figure 7. Normalization data set.
4.1.4. Dimension Reduction
Dimensionality reduction is the process of reducing the number of random variables or attributes
under consideration. High-dimensionality data reduction is extremely important in many real-
world applications, as part of a data pre-processing-step.
4.1.5. Data Transformation and Integration
Since there was a need to remove nulls or duplicate data and to increase the efficiency of the
analysis; data transformation was performed and duplicate and null free data was achieved by
using the method shown in Figure 8.
Data integration is the process of combining data from different sources into a single, unified
view. Integration begins with the ingestion process and includes steps such as cleansing, ETL
45
mapping, and transformation. Data integration ultimately enables analytics tools to produce
effective, actionable business intelligence(Pearlman, 2020). Therefore the methodsshown in
Figure 8 by using Python
Figure 8.Data integration and transformation using python.

The transform, clean, and normalized data set in this work undergo data integration to achieve
combined single data and achieved the objectives of the research. Thus the method shown in
Figure 8 was used.Then an integrated data which is unified and structured data was achieved for
further data modeling.Here as shown in Figure 9 the data was converted from unstructured to the
structured data set.
46
Figure 9. Transformed and integrated data set.
4.2. Appropriate algorithm for identifying frequently occurred

diseases
Logistic Regression to build the prediction models for a binary outcome. The underlying
probability of labels and the contribution of predictor variables can be explicitly provided in
Logistic Regression models, which is helpful for analysisin discovering the frequent disease that
interacts and cause the occurrence of disease.
4.3. Creating a Model
For creating a model a tool called Hadoop was used. Besides the tool, the method shown in
Figure 10 shown below was used to generate a meaningful view based on the collected data set
in 2010, 2011, and 2012 respectively. Here the model is developed based on the top100 disease
which affects the community from 2010-2012 E.C.
4.3.1. Machine learning andregression model using Phyton
Here a stepwise process where used.
47
4.3.1.1. Step 1: set up the environment
The first step here we have undertaken wasinstalling Python 3, NumPy, Pandas, and Scikit-Learn
(a.k.a. sklearn). Since installing Python was strongly recommended by Anacondathe installation
was performed based on the guidelines.
Then the installation of Scikit-Learnwasconfirm using code shown in Figure 10.
Figure 10. conformation for the proper installation of Scikit-Learn.

4.3.1.2. Step 2: Importing libraries and modules
Here to begin the process there is a need to import NumPy since it provides support for more
efficient numerical computation. Importing Numpy was carried out as shown in Figure 11.
Figure 11. importing NumPy.

The process wasimport Pandas. Pandas is a convenient library that supports dataframes.
Pandas is also technically optional because Scikit-Learn can handle numerical matrices directly
but Pandas it'll make our lives easier. Importing Pandas were carried out as shown in Figure 12.
Figure 12. Importing Pandas.
48
Then we do start importing functions for machine learning. The first one was the
train_test_split() function from the model_selection module this is shown in Figure 13. As its
name implies, this module contains many utilities that will help us choose between models.
Figure 13. Importing sampling helper.
Next, the entire preprocessing modulewas imported (Figure 14). This contains utilities for
scaling, transforming, and wrangling data.
Figure 14. Importing the entire processing module.
The next step was importing the families of models needed. A "family" of models are broad
types of models, such as random forests, SVM's, linear regression models, etc. Within each
family of models, you'll get an actual model after you fit and tune its parameters to the data. Here
the random forest familywas importedas shown in Figure 15.
Figure 15. Importing the random family forest.
49
Then we move to import tools that help us performingcross-validationthis is shown in Figure 16.
Figure 16. Importing cross-validation pipeline.

On the other hand,some metrics we can use to evaluate our model performance later were
imported as shown in Figure 17.
Figure 17. importing evaluation metrics.
Finally, we have imported a way to persist our model for further use as shown in Figure
18.Joblib is an alternative to Python's pickle package, and we used it because it's more efficient
for storing large NumPy arrays.
Figure 18. Importing Joblib.

4.3.1.3. Step 3: Load Health data (our dataset)
Now we were ready to load our data set. The Pandas library that we imported in step 2 is loaded
with a whole suite of helpful import/output tools. By using the Pandas library we can read data
from CSV, Excel, SQL, SAS, and many other data formats.The convenient tool we'll use today is
50
the read_csv() function. Using this function, we can load any CSV file, even from a remote
URL! But we already do this in the above methodology.
Table 2. Summary statistics.

1 print data.describe()
2 # Diseases Age Gender 2010 2011 2012
3 # count Failed or difficult intubation during pregnancy 15-29 Female 47325 67617

45506
Urinary tract infection 15-29 Female 39068 71695

45375
Urinary tract infection 30-64 Female 31475 63626

40409
Dyspepsia 15-29 Female 33280 60197

36644
Tonsilitis 1-4 Male 22701 57322 37959
4 Failed or difficult intubation during pregnancy 30-64 Female 34957 48319 32156
5 Tonsilitis 1-4 Female 22701 57322 37959
6 Dyspepsia30-64 Female 24263 45207 27223
7 # mean 8.3196 0.5278 0.2709
8 # std 1.7411 0.1791 0.1948
9 # min 4.6000 0.1200 0.0000
10 # 25% 7.1000 0.3900 0.0900
# 50% 7.9000 0.5200 0.2600
# 75% 9.2000 0.6400 0.4200
# max 15.9000 1.5800 1.0000
51
The futures are Age, Gender, Years, and Diseases. Since most of the features are numeric, which
is convenient. However, they have some very different scales, there is a need to standardize the
data in the data preprocessing.
4.3.1.4. Step 4: Split data into training and test sets
Since splitting the data into training and test seats at the beginning of the modeling workflow are
crucial for getting a realistic estimate of the model's performance. Therefore, separating our
target (Y) features from our input (X) features were done as shown in Figure 19.
Figure 19.Separate target from training features.
This allows us to take advantage of Scikit-Learn's useful train_test_split function: as shown in

Figure 20.
Figure 20. Split data intotrain and test sets.

As can be seen, we set aside 20% of the data as a test set for evaluating our model. We also set
an arbitrary “random state” so that we can reproduce our results.
52
Finally, it's good practice to stratify our sample by the target variable. This will ensure our
training set looks similar to our test set, makes our evaluation metrics more reliable.
4.3.1.5. Step 5: Declare data preprocessing steps
Remember, in Step 3, we made the mental note to standardize our features because they were on
different scales.
Standardization is the process of subtracting the means from each feature and then dividing by
the feature standard deviations.Standardization is a common requirement for machine learning
tasks. Many algorithms assume that all features are centered around zero and have approximately
the same variance.
So instead of directly invoking the scale function, we were using a feature in Scikit-Learn called
the Transformer API. The Transformer API allows us to "fit" a preprocessing step using the
training data the same way was used to fit a model and the same transformation was used.
Theprocess was:
I. Thetransformer was fit on the training set (the means and standard deviations were
saved)
II. Thetransformer was applied to the training set (the training datawere scaled)
III. the transformer was appliedto the test set (the same means and standard deviations was
used)
This makes our final estimate of model performance more realistic, and it allowed to insert our
preprocessing steps into a cross-validation pipeline.
Here fitting the transformer API was done as shown in Figure 21 below.
Figure 21.Fitting the Transformer API.
53
Now, the scaler object was saved means and standard deviations for each feature in the training
set. Then it was confirmed that the scaled dataset is indeed centered at zero, with unit variance as
shown in Figure 22.
Figure 22. Confirming scaled data set by using a unit variance.

Thescaler object was taken and using it to transform the training set. Later, we transformed the
test set using the same means and standard deviations used to transform the training set as shown
in Figure 23.
Figure 23. Appling the transformer to the data set.

The scaled features in the test set were not perfectly centered at zero with unit variance. This was
exactly what we were expecting, as we are transforming the test set using the means from the
training set, not from the test set itself.
In practice, when we set up the cross-validation pipeline, we won't even need to manually fit the
Transformer API. Instead, wehave simply declared the class object as shown in Figure 24.
54
Figure 24. Pipeline with preprocessing and model.
This is exactly what it looks like: a modeling pipeline that first transformed the data using
StandardScaler() and then fits a model using a random forest regressor.
4.3.1.6. Step 6: Declare hyperparameters to tune
Now it's time to consider the hyperparameters that we wanted to tune for our model. There are
two types of parameters we need to worry about: model parameters and hyperparameters.
Models parameters can be learned directly from the data (i.e. regression coefficients), while
hyperparameters cannot.
Hyperparameters express "higher-level" structural information about the model, and they are
typically set before training the model.
Within each decision tree, the computer can empirically decide where to create branches based
on either mean-squared-error (MSE) or mean-absolute-error (MAE). Therefore, the actual branch
locations are model parameters. However, the algorithm does not know which of the two criteria,
MSE or MAE, that it should use. The algorithm also cannot decide how many trees to include in
the forest. These are examples of hyperparameters that the user must set.
We listed the tunable hyperparameters as shown in Figure 25.
55
Figure 25. Tunable hyperparameters.
The hyperparameters we want to tune were declared through cross-validation as shown in Figure
26.
Figure 26.Declaration of hyperparameters to tune.

As you can see, the format should be a Python dictionary (data structure for key-value pairs)
where keys are the hyperparameter names and values are lists of settings to try.
4.3.1.7. Step 7: Tune model using a cross-validation pipeline
Cross-validation is one of the most important skills in all of the machine learning because it
helps you maximize model performance while reducing the chance of overfitting. It is a process
for reliably estimating the performance of a method for building a model by training and
evaluating your model multiple times using the same method. Practically, that "method" is
simply a set of hyperparameters in this context.
To cross-validation the steps were:
I. Splitting the data into k equal parts, or "folds" (typically k=10).

II. Training the model on k-1 folds (e.g. the first 9 folds).
56
III. Evaluated it on the remaining "hold-out" fold (e.g. the 10th fold).
IV. Performing steps (2) and (3) k times, each time holding out a different fold.
V. Aggregating the performance across all k folds.
Cross-validation is important in machine learning when want to train a random forest regressor.
One of the hyperparameters we must tune is the maximum depth allowed for each decision tree
in our forest.That's where cross-validation comes in. Using only our training set, we can use the
CV to evaluate different hyperparameters and estimate their effectiveness.This allows us to keep
our test set "untainted" and save it for a true hold-out evaluation when we are finally ready to
select a model. We can use CV to tune a random forest model, a linear regression model, and a
k-nearest neighbor model, using only the training set. Then, we still have the untainted test set to
make our final selection between the model families.
The best practice when performing CV w to include our data preprocessing steps inside the
cross-validation loop. This prevents accidentally tainting our training folds with influential data
from your test fold.
Here's how the CV pipeline looks after including preprocessing steps:
I. Splitting our data into k equal parts, or "folds" (typically k=10).

II. Preprocessing k-1 training folds.
III. Training our model on the same k-1 folds.
IV. Preprocessing the hold-out fold using the same transformations from step (2).
V. Evaluating our model on the same hold-out fold.
VI. Performing steps (2) - (5) k times, each time holding out a different fold.
VII. Aggregating the performance across all k folds. This is our performance metric.
Fortunately, Scikit-Learn makes it simplified to set this up as shown in Figure 27.
57
Figure 27.Sklearn cross validation with pipeline.
GridSearchCV essentially performs cross-validation across the entire "grid" (all possible
permutations) of hyperparameters.It takes in our model (in this case, we're using a model
pipeline), the hyperparameters of ours want to tune, and the number of folds to create.
Obviously, there's a lot going on under the hood. We've included the pseudo-code above.
Now, we can see the best set of parameters found using CV in Figure 28.
Figure 28.The best set of parameters found using Cross-validation.

Its shown that the default parameters win out for this data set.
4.3.1.8. Step 8: Refit on the entire training set
After we have tuned our hyperparameters appropriately using cross-validation, we generally get
a small performance improvement by refitting the model on the entire training set. Conveniently,
GridSearchCV from sklearn was automatically refitting the model with the best set of
hyperparameters using the entire training set.
This functionality was confirmed as shown in Figure 29.
58
Figure 29. Conforming functionality.
As shown in Figure 29 we can simply use the clf object as our model when applying it to other
sets of data.
4.3.1.9. Step 9: Evaluate model pipeline on test data
Here's Figure 30 showshow to predict a new set of data.
Figure 30. Predict a new set of data.

Then we used the metrics we imported earlier to evaluate our model performance as it can be
seen in Figure 31.
Figure 31. Evaluating model performance.

Now we need to check neither the performance is good enough or not. Well, the rule of thumb is
that our very first model probably won't be the best possible model. However, we recommend a
combination of three strategies to decide if ours is satisfied with our model performance.
59
I. Starting with the goal of our model. If the model was tied to a business problem, we have
successfully solved the problem.
II. We looked in academic literature to get a sense of the current performance bench-marks
for specific types of data.
III. Tried to find low-hanging fruit in terms of ways to improve your model.
There are various ways to improve the model. We will have more guides that go into detail about
how to improve model performance, but here are a few quick things to try:
 Try other regression model families (e.g. regularized regression, boosted trees,
etc.).
 Collect more data if it's cheap to do so.
 Engineer smarter features after spending more time on exploratory analysis.
 Speak to a domain expert to get more context
As a final note, when you try other families of models, we recommend using the same training
and test set as you used to fit the random forest model. That's the best way to get a true apples-to-
apples comparison between your models.
4.3.1.10. Step 10: Save model for future use
So far, we have done the hard part, But the final step was to save our hard work so we can use
the model in the future. It's really easy to do so and shown in Figure 32. all the algorithms used
so far in one.
Figure 32. Saving the model for further use.

When we wanted to load the model again, we simply use this function shown in Figure 33.
60
Figure 33. Loading the model for further use.
The complete code, from start to finish were shown in Figure 34.
61
Figure 34. All the code used to create the model.
One of the models that used to categorize the disease using Hive so should install and show how
to process it.
4.3.2. Installation
On a cluster managed by Cloudera Manager, Hive comes along with the base CDH installation
and does not need to be installed separately. With Cloudera Manager, we can enable or disable
the Hive service, but the Hive component always remains present on the cluster.
Figure 35.Cloudera manager login as admin.
Figure 36. Hive display on the cloudera Manager.
62
On an unmanaged cluster, we can install Hive manually, using packages or tarballs with the
appropriate command based on our operating system.
Here the appropriate Hive packages wereinstalled using the appropriate command for our
distribution.The packages are:
 hive – base package that provides the complete language and runtime
 hive-metastore – provides scripts for running the metastore as a standalone service
(optional)
 hive-server2 – provides scripts for running HiveServer2
 hive-hbase - optional; install this package.
4.3.3. Configuration
Hive offers several configuration settings related to performance, file layout and handling, and
options to control SQL semantics. Depending on our cluster size and workloads, configure
HiveServer2 memory, table locking behavior, and authentication for connections. The Hive
metastore service, which stores the metadata for Hive tables and partitions, must also be
configured.
4.3.3.1. Managing Hive
Cloudera recommends using Cloudera Manager to manage Hive services, which are called
managed deployments. If ours were not a managed deployment, configure HiveServer2 Web UI
was used to manage Hive services.
Using hive we created Database and table that import data from HDFS
63
Figure 37. Create a table using Hive over health_db.
So using Hive we get the top 10 dieasese depending on age or gender or dieases type as
illustrated in Figure 38, 39 and 40.
Figure 38. Number of patient categorized by age.
64
Figure 39. Gender with Age.
Figure 40. Top 15 diseases by Hive SQL.
Figure 41.The code used to generatetop disease affects the community based on age class.
65
The result shown above were the preprocessing of the data here we will see the result get from
the data set. The first thing here applied was to obtain the top hundred diseases that affected the
community in 2010, 2011, and 2012 respectively. As shown in Figure 42 the extracted data from
the dataset shows that from the age group 0-4 and 5-14 Typhus fever is dominant next to Urinary
Tract Infection. While we see the age group 15-64 Dyspepsia (Inability to swallow), Diarrhea,
Failed or difficult intubation during pregnancy are the top three diseases consecutively. On the
other hand, Common cold is a common disease in the age group sixty-five and above.
Figure 42. Top hundred diseases from 2010-2012 based on the dataset.
From the disease affect the community the top 30 are shown in Figure 43. As it can be seen from
the figure; Tonsillitis (Acute pharyngitis unspecified), Open wound of shoulder and upper arm,
and Typhoid Fever are the top three disease which affected the community without any age class.
66
Figure 43. Top 30 diseases affect the community from 2010-2012.
Figure 44 shows the top ten diseases affect the community in 2010, 2011, and 2012. These top
ten diseases are Urinary Tract Infection, Failed or difficulty incubation during pregnancy,
Dyspepsia, Tonsillitis, Hypertension, Typhus and Typhoid fever, Diarrhea, and Common cold.
Figure 44.Ten most abundant diseases from 2010 -2012 based on the dataset.
Age, environmental condition, way of life, stress, etc. are some of the factors given by Health
care workers for the causes of disease in humans. Besides that, it’s well known that some
67
diseases are affected by a certain group of age in our community. Therefore, Figure 45 shows
disease affects the community based on age class from 2010-2012. It is well known that some
group of age highly affects some disease. In this study 5 types of age class; which are <1 year, 1-
4 years, 5-14 years, 30-64 years, and >=65 years. Children between 1-4 years are highly affected
by disease than the other group of age. Next to children between 1-4 years; elders above or equal
to the age of 65 years old are the second group mostly affected by different diseases. Children
from 5-14 years of age are the third group of age which are affected by the disease. Age group
<1 year and 30-64 are the age groups affected by disease consecutively.
Figure 45. The disease affects the community based on age class from 2010-1012.
68
CHAPTER FIVE
CONCLUSION AND RECOMMENDATION
4.4. Summary
The top ten diseases affected the community from 2010-1012 are Urinary Tract Infection, Failed
or difficulty incubation during pregnancy, Dyspepsia, Tonsillitis, Hypertension, Typhus and
Typhoid fever, Diarrhea, and Common cold. On the other hand, the top three diseases affect the
community without any age or gender variation areTonsillitis (Acute pharyngitis unspecified),
Open wound of shoulder and upper arm, and Typhoid Fever. But based on the age group the
disease affects each group is different.
4.5. Conclusion
Machine learning on administrative data provides a powerful new tool for population health and
clinical hypothesis generation for risk factor discovery, enabling population-level risk
assessment that may help guide interventions to the most at-risk population. Using the approach
described herein, it is possible to identify diseases that are affecting the community in terms of
age and gender.
Finally, the approach used in this study can be applicable in showing the disease affecting the
community consecutively and high-risk groups in terms of age and gender. Therefore to
overcome the problem, the responsible body such as the Ministry of Health and the like can use
such kind of tool and prepare the necessary facility and aware of the community to protect
him/her self.
69
4.6. Recommendation
Based on the work that has been done in this study; generally,it’s strongly recommended that
further research has to be done to improve the health sector. Furthermore, research should focus
on creating a modal to facilitate and change thesector environment. Different big data analysis
which is applicable and friendly to use have to be tested to see theirapplicability in our country
contest. Country policy developers should give high priority for big data analysis especially in
the health sector.
70
References
Belay, A. (2019). Big data analytics to predict cancer based on diagnosed clinical data.
Unpublished Ph.D. thesis.
Boinepelli H. In: Mohanty H., Bhuyan P., Chenthati D. (2015). Applications of Big Data. Big
Data. Studies in Big Data, Springer, 11.
Caragea, D. (2004). Learning classifiers from distributed, semantically heterogeneous,

autonomous data sources.
Chen, M., Mao, S., & Liu, Y. (2014). Big data: A survey. Mobile networks and applications.
Chiang M-C, Tsai C-W, Yang C-S. (2011). A time-efficient pattern reduction algorithm for k-
means clustering.
Cui X, Charles JS, Potok T. (2013). GPU enhanced parallel computing for large scale data
clustering. Future Gener Comp Syst.
Dash, S., Shakyawar, S., Sharma, M., and Kaushik, S. (2019 ). Big data in healthcare:
management, analysis, and future prospects. Journal of Big Data.
Dindokar, Shantanu. (2018, May 24). https://www.synerzip.com/expertise/Preprocessing of Big

Data/Synerzip.html. Retrieved from https://www.synerzip.com/expertise/:
https://www.synerzip.com/expertise/
Fang, R., Pouyanfar, S., Yang, Y., Chen, S. C., & Iyengar, S. S. (2016). Computational health
informatics in the big data age:. A survey. ACM Computing Surveys.
Feldman D, Schmidt M, Sohler C. (2013). Turning big data into tiny data: Constant-size coresets
fork-means, pca and projective clustering. ACM-SIAM Symposium on Discrete
Algorithms.
Fujitsu. (2011). Big Data- The definitive guide to the revolution in business analytics.
71
Harvey, C. (2017). big-data-challenges. Retrieved from www.datamation.com:
https://www.datamation.com/big-data/big-data-challenges.html.
Hemlata and Preeti. (2016). Big Data Analytics. Research Journal of Computer and Information
Technology Sciences.
Hong, L., Luo, M., Wang, R., Lu, P., Lu, W., & Lu, L. . (2018). Big Data in Health Care:
Applications and Challenges. Data and Information Management.
IBM. (2012).
http://public.dhe.ibm.com/common/ssi/ecm/en/ims14398usen/IMS14398USEN.PDF.
Retrieved from IBM: http://public.dhe.ibm.com
IBM, S. (2012). http://public.dhe.ibm.com/common/ssi/ecm/en/imc14675usen/. Retrieved from

IBM: Large Gene interaction Analytics at University at Buffalo,:
http://public.dhe.ibm.com/common/ssi/ecm/en/imc14675usen/
import.io. (2020, May 29). what-is-data-normalization-and-why-is-it-important. Retrieved from

www.import.io: https://www.import.io/post/what-is-data-normalization-and-why-is-it-
important/
J. Jeba Praba and V. Srividhya. (2016). BIG DATA ANALYTICS IN HEALTHCARE. 9th
National Level Science Symposium.
Kaur, P., Sharma, M., & Mittal, M. . (2018). Big Data and Machine Learning-Based Secure
Healthcare Framework. Procedia Computer Science.
Kulkarni, A., Siarry, P., Singh, P., Abraham, A., Zhang, M., Zomaya, A., and Baki, F. (2020).
Big Data Analytics In Healthcare.
Kulkarni, A., Siarry, P., Singh, P., Abraham, A., Zhang, M., Zomaya, A., and Baki, F. (2020).
Big Data Analytics In Healthcare. Cham: Springer.
Ma C, Zhang HH, Wang X. (2014). Machine learning for big data analytics in plants. Trends
Plant Sci.
Manyika J, Chui M, Brown B, Buhin J, Dobbs R, Roxburgh C, Byers AH . (2011). Big Data:
The Next Frontier for Innovation, Competition, and Productivity. McKinsey Global
72
Institute.
Najafabadi et al. (2015). Deep learning applications and challenges in big data analytics. Journal
of Big Data.
Nlm.nih. (2020, March 12). Retrieved from https://www.nlm.nih.gov:

https://www.nlm.nih.gov/nichsr/stats_tutorial/section3/index.html
Pearlman, S. (2020, May 29 ). What is Data Integration? Retrieved from www.talend.com:

https://www.talend.com/resources/what-is-data-integration/
Pullokkaran, L. J. (2013). Analysis of data virtualization & enterprise data standardization in

business intelligence (Doctoral dissertation, Massachusetts Institute of Technology).
PYLE, P., SYDEMAN, W., & HESTER, M. ( 2001 ). Effects of age, breeding experience, mate
fidelity, and site fidelity on breeding performance in a declining population of Cassin's
auklets.
Raghupathi, Wullianallur & Raghupathi, Viju. (2014). Big data analytics in healthcare: Promise
and potential. Health Information Science and Systems.
Raghupathi, Wullianallur & Raghupathi, Viju. . (2014). Big data analytics in healthcare:
Promise and potential. Health Information Science and Systems.
Riahi, Y. and Riahi, S. (2018 ). Big Data and Big Data Analytics: concepts, types, and
technologies. International Journal of Research and Engineering.
Rudin, C., Dunson, D., Irizarry, R., Ji, H., Laber, E., Leek, J., & Wasserman, L. (2014).
Discovery with data: Leveraging statistics with computer science to transform science
and society. In American Statistical Association (Vol. 1).
Sabharwal, S., Gupta, S., & Thirunavukkarasu. (2016). Insight of big data analytics in the
healthcare industry. In Computing, Communication, and Automation. International
Conference .
Shirkhorshidi AS, Aghabozorgi SR, Teh YW, Herawan T. (2014). Big data clustering.
International Conference on Computational Science and Its Applications.
73
Shirkhorshidi AS, Aghabozorgi SR, Teh YW, Herawan T. (2014). Big data clustering: a review.
International Conference on Computational Science and Its Applications.
sisense. (2020, May 29 ). What is Data Cleaning? . Retrieved from https://www.sisense.com:

https://www.sisense.com/glossary/data-cleaning/
Stein, B., & Morrison, A. (2014). The enterprise data lake: Better integration and deeper
analytics. . PwC Technology Forecast: Rethinking integration.
Ta, V. D., Liu, C. M., & Nkabinde, G. W. (2016). Big data stream computing in healthcare real-
time analytics. In Cloud Computing and Big Data Analysis.
Tekin C, van der Schaar M. (2013). Distributed online big data classification using context
information . Allerton Conference on Communication, Control, and Computing.
Wang, Lidong & Alexander, Cheryl. . (2019). Big Data Analytics in Healthcare Systems. .
International Journal of Mathematical, Engineering, and Management Sciences.
WHO. (2020, Feb ). Retrieved from

https://www.who.int/hac/donorinfo/callsformobilisation/eth/en/
Xu R, Wunsch D. (2009). Clustering. Hoboken: Wiley-IEEE Pres.
Zhang T, Ramakrishnan R, Livny M. BIRCH. (1996). An efficient data clustering method for
very large databases. ACM SIGMOD International Conference on Management of Data.
74
Appendices
Appendix Ⅰ: Hardware used for installing and use Cloudera tool
Hardware Requirement Notes

component
CPU 16+ CPU(vCPU) cores Allocate at least 1 CPU core per

session. 1 CPU coreis often
adequate for light workloads
Memory 32 GB RAM  As a general guideline,

Cloudera recommends
nodes with RAM
between 60 GB and 256
GB
 Allocating less than 2 Gb
of RAM can lead to out-
of-memory errors from
many applications
Disk  Root Volume: 100 GB SSDs are strongly recommended for
 Application Block Device or the application data storage

Mount Point (Master Host-Only):
1 TB
 Docker Image Block Device: 1
TB
75
Appendix Ⅱ: Install Cloudera virtual machine in a virtual box.
Appendix Ⅲ:After the installation,the Cloudera tool open the home page
76
LaunchCloudera for the first time
Appendix Ⅳ: launch Cloudera manager for the first time
Cloudera manager server using terminal
Appendix Ⅴ: Start the cloudera manager using terminal
77
Start Cloudera Manager server using terminal
Appendix Ⅵ: run java package service to see all service is running
Running all java package service
Appendix Ⅶ: Open Cloudera manager in the web
78
OpenCloudera manager in the web
Appendix Ⅷ: Cloudera service options
Cloudera environment that used to big data analysis
Cloudera requirementsare
• HDFS
• Hbase
• Hive
• Hue
• Impalathose tools are used to process the data for analysis.
Appendix Ⅸ: Top hundred disease affect the community
79
Top hundred disease in 2010.
Top hundred diseases in 2011.
80
Top hundred disease in 2012.
81

Tigistu Bekele For Examiner Final

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Tigistu Bekele For Examiner Final

Uploaded by

Copyright:

Available Formats

AdmasUniversity

A Thesis Submitted to Department of Computer Science of Admas University

Advisor: MulugetaAdibaru (Assistant professor)

Date: July 2020

School of Postgraduate Studies

Name of Candidate:_____________________: Signature:_______________Date:____________

NameofAdvisor: ______________________: Signature:______________Date:____________

Signature of Board of Examiner’s:

External examiner:_____________________: Signature:_______________Date:____________

Internal examiner:_____________________: Signature:________________Date:____________

Dean, SGS:__________________________: Signature:_______________Date:____________

1.1.Background of the study........................................................................................................1

1.1.Statement of the problem.......................................................................................................2

1.4.Significance of the Research...............................................................................................3

1.5.Scope and limitation..............................................................................................................4

1.5.1.Scope of the research......................................................................................................4

1.5.2.Limitations of the research.............................................................................................4

1.9.Organization of the Paper......................................................................................................6

2.2.Big Data (BD)........................................................................................................................7

2.3.Big Data Analysis..................................................................................................................9

2.3.1.Tasks in Big Data Analysis.............................................................................................9

2.3.1.1.Integration of various data sources........................................................................10

2.3.1.2.Integration with the simulation process.................................................................10

2.3.1.3.High-level user interaction....................................................................................10

2.4.Big Data Application...........................................................................................................11

2.5. Big Data Application in Health Sector...........................................................................12

2.5.1.Sources of Data in Health Sector..................................................................................14

2.6.The algorithm in Data Analytics..........................................................................................14

2.6.1.Mining algorithms for the specific problem.................................................................14

II. Classification and Regression Trees........................................................................17

2.6.3. K-Nearest Neighbors...............................................................................................18

2.6.4. K-Means Clustering.................................................................................................19

2.7. Tools used in big data analysis.......................................................................................19

2.7.1. Hadoop for big data applications.............................................................................21

2.7.3. Distributed File System – HDFS.............................................................................22

2.8. Related Work..................................................................................................................23

3.2. Research Design..............................................................................................................25

3.4. Sample Size and Sampling Techniques..........................................................................26

3.5. Data Source and Collection Methods..............................................................................27

3.6. Data Analysis Tool..........................................................................................................27

3.8. Methods Used for Data pre-processing...........................................................................28

3.8.1. Noise identification..................................................................................................28

3.8.2. Data Cleaning..........................................................................................................29

3.8.3. Data normalization...................................................................................................31

3.8.4. Dimension Reduction (normalization)....................................................................31

3.8.5. Data Transformation................................................................................................32

3.8.6. Data Integration.......................................................................................................33

3.9. Method Used to Create a Model.....................................................................................35

3.9.1. Incorporating Big Data for Real-World Solution....................................................36

3.9.2. How to Predict With Regression Models................................................................37

FINDINGS AND DISCUSSION..................................................................................................39

4.1. Data Preprocessing and preparation................................................................................39

4.1.1. Data source..............................................................................................................39

4.1.2. Data Cleaning..........................................................................................................39

4.1.3. Data Normalization..................................................................................................41

4.1.4. Dimension Reduction..............................................................................................42

4.2. Appropriate algorithm for identifying frequently occurred diseases..............................44

4.3. Creating a Model.............................................................................................................44

4.3.1. Machine learning and regression model using Phyton............................................44

4.3.1.1. Step 1: set up the environment.........................................................................45

4.3.1.2. Step 2: Importing libraries and modules..........................................................45

4.3.1.3. Step 3: Load Health data (our dataset).............................................................47

4.3.1.4. Step 4: Split data into training and test sets......................................................49

Name of Candidate:_____________: Signature:_Date:______

NameofAdvisor: ________: Signature:Date:

External examiner:_____________: Signature:_Date:______

Internal examiner:_________: Signature:Date:

Dean, SGS:__________________: Signature:_Date:______