Professional Documents
Culture Documents
better healthcare”
Introduction
The term “Big Data” has created a lot of buzz from past few years. Initially it was shaped by
organizations which had to handle fast growing data like web data, data resulting from
scientific or business simulations or other data sources (sensors, clinical trials etc.). The data
was no more a talk of certain terabytes but the growth was rapid and took us to the sizes like
exabytes or even more. The traditional databases were becoming incapable to handle and
manage this data. Thus, the pressure, to handle the growing data amount on the web,
databases etc. lead the companies like Google to develop a file system that could handle the
unstoppable growing data. This marked the development of Google File System [45] and
MapReduce [45]. Efforts were made to rebuild those technologies as open source software.
This resulted in Apache Hadoop and the Hadoop File System [45] and laid the foundation for
technologies summarized today as ‘Big Data’.
With the foundation laid down by Google, the other information companies stepped in and
started to invest to extend their software portfolios and build new solutions especially aimed
at Big Data analysis. Among those companies were IBM, Oracle, Microsoft and SAP. The
effort taken by software companies to get part of the Big Data story is not surprising
considering the trends which analysts predict and the praise they chant on ‘Big Data’ and its
impact onto business and society as a whole. IDC predicts in its ‘The Digital Universe’ study
that the digital data created and consumed per year will grow up to 40 exabyte by 2020, from
which a third of this data will promise value to organizations if processed using Big Data
technologies [45]. IDC also states that in 2012 only0.5% of potentially valuable data were
analysed, calling this the ‘Big Data Gap’. While the McKinsey Global Institute also predicts
that the data globally generated is growing by around 40% per year, they furthermore
describe Big Data trends in terms of monetary figures. They project the yearly value of Big
Data analytics for the US health care sector to be around 300 billion US Dollars. They also
predict a possible value of around 250 billion Euros (£) for the European public sector and a
potential improvement of margins in the retail industry by 60%.[46]
The Big Data, differs from the normal data by virtue of its 3 main characteristics: Volume,
Velocity and Variety, usually referred to as 3V’s of Big Data. The Volume of Big Data is
greater than 2 hundreds of terabytes, or even greater than exabytes demanding a vast storage
capacity. The Velocity of the Big Data is the frequency of the generation of the data. The
mountains of data present today has been generated in last couple of years alone. With the
increase in sensors, mobile phones, social media feeds, web log files, clinical data,
forecasting data etc., there is no such second which pasts without generating GBs of data. So,
the data coming at pace adds more to the volume of Big Data and itself becomes a problem.
The Variety means the broad range of data that is being generated. Data comes in three forms
structured, semi-structured and unstructured. The traditional database managing software
would deal with structured data only i.e. the data with a definite schema, rather the data
which could be stored in a proper relational data tables or database. To some extent the semi-
structured data is sometimes also manageable. But it’s quite impossible for the traditional
databases to manage and handle the Big Data so that it could be analysed. We have a variety
of the data available, the text formats, and audio files, video files etc. data that cannot be
stored in row and column format. Analysing this type of data will bring out better outputs for
the organisations and will improve their performance.
The Big Data analytics is fruitful to the whole society in every aspect. It has been proving its
tenacity in various fields, some of the important fields can be political campaigns, Healthcare,
Weather Forecasting, Education sector, social media analysis etc. This thesis provides an
outlook over the “usage of the Big Data analytics in healthcare”. Healthcare sector is the
greater source of the unstructured data. The IDC Health Insights Study, states that the
worldwide health care data grew up to 500 petabytes in 2012 and is projected to grow 25,000
petabytes by 2020.Data generated by the EHR’s (Electronic Health Records), clinical trials,
health surveys, laboratory testing data, genomic data and the data generated by the wearable
sensors like ECG’s, accelerometers etc. produce a large variety of data that can be well
analysed to predict the future trends in health issues. The X-ray images, CT-scans, MRI’s,
ultrasound etc. testing produce a lot of unstructured data. This data, if well analysed, will
certainly provide better insights and will improve health care departments. Employing the
Big Data and Big Data technology in healthcare will improve the patient outcome and would
be cost effective. Big Data has a greater potential to transform the health and health care.
This Big Data needs to be handled and evaluated to make maximum out of it. The companies
are upgrading their infrastructure and have started implementing the “Big Data technologies”
to predict much out of the heaps. There are various Big Data technologies in market. Hadoop,
Spark, SAP-HANA, High Performance Cluster Computing (HPCC), amongst these Hadoop
is being used maximum. It is not wrong to say that the term “Hadoop” is being used
synonymously with Big Data. Hadoop is an Apache foundation developed by Doug Cutting.
Hadoop uses the Google developed MapReduce and an improved file system called the
Hadoop Distributed File System.
Our analysis was solely based on Apache Hadoop, The Hadoop distributed file system is used
to store and process the large amount of data in distributed manner. To take advantage of the
parallel processing that Hadoop provides, we need to express our query as a MapReduce job.
MapReduce works by breaking the processing into two phases: the map phase and the reduce
phase. Each phase has key-value pairs as input and output, the types of which may be chosen
by the programmer. The programmer also specifies two functions: the map function and the
reduce function. The environments like Hive, pig and R were used to analyse the data better.
There are varieties of data that can be put into use with the help of Big Data technologies like
Hadoop. For instance, genomic data can be analysed for predicting well out of the gene and
gene trends. Similarly, EHR data can be collected and analysed to keep a tight vigil on the
patient activities and flows, helping in preventing any kind of epidemic outbreaks like H1N1,
Hepatitis etc. According to the IDC Health Insights Study (Feb 14, 2014), Status of Clinical
Decision Support in China, which has adopted the Big Data technology in their CDS systems,
the clinical efficiency has increased than that of those which do not use this technology at all.
[43] The implementation of Big Data analytics will help in reducing the costs and expenses.
Picking up warning signs of serious illness at an early enough stage that treatment is far
simpler (and less expensive) than if it had not been spotted until later. The non-
communicable disease like diabetes are proving more fatal, and also comes with many health
hazards. This can be overcome by analysing the previous health records of various patients
and informing the patients with diabetes about the most probable complication occurrences.
A doctor with a great experience can sometimes be wrong with the diagnosis, but if he has a
better access to the patients previous records with better analysis, he can provide better
solution to the problems faced by the patient. The Big Data analytic solution, if attached with
the Clinical Decision Support systems, will provide a better help to clinician by comparing
the details of years with the current situation of the patient, thus providing even better case
specific advice.
The healthcare data can be collected from the hi-tech hospitals with HER’s, from the Web
sites which provide open access to the large datasets, the laboratories which provide
multivariate data for analysis, health survey data is beneficial to see the patterns of the
diseases from past 10-30 years. However it may not be a cake walk to find the desired
datasets because of the security and privacy reasons. But if managed to get the datasets, there
are is a vast amount of data stored from years in hospitals and other health centres which can
be well used to improve and deal with the health issues to major extent.
This thesis, will focus on the solutions to the health and healthcare using Big Data analytics.
With the analysis done on survey data the results and conclusions are based on the analysis
done the Global Burden of Diseases, a survey from 1990 – 2010, in 21 regions for 291
diseases. The data was collected from GHDx( Global Health Data exchange). The dataset is
semi-structured with more than 8000000 rows and 31 columns. We managed to manipulate
the data by dividing these 291 disease into 34 categories, and provided these categories a
‘category code’. Thus adding 2 more columns in the dataset. To make the dataset fully
unstructured, so as to justify the usage of Big Data technologies on unstructured data, we
turned the format of the datasets into tab delimited .txt format. The dataset had many null and
NaN(Not a Number) values, so it needed to be cleaned. This cleaning was done using the
MapReduce programming. After cleaning and making fully structured, the final data was put
into the Hive data tables. Finally the plotting and calculations were done using the R
environment.
We calculated the mortality rate caused by various disease and disease categories, in different
regions, in different genders and in different age groups. This gave us an idea, which sector
where to put our large efforts in and which disease requires an immediate dealing. Providing
us the idea which gender is being effected in a particular region and of which age group, the
analysis on this data took us to the minutest details of the effects of diseases on the
population, which could help us in reckoning the area, gender and age group requiring the
immediate attention.
CHAPTER 2
2.1 Big Data
Big Data is being generated by everything around us at all times. Every digital process and
social media exchange produces it. Systems, sensors and mobile devices transmit it. [1] We
create around 2.5 quintillions of data every day. [2] The 2011 IDC Digital Universe study
writes that around 130 exabytes of data were created and stored in 2005. With the rapid
increase in this, it grew to 1227 exabytes in 2010 and is projected to grow at the rate of 45%
in 2015(current year) i.e. around 7910 exabytes.[3] The data is growing rapidly, it will never
stop though. And this rapid growth of the data lay the ground for the “Big Data” phenomenon
– a technological phenomenon brought about by the rapid data growth and collateral
advancements in technology that have given rise to an ecosystem of hardware and software
products that are enabling users to analyse this data to produce new and comminute levels of
insights. [3]
There are many definitions for Big Data. The Oxford English Dictionary defines the Big Data
as: “Data of a very large size, typically to the extent that its manipulation and management
present significant logistical challenges.”[4] Wikipedia, defines term Big Data as: “An all-
encompassing term for any collection of data sets so large and complex that it becomes
difficult to process using on-hand data management tools or traditional data processing
applications.[5]However, the definition given in the”2011 Big Data study” by McKinsey, He
defines Big Data as Datasets whose size is beyond the ability of typical database software
tools to capture, store, manage, and analyse.[6]
Another definition for the Big Data by O’Reilly states it as:
“Big Data is data that exceeds the processing capacity of conventional database systems.
The data is too big, moves too fast, or does not fit to the existing database architectures. To
gain values from these data, there must be an alternative way to process it.”[7]
This is to be noted here, that there is no denotative definition for how big a data should be in
order to be considered as Big Data. New and innovative technologies need to be in place for
managing this Big Data phenomenon. International Data Corporation (IDC) defines Big Data
technologies as a new generation technologies and architectures designed to extract value
economically from very large volumes of varieties of the data by enabling high velocity
capture of conventional database systems.[8]
Concluding from the above definitions a new definition can be generated for the Big Data i.e.
“An extremely large amount of data of broader range, that is increasing rapidly and will never
stop growing, for which we will always need a parallel technology that is efficient enough to
handle it and bring out the better insights for a user”.
2.2 Characteristics of Big Data
Big Data is not only about the Volume or Size of the data, but also includes the data Variety,
and data Velocity. These three attributes together form the 3 V’s of the Big Data.
Volume
Velocity Variety
The data that is to be processed becomes input for the map function. The output formed i.e.
the random results (key/value pairs) generated from mapping which eventually becomes the
input for the reducer.
Fig5.2: MapReduce
The reducer aggregates the input from mapper and brings out the final output. As the
sequence of the name MapReduce implies, the reduce task is always performed after the map
job.
The major advantage of MapReduce is that it makes easy to scale data processing over
multiple computing nodes. Under the MapReduce model, the data processing primitives are
called mappers and reducers. Decomposing a data processing application into mappers and
reducers is sometimes nontrivial. But, once we write an application in the MapReduce form,
scaling the application to run over hundreds, thousands, or even tens of thousands of
machines in a cluster is merely a configuration change. This simple scalability is what has
attracted many programmers to use the MapReduce model.
5.4.3 Hadoop YARN
This is a framework for job scheduling and cluster resource management. YARN has the
master/worker architecture. The master (Resource manager) manages all the resources on the
workers and schedules work in the cluster. Furthermore, the resource manager handles all the
client interactions.[40]
5.4.5 Hadoop common utilities:
These are Java libraries and utilities required by other Hadoop modules.
5.5 Hive
Hive is a data-warehousing solution built on top of the Hadoop environment. It brings
familiar relational-database concepts, such as tables, columns, and partitions to the
unstructured world of Hadoop. Hive queries are compiled into MapReduce jobs executed
using Hadoop. [41] Hive provides an SQL dialect, called Hive Query Language (abbreviated
HiveQL or just HQL) for querying data stored in a Hadoop cluster. Hive communicates with
the JobTracker to initiate the MapReduce job. Hive does not have to be running on the same
master node with the JobTracker. In larger clusters, it’s common to have edge nodes where
tools like Hive run. They communicate remotely with the JobTracker on the master node to
execute jobs. Usually, the data files to be processed are in HDFS, which is managed by the
NameNode.
The data after cleaning was supposed to be stored in some order and had to be in some
shape(schema). The hive was used for the same. However, the alternative was kept to tackle
the hive failure situation like if the data was coming from the multiple sources and complex
transformations were to be done e.g. there were two datasets one from sensors and other one
was the global survey data.
Examples of hive queries:
Hive Query To Create a DataBase:
Hive>Create database Hive1;
Hive Query To Use a DataBase:
Hive>Use Hive1;
Hive Query To Create a Table:
Hive>Create table table1(Field1 DataType, Field2 DataType, Field3 DataType)
Row format delimited fields terminated by ‘\t’;
Hive Query To Load a File in Hive Table:
Hive>Load data local inpath ‘<File_Path>’ into table <Table_Name>;
Hive Query To Display Content Of a Table:
Hive>Select * from <Table_Name>;
5.6 Pig:
Pig is described as a data flow language, rather than a query language. Its a platform that
provides a high-level language for expressing programs that analyse large datasets. Pig is
equipped with a compiler that translates Pig programs into sequences of MapReduce jobs that
the Hadoop framework executes.[42] Yahoo, one of the heaviest user of Hadoop (and a
backer of both the Hadoop Core and Pig), runs 40 percent of all its Hadoop jobs with Pig.
Twitter is also another well-known user of Pig.[42]
CHAPTER 6
6.1 Analysis methodology
The medical data sets for analysis were collected from the GHDx (Global Health Data
Exchange). The Global Health Data Exchange is a data catalogue created and supported
by IHME (Institute for Health Metrics and Evaluation).[94]
This data set is about the “global burden of diseases” study for the year 1990 and 2010 .
The GBD Study 2010 estimated the burden of diseases, injuries, and risk factors globally
and for 21 regions for 1990 and 2010.
This dataset, delimited with commas (.csv), provides four metrics with uncertainty
intervals for 291 diseases and injuries: deaths, years of life lost (YLLs), years lived
with disability (YLDs), and disability-adjusted life years (DALYs) by region, age, and
sex.
6.1.1 Type and Structure of the data:
The data was in .csv format, we made it more unstructured so as to check the reliability of
Hadoop when used on unstructured datasets. It was again saved as in tab delimited
format. Below is the screenshot of the data to depict the structure and format of the data.
Fig. 6.1: Screenshot of the Data
form.
6.1.3 Mapping And Reducing process.
To clean the data, the MapReduce paradigm was used. It helped in removing all the
obstructions that would have kept us from visualisation process. The MapReduce program
was written in Java (comes integrated with cloud-era). Shown below is the screenshot of
Mapreduce for cleaning process.
respectively.
Fig. 6.4: Screen shot after the addition of the two other columns
We selected three diseases viz.
Cancerous
Contagious
Diabetes Mellitus
In following three regions :
North Africa and Middle East(Egypt, Iran, Iraq, Saudi Arabia, Morocco etc.)
South Asia (India, Pakistan, Afghanistan etc.)
East Asia(Mongolia, China, South Korea, Japan etc.)
We calculated and plotted the mortality rate caused by these three diseases in the mentioned
regions by:
Age Group (0-30, 30-60 and 60-77)
Gender (Male, Female and Both)
Year (1990, 2005 and 2010)
The calculation was done in hive and plotting was done using “R”.
Syntax for plotting:
]$ R
> res <- read.csv(“DataFileName”, sep = “,”, header = FALSE)
>barplot(Y_Axis_Data_Frame, names.arg=c(X_Axis_Data_Frame), col=c(“Red”,”Blue”))
6.3 GRAPHS:
Following are the different graphs with the corresponding variables and constants:
The above bar chart is about the Mortality caused by six diseases categories, for different age
groups.
Result 1:
In three age groups viz. “0-13, 52-65 and 65-77” the maximum deaths are caused by heart
diseases, with age group “0-13” being most effected. Next to heart diseases, the most
mortality causing disease in age group “0-13” are contagious diseases. And in the age groups
like “52-65” and “65-77” the cancerous diseases remain a threat after heart disease.
It is to be note here, cancer and heart disease begins to affect people from the age of 39.
Region: Global
Results 2:
The graph is about the mortality caused by 3 diseases from 1990 – 2010. As indicated
contagious diseases were dominating in 1990 followed by cancer. This has be controlled in
2005 onwards, but cancer remains a matter of concern.
Region: Global
Years: All
Result 3:
This graph indicates the deaths caused in both genders. As indicated males were effected the
most by contagious diseases, more than 250000x3 males died of the contagious diseases, and
20000 caused by cancer in various regions. In females, again contagious diseases caused
most mortality around 200000x3.
Region: Global
Years: all
Gender: all
Result 4:
The above graph indicates the mortality caused in three aggregated age groups, i.e. “0-30” ,
“30-60”, “60-77”, the visualisation shows that contagious disease remains fatal from infancy
to 30’s, However after 30 years, mortality caused by cancerous disease shows increment, and
in 60-77 years, most of the deaths are caused by cancer and communicable diseases.
Region: East Asia, North Africa & Middle East (MENA), South Asia
Years: All
Gender: All
Result 5:
The above visualisation shows the mortality caused in three regions. As indicated, East Asian
countries are most affected by Cancerous diseases followed by contagious. Around 50000x3
deaths were caused from 1990-2010 by these two diseases.
MENA, the diseases effecting the most is contagious, followed by cancer. Comparative to
East Asia, all the three disease categories show more mortality in MENA.
In South Asian countries, like India, Pakistan and Afghanistan, Contagious diseases have
caused around 150000 deaths. However the cancerous diseases and diabetes mellitus show
almost the same range as has the other regions.
Region: North Africa and Middle East. (Iraq, Iran, Algeria, Arab countries, Sudan etc.)
Age Group: All
Gender: All
Result 6:
As shown, the contagious diseases have caused more deaths in 1990, but with time the death
rate has been lowered down at a good rate. But deaths caused by cancerous diseases have
remained approximately same.
[32] Rajesh Vargheese, “Dynamic Protection for Critical Health Care Systems Using Cisco
CWS”, Fifth International Conference on Computing for Geospatial Research and
Application (2014], IEEE.
[33] Simon Tripp and Martin Grueber, (2011), available at: http://battelle.org/docs/default-
document-library/economic_impact_of_the_human_genome_project.pdf.
[34] Karthik Kambatla, Giorgos Kollias, Vipin Kumar, Ananth Grama, “Trends in big data
analytics”, J. Parallel Distrib. Compute. 74 (2014) 2561–2573, Available at Science direct.
[35] James Manyika, Michael Chui, Brad Brown, Jacques Bughin, Richard Dobbs, Charles
Roxburgh, Angela Hung Byers, (2011), ”Big data: The next frontier for innovation,
competition, and productivity”, McKinsey Global, available at:
http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_in
novation.
[36] “What is Apache Hadoop?”, https://hadoop.apache.org/
[37] Sujata A. Pardeshi, Pooja K. Akulwar, “Escalation the Power of Big Data”, ISBN: 978-81-
927230-0-6
[38] Suresh Lakavath, Ramlal Naik L., “A Big Data Hadoop Architecture for Online
Analysis”, - International Journal of Computer Science and Information Technology &
Security (IJCSITS), ISSN: 2249-9555, Vol. 4, No.6,(2014)
[39] Dr. M Moorthy, R. Baby, S. Senthamaraiselvi, “An Analysis for Big Data and its
Technologies”, IJCSET(2014), ISSN: 2231-0711
[40] “How Apache Hadoop Yarn works”, http://blog.cloudera.com/blog/2014/05/how-
apache-hadoop-yarn-ha-works/.
[41] Praveen Murthy, Anurag Bhardwaj, P. A. Subrahmanyam, Arnab Roy, Sree Rajan, “Big
Data Taxonomy”, (2014), Cloud Security Alliance.; O’Rielly “Programming in Hive”.
[42] Anurag Bharadwaj et.al “Big Data Taxonomy”, (2014), Cloud Security Alliance.
Available at www.cloudsecurityalliance.org. ; “how pig works”
[43] Pig Tutorial, available at : www.tutorialspoint.com
[44] “R: The R project for statistical computing”, http://www.r-project.org/.
[45] Markus Maier, Master’s Thesis, (2013), “Towards a Big data reference architecture”,
Eindhoven University of Technology.
[46] Bernard Marr, (2015), Forbes, “How Big Data is Changing Healthcare”, available at
http://www.forbes.com/sites/bernardmarr/2015/04/21/how-big-data-is-changing-healthcare/
[47] IBM study of 1,734 chief marketing officers from 64 countries, “Everyday we create 2.5
quintillion data”, available at: http://www.storagenewsletter.com/rubriques/market-
reportsresearch/ibm-cmo-study/
[48] Ming Yang, Melody Kiang, Wei Shang, “Filtering big data from social media –
Building an early warning system for adverse drug reactions”, available at science direct.
[49] Raghunath Nambiar, Adhiraaj Sethi, Ruchie Bhardwaj, Rajesh Vargheese, “A Look at
Challenges and Opportunities of Big Data Analytics in Healthcare”, 2013 IEEE International
Conference on Big Data.
[50] Lin Li, et.al. “Risk Adjustment of Patient Expenditures: A Big Data Analytics
Approach”. 2013 IEEE International Conference on Big Data
[51] Ping Jiang, et.al, “An Intelligent Information Forwarder for Healthcare Big Data
Systems with Distributed Wearable Sensors”, IEEE Systems journal.
[52] Ruchie Bhardwaj, Adhiraaj Sethi, Raghunath Nambiar, “Big Data in Genomics: An
Overview”, 2014, IEEE International Conference on Big Data.
[53] Marco Viceconti, Peter Hunter, and Rod Hose, “Big data, big knowledge: big data for
personalised healthcare”. JBHI-00566-2014, IEEE Journal of Biomedical and Health
Informatics.