You are on page 1of 51

“Analysis of multi diseases using Big Data for

better healthcare”
Introduction
The term “Big Data” has created a lot of buzz from past few years. Initially it was shaped by
organizations which had to handle fast growing data like web data, data resulting from
scientific or business simulations or other data sources (sensors, clinical trials etc.). The data
was no more a talk of certain terabytes but the growth was rapid and took us to the sizes like
exabytes or even more. The traditional databases were becoming incapable to handle and
manage this data. Thus, the pressure, to handle the growing data amount on the web,
databases etc. lead the companies like Google to develop a file system that could handle the
unstoppable growing data. This marked the development of Google File System [45] and
MapReduce [45]. Efforts were made to rebuild those technologies as open source software.
This resulted in Apache Hadoop and the Hadoop File System [45] and laid the foundation for
technologies summarized today as ‘Big Data’.
With the foundation laid down by Google, the other information companies stepped in and
started to invest to extend their software portfolios and build new solutions especially aimed
at Big Data analysis. Among those companies were IBM, Oracle, Microsoft and SAP. The
effort taken by software companies to get part of the Big Data story is not surprising
considering the trends which analysts predict and the praise they chant on ‘Big Data’ and its
impact onto business and society as a whole. IDC predicts in its ‘The Digital Universe’ study
that the digital data created and consumed per year will grow up to 40 exabyte by 2020, from
which a third of this data will promise value to organizations if processed using Big Data
technologies [45]. IDC also states that in 2012 only0.5% of potentially valuable data were
analysed, calling this the ‘Big Data Gap’. While the McKinsey Global Institute also predicts
that the data globally generated is growing by around 40% per year, they furthermore
describe Big Data trends in terms of monetary figures. They project the yearly value of Big
Data analytics for the US health care sector to be around 300 billion US Dollars. They also
predict a possible value of around 250 billion Euros (£) for the European public sector and a
potential improvement of margins in the retail industry by 60%.[46]
The Big Data, differs from the normal data by virtue of its 3 main characteristics: Volume,
Velocity and Variety, usually referred to as 3V’s of Big Data. The Volume of Big Data is
greater than 2 hundreds of terabytes, or even greater than exabytes demanding a vast storage
capacity. The Velocity of the Big Data is the frequency of the generation of the data. The
mountains of data present today has been generated in last couple of years alone. With the
increase in sensors, mobile phones, social media feeds, web log files, clinical data,
forecasting data etc., there is no such second which pasts without generating GBs of data. So,
the data coming at pace adds more to the volume of Big Data and itself becomes a problem.
The Variety means the broad range of data that is being generated. Data comes in three forms
structured, semi-structured and unstructured. The traditional database managing software
would deal with structured data only i.e. the data with a definite schema, rather the data
which could be stored in a proper relational data tables or database. To some extent the semi-
structured data is sometimes also manageable. But it’s quite impossible for the traditional
databases to manage and handle the Big Data so that it could be analysed. We have a variety
of the data available, the text formats, and audio files, video files etc. data that cannot be
stored in row and column format. Analysing this type of data will bring out better outputs for
the organisations and will improve their performance.
The Big Data analytics is fruitful to the whole society in every aspect. It has been proving its
tenacity in various fields, some of the important fields can be political campaigns, Healthcare,
Weather Forecasting, Education sector, social media analysis etc. This thesis provides an
outlook over the “usage of the Big Data analytics in healthcare”. Healthcare sector is the
greater source of the unstructured data. The IDC Health Insights Study, states that the
worldwide health care data grew up to 500 petabytes in 2012 and is projected to grow 25,000
petabytes by 2020.Data generated by the EHR’s (Electronic Health Records), clinical trials,
health surveys, laboratory testing data, genomic data and the data generated by the wearable
sensors like ECG’s, accelerometers etc. produce a large variety of data that can be well
analysed to predict the future trends in health issues. The X-ray images, CT-scans, MRI’s,
ultrasound etc. testing produce a lot of unstructured data. This data, if well analysed, will
certainly provide better insights and will improve health care departments. Employing the
Big Data and Big Data technology in healthcare will improve the patient outcome and would
be cost effective. Big Data has a greater potential to transform the health and health care.
This Big Data needs to be handled and evaluated to make maximum out of it. The companies
are upgrading their infrastructure and have started implementing the “Big Data technologies”
to predict much out of the heaps. There are various Big Data technologies in market. Hadoop,
Spark, SAP-HANA, High Performance Cluster Computing (HPCC), amongst these Hadoop
is being used maximum. It is not wrong to say that the term “Hadoop” is being used
synonymously with Big Data. Hadoop is an Apache foundation developed by Doug Cutting.
Hadoop uses the Google developed MapReduce and an improved file system called the
Hadoop Distributed File System.
Our analysis was solely based on Apache Hadoop, The Hadoop distributed file system is used
to store and process the large amount of data in distributed manner. To take advantage of the
parallel processing that Hadoop provides, we need to express our query as a MapReduce job.
MapReduce works by breaking the processing into two phases: the map phase and the reduce
phase. Each phase has key-value pairs as input and output, the types of which may be chosen
by the programmer. The programmer also specifies two functions: the map function and the
reduce function. The environments like Hive, pig and R were used to analyse the data better.
There are varieties of data that can be put into use with the help of Big Data technologies like
Hadoop. For instance, genomic data can be analysed for predicting well out of the gene and
gene trends. Similarly, EHR data can be collected and analysed to keep a tight vigil on the
patient activities and flows, helping in preventing any kind of epidemic outbreaks like H1N1,
Hepatitis etc. According to the IDC Health Insights Study (Feb 14, 2014), Status of Clinical
Decision Support in China, which has adopted the Big Data technology in their CDS systems,
the clinical efficiency has increased than that of those which do not use this technology at all.
[43] The implementation of Big Data analytics will help in reducing the costs and expenses.
Picking up warning signs of serious illness at an early enough stage that treatment is far
simpler (and less expensive) than if it had not been spotted until later. The non-
communicable disease like diabetes are proving more fatal, and also comes with many health
hazards. This can be overcome by analysing the previous health records of various patients
and informing the patients with diabetes about the most probable complication occurrences.
A doctor with a great experience can sometimes be wrong with the diagnosis, but if he has a
better access to the patients previous records with better analysis, he can provide better
solution to the problems faced by the patient. The Big Data analytic solution, if attached with
the Clinical Decision Support systems, will provide a better help to clinician by comparing
the details of years with the current situation of the patient, thus providing even better case
specific advice.
The healthcare data can be collected from the hi-tech hospitals with HER’s, from the Web
sites which provide open access to the large datasets, the laboratories which provide
multivariate data for analysis, health survey data is beneficial to see the patterns of the
diseases from past 10-30 years. However it may not be a cake walk to find the desired
datasets because of the security and privacy reasons. But if managed to get the datasets, there
are is a vast amount of data stored from years in hospitals and other health centres which can
be well used to improve and deal with the health issues to major extent.
This thesis, will focus on the solutions to the health and healthcare using Big Data analytics.
With the analysis done on survey data the results and conclusions are based on the analysis
done the Global Burden of Diseases, a survey from 1990 – 2010, in 21 regions for 291
diseases. The data was collected from GHDx( Global Health Data exchange). The dataset is
semi-structured with more than 8000000 rows and 31 columns. We managed to manipulate
the data by dividing these 291 disease into 34 categories, and provided these categories a
‘category code’. Thus adding 2 more columns in the dataset. To make the dataset fully
unstructured, so as to justify the usage of Big Data technologies on unstructured data, we
turned the format of the datasets into tab delimited .txt format. The dataset had many null and
NaN(Not a Number) values, so it needed to be cleaned. This cleaning was done using the
MapReduce programming. After cleaning and making fully structured, the final data was put
into the Hive data tables. Finally the plotting and calculations were done using the R
environment.
We calculated the mortality rate caused by various disease and disease categories, in different
regions, in different genders and in different age groups. This gave us an idea, which sector
where to put our large efforts in and which disease requires an immediate dealing. Providing
us the idea which gender is being effected in a particular region and of which age group, the
analysis on this data took us to the minutest details of the effects of diseases on the
population, which could help us in reckoning the area, gender and age group requiring the
immediate attention.
CHAPTER 2
2.1 Big Data
Big Data is being generated by everything around us at all times. Every digital process and
social media exchange produces it. Systems, sensors and mobile devices transmit it. [1] We
create around 2.5 quintillions of data every day. [2] The 2011 IDC Digital Universe study
writes that around 130 exabytes of data were created and stored in 2005. With the rapid
increase in this, it grew to 1227 exabytes in 2010 and is projected to grow at the rate of 45%
in 2015(current year) i.e. around 7910 exabytes.[3] The data is growing rapidly, it will never
stop though. And this rapid growth of the data lay the ground for the “Big Data” phenomenon
– a technological phenomenon brought about by the rapid data growth and collateral
advancements in technology that have given rise to an ecosystem of hardware and software
products that are enabling users to analyse this data to produce new and comminute levels of
insights. [3]
There are many definitions for Big Data. The Oxford English Dictionary defines the Big Data
as: “Data of a very large size, typically to the extent that its manipulation and management
present significant logistical challenges.”[4] Wikipedia, defines term Big Data as: “An all-
encompassing term for any collection of data sets so large and complex that it becomes
difficult to process using on-hand data management tools or traditional data processing
applications.[5]However, the definition given in the”2011 Big Data study” by McKinsey, He
defines Big Data as Datasets whose size is beyond the ability of typical database software
tools to capture, store, manage, and analyse.[6]
Another definition for the Big Data by O’Reilly states it as:
“Big Data is data that exceeds the processing capacity of conventional database systems.
The data is too big, moves too fast, or does not fit to the existing database architectures. To
gain values from these data, there must be an alternative way to process it.”[7]
This is to be noted here, that there is no denotative definition for how big a data should be in
order to be considered as Big Data. New and innovative technologies need to be in place for
managing this Big Data phenomenon. International Data Corporation (IDC) defines Big Data
technologies as a new generation technologies and architectures designed to extract value
economically from very large volumes of varieties of the data by enabling high velocity
capture of conventional database systems.[8]
Concluding from the above definitions a new definition can be generated for the Big Data i.e.
“An extremely large amount of data of broader range, that is increasing rapidly and will never
stop growing, for which we will always need a parallel technology that is efficient enough to
handle it and bring out the better insights for a user”.
2.2 Characteristics of Big Data
Big Data is not only about the Volume or Size of the data, but also includes the data Variety,
and data Velocity. These three attributes together form the 3 V’s of the Big Data.

Volume

Velocity Variety

Figure 1.1: The 3 Vs of Big Data


2.2.1 Volume:
Every day, in the digital universe, we create around 2.5 exabytes of data. Every time we click
on mouse, each phone call we make, a text message we send to each other, everytime we
search on the web, purchase transactions and even single “like” we do on any social
networking site is stored and catalogued in Big Data cloud.[1] The “Volume”is actually
synonymous with the “big” in the term Big Data. The data volume will always tend to grow
regardless of the profile of an organisation. As mentioned earlier the data is increasing
rapidly and has shown some enormous increase over past few decades. Every organisation is
creating and storing data on each process they do. The data is not about some terabytes
anymore, our memory are being upgraded. We will soon be dealing with “zetabytes” and
“yotabytes”. And to store the huge amounts of data we always be in desperate need to
implement the new technologies time to time. As mentioned in the beginning of the chapter,
more than 7900 exabytes of data is being expected to be generated till the end of year 2015.
There is a natural tendency for companies to store all sort of data: medical data, financial
data, environmental data and so on. However, many of the smaller companies are still within
the range of terabytes but soon they could exceed the petabytes and exabytes or even more.
[9] The primary goal of the Big Data is to make this large volume of the data useful for the
companies and also for the consumers, to optimize future results. [1]
2.2.2 Velocity:
The data is being generated at a faster pace than ever before. The more the mankind is getting
digitized the more the speed of the data generation is increased. In fact, the total amount of
the data present in the world has been created in last two years alone.[10]The social
networking sites, health centres, banks etc. these are being accessed and used every second
and within a blinking of the tens of thousands of data is generated. Some key facts to be
noted here, are as follows [2]:
 950 million people generate 2.7 billion “likes” per day on Facebook.
 400 million tweets are created by active users each day on Twitter.
 72 hours of video is uploaded on YouTube every minute.
From the above facts and figures one can imagine the frequency of the data generation. The
Velocity of the Big Data is the speed at which the data is flowing. A conventional
understanding typically considers how quickly the data arrives and is stored, and how quickly
it can be retrieved. The increase in digitization and increase in sensor network deployment
has also led to the constant flow of data at a pace that has made it impossible for a traditional
systems to handle it.
2.2.3 Variety
The data comes from the different sources and in different forms/types. The 3 rd ‘V’ for the
Big Data stands for the Varieties of the data an organisation can get. The explosion of the
sensors, smart devices (phones), as well as the social networking sites has increased the
complexity of the data. This variety of the data generated can be categorised as:
 Structured Data
 Semi-Structured Data
 Unstructured data.
2.2.3a Structured Data
A simpler definition for the Structured is that “any data that resides or can be grouped into a
relational schema (e.g. rows and columns within a standard database). Or in other words
“Data that resides in a fixed field within a record or a file is called as Structured Data.” The
structured data has an edge over the other two types of data that it can be handled,
manipulated easily. It can easily be entered, stored, queried and analysed. A typical relational
database is an example of the structured data.
2.2.3b Semi-Structured Data
The semi-structured data lies between structured and unstructured data. It is rather a type of
structured data but does not conform to an explicit fixed schema. Certain “tags” and
“markers” are used to identify the elements within the semi-structured data. The data in this
category doesn’t have the rigid structure from which complete meaning can be extracted
without much further processing. The data like web log files, social media files, .xml files are
the example of semi-structured data.
2.2.3c Unstructured Data
The type of data that cannot be easily indexed into relational tables for any kind of analyses
and querying. The unstructured data example include image files, audio files and video files,
pdf files, Power Point Presentation etc., it’s always tricky to extract information from
unstructured “Raw Data”. But the information extracted from this type data always proves
fruitful for the companies.
2.2.4 “Value” of the Big Data
Many of the IT pro’s consider it as the fourth “V” for the Big Data. However, this very
characteristic is achieved only after the proper processing of the three “V’s” i.e. Volume,
Velocity and Variety. the value of the Big Data is predicting likely the future occurrences and
what action would be the successful for each occurrence; keeping an eye on what is
happening in the real time(or close to real time) and determining the action to take etc. [1]
2.3 Big Data Infrastructure
Infrastructure is the foundation of Big Data technology stack. Yet, another unique
characteristic of Big Data is that, unlike large data sets that have historically been stored and
analysed, often through data warehousing, Big Data is made up of discretely small,
incremental data elements with real-time additions or modifications. It does not work well in
traditional, online transaction processing (OLTP) data stores or with traditional SQL analysis
tools. Big Data requires a flat, horizontally scalable database, often with unique query tools
that work in real time with actual data (as opposed to time delineated snapshots). Table 1
compares traditional data with Big Data.[11]
Components Traditional Data Big Data
Architecture Centralized Distributed
Data Volume Terabytes Petabytes or exabytes
Data Type Structured or Transactional Semi or Unstructured
Data Relationship Known relation Complex/Unknown
Relationships
Data Model Fixed schema Schema Less
Table 1.1: Big Data Vs Traditional Data Types
Figure 1.2 provides a general view on the Big Data infrastructure that includes the general
infrastructure for general data management, typically cloud based, and Big Data Analytics
part that will require high-performance computing clusters, which in their own turn will
require high-performance low-latency network.[12]
Figure 1.2: General Big Data Infrastructural Components [12]
General BDI services and components include
 Big Data Management tools
 Registries, indexing/search, semantics, namespaces
 Security infrastructure (access control, policy enforcement, confidentiality, trust,
availability, privacy)
 Collaborative environment (groups management) [12]
The Source of the figure 2 defines Federated Access and Delivery Infrastructure (FADI) as an
important component of the general BDI that interconnects different components of the
cloud/Intercloud based infrastructure combining dedicated network connectivity provisioning
and federated access control. [12]
Thus Big Data Infrastructure demands to be highly scalable and as fast as possible. A
company might not be able to buy highly configured servers and other required equipment
time and again for the streaming data. They can, however, take help of “the Cloud
Technology” as provided by IT companies like IBM BlueMix, Amazon etc.
2.4 Big Data handling tools OR the Big Data technologies
To handle Big Data for getting the better insights and judgments we will always be in need of
the efficient tools to manage and process the giant. When we talk about Big Data technology,
the first thing that comes in the mind is “Hadoop”. Hadoop is almost synonymous with the
term “Big Data” in the industry and is popular for handling huge volumes of unstructured
data. Hadoop and Big Data work hand in hand. The Hadoop Distributed File System enables
highly scalable, redundant data storage and processing environment that can be used to
execute different types of large-scale computing projects. For large volume structured data
processing, enterprises use analytical databases such as EMC’s Greenplum and Teradata’s
Aster Data Systems. Many of these appliances offer connectors or plug-ins for integration
with Hadoop systems. [1]
Big Data technology can be broken down into two major components – the hardware
component and the software component, as shown in the figure below. The hardware
component refers to the component and infrastructure layer. The software component can be
further divided into data organisation and management software, analytics and discovery
software, and decision support and automation software.[1]

Figure 1.3: Big Data Technological Stack [1]


Following are the major technologies that are being used to process the Big Data:
2.4.1 Hadoop
Hadoop was generated by Doug Cutting (who also developed apache lucene) for ascendable,
reliable distributed processing. Hadoop mainly consists of HDFS and Map Reduce. The
Hadoop distributed file system is used to store and process the large amount of data in
distributed manner. To take advantage of the parallel processing that Hadoop provides, we
need to express our query as a MapReduce job. MapReduce works by breaking the
processing into two phases: the map phase and the reduce phase. Each phase has key-value
pairs as input and output, the types of which may be chosen by the programmer. The
programmer also specifies two functions: the map function and the reduce function.[13]
2.4.2 Spark
Hadoop provides cluster storage approach, whereas Spark provides scalable data analytics
platform with “in-memory computing”. It has been proved that “in-memory” computing
provides faster data access by eliminating the 110 overhead. Spark supports open source
environment which increase the computing power, eventually leading to superiority then
Hadoop. It has been designed for explicit applications like machine learning algorithms and
natural language processing. The Spark runs on “Apache Mesos”, a cluster manager by which
Spark applications coexist with Hadoop. The drivers working in Spark uses two type of
operation. (1) Action (2) Transformation. Action is similar as reduce and transformation is
similar as map and cache operation. Spark is developed in scala and it supports the scala,
which is functional programming language used to provide distributed and iterative
environment.[13]
2.4.3 Storm
Storm was launched as an open source by Twitter in September, 2011.Storm is the
implementation of Map-Reduce concept of Hadoop. Ruby and Python is supported to make
applications in Storm. The key idea is it is used for streaming processes. Storm uses no
storage concept, which simply tells all about semi-structured, un-structured and structured
data together.[14]
2.4.4 HPCC (High Performance Computing Cluster)
High Performance Computing Cluster also uses the MapReduce framework for the analysis
of large amount of data. It works with “Enterprise control language (ECL)”, a declarative
programming language. ECL provides entire programming paradigm in which highly
parallelism is achieved. Mainly two clusters are concerned: Thor and Roxie, Thor provides
the simplification of ETL (Extract-Transform-Load) process and Roxie provides data
delivery by highly concurrent procedures. It uses HPCC distributed file system. Major two
advantages provided by HPCC over Hadoop is scalability and speed. HPCC platform
provides the intricate physics submissions and imagining of simulations in depth. In support
of decision making HPCC is used by Elsevier to boost its logical and critical skills for
SciVal.[13]
2.4.4 SAP-HANA
Again, an “In-memory computing” for Big Data SAP devised a new tool HANA, which
processes on block of the data by using advanced parallel architecture and algorithms for
faster speed.[13]
2.5 Applications of Big Data
Before jotting down a brief description of the applications of Big Data, let us examine the
following comparison table which presents some of the implications of Big Data in business
intelligence – moving from traditional decision making to data driven decision making .[14]
Traditional Data Making environs Big Data Extensions
 Determine and analyse current  Provides complete answers, predict
business situation. future business situation and
investigates new opportunities.
 Integrated data sources.  Virtualised and blended data
sources.
 Supports only structured Data.
 Supports multi-structured data.
 Have aggregated and detailed data
within limits.  Have got large volumes of
data(detailed) without any limits.
 One size fits all data management.
 Flexible and optimized data
management.

Table 1.2: Traditional (vs) Big Data environments [14]


Though Big Data covers and extensively vast area for its application and implementation.
Following are the five major domains in which Big Data is playing transformative role. Some
of them are briefly defined. The domains are as follows:
 Political Campaigns
 Education
 Healthcare
 Pervasive Computing
 Weather Forecasting
2.5.1 Role of Big Data in Political Campaigning:
In India, Big Data analysis was, in parts, responsible for the BJP and its allies to win a highly
successful Indian General Elections 2014. [15]
Big Data has the power of transforming our lives and environment. The Data Scientist are
well aware of the thing that the use and combination of the different data sources can create
competitive advantage when running public posts. In the book titled as “The Victory Lab”,
author Sacha Issengber, gives a glance of techniques used to apply data science to win
political campaigns. [14]
2.5.2 Role of Big Data analytics in Education:
Education systems at all levels will be benifited from Big Data analytics. There are certainly
two ways to achieve Goal and Value in modifying our education system and taking it to the
hieghts. These are:
 Predictive Models
 Analytics.
These two application areas deliver more personalized and higher quality education by
increasing collaboration between students, teachers, administrators and parents. Thus
resulting in satisfaction and better performance. [14]
2.5.3 Role of Big Data Analysis in Healthcare:
New insights derived from Big Data analytics will serve to advanced personalized care,
improve patient outcomes and it will help avoiding unnecessary costs. EHR’s (Electronic
Health Records) coupled with new analytical tools will significantly open doors to mining
information for most effective outcomes across large population.[16] surveys of more than
one decades can be processed and analysed to curb the diseases that come at a particular
period or so. The epidemic outbreak can be controlled by improvising the data scientists who
can predict the outbreak way before it shows up.
2.5.4 Role of Big Data in weather forecasting:
Weather changes have always been uphill task for the forecast men. The Big Data solution
can help the meteorological departments to analyse the unstructured data produced by the
sensors they deal with. In September 2012; Palau, Japan, Korea, China and Russia were hit
by Typhoon called “Typhoon Sanba”, the damage calculated was around $378.8 million
(2012 USD).[17]
Menaced by such destructive weather phenomena, South Korea upgraded its national weather
information system with the goal of understanding weather patterns better and predicting
better the location and ferocity of weather events. The upgrade installed by the Korean
Meteorological Administration increased the agency’s data storage capacity by nearly 1000%
to 9.3 petabytes, making it Korea’s most capable storage system. IBM provided the storage
hardware and software. The KMA project dramatically illustrates the Big Data phenomenon
and its impact on weather forecasting. Due to the rapid spread of sensors and satellites, and to
the increase in computer number-crunching speeds, it’s possible to forecast weather changes
more accurately and with improved detail–potentially saving thousands of lives and
safeguarding property. [18]
Increasing evidence of climate change worldwide is prompting governments and scientists to
take action to protect people and property from its effects. But, to take effective action, they
need to know understand a lot more about the weather–everything from what’s going to
happen tomorrow to what’s coming next year.[18] This can happen only because of the Big
Data analytics. Weather data is stored in repositories since day one. All it needs is the
implementation.
CHAPTER 3
3.1 REVIEW OF LITERATURE
The term Big Data has become the talk of the town from last couple of years. According to
the IDC white paper titled “Big Data: What It Is and Why You Should Care” by Richard L.
Villiars et.al. (2011) in 2010, the world generated over 1ZB of data; by 2014, we will
generate 7ZB a year. Much of this data explosion is the result of a dramatic increase in
devices located at the periphery of the network including embedded sensors, smartphones,
and tablet computers. All of this data creates new opportunities to "extract more value" in
human genomics, healthcare, oil and gas, search, surveillance, finance, and many other areas.
[9] the data will never stop growing.
Another study carried out by IBM (2011), states every day we create 2.5 quintillion (10 18)
bytes of data - so much that 90 percent of the world's data today has been created in the last
two years alone. The increasing volume, variety and velocity of data available from new
digital sources like social networks, in addition to traditional sources such as sales data and
market research, tops the list of challenges. The difficulty is how to analyze these vast
quantities of data to extract the meaningful insights, and use them effectively to improve
products, services and the customer experience. [47]
Alejendra Zarate Sentovena, (2013), in her research on Big Data, concludes that the 3V’s of
Big Data will grow constantly in complexity. The data will arrive faster in much more
complex form and in increasingly higher volumes. Much of this data will be useful, in some
cases just few seconds or even less.[14]
When we talk about the Variety of the Big Data, the three characters come in our mind i.e.
structured, semi – structured and unstructured from. Big Data is comprised variety of data,
like web log files, audio and video files etc. With the rapid increment in the data, this Big
Data needs to be handled and managed for better insights.
Karthik Kambatala et.al. Mentions that the volume of data operated upon by modern
applications is growing at a tremendous rate, posing intriguing challenges for parallel and
distributed computing platforms. These challenges range from building storage systems that
can accommodate these large datasets to collecting data from vastly geographically
distributed sources into storage systems to running a diverse set of computations on data. [34]
The companies have started to integrate the Big Data technologies to improve their
performance. Yahoo, Facebook, google and twitter the internet giants are already using the
Big Data technologies. Another major data generating resource is Health care.
Dr Saravana kumar N M et.al, states the healthcare industry is moving from reporting facts to
discovery of insights, toward becoming data driven healthcare organizations. Big Data holds
great potential to change the whole healthcare value chain from drug analysis to patients
caring quality. Due to the growing unstructured nature of Big Data form health industry, it is
necessary to structure and emphasis its size into nominal value with possible solution.
Healthcare industry faces many challenges that make us to know the importance to develop
the data analytics. [25]
Srivathsan, M and Yogesh, Arjun K, states Technological advancement in healthcare has
reached a saturation.A break-throughcan be achieved by Prognotive computing. Prognotive
Computing is related to Big Data analytics as the process may require the collection,
processing and analysis of extremely large volume of structured and unstructured biomedical
data stemming from a wide range of experiments and surveys collected by hospitals,
laboratories, pharmaceutical companies or even social media which is implemented by using
existing tools for Big Data. The result of prognosis will improve the efficiency in providing
better living to people.[26]
Aisling O’Driscoll et.al carried out the study on Big Data analytics in genomic data and states
that Human DNA is comprised of approximately 3 billion base pairs with a personal genome
representing approximately 100 gigabytes (GB) of data, the equivalent of 102,400 photos.
The Big Data processing capabilities of cloud computing facilitating the analysis of all the
variables at once is a significant enabler of the new area of systems biology. The Big Data
technologies such as the Apache Hadoop project, which provides distributed and parallelised
data processing and analysis of petabyte (PB) scale data sets will be much helpful in
analysing the genomic data.. [28]
Ming Yang et.al. proposed an early warning system for adverse drug reaction, using social
media feeds. They selected more than 500 threads of discussion, and collected all the original
posts and comments of these drugs using an automatic Web spidering program as the text
corpus. Various classifiers were trained by varying the number of positive examples and the
number of topics. The trained classifiers were applied to 3000 posts published over 60 days.
Top-ranked posts from each classifier were pooled and the resulting set of 300 posts was
reviewed by a domain expert to evaluate the classifiers. [48] Their design provided
satisfactory performance in identifying ADR related posts for post-marketing drug
surveillance. The overall design of system also pointed out a potentially fruitful direction for
building other early warning systems that need to filter Big Data from social media networks
Raghunath Nambiar et.al researched on Big Data in health care and sums up that personalized
medicine is a being promoted as the future of the healthcare industry. Today, medicines are
made for the masses not for the individual. Going forward, with the help of Big Data, more
personalized medicines that uses patient specific data such as genomics and proteomics can
be created based on the profiling of similar patients and their responses to such approaches.
[49]
Lin Li, Asif Hassan et.al evaluated the effectiveness of the random forest-based risk
adjustment model and the efficiency of the distributed computing framework. The distributed
computing framework they used was Apache Hadoop, which contains one master server and
three slave servers for the initial prototype framework. We use Apache Mahout’s random
forest implementation, to train the risk adjustment model on the cluster, which builds
enormous number of decision trees in the model in parallel. The results show that the random
forest significantly outperforms linear regression model, which indicates the effectiveness of
the random forest to identify the complex patterns in high dimensional patient data and thus
illustrates its capability of enhancing the risk adjustment model performance. This also
indicates the efficiency of the usage of the HDFS and other integrated applications.[50]
Ping Jiang and Jonathan Winkley et.al studied and analysed the data produced by the
wearable sensors. They presented a “Big Data healthcare system for elderly people”. The
system connects with remote wrist sensors through mobile phones to monitor the wearers’
well-being. Due to a tremendous number of users involved, collecting realtime sensor
information to the centralized servers becomes very expensive and difficult. However, such a
Big Data system can provide rich information to healthcare providers about individuals’
health conditions and their living environment. Thus indicated the need of the Big Data
technology in collecting and handling the data produced. [51]
Ruchie Bhardwaj et.al summarises their study on Big Data in genomics and states with the
advancements of technology the productivity of the healthcare industry increases, the number
of people benefitting from the healthcare industry is increasing. The five value pathways:
right living, right care, right providers, right value and right innovation, define the framework
of the new industry. It has been proven that by adapting to this new approach to healthcare, a
total of $1 billion can be saved in just one healthcare facility and a total of up to $450 billion
can be saved in the United States. Additionally, the introduction of devices that gather large
amounts of data are key to the shift to evidence-based and preventive-based medicine that the
industry is going through. These approaches lead to a more successful treatment for patients.
The future is bright for the newest intersection between technology and healthcare. [52]
Prof. Jigna Ashish Patel and Dr. Priyanka Sharma, write in their study on Big Data for better
health planning, that for healthcare usage and applications, ample patient information and
historical data, (which enclose rich and significant insights) can be exposed using advanced
tools and techniques as well as latest machine learning algorithm. [13]
Marco Viceconti, Peter Hunter, and Rod Hose; also put emphasis on using the Big Data
technologies to analyse the data stored and bringing better insights. Stating, that the Big Data
technologies do have great potential in the domain of computational biomedicine, but their
development should take place in combination with other modelling strategies, and not in
competition. This will minimise the risk of research investments, and will ensure a constant
improvement of in silico medicine, favouring its clinical adoption. [53]
Apart from all the above literature, websites of IBM, FORBES, IDA, IDC and amazon etc.
were extensively browsed. And articles published, in these websites, by the experts have been
included in this thesis.
CHAPTER 4
4.1 Big Data in Healthcare:
Chapter 2 gave us the notion of what Big Data is and how Big Data analytics is capable of
enhancing rather transforming over lives. This Chapter will be elaborating the Big Data
services in health care.
As we are already familiar with the fact that Big Data phenomenon means the large amount
of data (petabytes, exabytes or even greater). For data scientist and analysts it is a quite
fruitful to use the larger datasets and implement the better tactics to bring out the
extraordinary results.
Professor Alex Pentland,[19] Director Human Dynamics Laboratory at MIT( Massachusetts
Institute of Technology), says Big Data is turning the process of decision making inside out.
The Big Data covers the multi structured data formats and the tools handling the Big Data are
capable of analysing the unstructured data as well. So, implementing the Big Data in
Healthcare and in HIT(Health Information Technology) apps would definitely enhance the
outcome of the organisation and will certainly change the trend. With the data becoming
available, innovators have taken up the initiative to build applications which will make it
easier to share and analyse the information. These advances are starting to improve health
care quality and reducing costs.[20]
Employing the Big Data and Big Data technology in healthcare will improve the patient
outcome and would be cost effective. Big Data has a greater potential to transform the health
and health care. According to the IDC Health Insights Study (Feb 14, 2014), Status of
Clinical Decision Support in China, which has adopted the Big Data technology in their CDS
systems, the clinical efficiency has increased than that of those which do not use this
technology at all. [21]
Leon Xiao, Senior research manager of “Vertical Industry Research and Consulting”, IDC
China, says “Big Data can create a dynamic clinical knowledge base and can help deploy
more complicated models for artificial intelligence in inferring, which makes Clinical
Decision Support Systems more credible and effective”. [22]
The data from search engines and social networks can help to gather people’s reactions and
monitor the conditions of epidemic diseases.[23]
4.2 Big Data issues in Healthcare:
According to the IDC Health Insights Study, the worldwide health care data grew to 500
petabytes in 2012 and is estimated to grow 25,000 petabytes by 2020. In addition to the
volume, data is increasing in terms of variety and complexity such as the medical imaging,
videos, and social media feeds (tweets, posts etc.)- Unstructured data. This data comprises of
overall 85% of the information today. [24]
After researching and studying the various journals and references following is the list of
major issues in healthcare which need to be harmonized:
 The disease, like diabetes, may come associated with severe diseases such as heart
attacks, strokes, eye diseases and kidney diseases, etc. Analysing the risk value by
the level of patient health condition can be used by the physicians at remote
locations to serve the people.[25]
 Detecting diseases at earlier stages can help to be treated more easily and
effectively. In developing countries such as India, it is mandatory to manage
specific individual and population health and detecting health care fraud more
quickly. [25]
 Non-Communicable Diseases like diabetes, is one of a major health hazard in
India. By transforming various health records of diabetic patients to useful
analysed result, this will make the patient understand the complications to occur.
[25]
 In case of medical, including studying the patient’s health conditions, studying
about different diseases and their patterns. It is not enough to have normal
database system because it requires not only retrieving but also interpreting large
amount of data. [26]
 A normal doctor predicts the diseases from patient’s previous record and past
experience but he is always liable to make mistakes.
 Health data describing the phenotypes and treatment of patients covers multiple
data sources including medication, laboratory, imaging and narrative data, which
are underused and have much greater potential to be unleashed than is currently
realized.[27]
 Advancement in new sequencing technologies has resulted in generation of
unprecedented levels of sequencing data.[28] Human DNA is comprised of
approximately 3 billion base pairs with a personal genome representing
approximately 100 gigabytes (GB) of data, the equivalent of 102,400 photos. By
the end of 2011, the global annual sequencing capacity was estimated to be 13
quadrillion bases and counting, enough data to fill a stack of DVDs two miles
high. Therefore, modern biology now presents new challenges in terms of data
management, query and analysis.[28]
 Biology’s Big Data sets are now more expensive to store, process and analyse
than they are to generate.
 With the increasing digitization of healthcare, a large amount of healthcare data
has been accumulated and the size is increasing in an unprecedented rate.
Discovering the deep knowledge and values from the big healthcare data is the
key to deliver the best evidence-based, patient-centric, and accountable care.[29]
 There are wide varieties of health related datasets that play a critical role in the
healthcare. These datasets differ widely in their volume, variety, and velocity,
from patient focused sets such as electronic medical records to population focused
sets such as public health data, and knowledge focused sets such as drug-to-drug,
drug-to-disease, disease to disease interaction registries.[30]
 The design of predictive analysis system of diabetic treatment may give enhanced
data and analytics yield the greatest results in healthcare.[25]
 According to the McKinsey report on Big Data, an estimated 150M patients in the
U.S. in 2010 were chronically ill with diseases such as diabetes, congestive heart
failure and hypertension, and they accounted for more than 80% of health system
costs that year. Engaging and educating consumers to make informed decisions
about preventive care and provider networks can improve health and reduce
demand and waste in healthcare.[31]
Another important class of Big Data application in the healthcare domain includes the
Medical Body Area Networks (MBANs). MBANs enable a continuous monitoring of
patient’s condition by sensing and transmitting measurements such as heart rate,
electrocardiogram (ECG), body temperature, respiratory rate, chest sounds, and blood
pressure etc. MBANs will allow:
 Real-time and historical monitoring of patient’s health;
 Infection control;
 Patient identification and tracking; and
 Geo-fencing and vertical alarming.
To manage and analyse such massive MBAN data from millions of patients in real-time,
healthcare providers will need access to an intelligent and highly secure ICT
infrastructure.[30]
These issues can be managed and analysed by deploying Big Data technologies in the
healthcare systems. Deploying Big Data analytics with various health IT apps like HER,
EMR (Electronic Medical Records), CDSS (Clinical Decision Support Systems) and PHR
(Personalized Health Records) etc., will improve the health systems.

4.3 Availability of Medical Data/Data collection


Healthcare organizations are laying the foundation for enterprise health analytics enriched
with Big Data to get a more insightful understanding of the patient in the context of who
they are and driving effective resource utilization across the healthcare ecosystem.[31]
Large scale data could be collected from various sources. Specific to health care, the
types of data anticipated to be available for use by BDA (Big Data Analytics) include:
[31]
 Genomic data – Represents significant amounts of new gene sequencing data.
 Streamed data – Home monitoring, tele health, handheld and sensor-based
wireless and smart devices are new data sources and types. They represent
significant amounts of real time data available for use by the health system.
 Web and social networking-based data – Web-based data comes from Google and
other search engines, consumer use of the Internet, as well as data from social
networking sites.
 Health publication and clinical reference data – This includes text-based
publications (clinical research and medical reference material) and clinical text
based reference practice guidelines and health product (e.g., drug information)
data.
 Clinical data – Eighty per cent of health data is unstructured as documents,
images, clinical or transcribed notes. These semi-structured to unstructured
clinical records and documents represent new data sources.
 Global Health Survey Data – These include the data collected by conducting time
to time surveys globally for different types of disease e.g. global burden of
diseases etc. Data can be collected from their online open to all website Global
Health Data Exchange.
 Hospitals – multi facilitated hospitals can be approached for collecting the larger
data sets, including the cost data, medical records of the patients. Hospitals are the
most reliable source from where can have the different varieties of data in one
roof.
 WHO (World Health Organisation) – WHO has the global data repository, always
accessible, available for the researchers. The data from Global Health repository is
in fact the better resource for data analysis.
 Techniques such as deep content Analysis, Structural content investigation and
virtualized script emulation and leveraging the machine learned knowledge can
help identify threats early in the cycle before it attacks critical care assets.[32]
Thus the four main sources of data can be categorised as follows:
i. Internet: such as browsing history, search history, shopping history and social
network.
ii. Mobile Phones: Smart phones are remarkably embedded many sensors like GPS,
accelerometer, gyroscope, camera, microphone, etc. The ubiquity of mobile phone
and a large amount of data generated from embedded sensors offer new opportunities
to characterize and understand user real-life behaviours (e.g., human mobility,
interaction patterns).
iii. Social Media: Online social networks such as Facebook, Twitter, and LinkedIn have
gained remarkable attentions in past few years. These online social networks are
extremely rich in content. The data from social networks could be collected by the
following three ways:
(1) By retrieving data shared on social network websites;
(2) By asking participants about their behaviour; and
(3) Through deployed applications.
iv. Biomedical data from hospitals and scientific community: The patient data, such
as CT (Computed Tomography) images, medical histories and genetic test data, is
foundation for personalized medicine and large cohort study. In biomedical area,
with the rapid development of sequencing and microarray technologies, tons of
sequencing data and molecular profiling data were generated with low cost, high
accuracy, fast speed and minimum sample requirement. Some large projects, such as
1000 genome ENCODE (Encyclopaedia of DNA Elements) accumulated large scale
genomic data which are public available. These data sets provide benchmarks for
method development and performance elevation of Big Data analysis. [33]
CHAPTER 5
5.1 Big Data Handling and analytics in Healthcare
Analytics is increasingly weaving itself into the fabric of healthcare and will
fundamentally shape the future of medicine and care delivery. With opportunities like
improving healthcare efficiency while enhancing the quality of care as well as the ability
to mine genetic data, reduce costs, effectively respond to disasters, and achieve numerous
other goals, the application of analytics is broad and far- reaching. Analysis depends upon
the context in which it is being performed. Clinical care and performance improvement
can require very different data perspectives and use the data in unique ways. Clinical
analytics involves improving patient care. This type of data is very different than process-
oriented data and may include genetic data as well as clinical records, which are often
narrative and may be more difficult to analyse on a large scale. Performance data, on the
other hand, may be subject to the aforementioned issues; namely, availability and quality.
Analysing the big and large medical data sets will require the hell of man work from
collection process through cleaning and then studying the same. In addition to clinical
data, healthcare data also includes pharmaceutical data (drug molecules and structures,
drug targets, other bimolecular data, high-throughput screening data (microarrays, mass
spec, sequencers), and clinical trials), data on personal practices and preferences
(including dietary habits, exercise patterns, environmental factors), and financial/activity
records. Effectively integrating all of this data holds the key to significant improvements
in interventions, delivery, and well-being. [34]
The cost-benefits of big-data analytics in the healthcare domain are also well-
acknowledged. A recent study by the McKinsey Global Institute estimates that healthcare
analytics could create more than $300 billion in value every year. Similar cost efficiencies
could also be realized in Europe (estimated at $149 billion).[35]
Predictive analysis can help healthcare providers accurately expect and respond to the
patient needs. It provides the ability to make financial and clinical decisions based on
predictions made by the system. This system uses the predictive analysis algorithm in
Hadoop/Map Reduce environment to predict and classify the type of DM, complications
associated with it and the type of treatment to be provided. [25]
The platform used in handling and analysing the healthcare data was the open source
framework “Hadoop”.
5.2 Hadoop
Hadoop was generated by Doug Cutting, the maker of Apache Lucene, extensively used
to provide a framework for Big Data [13].Hadoop is an Apache open source framework
written in java that allows distributed processing of large datasets across clusters of
computers using simple programming models. The Hadoop framework application works
in an environment that provides distributed storage and computation across clusters of
computers. Hadoop is designed to scale up from single server to thousands of machines,
each offering local computation and storage.[36] Hadoop is a framework for processing,
storing, and analysing massive amounts of distributed unstructured data. As a distributed
file storage subsystem, Hadoop Distributed File System (HDFS) was designed to handle
petabytes and exabytes of data distributed over multiple nodes in parallel.[11]

5.3 Working of Hadoop:


Hadoop runs code across a cluster of computers. This process includes the following core
tasks that Hadoop performs:
 Data is initially divided into directories and files. Files are divided into uniform
sized blocks of 128M and 64M (preferably 128M).
 These files are then distributed across various cluster nodes for further processing.
 HDFS, being on top of the local file system, supervises the processing.
 Blocks are replicated for handling hardware failure.
 Checking that the code was executed successfully.
 Performing the sort that takes place between the map and reduce stages.
 Sending the sorted data to a certain computer.
 Writing the debugging logs for each job.
5.4 Hadoop Architecture

Fig. 5.1 Hadoop Architecture


Hadoop architecture is comprised of following components:
a) Name Node
b) Job Tracker
c) Data Node
d) Task Tracker
e) HDFS
1. Datanode:
The data nodes are the repositories for the data, and consist of multiple smaller
database infrastructures that are horizontally scaled across compute and storage
resources through the infrastructure.[11] For every node (Commodity
hardware/System) in a cluster, there will be a datanode. These nodes manage the data
storage of their system and perform the following operations:
 Perform read-write operations on the file systems, as per client request.
 Perform operations such as block creation, deletion, and replication according
to the instructions of the namenode.
 It also reduces the data loss and prevents corruption of the file system. Name
node only monitors the number of blocks in data node and if any block is lost
or failed in the replica of a datanode, the name node creates another replica of
the same block.[38]
2. Namenode:
It is the commodity hardware that contains the GNU/Linux operating system and the
namenode software. The system having the namenode acts as the master server and it
does the following tasks:
 Manages the file system namespace.
 This node maintains the index and location of every data node.
 Regulates client’s access to files.
 It also executes file system operations such as renaming, closing, and opening
files and directories.
3. JobTracker - Schedules jobs and tracks the assign jobs to Task tracker.
4. Task Tracker - Tracks the task and reports status to JobTracker.
5. Block:The data is stored in the files of the HDFS. This file in the file system will be
divided into one or more segments and/or stored in individual data nodes. These file
segments are called as blocks. In other words, the minimum amount of data that
HDFS can read or write is called a Block. The default block size is 64MB, but can be
increased as per the need to change in HDFS configuration.
5.4.1 HDFS (Hadoop Distributed File System)
The HDFS is a fault-tolerant storage system that can store huge amounts of information, scale
up incrementally and survive storage failure without losing data. HDFS manages storage on
the cluster by breaking files into small blocks and storing duplicated copies of them across
the pool of nodes(commodity hardware/system). HDFS allows more than 1000 nodes by a
single Operator. HDFS offers two key advantages; Firstly,HDFS requires no special
hardware as it can be built from common hardware. Secondly, it enables an efficient
technique of data processing in the form of MapReduce.[37]Collectively, multiple number of
Nodes is called as ‘Racks’ and multiple number of racks are called as “clusters”.
5.4.2MapReduce:
MapReduce is a data processing algorithm that uses a parallel programming implementation.
It is a programming paradigm that involves distributing a task across multiple nodes by
running a "map" function. The map function takes the problem, splits it into sub-parts and
sends them to different machines so that all the sub-parts can run concurrently. The results
from the parallel map functions are then collected and distributed to a set of servers running
"reduce" functions, which takes the results from the sub-parts and re-combines them to get
the single result (Output).[39] making it more simpler there is a chain of inputs and outputs.

The data that is to be processed becomes input for the map function. The output formed i.e.
the random results (key/value pairs) generated from mapping which eventually becomes the
input for the reducer.
Fig5.2: MapReduce
The reducer aggregates the input from mapper and brings out the final output. As the
sequence of the name MapReduce implies, the reduce task is always performed after the map
job.
The major advantage of MapReduce is that it makes easy to scale data processing over
multiple computing nodes. Under the MapReduce model, the data processing primitives are
called mappers and reducers. Decomposing a data processing application into mappers and
reducers is sometimes nontrivial. But, once we write an application in the MapReduce form,
scaling the application to run over hundreds, thousands, or even tens of thousands of
machines in a cluster is merely a configuration change. This simple scalability is what has
attracted many programmers to use the MapReduce model.
5.4.3 Hadoop YARN
This is a framework for job scheduling and cluster resource management. YARN has the
master/worker architecture. The master (Resource manager) manages all the resources on the
workers and schedules work in the cluster. Furthermore, the resource manager handles all the
client interactions.[40]
5.4.5 Hadoop common utilities:
These are Java libraries and utilities required by other Hadoop modules.

5.5 Hive
Hive is a data-warehousing solution built on top of the Hadoop environment. It brings
familiar relational-database concepts, such as tables, columns, and partitions to the
unstructured world of Hadoop. Hive queries are compiled into MapReduce jobs executed
using Hadoop. [41] Hive provides an SQL dialect, called Hive Query Language (abbreviated
HiveQL or just HQL) for querying data stored in a Hadoop cluster. Hive communicates with
the JobTracker to initiate the MapReduce job. Hive does not have to be running on the same
master node with the JobTracker. In larger clusters, it’s common to have edge nodes where
tools like Hive run. They communicate remotely with the JobTracker on the master node to
execute jobs. Usually, the data files to be processed are in HDFS, which is managed by the
NameNode.
The data after cleaning was supposed to be stored in some order and had to be in some
shape(schema). The hive was used for the same. However, the alternative was kept to tackle
the hive failure situation like if the data was coming from the multiple sources and complex
transformations were to be done e.g. there were two datasets one from sensors and other one
was the global survey data.
Examples of hive queries:
Hive Query To Create a DataBase:
Hive>Create database Hive1;
Hive Query To Use a DataBase:
Hive>Use Hive1;
Hive Query To Create a Table:
Hive>Create table table1(Field1 DataType, Field2 DataType, Field3 DataType)
Row format delimited fields terminated by ‘\t’;
Hive Query To Load a File in Hive Table:
Hive>Load data local inpath ‘<File_Path>’ into table <Table_Name>;
Hive Query To Display Content Of a Table:
Hive>Select * from <Table_Name>;
5.6 Pig:
Pig is described as a data flow language, rather than a query language. Its a platform that
provides a high-level language for expressing programs that analyse large datasets. Pig is
equipped with a compiler that translates Pig programs into sequences of MapReduce jobs that
the Hadoop framework executes.[42] Yahoo, one of the heaviest user of Hadoop (and a
backer of both the Hadoop Core and Pig), runs 40 percent of all its Hadoop jobs with Pig.
Twitter is also another well-known user of Pig.[42]

Pig has two major components:[43]


1. A high-level data processing language called Pig Latin .
2. A compiler that compiles and runs your Pig Latin script in a choice of evaluation
mechanisms. The main evaluation mechanism is Hadoop. Pig also supports a local
mode for development purposes.

Pig Query To Load a File In Pig Alias : -


3. >a = load ‘File_Name’ using PigStorage(‘,’) as (Field1:DataType, Field2 : DataType,
Field3 : DataType);
4. > a (Displays Content Of File)

5.7 The “R” Environment[44]


R is an integrated suite of software facilities for data manipulation, calculation and
graphical display. Among other things it has
 an effective data handling and storage facility,
 a suite of operators for calculations on arrays, in particular matrices,
 a large, coherent, integrated collection of intermediate tools for data analysis,
 graphical facilities for data analysis and display either directly at the computer or on
hardcopy,
and
 A well developed, simple and effective programming language (called ‘S’) which
includes conditionals, loops, user defined recursive functions and input and output
facilities.
R is very much a vehicle for newly developing methods of interactive data analysis. It has
developed rapidly, and has been extended by a large collection of packages.
Graphical facilities are an important and extremely versatile component of the R
environment. R plotting commands can be used to produce a variety of graphical displays and
to create entirely new kinds of display.
Plotting commands are divided into three basic groups:
 High-level plotting functions create a new plot on the graphics device, possibly with
axes, labels, titles and so on.
 Low-level plotting functions add more information to an existing plot, such as extra
points, lines and labels.
 Interactive graphics functions allow you interactively add information to, or extract
information from, an existing plot, using a pointing device such as a mouse.

CHAPTER 6
6.1 Analysis methodology
The medical data sets for analysis were collected from the GHDx (Global Health Data
Exchange). The Global Health Data Exchange is a data catalogue created and supported
by IHME (Institute for Health Metrics and Evaluation).[94]
This data set is about the “global burden of diseases” study for the year 1990 and 2010 .
The GBD Study 2010 estimated the burden of diseases, injuries, and risk factors globally
and for 21 regions for 1990 and 2010.
This dataset, delimited with commas (.csv), provides four metrics with uncertainty
intervals for 291 diseases and injuries: deaths, years of life lost (YLLs), years lived
with disability (YLDs), and disability-adjusted life years (DALYs) by region, age, and
sex.
6.1.1 Type and Structure of the data:
The data was in .csv format, we made it more unstructured so as to check the reliability of
Hadoop when used on unstructured datasets. It was again saved as in tab delimited
format. Below is the screenshot of the data to depict the structure and format of the data.
Fig. 6.1: Screenshot of the Data

6.1.2 Processing and cleaning in Hadoop:


The Data was directly copied in the Cloud-era environment. After this, data was loaded
in the HDFS. The analyses process comprises of the three main steps:
 Pre-processing.
 Cleaning
 Visualization
The data set contained the values having the symbols like Comma (“,”), Parenthesis “()”,
hyphens “-“ etc. and certain characters by which it couldn’t have been possible to plot the
graphs. Moreover, the data after cleaning was to be stored in the HIVE database in tabular

form.
6.1.3 Mapping And Reducing process.
To clean the data, the MapReduce paradigm was used. It helped in removing all the
obstructions that would have kept us from visualisation process. The MapReduce program
was written in Java (comes integrated with cloud-era). Shown below is the screenshot of
Mapreduce for cleaning process.

Fig. 6.2: MapReduce program for Cleaning

6.1.4 Execution of .Jar file in Hadoop


The .jar file of MapReduce was then executed in Hadoop terminal. The data in the HDFSgot
changed and looked like this:
Syntax:
]$ hadoop jar <Path_Of_Jar_File> <Package_Name>. <ClassName> <Input_File>
<Output_File>
Fig. 6.3: Screenshot of the HDFS
The GBD dataset contained data values of 291 diseases globally and for 21 other regions. It
was rather necessary to categorise these diseases on the bases of their propagation mode and
the organ which they effect. A separate table was created in MS-Excel, with two columns
containing “34disease categories” and a “category code” was assigned to each disease

respectively.

6.2 Manipulation in Hive:


This table was then accordingly joined to the Data table in HIVE using HQL data
manipulation command. Now, the final data set we had in the hive contained 291 diseases
with a corresponding disease category and category code. These two column took the sum of
31 columns to the 33 columns in our data.
The screenshot of data set after all the manipulation process in HIVE is as follows:

Fig. 6.4: Screen shot after the addition of the two other columns
We selected three diseases viz.
Cancerous
Contagious
Diabetes Mellitus
In following three regions :
North Africa and Middle East(Egypt, Iran, Iraq, Saudi Arabia, Morocco etc.)
South Asia (India, Pakistan, Afghanistan etc.)
East Asia(Mongolia, China, South Korea, Japan etc.)
We calculated and plotted the mortality rate caused by these three diseases in the mentioned
regions by:
Age Group (0-30, 30-60 and 60-77)
Gender (Male, Female and Both)
Year (1990, 2005 and 2010)
The calculation was done in hive and plotting was done using “R”.
Syntax for plotting:
]$ R
> res <- read.csv(“DataFileName”, sep = “,”, header = FALSE)
>barplot(Y_Axis_Data_Frame, names.arg=c(X_Axis_Data_Frame), col=c(“Red”,”Blue”))

6.3 GRAPHS:
Following are the different graphs with the corresponding variables and constants:

The above bar chart is about the Mortality caused by six diseases categories, for different age
groups.
Result 1:
In three age groups viz. “0-13, 52-65 and 65-77” the maximum deaths are caused by heart
diseases, with age group “0-13” being most effected. Next to heart diseases, the most
mortality causing disease in age group “0-13” are contagious diseases. And in the age groups
like “52-65” and “65-77” the cancerous diseases remain a threat after heart disease.
It is to be note here, cancer and heart disease begins to affect people from the age of 39.
Region: Global
Results 2:
The graph is about the mortality caused by 3 diseases from 1990 – 2010. As indicated
contagious diseases were dominating in 1990 followed by cancer. This has be controlled in
2005 onwards, but cancer remains a matter of concern.
Region: Global
Years: All
Result 3:
This graph indicates the deaths caused in both genders. As indicated males were effected the
most by contagious diseases, more than 250000x3 males died of the contagious diseases, and
20000 caused by cancer in various regions. In females, again contagious diseases caused
most mortality around 200000x3.

Region: Global
Years: all
Gender: all
Result 4:
The above graph indicates the mortality caused in three aggregated age groups, i.e. “0-30” ,
“30-60”, “60-77”, the visualisation shows that contagious disease remains fatal from infancy
to 30’s, However after 30 years, mortality caused by cancerous disease shows increment, and
in 60-77 years, most of the deaths are caused by cancer and communicable diseases.

Region: East Asia, North Africa & Middle East (MENA), South Asia
Years: All
Gender: All
Result 5:
The above visualisation shows the mortality caused in three regions. As indicated, East Asian
countries are most affected by Cancerous diseases followed by contagious. Around 50000x3
deaths were caused from 1990-2010 by these two diseases.
MENA, the diseases effecting the most is contagious, followed by cancer. Comparative to
East Asia, all the three disease categories show more mortality in MENA.
In South Asian countries, like India, Pakistan and Afghanistan, Contagious diseases have
caused around 150000 deaths. However the cancerous diseases and diabetes mellitus show
almost the same range as has the other regions.

Region: North Africa and Middle East. (Iraq, Iran, Algeria, Arab countries, Sudan etc.)
Age Group: All
Gender: All
Result 6:
As shown, the contagious diseases have caused more deaths in 1990, but with time the death
rate has been lowered down at a good rate. But deaths caused by cancerous diseases have
remained approximately same.

Region: MENA (Iraq, Iran, Algeria, Arab countries, Sudan etc.)


Age Group: All
Years: All
Result 7:
This bar plot shows the most deaths caused in different genders. As we have seen in Graph 5,
contagious diseases have caused more deaths. Here it is shown that, Male people were mostly
affected by communicable diseases and cancerous diseases.
Region: MENA (Iraq, Iran, Algeria, Arab countries, Sudan etc.)
Years: All
Gender: All
Result 8:
We have seen the deaths caused in particular years and in particular genders of MENA. This
bar plot gives a notion of the death rate caused in particular age group. As indicated the
population of the age group “0-30” is most effected. While in the third age group cancer is
the dominating disease category, causing more deaths.
Region: East Asia ( China, Mongolia, Japan, South Korea etc.)
Age Group: All
Gender: All
Result 9:
The bar plot shows the mortality caused in East Asian countries in different years. As we saw
in the Graph 5, the cancer remains the maximum death causer. 1990 being most affected by
cancerous diseases and followed by contagious diseases. However, mortality caused by
contagious diseases show decrement at a good rate, but cancerous diseases have not shown
favourable decrease. Another thing to be noted here is that, Diabetes is shown increment after
2005.
Region: East Asia
Age group: All
Years: All
Result 10:
As indicated the male population in East Asian countries have been affected mostly, both in
cancerous diseases and contagious diseases. Diabetes is showing the constant range.
Region: East Asia
Gender: All
Years: All
Result 11:
As indicated by the earlier graphs of East Asian countries the cancerous disease, which has
caused most deaths in the region, is affecting the population of the age group “30-60” and
“60-70”. While as the contagious diseases is affecting the age group of “0-30” followed by
“60-77” and a little in “30-60”. The death rate caused by diabetes mellitus is almost constant
for the first two age groups but is gaining hike in the last one.
Region: South Asia (India, Pakistan, Afghanistan etc.)
Age group: All
Years: All
Result 12:
The bar plot indicates the mortality rate has tremendously high in South Asian countries like
India, Pakistan etc. The deaths caused by this category has been soaring in 1990, but has
shown a gradual decrement from 2005. The mortality caused by other two disease categories,
i.e. Cancerous and Diabetes is almost equal for all the three years.
Region: South Asia (India, Pakistan, Afghanistan)
Age Group: All
Years: All
Result 13:
As shown in the above bar plot, the male community has been affected at a higher range.
Female death rate though less but not too far from the male death rate. Cancerous and
diabetes are almost same in both the genders.
Region: South Asia
Gender: All
Years: All
Result 14:
Shown above is the death rate in South Asian countries by three diseases. We have already
analysed that contagious diseases have caused most death rates in South Asian countries, with
male population affected worse. This graph indicates that in the age group of 0-30 yrs.
maximum deaths have been caused by communicable diseases. And next to this age group is
“60-77”, again the pink bar has the more shares of deaths in this age group, followed by the
cancerous diseases. Age group “30-60” has not been affected as are the other two.
CONCLUSION
The Big Data is an upcoming technology for the analytics in different fields. Mostly it is
being used for business analytics, but more and more research is going on for its usage in
different fields where the data is huge.
I have taken the healthcare for analytics using Big Data. In healthcare, I have taken the data
of diseases in different parts of the world. Knowing the different parameters of the diseases,
apart from medical diagnosis and causal agents, using Big Data we can predict what are the
other factors due to which the disease are more frequent in one age group than other,
occurrence of the disease in a particular area, regional and other factors.
I have used Hadoop as a tool for the analytics for multi diseases. Taking few parameters and
the results are displayed, in the thesis, for better healthcare.
REFFERENCES
[1] “What is Big Data”, http://www.ibm.com/big-data/us/en/, last accessed 03-04-15.
[2] “Big Data 3 V’s: Volume, Variety, Velocity (Infographic)”,
http://whatsthebigdata.com/2013/07/25/big-data-3-vs-volume-variety-velocity-infographic/,
last accessed 03-04-2015.
[3] “Big Data”,
https://www.ida.gov.sg/~/media/Files/Infocomm%20Landscape/Technology/
TechnologyRoadmap/BigData.pdf. Last accessed 05-04-2015.
[4] “What is the big data? 12 Definitions”, http://whatsthebigdata.com/2014/09/08/whats-the-
big-data-12-definitions/, last accessed 05-04-2015.
[5] http://en.wikipedia.org/wiki/Big_data, last accessed 05-04-2015.
[6] James Manyika, Michael Chui, Brad Brown, Jacques Bughin, Richard Dobbs, Charles
Roxburgh, Angela Hung Byers, (2011), “Big data: The next frontier for innovation,
competition, and productivity”,
http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_inn
ovation., last accessed on 05-05-2015
[7] Edd Dumbill. What is big data? [Online] Available from:
http://radar.oreilly.com/2012/01/what-is-big-data.html.
[8] Carl W. Olofson & Dan Vesset, (2012), “B i g D a t a : T r e n d s , S t r a t e g i e s , a n d
S AP T e c h n o l o g y”, http://www.sap.com/bin/sapcom/de_at/downloadasset.2012-09-sep-
26-13.idc-report--big-data-trends-strategies-and-sap-technology-pdf.html. Last accessed 06-
04-2015.
[9] Richard L. Villars, Carl W. Olofson, Matthew Eastwood, (2011), “Big Data: What It Is
and Why You Should Care”,
http://www.admin-magazine.com/HPC/content/download/5604/49345/file/IDC_Big
%20Data_whitepaper_final.pdf last accessed 07-01-2015.
[10] IBM, “What Is Big Data”, http://www-01.ibm.com/software/data/bigdata/what-is-big-
data.html.
[11] Juniper Networks, (2012), “Introduction to Big data: Infrastructure and network
considerations. http://www.juniper.net/us/en/local/pdf/whitepapers/2000488-en.pdf , last
accessed 08-04-2015
[12] Yuri Demchenko, Cees de Laat, Peter Membrey, “Defining Architecture Components of
the Big Data Ecosystem”, IEEE.
[13] Prof. Jigna Ashish Patel, Dr. Priyanka Sharma, “Big Data for better health planning”,
IEEE International Conference on Advances in Engineering & Technology Research
(ICAETR- 2014), August 01-02, 2014, Dr. Virendra Swarup Group of institutions, Unnao,
India.
[14] Alejendro Zarate Sentovena,(2013) “Big Data: evolution, components, challenges and
opportunities”, Massachusetts Institute of Technology Libraries.
[15]Wikipedia, “Applications of Big Data”,
http://en.wikipedia.org/wiki/Big_data#Applications. Last accessed 10-04-2015
[16] IHTT (Institute for Health Technology Transformation), “Transforming Health Care
Through Big Data: Strategies for leveraging big data in the health care industry”, (2013)
[17] Wikipedia, “Typhoon Sanba (2012)”, http://en.wikipedia.org/wiki/Typhoon_Sanba_
%282012%29. Last accessed 25-05-2015
[18] Steve Hamm, IBM,(2013), “How Big Data Can Boost Weather Forecasting”,
http://www.wired.com/2013/02/how-big-data-can-boost-weather-forecasting. Last accessed
25-05-2015
[19] Paul Nannette, “The deciding factor: Big Data and Decision Making”, Capegemini,
[20] Peter Groves, Basel Kayyali, David Knott, Steve Van Kuiken, (2013), “The Big Data
revolution in Healthcare”, McKinsey and Company.
[21] IDC, “Health Insight Studies of China”, available at www.idc.com
[22] Leon Xiao, Judy Hanover, Sash Mukherjee, (2014), IDC, “Big Data Enables Clinical
Decision Support in Hospital Settings”, Available at: http://www.idc.com/getdoc.jsp?
containerId=CN245651. Last Accessed 25-05-2104
[23] WHO (World Health Organisation), “Global Infectious disease surveillance”, available
at http://www.who.int/mediacentre/factsheets/fs200/en/
[24] John Carew, Bob Gladden, Graham Hughes, Charles Schmitt ,(2013), Intel, “Care
Customization: applying big data to clinical analytics and life sciences”. Available at
http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/health-
innovation-summit-big-data-white-paper.pdf.
[25] Dr Saravana Kumar N M, Eswari T, Sampath P & Lavanya S, “Predictive Methodology
for Diabetic Data Analysis in Big Data”, 2nd International Symposium on Big Data and
Cloud Computing (ISBCC’15), Science Direct.
[26] Srivathsan M,Yogesh Arjun K, “Health Monitoring System by Prognotive Computing
using Big Data Analytics”, 2nd ISBCC 2015, Procedia Computer Science 50 ( 2015 ) 602 –
609 available at Science Direct.
[27] Ji-Jiang Yang, Jianqiang Li, Jacob Mulder , Yongcai Wang, Shi Chen ,
Hong Wu, Qing Wang, Hui Pan, “Emerging information technologies for enhanced
healthcare”, Computers in Industry, available at www.sciencedirect.com.
[28] Aisling O’Driscoll, Jurate Daugelaite, Roy D. Sleator, “‘Big data’, Hadoop and cloud
computing in genomics”, Journal of Biomedical informatics 46 (2013) 774–781, available at
science direct.
[29] IEEE, “Mobile and Cloud Systems – Challenges and Applications”, MediComp 2015:
2nd IEEE International Workshop on Medical Computing.
[30] “Software Tools and Techniques for Big Data Computing in
Healthcare Clouds”, Future Generation Computer Systems 43–44 (2015) 38–39, available at
www.sciencedirect.com
[31] IBM, “Harness your Data resources in Healthcare”,
http://www-01.ibm.com/software/data/bigdata/industry-healthcare.html. Last accessed 25-
05-2015

[32] Rajesh Vargheese, “Dynamic Protection for Critical Health Care Systems Using Cisco
CWS”, Fifth International Conference on Computing for Geospatial Research and
Application (2014], IEEE.
[33] Simon Tripp and Martin Grueber, (2011), available at: http://battelle.org/docs/default-
document-library/economic_impact_of_the_human_genome_project.pdf.
[34] Karthik Kambatla, Giorgos Kollias, Vipin Kumar, Ananth Grama, “Trends in big data
analytics”, J. Parallel Distrib. Compute. 74 (2014) 2561–2573, Available at Science direct.

[35] James Manyika, Michael Chui, Brad Brown, Jacques Bughin, Richard Dobbs, Charles
Roxburgh, Angela Hung Byers, (2011), ”Big data: The next frontier for innovation,
competition, and productivity”, McKinsey Global, available at:
http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_in
novation.
[36] “What is Apache Hadoop?”, https://hadoop.apache.org/
[37] Sujata A. Pardeshi, Pooja K. Akulwar, “Escalation the Power of Big Data”, ISBN: 978-81-
927230-0-6

[38] Suresh Lakavath, Ramlal Naik L., “A Big Data Hadoop Architecture for Online
Analysis”, - International Journal of Computer Science and Information Technology &
Security (IJCSITS), ISSN: 2249-9555, Vol. 4, No.6,(2014)
[39] Dr. M Moorthy, R. Baby, S. Senthamaraiselvi, “An Analysis for Big Data and its
Technologies”, IJCSET(2014), ISSN: 2231-0711
[40] “How Apache Hadoop Yarn works”, http://blog.cloudera.com/blog/2014/05/how-
apache-hadoop-yarn-ha-works/.
[41] Praveen Murthy, Anurag Bhardwaj, P. A. Subrahmanyam, Arnab Roy, Sree Rajan, “Big
Data Taxonomy”, (2014), Cloud Security Alliance.; O’Rielly “Programming in Hive”.
[42] Anurag Bharadwaj et.al “Big Data Taxonomy”, (2014), Cloud Security Alliance.
Available at www.cloudsecurityalliance.org. ; “how pig works”
[43] Pig Tutorial, available at : www.tutorialspoint.com
[44] “R: The R project for statistical computing”, http://www.r-project.org/.
[45] Markus Maier, Master’s Thesis, (2013), “Towards a Big data reference architecture”,
Eindhoven University of Technology.
[46] Bernard Marr, (2015), Forbes, “How Big Data is Changing Healthcare”, available at
http://www.forbes.com/sites/bernardmarr/2015/04/21/how-big-data-is-changing-healthcare/

[47] IBM study of 1,734 chief marketing officers from 64 countries, “Everyday we create 2.5
quintillion data”, available at: http://www.storagenewsletter.com/rubriques/market-
reportsresearch/ibm-cmo-study/
[48] Ming Yang, Melody Kiang, Wei Shang, “Filtering big data from social media –
Building an early warning system for adverse drug reactions”, available at science direct.
[49] Raghunath Nambiar, Adhiraaj Sethi, Ruchie Bhardwaj, Rajesh Vargheese, “A Look at
Challenges and Opportunities of Big Data Analytics in Healthcare”, 2013 IEEE International
Conference on Big Data.
[50] Lin Li, et.al. “Risk Adjustment of Patient Expenditures: A Big Data Analytics
Approach”. 2013 IEEE International Conference on Big Data
[51] Ping Jiang, et.al, “An Intelligent Information Forwarder for Healthcare Big Data
Systems with Distributed Wearable Sensors”, IEEE Systems journal.
[52] Ruchie Bhardwaj, Adhiraaj Sethi, Raghunath Nambiar, “Big Data in Genomics: An
Overview”, 2014, IEEE International Conference on Big Data.
[53] Marco Viceconti, Peter Hunter, and Rod Hose, “Big data, big knowledge: big data for
personalised healthcare”. JBHI-00566-2014, IEEE Journal of Biomedical and Health
Informatics.

You might also like