You are on page 1of 6

IEEE Sponsored 2nd International Conference on Innovations in Information Embedded and Communication Systems

ICIIECS’15

Big Data Solutions in Healthcare:


Problems and Perspectives

Prabha Susy Mathew1 Dr. Anitha S. Pillai2


School of Computing Sciences School of Computing Sciences
Hindustan University Hindustan University
Chennai, India Chennai, India
prabhasm@hindustanuniv.ac.in anithasp@hindustanuniv.ac.in

Abstract- Data in the healthcare sector is growing beyond Digitized data in the health care sector is growing
dealing capacity of the health care organizations and is expected massively with data coming in from internal as well as external
to increase significantly in the coming years. Majority of the sources, from mobile devices, wearable sensor devices[1,3],
Healthcare data is often unstructured, exists in silos and resides Electronic Health Records (EHR), Radiology images, Videos,
in imaging systems, medical prescription notes, insurance claims
clinical notes, social media, blogs, remote health monitoring
data, EPR (Electronic Patient Records) etc. integrating these
heterogeneous data and factoring it in to advance analytics is devices etc. newer forms of big data such as imaging, sensor
critical to improve healthcare outcomes. Either because data are reading is also fueling to the need of Big Data solutions to
isolated in disparate or incompatible formats or due to the lack in manage these massive and silos of data available in the
processing capability to load and query large datasets in a timely healthcare industry. The health care industry need to work on
fashion the Healthcare organizations are not in a position to prediction, prevention and personalization to improve their
leverage the benefits of the vast data they have. With outcomes.
convergence of advanced computing and numerous Big Data Numerous amounts of data-structured, semi-structured and
technological options like commercial solutions, Open Source, unstructured data are a characteristic that makes the health care
Cloud etc. it is now possible to attain high performance,
scalability at a relatively low cost. Big data solutions often come
data most challenging. Most of the data in health care comes
with set of innovative data management solutions and analytical from various sources like X-Rays images, MRI Scan reports,
tools, when effectively implemented can transform the healthcare Blood test value, hand written prescriptions, real time data such
outcomes. as OT room monitors for anesthesia, heart monitors, blood
pressure readings [8] etc. The health care data has large amount
Index Terms: Analytics, Big Data, Cloud, Data Management, of data coming from internal and external sources these data
Healthcare, Open Source. majorly comes from:
1. Providers: medical data (EHRs, EPRs)
I. INTRODUCTION 2. Payers: claims and cost data
3. Researchers: academic, independent
Till the recent past the health care industry had been using
4. Consumers and Marketers: patient behavior and
the conservative approach for diagnosis and treatment, where
sentiment data
most doctors depended on their individual knowledge and
5. Government: population and public health data
skills in diagnosing diseases in patients resulting in a less
6. Developers: pharmacy and medical device R&D
precise and patient centric. Digitization, rising rates of chronic
The five key characteristics that define big data are:
diseases, increased population, advancement in technology,
1. Volume: Data is continuously generated in large volumes
need for evidence based medicine, inability to process and get
from real time health monitoring systems, EHRs, EPRs, Labs,
insight from ever increasing heterogeneous medical data are
sensor devices etc.
some of the drivers for adopting Big Data solutions.
2. Velocity: The need to process the data in real time
Adoption of Big Data solutions will play an important role
coming from streaming data like Remote Patient Monitoring,
in transforming the outcomes of the health care industry by
data from sensor devices, Telemedicine etc.
promoting evidence based reasoning in treatment, providing
patient centric treatment by enabling a 360-degree view of each 3. Variety: Data can be structured, semi-structured or
patient. Now most of the stakeholders are starting to embrace unstructured collected from different sources like
the concept of evidence based medicine a system where Patient/Member conversations, Health Community Blogs,
treatment decision are based also on the scientific evidence Social Media etc.
available rather than just the doctors skill and knowledge 4. Veracity: Deals with the quality of data being captured.
providing a measurable outcome towards treatment. 5. Value: It is the most important V of big data, deals with
extracting value from the data.

978-1-4799-6818-3/15/$31.00 © 2015 IEEE


IEEE Sponsored 2nd International Conference on Innovations in Information Embedded and Communication Systems
ICIIECS’15
This paper focuses on big data problems, perspectives and III. BIG DATA TOOLS AND TECHNOLOGIES
solutions with respect to health care industry. Different big data Big data domain has two options, to use open source
solutions as well as a framework for handling heterogeneous solution or the commercial solutions available. Some of the key
clinical data has been discussed which if implemented products are: Hadoop based analytics, Data warehouse for
effectively could prove helpful in decision making for patient operational insights, Stream computing software for real time
care and overall health care outcomes. analysis of streaming data. Apache Hadoop framework is an
II. RELATED WORK open source solution, followed by NoSQL databases such as
Cassandra, MongoDB, DynamoDB, Neo Technologies and
The role of Big Data in several sectors such as banking, couch base. [5], [8], [9], [16], [22]
retail etc. has been witnessed but in the recent times even • HDFS is Hadoop distributed file systems attempts to
health care systems have shown the preparedness to bring the enable storage of large files and does this by distributing data
transformation and leverage on the benefits that Big data among pools of data nodes.
solutions provides for the health care systems. • MapReduce is a programming model that allows for
Ping Jiang et al. [1] presented a system of using wearable massive job execution scalability against cluster of servers.
sensors capable for continuous monitoring of the elderly and • Apache Hadoop YARN (Yet another Resource
forwarding the data to the big data systems. A big data system Negotiators) separates resource management and processing
manages high volume, velocity and variety of information from components.
different sources, so as to address this challenge a wearable • HBase is another example of a non relational data
sensor system with intelligent data forwarder was introduced management solution that distributes massive datasets over the
adopting Hidden Markov model for human behavior underlying Hadoop framework. HBase derived from Google’s
recognition for feeding meaningful data to the system. The BigTable, is a column-oriented data layout and provides fault
sensor readings and states of a user are sent to Big Data tolerant storage solution.
analytics for improving and personalizing the quality of care. • Hive is a "SQL-like" bridge that allows conventional
Abdelhmid Salih Mohmed Salih et al. [3] evaluates the BI applications to run queries against a Hadoop cluster.
ensembles design and combining different algorithms to • Zookeeper and Chukwa are used to manage and
develop novel intelligent ensemble health care and decision monitor distributed applications that run on Hadoop.
system to monitor the health using wearable sensors. New • PIG consists of a "Perl-like" language that allows for
Novel Intelligent Ensemble method was constructed based on query execution over data stored on a Hadoop cluster, instead
Meta classifier voting combining with three base classifiers of a "SQL-like" language.
J48, Random Forest and Random Tree algorithms.
• Sqoop is used for transferring data between relational
Varun Chandola et al. [12] in his work did analysis of the database and Hadoop
health care data using social network analysis, temporal
• Mahout is an Apache project used to generate free
analysis and text mining and higher order feature construction
applications of distributed and scalable machine learning
to understand how each of these areas contributes to
algorithms that support big data analytics on Hadoop
understand the domain of healthcare. Temporal Analysis
framework.
methods are used as they do not require trained classifier to
• SkyTree is high-performance machine learning and
identify anomalies and it can be used as a timely technique for
data analytics platform focuses specifically on handling Big
detection of transient billing practices that are anomalous.
Data.
Keith Feldman et al. [2] in his paper focuses on the
• Storm can be used for stream processing of
performance limitation of the prediction engine CARE
information coming from social media feeds or sensor data.
(Collaborative Assessment and Recommendation Engine). In
• Apache Solr can be used for searching semi structured
order to solve the computation time issue of care algorithm two
data and documents.
methods have been devised, in first method single patient
• Apache Drill is a distributed system for interactive ad-
version of CARE is taken to perform disease risk ranking on
hoc analysis of large-scale datasets. Designed to handle data
demand and with high degree of accuracy and second method
that spread across thousands of servers, the objective of Drill is
is distributed computation of the care algorithm for nightly
to respond to ad-hoc queries in a low latency manner.
batch job on large patient datasets.
• SAP HANA and Hadoop can be used to manage
Gautham vemuganti [13] has discussed about importance
massive and complex data volumes at high speeds.
of Meta data management to leverage on big data analytics.
• Visual analytics tools are Splunk, Datameer,
Framework of Meta data management and its importance is
Jaspersoft, Tableau, Karmasphere, Pentaho, Hadapt, HP
briefed in the paper.
Vertica, Teradata Aster solutions etc.
Muni Kumar N et al. [4], discusses how Big Data Analytics
• Predictive Analytics tools are Revolution analytics R,
are beneficial to transform the rural healthcare by gaining
Zementis with datameer, Spotfire Miner, Mahout, Rapid
insights from their clinical data and to effectively make the
Miner, Oracle data miner, Statistica, IBM SPSS, SAS
right decisions.
enterprise miner etc
IEEE Sponsored 2nd International Conference on Innovations in Information Embedded and Communication Systems
ICIIECS’15
• NoSQL Databases: Database to handle unstructured Alyuda ForecasterXL implements neural networks in Excel
and semi-structured data is NoSQL. Some of the NoSQL and various graphical and analytical displays are also
databases are provided. DataMinerXL tool is a right option for people
Key Value Pair Databases: It stores data as a record familiar with data mining techniques; it supports creation of
structure comprising of a key and content. They are used for predictive model using a wide variety of techniques. XLMiner
applications that involve large volumes of data, flexible data an add-in for Excel provides a complete data mining
structures and high-speed transactions. Some of the most capability with data preparation tools, support for times series
widely used key value pair databases include Redis, analysis and visualization tools.
MemcacheDB, Berkeley DB, Voldemort, Amazon’s Dynamo, Rapid Miner: it is written in java programming language. It
and Riak. provides template based framework to do advanced data
Document Stores Database: It specializes in storing, analytics. It is a code free advanced analytics platform that can
parsing and processing objects, typically using a lightweight execute in-memory, in-cloud, in-database, in-stream and in-
structure, like JSON. As document style databases are schema Hadoop.
less adding fields to JSON can be done easily. CouchDB and Weka: collection of machine learning algorithms for solving
MongoDB are the most popular document based databases. data mining related problems. It is based on Java and can run
Column Store Database: The columnar databases store data on almost any platform. It supports many of the Data Mining
in columns instead of rows. The advantages are that data can be tasks such as Preprocessing, Clustering, Classification,
highly compressed and it is self indexing and uses less space Regression, Visualization etc.
compared to RDBMS. Examples of some of the columnar KNIME: Konstanz Information Miner is an open source data
databases are HBase, Cassandra and Google’s Big Table. analytics, reporting and integration platform. KNIME does all
Graph Base Database: This database does not follow rigid the main components of data preprocessing like extraction,
formats of SQL or the tables and columns representation. This transformation and loading written in java and works on
database uses edges and nodes to represent and store data. eclipse.
Some of the popular Graph Based Databases are Neo4j, Orange: is a Python based tool. It is component based data
FlockDB, Alleograph, and Infinite Graph. Graph based mining and machine learning software. It supports add-ons for
databases are perfect for Internet of things. Bioinformatics and text mining.
Unified information access: UIA is a technology that Apache mahout: used to generate free applications of
allows ad hoc interrogation of multiple content sources in a distributed and scalable machine learning algorithms that
highly target and specific manner. It is a paradigm to query support big data analytics on Hadoop framework
structured and unstructured data. UIA technology converges NLTK: Natural language toolkit is a suite of libraries and
and integrates previously isolated structured and unstructured programs for symbolic and statistical natural language
data enabling better insights from the data available. processing for Python language. It extends support for
Other: There are several other NoSQL databases that vary research and teaching in areas such as Machine Learning,
in how they store and process data or the different types of Cognitive Science and Artificial Intelligence.
applications they are intended to support. Graph Lab: is a graph based high performance, distributed
Many tools for data mining and analysis are available in the computation framework written in C++.
market because of the constant demand to analyze large set of GNU Octave: is a high level programming language, primarily
data and to gain valuable insights from it. This section gives a intended for computation compatible with MATLAB
brief review of top analytics tools as specified by KDNuggets IBM SPSS Modeler: a visual and powerful data mining
[24] and predictive Analytics Survey [25] workbench.
R Tool: is a language and environment for statistical GhostMiner: is a complete data mining suite, including k-
computation. It is free and open source software. R has very nearest neighbors, neural nets, decision tree, neurofuzzy,
power graphics capabilities and works very well with Hadoop. clustering, and visualization.
It compiles and runs on a wide variety of platforms such as Python: Python is a leading general scripting and a web
UNIX, Windows and MacOS and performs complex data development language. Data analysis and scientific
analysis at a cheaper price. programming are developed with the help of packages such as
Rattle GUI: a data mining suite based on open source Numpy and Scipy along with Visualization package
statistical language R includes graphics, clustering, modeling, Matplotlib.
and more. Rattle package allows data miners to use R without
IV. BIG DATA ANALYTICS IN HEALTHCARE
the need to know the associated programming language.
Excel: Excel, a core component of Microsoft Office, provides The usage of the data generated and the need to get
powerful data processing and statistical analysis capabilities. valuable insights for making effective decisions have lead to
Many data mining tasks can be accomplished in Excel with usage of analytics tool that does analysis on the data from
the help of suitable add-on, some of the add-on tools for Excel which inference can be made. Analytics can be classified in to
are: three types they are: Predictive Analytics, Descriptive
Ant Model Builder is one such add-on that can be used with Analytics and Prescriptive analytics [18, 10, 23].
minimum training and can be used to predict patterns in data.
IEEE Sponsored 2nd International Conference on Innovations in Information Embedded and Communication Systems
ICIIECS’15
Predictive Analytics can be used to forecast what might VI. ADOPTION CHALLENGES OF BIG DATA
happen in the future, it uses statistical approaches to search SOLUTIONS IN HEALTH CARE ORGANIZATIONS
through large patient data sets and does analysis of those data Some of the adoption challenges of Big Data in healthcare
to predict individual patient outcomes. In health care the industries are [14]:
predictions can range from medication to readmission rate of a
• No Fixed standards are followed for health care data -
patient.
There is a vast amount of healthcare data that is generated
Descriptive Analytics is generally used to summarize what and collected by different agents in health care today,
happened; it uses the past and current healthcare data to make
from insurance claims to general practitioner notes within
quality healthcare decisions. Some of the examples of what
the medical record, images from patient scans,
descriptive analytics can do in healthcare are to identify high-
conversations about health in social media, and
risk patients, to target patients for promotional trial of drug,
information from wearable sensors and other health
provides insights to carry out health management programs.
monitoring devices.
Prescriptive analytics is a type of predictive analytics used to
• Integration of heterogeneous data sources - Spread across
prescribe actions for the decision makers to act upon. In
labs, hospital systems, operation theaters, financial IT
healthcare prescriptive analytics is used in evidence based
systems, and electronic health records (EHRs),
medicine to improve patient care, to prescribe better business
fragmentation is a significant obstacle to merging data
practices.
into an integrated database system.
Evidence Based Medicine is the intersection of individual
• Skilled resources - To handle big data solutions certain
clinical expertise, external evidence, and value to the patient.
skills sets and competencies are required. As such
Evidence can be generated from both extant medical literature
globally there is shortage for the data scientist and data
and practice-based evidence. Practice-based evidences are
analyst roles.
generated from the day-to-day data collected in the hospital
through treating the patients (electronic health record). • Privacy and security - Traditional privacy and security
Additional sources of practice-based evidence include claims measures works on a smaller data set, ability to enforce
data, insurance, and other administrative hospital data. the same measures on massive and streaming data set is a
concern especially when dealing with patient’s health
V. OPPORTUNITIES FOR BIG DATA SOLUTIONS IN information.
HEALTH CARE • Infrastructure Issues - Hospitals already have a Legacy
The big data solutions can be used in the health care to get system and their compatibility with new technologies
innovative outcomes in the following areas: always remain an issue.
• Clinical decision support - BDA technologies can be used • Insufficient real time processing - Time delay in
to predict outcomes [6] or recommend alternative processing complex data models could lead to less quality
treatments to clinicians and patients at the point of care. patient care.
• Personalized care - Predictive data mining or analytic • Interpretation of the analytical results - The analysts must
solutions may offer early detection and diagnosis before a get the right clinical support for interpreting the result
patient develops disease symptoms. Pattern detection after the clinical data is analyzed as right interpretation is
through real time wearable sensors for elderly or disabled very essential for the getting the desired outcomes.
patients to alert the physicians if there is any change in • Data Quality - to get reliable insights from the data for
their vital parameters or post-market monitoring of drug making patients health care related decisions, the quality
effectiveness can be done. of the data is very important.
• Public and population health - BDA solutions can mine VII. ARCHITECTURE TO HANDLE HEALTH CARE
web-based and social media data to predict flu and predict DATA
the future trends.
• Fraud Detection – Fraud in medical claims can increase Health care industry has become extremely data intensive
the burden on the society, Predictive models like decision with data coming from multiple sources. Data integration
tree, neural networks, regression etc. can be used to across heterogeneous data sources is the biggest challenge
predict and prevent fraud at the point of transactions [17]. being identified which would otherwise contribute to greater
insights on the data available [15, 20]. There are some viable
• Secondary usage of health data [8] - Deals with
solutions for dealing with heterogeneous data in health care
aggregation of clinical data from finance, patient care,
they are:
administrative records to discover valuable insights like
Implementing three-tier architecture [15] with client tier
identification of patients with rare disease, therapy
providing access to system, middle tier for the defining rules
choices, clinical performance measurement etc.
and processing that can be implemented using XML following
• Evidence based medicine [8]: evidence-based medicine
SOAP protocol & HL7 Reference Information model (RIM).
involves the use of statistical studies and quantified
The implementation enables the user to retrieve and search
research by doctors to form diagnosis. This practice
information that has been integrated using HL7 v3-RIM
enables doctors to make decisions not only based on their
technology from disparate health care systems [21] and
own perceptions but also from the best available evidence.
IEEE Sponsored 2nd International Conference on Innovations in Information Embedded and Communication Systems
ICIIECS’15
database tier with NoSQL which provides schema less promising results. The different adoption challenges
database design [27,19] and is also more scalable option highlighted in the paper are some of the open issues and can
compared to traditional RDBMS. be considered as an area of research for future work.
Steps for Integration of heterogeneous clinical data and to
get meaningful insight from it can be attained by modifying REFERENCES
and adapting the architecture proposed by [21].
[1] Ping Jiang, Jonathan Winkley, Can Zhao, Robert Munnoch,
Geyong Min, and Laurence T. Yang, Member, IEEE “An
DATA INTERPRETATION Intelligent Information Forwarder for Healthcare Big Data
Client Systems with Distributed Wearable Sensors”, IEEE
tier Systems Journal, 2014.
DATA ANALYSIS
[2] Keith Feldman, Nitesh V. Chawla. “Scaling Personalized
Healthcare with Big Data” 2nd International Conference on
Big Data and Analytics in Healthcare, Singapore, 2014.
STANDARDIZATION OF
DATA XML & HL7V3 RIM Middle [3] Abdelhmid Salih Mohmed Salih and Ajith Abraham,
tier “Novel Ensemble Decision Support and Health Care
Monitoring System”, Journal of Network and Innovative
DATA EXTRACTION - NOSQL Computing ISSN 2160-2174 Volume 2 (2014) pp. 041-051,
2014.
Database [4] Muni Kumar N, Manjula R “Role of Big Data Analytics in
DATA COLLECTION
Tier Rural Health Care - A Step Towards Svasth Bharath”,
(IJCSIT) International Journal of Computer Science and
Information Technologies, Vol. 5,2014
[5] Han Hu, Yonggang Wen, (Senior Member, IEEE), Tat-
PATIENT RECORD LAB REPORTS IMAGE DATA Seng Chua and Xuelong Li, (Fellow, IEEE) “Toward
Scalable Systems for Big Data Analytics: A Technology
Fig 1 : Three Tier Architecture for handling Heterogeneous Tutorial. IEEE Access Object Identification number 1109,
2014.
data
[6] A Working Group of the American Statistical Association
“Discovery with Data: Leveraging Statistics with Computer
• Data collection - Heterogeneous health care data from Science to Transform Science and Society “July 2, 2014.
different sources are collected. [7] Meetali Bageshwari,, Pradnesh Adurkar, Ankit Chandrakar
• Data extraction - The data that is extracted from multiple “Clinical Database: Rdbms V/S Newer Technologies
sources are extracted and stored on a single NoSql (Nosql And Xml Database); Why Look Beyond Rdbms and
database. Consider the Newer.” March, International Journal of
Computer Engineering & Technology (Ijcet) Volume 5,
• Converting the data extracted in to a standard format -
Issue 3, pp. 73-83, 2014.
XML and HL7 Reference Information Model is used for [8] Wullianallur Raghupathi, Viju Raghupathi “Big Data
converting clinical data into standard format. Analytics in Healthcare: Promise and Potential”
• Data analysis - Analysis on the data is done to gain http://www.hissjournal. Com/content/2/1/3 Health
valuable insights from the healthcare data using various Information Science and Systems 2014, 2:3 doi:
analytical methods and technologies such as using data 10.1186/2047-2501-2-3
mining algorithms, SAP HANA with in-memory [9] “Wayne eckerson” Big Data and its Impact on Data
computing etc. to increase the data processing speed. Warehousing-Beyenetworks,2014
[10] Min Chen · Shiwen Mao · Yunhao Liu, “Big Data: A
• Data Interpretation - Interpreting the result of
Survey” 22 January 2014 © Springer Science+Business
analytics on healthcare data needs to be done with Media New York 2014
appropriate clinical support otherwise the inference drawn [11] Omar El-Gayar” Opportunities for Business Intelligence
from the result could be misleading. and Big Data Analytics In Evidence Based Medicine” 2014
47th Hawaii International Conference on System Science
VIII. CONCLUSION [12] Varun Chandola, Sreenivas R. Sukumar, Jack Schryver
The health care industry is all set for transformation in order “Knowledge Discovery from Massive Healthcare Claims
to improve patient care and to make it cost effective by Data”, ACM 978-1-4503-2174-7/13/08, 2013.
leveraging on latest technologies. Digitization has lead to [13] Gautham vemuganti “Metadata management in Big Data”,
Infosys lab briefing vol.11, 2013.
large amount of digital data especially in health care industry. [14] Timothy Schultz. “Turning Healthcare Challenges into Big
This paper proposes a method for handling heterogeneous Data Opportunities: A Use-Case Review across the
health care data using the right technology and architecture Pharmaceutical Development Lifecycle” Bulletin of the
which has the potential to transform healthcare outcomes and Association for Information Science and technology- -
the quality of patient care at optimal cost. This paper also volume 39, Number 5, June/July 2013.
presents various analytical tools that can be used to leverage [15] Zhang, Z., Sarcevic, A., & An, Y. “A prototype system for
benefits from the huge set of healthcare data. Proper selection heterogeneous data management and medical devices
of tools to do analytics on health care data can provide
IEEE Sponsored 2nd International Conference on Innovations in Information Embedded and Communication Systems
ICIIECS’15
integration in trauma resuscitation” iConference Kong, Information Technology Committee, Hong Kong
Proceedings (pp. 785-789). Doi: 10.9776/13388, 2013. Doctors Union, Hong Kong, 31 October, 2012
[16] Michael Hausenblas and Jacques Nadeau “APACHE [20] Aniket Bochare “Heterogeneous Data Integration for
DRILL: Interactive Ad-Hoc Analysis at Scale” JUNE Clinical Decision Support System” 2011
2013.MapR Technologies BIG DATA _ DOI: [21] Teeradache Viangteeravat, Matthew N Anyanwu,
10.1089/big.2013.0011 Venkateswara Ra Nagisetty, Emin Kuscu, Mark Eijiro
[17] Venkata Reddy Konasani, Mukul Biswas, Praveen Sakauye, Duojiao Wu “Clinical data integration of
Krishnan Keloth.”Healthcare Fraud Management using Big distributed data sources using Health Level Seven (HL7)
Data Analytics” A Whitepaper by Trendwise v3-RIM mapping “Journal of Clinical Bioinformatics 1:32,
Analytics,2012 2011.
[18] Houser SH, Colquitt S, Clements K, Hart-Hester S. “The [22] http://www.techrepublic.com/blog/big-data-analytics/10-
impact of electronic health record usage on cancer registry emerging-technologies-for-big-data/
systems in Alabama”. Perspect Health Inf Manag. [23] http://www.informationweek.com/big-data/big-data-
2012;9:1f. [PMC free article] [PubMed] analytics
[19] Ken Ka-Yin Leea,, Wai-Choi Tangb, Kup-Sze Choia, [24] http://www.kdnuggets.com/software/suites.html
“Alternatives to relational database: Comparison of NoSQL [25] http://www.predictiveanalyticstoday.com/top-15-free-data-
and XML approaches for clinical data storage” Centre for mining-software/
Integrative Digital Health, School of Nursing, The Hong
Kong Polytechnic University, Hung Hom, Kowloon, Hong

You might also like