You are on page 1of 42

Healthcare Data Extraction and Analysis

Thesis submitted in partial fulfillment of the


Requirements for the degree of
Master of Engineering ME (Internet of Things)

Manju S

Reg. No : 201048007

Master of Engineering

ME (IoT)

Project Start Date: 01/05/2022

Under the guidance of

Dr. Deepak Rao Prof. Samarendranath Bhattacharya


Associate Professor Assistant Professor
MSIS MSIS
MAHE, Manipal MAHE, Manipal

MANIAPL SCHOOL OF INFORMATION SCIENCES


(A Constituent unit of MAHE, Manipal)

1
MANIPAL SCHOOL OF INFORMATION SCIENCES
(A Constituent unit of MAHE, Manipal)
CERTIFICATE
This is to certify that this thesis work titled
Health Care Data Extraction and Analysis
Is a bonafide record of the work done by
Manju S
Reg No. 201048007
In partial fulfillment of the requirements for the award of the degree of Master of Engineering
- ME (Internet of Things) under MAHE, Manipal and the same has not been submitted
elsewhere for the award of any other degree. The dissertation does not contain any part / chapter
plagiarized from other sources.

Dr. Deepak Rao Prof. Samarendranath Bhattacharya


Assistant Professor Assistant Professor
MSIS MSIS
MAHE, Manipal MAHE, Manipal

Dr. Keerthana Prasad


Director,
Manipal School of Information Sciences
MAHE, Manipal

2
ACKNOWLEDGEMENT

My sincere thanks to my mentor Mr. Vivek Kumar, Project Manager, SLP Technologies Pvt

Ltd for his technical and professional guidance during the project tenure.

I thank Dr. Keerthana Prasad, Director, Manipal School of Information Sciences, Manipal

Academy of Higher Education, Manipal for her valuable support.

I extend my immense gratitude to my guide Prof. Samarendranatha Bhattacharya (Assistant

Professor, Manipal School of Information Sciences), Manipal Academy of Higher Education,

Manipal for his constant guidance and support.

I thank all the teaching and non-teaching staff of MSIS for their assistance during the course.

3
INDEX

Abstract…………………………………………………………………….….7

1.Introduction …………………………………………………………………8

2.Introduction to Electronic Health Care Records…………………………….9

2.1 General themes in healthcare IT:…………………………..……..13

2.2Industrialisation of medicine is driving IT………………...………13

2.3 Consumerization of healthcare information…………………..….14

2.4 Potential Benefits of a ‘Single’ HER……………………………..15

2.5 EHR Maturity Staircase:………………………………….………15

2.5.1 EHR Stages and Benefits……………………………………....18

3.Data Extraction from EHR…………………………………………..…….22

3.1 Rule-based extractions……………………………..……………22

3.1.1 Regular Expressions …………………………………………..23

3.1.2 Ontologies……………………………………………………..23

3.2 Named Entity Recognition…………………………...………….24

3.2.1 Rule-based approaches……………………………...…………25

4.Data Extraction from EHR using Python……………………………..…..26

4.1 Various Python Libraries to extract data………………………..26

4.2 Information extraction using regular expressions………………34

5.Outcomes of Internship………………………………………………...…40

6.Conclusion and Future Work………………………………………..……41

7.Bibilography……………………………………………………………..42

4
List of Figures

Page No

Figure 1: Electronic Health Records System…………………………………………11

Figure 2: Evolution of HER………………………………………………………….13

Figure 3: EHR Maturity Staircase……………………………………………… …..16

Figure 4: EHR Benefits…………………………………………………………..….21

Figure 5. Unstructured EHR extraction ………………………………….………....34

Figure 6. Sample data……………………………………………………………….36

Figure 7. Blood sugar levels of a patient……………………………………………37

Figure 8. Body mass index (BMI) of patients in the

comprehensive analytics database………………………………..38

Figure 9: Sample EHR marked with Headers, Sub-headers……………….………39


Figure 10 : Sample XML Coordnate position……………………………………..39

Figure 11: Sample marked X,Y Coodrinates for PDF………………….………….40

5
MANIAPL SCHOOL OF INFORMATION SCIENCES

(A Constituent unit of MAHE, Manipal)

Healthcare Data Extraction and Analysis

SLP Technologies

Manju S

Reg. No : 201048007

Master of Engineering

ME (IoT)

Project Start Date: 01/05/2022

Akash Gowda Mr.Samarendranath Bhattacharya


Project manager Assistant Professor
SLP Technologies MSIS
Bengaluru MAHE, Manipal

6
Abstract:-
One of the most important functions for a medical practitioner while treating a patient
is to study the patient’s complete medical history by going through all records, from test results
to doctor’s notes. With the increasing use of technology in medicine, these records are mostly
digital, alleviating the problem of looking through a stack of papers, which are easily
misplaced, but some of these are in an unstructured form. Large parts of clinical reports are in
written text form and are tedious to use directly without appropriate pre-processing. In medical
research, such health records may be a good, convenient source of medical data; however, lack
of structure means that the data is unfit for statistical evaluation. In this project, we introduce
a system to extract, store, retrieve, and analyse information from health records, with a focus
on the Indian healthcare scene.

Methods: A Python-based tool, Healthcare Data Extraction and Analysis (HCDEA), has been
designed to extract structured information from various medical records using a regular
expression-based approach.

7
1.Introduction

1.1 Organization

SLP Technologies is an Indian based MNC Company Headquarter in Bengaluru,


Karnataka, India. SLP Technologies is focused entirely on a single goal: to help life sciences
companies compete more effectively by managing their Labeling, Disclosure and
Documentation more efficiently, while improving compliance.

1.2 Objective

Pharmacovigilance and drug-safety surveillance are crucial for monitoring adverse


drugevents (ADEs), but the main ADE-reporting systems such as Food and Drug
Administration Adverse Event Reporting System face challenges such as underreporting.
Therefore, as complementary surveillance, data on ADEs are extracted from electronic health
record (EHR) notes via Data Extraction and Data mining , natural language processing (NLP).
As NLP develops, many up-to-date machine-learning techniques are introduced in this field,
such as deep learning and multi-task learning (MTL).

The goal of the thesis is to develop models for automatic detection of Adverse Drug
Reactions in clinical data. We address the problem by a supervised approach divided into
Named Entity Recognition and Relation Extraction tasks, with a model able to learn patterns
during training with annotated data from clinical notes. We base the model on recently
proposed Deep Learning methods, and we try to exploit contextual information and different
features of clinical notes for classification of entities into categories defined by labelled data,
to finally extract relations between the entities with the trained model. Therefore, given a
clinical note as input to the trained model, the model returns pairs of entities and their relations
such as the Adverse Drug Reaction relation between Adverse Drug Events and medications.

8
2.Introduction to Electronic Health Care Records

An Electronic Health Record (EHR) is an electronic version of a patients medical history,


that is maintained by the provider over time, and may include all of the key administrative clinical
data relevant to that persons care under a particular provider, including demographics, progress
notes, problems, medications, vital signs, past medical history, immunizations, laboratory data and
radiology reports The EHR automates access to information and has the potential to streamline
the clinician's workflow. The EHR also has the ability to support other care-related activities
directly or indirectly through various interfaces, including evidence-based decision support, quality
management, and outcomes reporting.

EHRs are the next step in the continued progress of healthcare that can strengthen the relationship
between patients and clinicians. The data, and the timeliness and availability of it, will enable
providers to make better decisions and provide better care.

For example, the EHR can improve patient care by:

 Reducing the incidence of medical error by improving the accuracy and clarity of medical
records.

 Making the health information available, reducing duplication of tests, reducing delays in
treatment, and patients well informed to take better decisions.

 Reducing medical error by improving the accuracy and clarity of medical records.

Figure 1: Electronic Health Records System

9
The history of Health Information System and the evolution of the EHR is shown in Fig.1.
Medical records are an essential feature of any Hospital Management System (HMS) and it
has evolved during the past several years. The latest innovations in healthcare Information
Technology has changed the way of recording health information. In 1920s, all medical
records were maintained and managed manually with the paper based documentation.
Patient visit details, history, diagnosis, lab result, medication and all details were written on
the paper and manually managed in the medical record room.

The medical record contained key clinical information about the patient and his care. An
accrediting body, Joint Commission on Accreditation of Hospitals (Affeldt 1980) started
surveying hospitals and other care facility on regular basis to check the quality of the medical
care by using medical record as a tool. With these initiative greatest improvement started in
the hospital for standardizing the me dical record section with the defined regulations. With
this goal in mind, the American College of Surgeons (ACS) initiated the use of standards for
the hospitals in United States and Canada for enhancing the clinical care setting. The
association of professionals exists today under the name of American Health Information
Management Association (AHIMA).

Medical records are documented in several ways and one such concept used for
documentation of the patient health information is Problem Oriented Medical Record
(POMR) (Takeda 1999), which mainly focuses on specific problems that patients have. POMR
was introduced by Dr. Lawrence Weed in the 1950s which represents most strategic
approach to document the patient care.

The SOAP (Subjective, Objective, Assessment, and Plan)is a technique used to capture the
patient data in POMR. SOAP improves the patient care and helps healthcare professionals to
provide structured care and treatment. It clearly shows what is happening to the patient in
an organized way (Jaroudi and Payne 2019).

10
Advancement in the technology such as the use of computer system in healthcare sector has
led to organized way of recording the medical records electronically. Electronic method of
storing health information is beneficial to the organization in the technology advancement
period. With increased usage of the computers in the hospitals, individual department in the
healthcare organization started using legacy Health Information System(HIS). One such
system is EMR (Hersh 1995) which allows transfer of health related data between the
healthcare service providers of the same organization and with which healthcare
professionals get quickly and easily patient data at their fingertips (Giaedi 2008).

The major changes from EMR to EHR occur in the year 2009 during President Barack Obama
term and it is named as the Health Information Technology for Economic and Clinical Health
Act (HITECH Act) (Gold and McLAUGHLIN 2016). The act is a part of the American Recovery
and Reinvestment Act of 2009 (ARRA). The prime focus of HITECH act is to motivate the
implementation of the Electronic Health Records and other supporting health IT in the United
Sates, so that patient can get better care (Aocnp 2015).

11
Figure 2: Evolution of EHR

12
2.1 General themes in healthcare IT:

Rising healthcare costs are shaping IT investments

Other countries also face similar challenges to India, in that healthcare costs keep rising. In
developed countries, where acute care and institutional long-term care services are widely
available, the use of healthcare services by adults rises with age, and per capita expenditures
on health care is relatively high among older age groups. The rise is not just driven by the
ageing populations, but also medical inflation, longer life expectancies, chronic long-term
conditions and diseases, as well as increased consumer demand.

Facing these challenges, the general increase in the use of IT is also observed across
OECD countries, as well as non OECD countries. Emerging market nations often seek to
‘leapfrog’ western countries, as they stamp entire new hospital systems, aged-care or other
healthcare systems out of the ground. These new facilities tend to focus on full digitization
and virtually paperless environments.

2.2Industrialisation of medicine is driving IT

Frequently decried as ‘cook-book’ medicine, the industrialisation of clinical care and


service delivery is progressing rapidly in most developed nations. It is a function of having to
decrease the variability of services, quality and outcomes, as well as needing to deliver more
for less. This can take the form of strong quality management with financial sanctions – such
as Joint Commission on Accreditation of Healthcare Organizations (JCAHO) accreditation
increasing or reducing fee levels in the US. It can also be driven through strong clinical
governance and a desire to deliver the best possible service with appropriate clinical
protocols, such as at the Kaiser or Mayo clinics.

In the United States, there has been a strong shift from volume to Value-Based-Care
(VBC). As delivery systems mature there typically are widespread efforts to control/reduce
costs, improve outcomes, and obtain more value for money spent through different
contracting arrangements. Where Accountable Care Organizations (ACOs) have been
established, there is a heavy focus on standardizing clinical care and guidelines, to ensure
that patients are well managed and supported with evidence based protocols.

This is done across the entire healthcare continuum from primary care through to
specialist services, hospital care and long-term residential care facilities.

13
In integrated delivery systems, the delivery of care is not just reviewed
retrospectively, but concurrently – i.e. during the process of service delivery, outliers,
exceptions and reasons for variation are noted and captured. As a result, the IT systems that
support the delivery of care at the coal face have become increasingly sophisticated. The
success of VBCs or ACOs to delivering integrated care depends heavily on clinical ownership
and control. There has to be buy-in to the harmonisation of processes, the way that care is
coordinated and the way in which the overall system is rewarded and managed. Clinical
leadership and a clear system perspective are key to ensuring that healthcare professionals
can play their integral role in health care delivery.20 To boost clinician and patient
participation in such systems, designers build systems that support clinical and business
resources, and bring the advanced automation and decision support capabilities necessary
for day-to-day service delivery. The resultant solutions use technology for communications
and information exchange, provide robust functionality and allow both clinicians
(physicians) and non-clinical staff to coordinate patient care.

2.3 Consumerization of healthcare information

There is a general acknowledgement that EHRs need to be consumer centric with the
person at the centre. Policy developers, Health Management Organisations (HMOs), Insurers
and Payers, all recognised that personal involvement in healthcare delivery, participation in
wellness programmes and engaged consumers are key to success. The proliferation of
Internet enabled medical devices, home health appliances, and biometric data being captured
by wearable devices, means that healthcare IT systems are also being hallenged with an influx
of data from many different and disparate data sources.

New technologies enabling healthcare agencies to leverage “big data” in healthcare


are requiring healthcare agencies globally to reposition how they maximise the use of
information. The entry of EHRs, smartphone technology, wearables and sophisticated
analytics tools into patient behaviour is driving the exponential growth of healthcare data. As
a result, the future “core” of health information records needs to accommodate the rise in
core clinical information, the rise in patient-generated information, and the importance of
cross-data sets (e.g. welfare and education outcome information).

14
It has been estimated that there are already 15 Exabytes of health data in the world –
three times the number of words that have ever been spoken .Therefore EHRs are seeing a
shift in the centre of gravity from provider generated data to consumer generated and
captured data over time.

2.4 Potential Benefits of a ‘Single’ EHR

A significant lesson from international experience is the importance of clinical


leadership. This is fundamental to driving the harmonisation of clinical processes and
workflow, without which many of the benefits of an EHR are not able to be realised. A study
by the NHS in the UK noted that organisations with strong clinical leadership were more
successful in delivering change needed to capture EHR benefits.56 Clinical leaders are
proactive in the reception, design, development, and implementation of an EHR, and play a
critical role in creating an organizational culture that allows for the efficient and accurate
flow of data.

Furthermore, it has also been shown that hospitals with the greatest clinician
involvement in management scored 50% higher on key measures of organisational
performance. In particular getting alignment around the respective roles in an integrated
delivery setting, for care coordination across different disciplines, can be challenging.

2.5 EHR Maturity Staircase:

The benefits that arise from a health system-wide implementation of a ‘Single’ EHR
depend on the maturity of the underlying EHR and its functional scope. Certain benefits such
as improved clinical outcomes, better care coordination and a better patient experience,
come from functional ‘depth’. Other benefits such as allocative efficiency, population risk
management, advanced analytics capabilities and development of a ‘learning system’, are
driven by the ‘width’ of the data included in the scope of the EHR.

As outlined in international experience, the ‘Single’ EHR is typically constructed out


of the same functional building blocks as an EMR – either through a single system or by
integrating and linking multiple EMRs into joint-up architecture. Therefore, the underlying
EMRs used in a healthcare system have a material impact on the shape and capabilities of the
overarching EHR at national or state-wide level.

The Maturity Staircase for an EHR is illustrated below and shows a series of
progressive capabilities that are required to build up, in order to reach the highest level of
maturity:

15
Figure 3: EHR Maturity Staircase

As the diagram also illustrates, the system-wide benefits of an EHR become much
more substantial and material, as healthcare systems move up the staircase. HIMSS research
describes Level 5 as a ‘glass ceiling’, after which those systems that move higher, start to reap
substantial rewards.

Some of the EHR benefits are entirely achievable with a ‘Virtual’ EHR – particularly at
the early stages of a healthcare system’s maturity journey. However as the functionality
becomes more advanced and as the data sets become broader, virtual approaches can no
longer keep up. That is why the most advanced healthcare systems have gravitated towards
rationalise their platforms towards ‘Single’ EHR architectures.

16
At higher levels of maturity, the required level of fidelity for information to automate
workflow, trigger alerts or leverage decision support, is very high. This is difficult to achieve,
if the underlying source-data was captured with a ‘Virtual’ EHR through different clinical
workflow, with different processes and with a different systems: There is often too much
variability in the context in which data was captured, to be able to assemble it in machine
readable form, so that it can drive automation or decisions support. Also performance issues
make it challenging to assemble information ‘on-the-fly’ through software and then use it to
drive automation or assist with decision support.

For real-time system interactions and automation, the laws of physics require the underlying
health information to be stored in a single physical repository. If workflow tasks are to trigger
events in real-time and drive straight-through processing, then the rules associated with
those tasks as well as the assembly of the underlying information, cannot be done ‘on-thefly’
anymore. There is too much latency in processing and network connectivity to process
complex data sets which are distributed across multiple systems. Advanced decision support
rules and alerts that wish to trigger in real-time across a broad array of underlying
information, require the underlying data sets to be pre-assembled.

This is also one of the key drivers behind ‘monolithic’ EMR solutions being deployed
in hospitals in particular. Hospitals require as much processing efficiency, workflow
automation, and alert functionality in real time as possible. They tend to run tightly coupled
business processes that involve many different departments and disciplines interacting with
each other. If for example the nurse hand-over sheet (or mobile pad equivalent) had to look
into 40 different departmental systems to check what ‘Allergies’ are noted for a patient, then
it would be very difficult to automatically run a medication check.

17
2.5.1 EHR Stages and Benefits:

Each one of the EHR stages provides incremental benefits over and above what is possible
with the previous stages. Unfortunately it is not possible to ‘leap-frog’ straight to Level 4 – i.e.
the foundational capabilities must be in place, before the more advanced capabilities can be
delivered. Each of the stages is briefly described below:

Stage 1 - Common Identifiers & Information Exchange:

At this stage a healthcare system has established the base capabilities necessary to exchange
information between healthcare practitioners in a secure and reliable manner. That means
that interfaces are in place, as well as common identifiers so that communications are not
misdirected. Often this early stage goes hand-in-hand with administrative harmonisation and
defines common service definitions. These can then be used for both tracking as well as
billing purposes. Key benefits at this stage include:

 An ability to track service utilisation and furnish management information about the
overall system. Especially when this is combined with billing / costing information,
there are gains to be made with regard to allocative efficiencies as well as resource
optimisation.
 An ability to re-allocate resources and configure services for optimal usage and
coverage. Often this starts with hospital planning, but then also extends into high-cost
imaging (e.g. MRI) and diagnostic service planning and configuration.
 Visibility on billing, revenue and costs across the healthcare system. Once the
activities (service definitions) and units of measure have been standardised, the
healthcare system can start to track Key Performance Indicators (KPIs) to manage
overall health system performance from a cost and efficiency perspective.
 For individual providers in the healthcare system, key benefits arise from the ability
to better manage their scheduling and appointments. With the ability to book services
and exchange basic information, providers gain administrative efficiencies, since they
do not have to repeatedly enter the same data. Patients also experience an enhanced
service, when they can choose booking slots at their convenience.

18
Stage 2 – Clinical Information Access:

At this stage a healthcare system has established the base capabilities necessary to actually
assemble a more integrated view around the patient on a single screen. That means different
information such as diagnostic data, personal details of the patient and diagnostic or treatment
information is accessible from one place. Having this information assembled also allows
patients to start to engage in their own care and wellness activities – especially if they can
contribute personally generated information into the joint-up record. Often this stage focuses
on results reporting and diagnostic imaging in the first instance. Medication management with
a joint-up list of medicines across the continuum of care tend to be close second. Further
enrichment of the data can come from discharge summaries as well as referrals generated across
the continuum of care. Key benefits at this stage include:

 An ability to capture core clinical and encounter data on the patient. This will includes
personal information that is core to service delivery, as well as key information
specific to a particular health service. By capturing such information once and re-
using it many times, there are efficiency gains across the system and time is saved by
each provider delivering services.

Stage 3 – Service Management & Collaboration:

At this stage a healthcare system has established a common vision and clear goals for
what it seeks to achieve – typically defined as some form of integrated care. That means the
various roles are agreed for the participants in the healthcare system and there are agreed
protocols in place for collaborative care and service coordination. The underlying HER solution
starts to provide a rich set of transactional services that help get things done. The EHR system
moves from passively providing information to actively automating key delivery processes.
Work orders can be created, including complex order sets that straddle different modalities and
services, so that the right tasks are allocated to the right participants. Practitioners receive and
review task lists and get system support that assists them in their delivery of services.

19
Stage 4 – Decision Support & Risk Management:

At this stage a healthcare system starts to think about working ‘smarter’ rather than
‘harder’ and has established a vision around continuous learning and improvement. That means
that health informatics standards and governance is in place to look for variability in clinical
practice and to shape what ‘best practice’ or ‘evidence-based-care’ actually looks like. The
consolidated EHR data provides a rich set of information on what has taken place in a person’s
lifetime and what the respective outcomes were. This can be used to understand population
risk, analyse the needs of the enrolled population and identify optimisation opportunities. As
new patterns are discovered, these can be embedded through advanced decision-support rules
into day-to-day service delivery. Real-time alerts can be generated at the point of care, to ensure
clinical decisions are well-informed and to proactively schedule particular services or contact
patients.

 At this stage in the evolution of a healthcare system, there is not just agreement around
the key care paths across the continuum of care, but also what constitutes ‘best practice’.
End-to-end care paths with agreed outcome measures and KPIs can be implemented, to
fashion a learning system. This may for example involve implementing intelligent
guidelines with decision-support to assist in the management of cardiac events
including the post-acute follow-up in the community (with dietary advice, patient
engagement in lifestyle adjustments, on-going education and behavioural modification
programmes that target the family as well).

 A common use case for EHR systems at national level, tend to be screening
programmes that seek to identify key risk factors early and then encourage appropriate
interventions. Examples include cervical screening, prostrate checks, colon-cancer
screening, breast cancer screening, etc. In a Stage 4 system, the interventions do not
just stop with the screening, but actually start with this process – i.e. they drive on-
going workflow and follow-up to ensure that the identified risks are proactively
managed by the overall healthcare system.

20
Figure 4: EHR Benefits

21
3.Data Extraction from EHR

As mentioned already, the ability to extract specific information from medical records
is crucial for medical treatment and research. Extracted and structured data can be used as the
main data source or as enrichment for all kinds of biomedical data. In this chapter I will go
through the different NLP techniques of IE and discuss how are they implemented to see which
method is the best suited for IE from EHR.

I will introduce to you a rule-based extractions using Regular expressions (REGEX) and
Ontologies. Then I will focus on a specific NLP task called Named Entity Recognition (NER)
and different implementations of this task. Finally, I will do an overview of the various general
biomedical systems, how they work and if they can be used for EHR.

3.1 Rule-based extractions

The first and older way how to do the IE is a rule-based approach. Using this approach,
we define specific rules to extract certain information. It can be syntactic rules defined by
Regular expressions , semantic rules defining a sequence of Part of Speech tags or specific
domain-based Ontologies or vocabularies . These rules can be defined manually by a domain
expert, they can be outlined and then fine-tuned using Machine Learning (ML) [20] or they can
be created strictly using supervised ML methods [23]. These techniques have been shown to
work well on various types of EHR. A study from 2013 [24] Compared rule-based IE and ML
bases IE in research and the industry. The results showed a big discrepancy where over 96%
of NLP papers from 2003 to 2012 talked about ML-based or hybrid IE, but only 33% of
commercial systems used ML bases or hybrid methods. One of the reasons for this discrepancy
stated in the study is the fact that in the industry, the IE tasks are generally ill-defined and prone
to change relatively fast, but ML-based IE models require a careful up-front definition of the
IE task.73. Information extraction from Electronic Health Records

22
3.1.1 Regular Expressions

The concept of Regular Expressions was first introduced in the 1950s. Regular
expression (REGEX) is a sequence of characters that defines a pattern to be searched in a given
text. REGEX is used in most text processors and editors and is supported by most programming
languages. REGEX is usually used to extract "standardised" text such as dates or
measurements, but it can also be used for the extraction of text that is strictly set by some
definition (e.g. standards as ICD-10 [25], TNM-7 [26]). Regular expressions can also be used
to extract more complex words or phrases using Regular Expression Pattern Discovery
Algorithm [27] or rule induction algorithms like WHISK [1]. These approaches use "seed
patterns" that are created manually and then adjusted in ML training or learn the patterns from
scratch using an annotated text.

3.1.2 Ontologies

Ontology is a formal representation of knowledge extraction . This representation


consists of classes, their properties, individual entities that belong to these classes and
relationships between them. The ontologies can then be used for automated knowledge
extraction from unstructured text. Ontologies can be created manually or automatically, or
some hybrid approaches can be used. Ontologies can also be used to extract data from medical
records. Paper from 2009 [22] uses a special ontology for IE from Polish EHR (mammography
reports and diabetes records).

The model consists of several parts: a basis for structuring textual information, domain
lexicons (containing specialised keywords for IE) and a processing platform. The ontology is
built using the available data and expert knowledge. Entities like Diagnosis, Physical features,
Person or Time are extracted. The results show an overall F-score 1 of 99.58%, with over 80%
F-score for nearly all attributes. Achieving these extremely high results comes at a cost.
Creating such rule-based systems requires enormous domain knowledge and an extreme
portion of time.

23
3.2 Named Entity Recognition

NER is one of the tasks of IE. This task tries to achieve two parts. Identify Named entities in
unstructured text and correctly classify those entities. A Named entity is an object in the real
world that one can assign a name to. Typical examples of Named entities are a name of a
person, a country or a company, Named entities can also be a date or a drug name. The methods
of implementation differed through time and depending on the specific domain from Rule-
based and Machine Learning approaches to state-of-the-art Deep learning approaches. I will
describe these different methods later in this chapter, specifically in the medical domain.

Medical Named Entity Recognition is a branch of NER that focuses on EHR. Common
entities like Names or Companies can be extracted, but researchers work with other domain-
specific Named entities like diseases and symptoms [29], [30], medication [31] or information
about tumours [32].

3.2.1 Rule-based approaches

I mentioned Rule-based approaches above (3.1) in the context of IE. Rule-based


systems can also be used (even as probably the least common now) in NER and even Medical
NER. Balász Gődény from Meltwater Group [33] used Rule-based NER trying to identify
names of different products. The system is based on generating a large number of custom made
rules. The algorithm is made of two parts. The first part removes tokens that are unlikely to
refer to products, this part is implemented using simple, custom rules. In the second phase the
remaining tokes undergo sequence recognition and disambiguation using a product database.
The results of this strictly rule-based system are quite low achieving a 18.78% F-score. The
following Machine learning and Deep learning methods outperform these rule-based methods
significantly.

24
3.2.2 Machine learning approaches

Improving on the rule-based extractions are the statistical ML models. Systems using different
types of statistical models like Naive Bayes, Bayesian Networks, Support Vector Machines or
Conditional Random Fields (CRF) quickly outperformthe rule-based models when it comes to
NER and Medical NER. Their only disadvantage is that (unlike the rule-based model shown
before) they all are a supervised ML techniques. Thus, we need a (relatively big) corpus of
annotated data to be able to train these ML models successfully. Paper compares different ML
models (Naive Bayes, CRF and Maximum Entropy) and different segment representations of
Named entities. The paper classifies three Named entities: Problem (a medical problem in the
human body), Treatment (Medical procedures taken to treat the patient) and Test (Type of
examination, Chest X-ray etc.). The results show CRF outperforming the other two (around
70% F-score) by a big margin, followed by the Maximum Entropy (from 20% to 70% F-score
depending on the entity and selected segment representation) and Naive Bayes as the last of
the three (from 10% to 40% F-score).

3.2.3 Deep learning approaches

Deep learning can represent an enormous number of machine learning methods. I will
specifically write about deep neural networks and how they can be used for Medical NER.

Convolution Neural Networks (CNN)

CNNs is a Deep neural network with at least one convolution layer. The convolution
layer consists of matrices or tensors that convolve an input from the previous layer and pass it
to the next layer. Although this type of Deep neural network is especially successful in
computer vision [35], it can also be used in NLP and medical NER. Zhao et al. [36] focused on
medical NER and uses CNNs for character representation, the output is fed into Bidirectional
Long Short-term Memory (BI-LSTM). Using the CNN as an additional layer shows an
improvement of around 2% F-score in recognition and normalization.

25
4.Data Extraction from EHR using Python

4.1 Various Python Libraries to extract data

Here, to perform the given tasks, I have tried various python packages. In this section, I have
summarized the performance of these packages, each with its pros and cons.

Below is the list of packages I have used for extracting text from PDF files.

1. PyPDF2

2. Tika

3. Textract

4. PyMuPDF

5. PDFtotext

6. PDFminer

7. Tabula

We will go through each package in detail along with python code

PyPDF2

PyPDF2 is a pure-Python package that can be used for many different types of PDF
operations. PyPDF2 can be used to perform the following tasks.

· Extract document information from a PDF in Python

· Rotate pages

· Merge PDFs

· Split PDFs

· Add watermarks

· Encrypt a PDF

Shown below is the code for extracting full text and the number of pages using PyPDF2
along with Input PDF and output extracted text.

26
path = r"\....Downloads\Ruchaar.pdf"

#Using PyPDF2

#importing required modules


import PyPDF4

# creating a pdf file object


pdfFileObj = open(path, 'rb')

#creating a pdf reader object


pdfReader = PyPDF4.PdfFileReader(pdfFileObj)

#printing number of pages in pdf file


print(pdfReader.numPages)

#creating a page object


pageObj = pdfReader.getPage(0)

# extracting text from page


for i in range(pdfReader.numPages):
pypdf2_text +=pdfReader.getPage(i).extractText()#closing the pdf file object
pdfFileObj.close()

Cons of using the PyPDF2 package:

1. This package extracts text but does not preserve the structure of the text in the original
PDF.

2. Unnecessary spaces and newlines are included in the extracted text.

3. It does not preserve the table structure.

When I used the PDF created using LATEX, the text is extracted with no spaces, which
means some information is potentially lost.

27
Tika

Tika is a Java-based package. Tika-Python is Python binding to the Apache TikaTM REST
services which allows Tika to be called natively in python language. To use the Tika package
in python, we need to have java installed in your system. When you run the code for the first
time, it will initiate the connection with the Java server. This results in delayed extraction of
text from PDF using the Tika package if the code is running for the first time in the system.

Below are some additional tasks performed while extracting texts from PDF.

1. Extract contents of the PDF file

2. Extract Meta-Data of PDF file

3. Extract keys (metadata and content for dictionary)

4. To know the Tika server status

Shown below is the code for extracting full text from PDF using the Tika package along with
Input PDF and output extracted text.

path = r"\....Downloads\Rucar.pdf"

#using Tika
#pip install tikafrom tika import parser
raw = parser.from_file(path2)
tika_text = raw['content']

Some major disadvantages of using the Tika package are:

1. Needs java installed

2. Java server connection is time-consuming

3. Does not preserve table structure

So, if you are comfortable installing Java in your system, then you may use this package.

28
Textract

While several packages exist for extracting content from various formats of files on their own,
the Textract package provides a single interface for extracting content from any type of file,
without any irrelevant markup.

Textract is used to extract text from PDF files as well as other file formats. The other file
format includes csv, doc, eml, epub, json, jpg, mp3, msg, xls, etc.

The most noteworthy point of using the Textract package is that it extracts information from
files in byte format. To convert byte data into a string we need to use other python packages
for decoding like codecs.

Shown below is the code for extracting text from PDF using Textract along with Input PDF
and output extracted text.

path = r"\....Downloads\Rurkar.pdf"

#for decoding
import codecs
#using Textract
import textract
#extract text in byte format
textract_text = textract.process(path)
#convert bytes to string
textract_str_text = codecs.decode(textract_
After using this package for text extraction there is no loss of information. The structure of
the original document is maintained. However, the table structure is not preserved.

Overall, this package provides a good option for text extraction from not only PDF but also
other types of files.

29
PyMuPDF

PyMuPDF is a python binding for MuPDF which is a lightweight PDF viewer. PyMuPDF is
not entirely python based. This package is known for both, its top performance and high
rendering quality.

With PyMuPDF, we can access files with extensions like *.pdf, *.xps, *.oxps, *.epub, *.cbz
or *.fb2 from your Python scripts. Several popular image formats are supported as well,
including multi-page TIFF images.

PyMuPDF extracts the information of multipage documents also. It gives us the privilege to
extract information for a particular page if you enter the page number.

Below is the code to extract text from PDF using PyMuPDF along with Input PDF and output
extracted text.

path = r"\....Downloads\warkar.pdf"#Usinf pymupdf


import fitz # this is pymupdf#extract text page by page
with fitz.open(path) as doc:
pymupdf_text = ""
for page in doc:
pymupdf_text += page.getText()

In general, PyMuPDF is the choice that you can consider while extracting text from PDF
files. It does remove unnecessary spaces from the text, so the text cleaning task of pre-
processing is automatically done by this package.

It does maintain the original structure of the document. However, similar to other packages,
the problem of extracting tables in their original format still exists. We will have to use some
other package to preserve information in tables.

30
PDFtotext

PDFtotxt is a purely python-based package that can be used to extract texts from PDF files.
As the name suggests, it supports only PDF files while other file formats are not supported.

The data is extracted in the form of an object. The structure of the PDF is preserved.

Below is the code to extract text from PDF using PDFtotext package along with Input PDF
and output extracted text.

path = r"\....Downloads\222.pdf"

#Using PDFtotext
import pdftotext# Load your PDF
with open(path2, "rb") as f:
pdf = pdftotext.PDF(f)# Read all the text into one string
pdftotext_text = "\n\n".join(pdf)

In other words, unlike all previously discussed packages, the main advantage of the package
is that it preserves the structure of the PDF text as well as the table structure format.

PDFminer

This is yet another purely python-based package that is used to extract only PDF files. It can
also convert PDF files into other file formats like HTML/XML. There are various versions of
PDFminer and the latest version is compatible with python 3.6 and above.

PDFminer provides its service in the form of an API request. Thus, the results obtained from
this package take slightly more time than other purely python-based packages.

The code used to extract text from PDF using PDFminer package is tedious and longer
compared to simple code used for other packages which are given below along with Input
PDF and output extracted text.

31
path = r"\....Downloads\234.pdf"#Using PDFminer
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIOdef convert_pdf_to_txt(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec,laparams=laparams)
fp = open(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos=set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages,
password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
text = retstr.getvalue()
fp.close()
device.close()
retstr.close()
return textpdf_miner_text = convert_pdf_to_txt(path1)

32
Tabula

This java-based package is mainly used to read tables in a PDF. It is a simple python wrapper
for tabula-java.

The information extraction is stored in the python DataFrame in python which later can be
converted into csv, tsv, excel, or json file format.

Shown below is the code to extract the table into DataFrame from a PDF file using Tabula
Package along with Input PDF and output extracted text.

path = r"\....Downloads\Ruwarkar.pdf"

#using Tabula
import tabuladf = tabula.read_pdf(path, pages='all')

This package is useful for extracting table information. Using Tabula along with the other
packages mentioned above can be useful to extract full text from PDFs.

Conclusion of various python libraries

In this blog, I have compared various python packages to extract text from PDF file format.
In addition, I have included the code snippets for each package in the python programming
language.

In summary:

1. PyPDF2 — Less preferred as compared to others

2. Tika — Need java installed — Needs familiarity with Java installations, un-necessary
involves java connection, good to extract contents, keys, metadata.

3. textract — Returns byte object — need to convert it into a string

4. PyMuPDF — Extracts text from PDF files, removes unnecessary spaces from the
text, maintains the original structure of the document

5. PDFminer — Preserves the structure of PDF file text but not the table structure.

6. PDFtoText — Comparatively most preferred as it preserves table and original


structure

33
4.2 Information extraction using regular expressions

The information extraction unit was written in Python 3, and it uses regular expressions
to extract information from text with a specified format. Data corresponding to attributes such
as date of examination, weight, height, symptoms, and prescribed medicine are extracted from
the file and stored along with the patient’s ID number in a file for each visit. Similarly, blood
and urine test results with the date and patient ID are also extracted and stored in the database.

Figure 5. Unstructured extraction using regular expressions and distance scoring.

The information set used to design the information extraction unit was created by collecting a
set of 90 healthcare reports from different patients with their consent. It contains a mix of
doctors’ notes, blood and urine test reports, and prescription files, from a number of healthcare
institutes.

The input text files are scanned for matching patterns, and the relevant information is extracted.
This is done with the help of regular expressions. For semi-structured data, the regular
expressions directly extract the required data because the relevant data is expected to be
labelled to a reference keyword, albeit in different formats. The data value should follow a
specified format or type, or it should be in the vicinity of a reference keyword, as defined in
the list of regular expressions.

34
For unstructured data, generally found in discharge notes and the like, another approach using
regular expressions is used. A data value in the unstructured text should follow a specified
format or type, or it should be a part of a list of keywords. As an example, for the reference
keyword ‘blood pressure’, keywords such as ‘high’, ‘low’, or ‘normal’ are allowed, or
numerical strings such as ‘110/90’ or ‘120/80’. The data value should lie in the vicinity of the
reference keyword.

Each probable data value is assigned a score based on the distance from the keyword, which is
the difference between the total number of words in the sentence with the keyword and the
number of words occurring between the two. The score is reduced by a large factor if the
keyword and data value occur in different sentences. The score is augmented slightly for
numerical data values. For example, as shown in Figure 2, ‘120/80’ will have a higher score
than the word ‘normal’. Thus, the attribute ‘blood pressure’ is assigned a value ‘120/80’ in the
database for the particular tuple.

Some extracted information, such as date and ID information, are processed before storage to
maintain consistency in data formats, thus making operations like searching and sorting of data
easier.

iii)Analysis & output unit

(a) Patient history

The output is obtained in an HTML page generated by the HCDEA A&O unit, which was
developed using common gateway interface (CGI) scripting. The output, formatted in tables,
has attributes as columns and corresponding records with dates as rows. Thus, any fluctuation
in a patient’s vital signs and test results can be easily detected, and relevant medical action can
be taken. The page shows results in three categories, namely, prescriptions, blood reports, and
urine reports. An example output has been shown in Figure 6.

35
Figure 6. Sample data.

This is now a completely structured transformation of the unstructured data available in the
reports. Patient history can also be analysed using charts that can be generated on any numerical
attribute. This has been demonstrated using the blood sugar levels of a patient over four weeks
in Figure 7. HEDEA enables data aggregation and visualization even if the tests were
performed at different pathological laboratories.

36
Figure 7. Blood sugar levels of a patient.

(b) Analytics for medical research

HCDEA can be used to generate analysis reports and charts using information in the
central database without identifying information of the patient. This information is only
provided if the patient allows the information to be used formedical research purposes by
allowing the said information to be stored into the CAD. An example chart is shown in Figure
8. This chart uses the calculated body mass index (BMI) of all patients with relevant details in
the CAD using the latest weight and height information, and presents the count of people falling
in each segment.

Figure 8. Body mass index (BMI) of patients in the comprehensive analytics database.

Workflow of data Extraction

First of all, we need to convert PDF into an Extensible Markup Language (XML),or html
which includes data and metadata of a given PDF page. XML defines a set of rules for
encoding PDF in a format that is both human-readable and machine-readable. It includes both
data and metadata (e.g., text box coordination, height, width, etc.)

37
Figure 9: Sample EHR marked with Headers, Sub-headers

Figure 10 : Sample XML Coordnate position

38
The values inside the text box, [195, 735, 243, 745] in the snippet refers the “Left,
Bottom, Right, Top” coordinates of the text box. You can think about the pdf page in terms of
X-Y coordinates. The X-axis spans the width of the PDF page and the Y-axis spans the
height of the page. Every element has its bounds defined by a bounding box which consists of
4 coordinates. These coordinates (X0, Y0, X1, Y1) represent left, bottom, right and top of the
text box, which would give us the location of data we are interested in the PDF page.

Figure 11: Sample marked X,Y Coodrinates for PDF

Next, using the textbox coordinates from the XML file, we can extract each piece of relevant
information individually using their corresponding text box coordinates, and then combined
all scraped information into single observation. In the following, we write a function to use
“pdf.pq(‘LTTextLineHorizontal:overlaps_bbox(“#, #, #, #”)’).text()” to extract the data
inside each textbox, then use pandas to construct a dataframe.

39
5.Outcomes of Internship

 Learning Python programming language.


 Information Extraction using Regex and rule based approach.
 Getting familiarized with Pharmacovigilence process.
 Documentation of the project and work done.

40
6.Conclusion and Future Work

In this project, we introduced an information extraction and resentation system that was
designed to recognize and classify basic attributes present in medical records. We proposed a
natural language processing model involving keyword based and rule-based approaches to
cope with the inherent complexity and structure of these records. A rich set of features are
extracted using regular expression template patterns. At the retrieval step, only the necessary
information is displayed for the relevant user, with all personal information removed from
data for medical research, and only patientauthorised information available to the medical
practitioner, with authentication based on Aadhaar ID.

For data extraction from unstructured text, an additional layer of model-based search using a
convolutional neural network can be used to verify obtained results.

Future Work

A mobile application can be designed for easier access to patients in comparison to the
current web-based solution, which will help encourage adoption of the system.

41
7.Bibilography

1. Dinu V, Nadkarni P. Guidelines for the effective use of entity-attribute-value modeling for
biomedical databases. Int J Med Inform 2007;76(11-12):769-79.

2. Harkema H, Roberts I, Gaizauskas R, Hepple M. Information extraction from clinical


records. Proceedings of the 4th UK e-Science All Hands Meeting; 2005 Sep 19-
22Nottingham, UK.

3. Fette G, Ertl M, Worner A, Kluegl P, Stork S, Puppe F. Information extraction from


unstructured electronic health records and integration into a data warehouse. Proceedings of
the 57th Annual Meeting of the German Society for Medical Informatics, Biometry and
Epidemiology (GMDS); 2012 Sep 16-20; Braunschweig, Germany. p. 1237-51.

4. Atzmueller M, Beer S, Puppe F. A data warehousebased approach for quality management,


evaluation and analysis of intelligent systems using subgroup mining. Proceedings of the
22nd International Florida Artificial Intelligence Research Society Conference (FLAIRS);
2009 May 19-21; Sanibel Island, FL. p. 372-7.

5. Black N. Why we need observational studies to evaluate the effectiveness of health care.
BMJ 1996;312(7040): 1215-8.

6. Kamal J, Pasuparthi K, Rogers P, Buskirk J, Mekhjian H. Using an information warehouse


to screen patients for clinical trials: a prototype. AMIA Annu Symp Proc 2005;2005:1004.

7. Aberdeen J, Bayer S, Yeniterzi R, Wellner B, Clark C, Hanauer D, et al. The MITRE


Identification Scrubber Toolkit: design, training, and assessment. Int J Med Inform

2010;79(12):849-59.

42

You might also like