You are on page 1of 8

Research and applications

Editors choice
Scan to access more
free content

Department of Health
Sciences Research, Mayo Clinic,
Rochester, Minnesota, USA
2
Department of Linguistics,
University of Colorado,
Boulder, Colorado, USA
3
Group Health Research
Institute, Seattle, Washington,
USA
4
Boston Childrens Hospital,
Harvard University, Boston,
Massachusetts, USA
5
Homer Warner Center for
Informatics Research,
Intermountain Healthcare, Salt
Lake City, Utah, USA
6
Agilex Technologies, Chantilly,
Virginia, USA
7
School of Biomedical
Informatics, University of Texas
Health Sciences Center,
Houston, Texas, USA
Correspondence to
Dr Jyotishman Pathak, Mayo
Clinic College of Medicine,
Rochester, MN 55902, USA;
Pathak.Jyotishman@mayo.edu
Received 17 April 2013
Revised 7 October 2013
Accepted 11 October 2013
Published Online First
5 November 2013

To cite: Pathak J, Bailey KR,


Beebe CE, et al. J Am Med
Inform Assoc 2013;20:
e341e348.

Normalization and standardization of electronic


health records for high-throughput phenotyping:
the SHARPn consortium
Jyotishman Pathak,1 Kent R Bailey,1 Calvin E Beebe,1 Steven Bethard,2
David S Carrell,3 Pei J Chen,4 Dmitriy Dligach,4 Cory M Endle,1 Lacey A Hart,1
Peter J Haug,5 Stanley M Huff,5 Vinod C Kaggal,1 Dingcheng Li,1 Hongfang Liu,1
Kyle Marchant,6 James Masanz,1 Timothy Miller,4 Thomas A Oniki,5 Martha Palmer,2
Kevin J Peterson,1 Susan Rea,5 Guergana K Savova,4 Craig R Stancl,1
Sunghwan Sohn,1 Harold R Solbrig,1 Dale B Suesse,1 Cui Tao,7 David P Taylor,5
Les Westberg,6 Stephen Wu,1 Ning Zhuo,5 Christopher G Chute1
ABSTRACT
Research objective To develop scalable informatics
infrastructure for normalization of both structured and
unstructured electronic health record (EHR) data into a
unied, concept-based model for high-throughput
phenotype extraction.
Materials and methods Software tools and
applications were developed to extract information from
EHRs. Representative and convenience samples of both
structured and unstructured data from two EHR systems
Mayo Clinic and Intermountain Healthcarewere used
for development and validation. Extracted information
was standardized and normalized to meaningful use (MU)
conformant terminology and value set standards using
Clinical Element Models (CEMs). These resources were
used to demonstrate semi-automatic execution of MU
clinical-quality measures modeled using the Quality Data
Model (QDM) and an open-source rules engine.
Results Using CEMs and open-source natural language
processing and terminology services enginesnamely,
Apache clinical Text Analysis and Knowledge Extraction
System (cTAKES) and Common Terminology Services
(CTS2)we developed a data-normalization platform
that ensures data security, end-to-end connectivity, and
reliable data ow within and across institutions. We
demonstrated the applicability of this platform by
executing a QDM-based MU quality measure that
determines the percentage of patients between 18 and
75 years with diabetes whose most recent low-density
lipoprotein cholesterol test result during the measurement
year was <100 mg/dL on a randomly selected cohort of
273 Mayo Clinic patients. The platform identied 21 and
18 patients for the denominator and numerator of the
quality measure, respectively. Validation results indicate
that all identied patients meet the QDM-based criteria.
Conclusions End-to-end automated systems for
extracting clinical information from diverse EHR systems
require extensive use of standardized vocabularies and
terminologies, as well as robust information models for
storing, discovering, and processing that information. This
study demonstrates the application of modular and opensource resources for enabling secondary use of EHR data
through normalization into standards-based, comparable,
and consistent format for high-throughput phenotyping to
identify patient cohorts.

Pathak J, et al. J Am Med Inform Assoc 2013;20:e341e348. doi:10.1136/amiajnl-2013-001939

INTRODUCTION
The Ofce of the National Coordinator for Health
Information Technology (HIT) in 2010 established
the Strategic Health IT Research Program (SHARP)
to address research and development challenges in
wide-scale adoption of HIT tools and technologies
for improved patient care and a cost-effective healthcare ecosystem. SHARP has four areas of focus1:
security in HIT (SHARPs2; led by University of
Illinois, Urbana-Champaign), patient-centered cognitive support (SHARPc3; led by University of Texas,
Houston), healthcare applications and network
design (SMART4; led by Harvard University), and
secondary use of electronic health records (EHRs)
(SHARPn5; led by Mayo Clinic).
The Mayo Clinic-led SHARPn program aims to
enhance patient safety and improve patient medical
outcomes by enabling the use of standards-based
EHR data in research and clinical practice. A core
component for achieving this goal is the ability to
transform heterogeneous patient health information,
typically stored in multiple clinical and health IT
systems, into standardized, comparable, consistent,
and queryable data.6 7 To this end, over the last
3 years, SHARPn has developed a suite of opensource and publicly available applications and
resources that support ubiquitous exchange, sharing
and reuse of EHR data collected during routine clinical processes in the delivery of patient care. In particular, the research and development efforts have
focused on four main areas: (1) clinical information
modeling and terminology standards; (2) a framework for normalization and standardization of clinical databoth structured and unstructured data
extracted from clinical narratives; (3) a platform for
representing and executing patient cohort identication and phenotyping logic; and (4) evaluation of
data quality and utility of the SHARPn resources.
In this article, we describe current research progress and future plans for these foci of the SHARPn
program. We illustrate our work via use cases in highthroughput phenotype extraction from EHR data,
specically for meaningful use (MU) clinical quality
measures (CQMs), and discuss the challenges to
shared, secondary use of EHR data relative to the
work presented.
e341

Research and applications


structure used to convey patient data instances conforming to
CEMs.
A key component in dening and instantiating the CEMs are
structured codes and terminologies. In particular, the CEMs
used by SHARPn adopt MU standards and value sets
(use-case-specic subsets of standard vocabularies) for specifying
diagnosis, medications, laboratory results, and other classes of
data. Terminology servicesespecially mapping servicesare
essential, and we have adopted Common Terminology Services
2 (CTS2)9 as our core terminology infrastructure.

Clinical data normalization and standardization


Syntactic and semantic data normalization

Figure 1 SecondaryUsePatient Clinical Element Model.

METHODS
SHARPn organizational and functional framework
The SHARPn vision is to develop and foster robust, scalable and
pragmatic open-source informatics resources and applications
that can facilitate large-scale use of heterogeneous EHR data for
secondary purposes. In the past 3 years, via this highly collaborative project with team members from seven different organizations, we have developed an organizational and functional
framework for SHARPn that is structured around informatics
infrastructure for (1) information modeling and terminology
standards, (2) clinical natural language processing (NLP), (3) clinical phenotyping, and (4) data quality. We describe these infrastructural components briey in the following sections.

Clinical information modeling and terminology standards


Clinical concepts must be normalized if decision support and analytic applications are to operate reliably on heterogeneous EHR
data. To achieve this goal, we adopted Clinical Element Models
(CEMs)8 as target information models which have been designed
to provide a consistent architecture for representing clinical information in EHR systems. The goal of CEMs has been to specify
granular, computable models of all elements that might be stored
in an EHR system. The intent is to use these models in validating
data during persistence and in developing the services and applications that exchange data, thus promoting interoperability between
applications and systems. For the secondary use needs of
SHARPn, a set of generic CEMs for capturing core clinical information, such as Medication, Labs, and Diagnosis, have been
dened. CEMs are distributed in the following formats: Tree,
Constraint Denition Language (CDL, the language GE
Healthcare developed for authoring CEMs), and Extensible
Markup Language (XML) Schema Denitions (XSDs). Figure 1
shows the demographics model, SecondaryUsePatient, in Tree
format where the bracket shows the cardinality of the corresponding feature. For example, a SecondaryUsePatient instance has at
least one entry for PersonName (unbounded indicates that a
demographics instance can have multiple entries for PersonName)
while all other entries are optional. The CEM is authored in CDL
and processed by a compiler, which validates the CDL and can
generate various outputs, one of which is a SHARPn-specic XSD
e342

Figure 2 shows the data normalization pipeline architecture. The


SHARPn data normalization pipeline adopts Mirth Connect, an
open-source healthcare integration engine, as the interface engine,
and Apache Unstructured Information Management Architecture
(UIMA) as the software platform.10 It takes advantage of Mirths
ability to support the creation of interfaces between disparate
systems, and UIMAs resource conguration ability to enable the
transformation of heterogeneous EHR data sources (including clinical narratives) to common clinical models and standard value sets.
The behavior of the normalization pipeline is driven through
UIMAs resource congurationthat is, source and target information models (syntactic) and their associated value sets (semantic) as well as the mapping information from source to targets.
As discussed in the previous section, we adopt the CEMs and
CTS2 infrastructure for dening the information models and
access to terminologies and value sets, respectively, for enabling
syntactic and semantic normalizations. In particular, syntactic
normalization species where (in the source data or other location) to obtain the values that will ll the target model. They
are structural mappingsmapping the input structure (eg, an
HL7 message) to the output structure (a CEM). For semantic
normalization, we draw value sets for problems, diagnoses,
laboratory observations, medications and other classes of data
from MU terminologies. The entire normalization process relies
on the creation or identication of syntactic and semantic mappings, and we adopt the CTS2 Mapping Service (http://www.
omg.org/spec/CTS2/1.0/) to specify these mappings. For brevity,
additional details about the normalization process, along with
examples, can be reviewed in the article by Kaggal et al.11

Normalization of the clinical narrative using NLP


The clinical narrative within the EHR consists primarily of care
providers notes describing the patients status, disease or tissue/
image. The normalization of textual notes requires NLP methods
that deal with the variety and complexity of human language. We
have developed sophisticated information extraction methods to
discover a relevant set of normalized summary information for a
given patient in a disease- and use-case agnostic way. We have
dened six templatesabstractions of CEMswhich are populated by processing the textual information and then map it to
the models. Table 1 represents the six templates along with their
attributes. The anchors for each template are a Medication, a
Sign/symptom, a Disease/disorder, a Procedure, a Laboratory
result and an Anatomic site. Some attributes are relevant to all
templatesfor example, negation_indicatorothers are specic to a particular templatefor example, dosage is specic to
Medications.
The main methods used for the normalization of the clinical
narrative are rules and supervised machine learning. As with any
supervised machine learning techniques, the algorithms require
labeled data points from which to learn the patterns and

Pathak J, et al. J Am Med Inform Assoc 2013;20:e341e348. doi:10.1136/amiajnl-2013-001939

Research and applications

Figure 2 SHARPn clinical data normalization pipeline. (1) Data to be normalized are read from the le system. These data can also be transmitted
on NwHIN via TCP/IP from an external entity. (2) Mirth Connect invokes the normalization pipeline using one of its predened channels and passes
the data (eg, HL7, CCD, tabulardata) to be normalized. (3) The normalization pipeline goes through initialization of the components (including
loading resources from the le system or other predened resource such as the Common Terminology Services 2 (CTS2) and then performs syntactic
parsing and semantic normalization to generate normalized data in the form of a Clinical Element Model (CEM). (4) Normalized data are handed
back to Mirth Connect. (5) Mirth Connect uses one of the predened channels to serialize the normalized CEM data to CouchDB or MySQL based on
the conguration. CTAKES, clinical Text Analysis and Knowledge Extraction System; DB, database; NLP, natural language processing; UIMA,
Unstructured Information Management Architecture.
evaluate the output. Therefore, a corpus representative of the
EHR from two SHARPn institutions (Mayo Clinic and Seattle
Group Health) was sampled and deidentied following Health
Insurance Portability and Accountability Act guidelines. The deidentication process was a combination of automatic output
from MITRE Identication Scrubber Tool (MIST)12 and manual
review. The corpus size is 500 K words that we intend to share
with the research community under data use agreements with

the originating institutions. The corpus was annotated by


experts for several layers following carefully extended or newly
developed annotation guidelines conformant with established
standards and conventions in the clinical NLP community to
allow interoperability. The annotated layers (syntactic and
semantic) allow learning structures over increasingly complex
language representations, thus enabling state-of-the-art information extraction from the clinical narrative. A detailed description

Table 1 Template generalizations of the Clinical Element Model used for the normalization of clinical narrative
Template-specific attributes
Common attributes

Template

Attribute

Template

Attribute

Template

Attribute

AssociatedCode
Conditional
Generic
Negation_indicator
Subject
Uncertainty_indicator

Medication

Change_status
Dosage
Duration
End_date
Form
Relative_temporal_context
Frequency
Route
Start_date
Strength
Body_laterality
Body_location
Body_side
Device
Duration
End_date
Method
Relative_temporal_context
Start_date

Sign/symptom

Alleviating_factor
Body_laterality
Body_location
Body_side
Course
Duration
End_date
Exacerbating_factor
Relative_temporal_context
Severity
Start_date
Abnormal_interpretation
Delta_flag
End_date
Lab_value
Ordinal_interpretation
Reference_range_narrative
Start_date

Disease/disorder

Alleviating_factor
Associated_sign_symptom
Body_laterality
Body_location
Body_side
Course
Duration
End_date
Exacerbating_factor
Relative_temporal_context
Start_date
Body_laterality
Body_side

Procedure

Laboratory results

Pathak J, et al. J Am Med Inform Assoc 2013;20:e341e348. doi:10.1136/amiajnl-2013-001939

Anatomic site

e343

Research and applications


of the syntactic and select semantic layers is provided in
Albright et al.13
The best performing NLP methods are implemented as part
of the Apache clinical Text Analysis and Knowledge Extraction
System (cTAKES)14 which is built using UIMA. It comprises a
variety of modules including sentence boundary detection, syntactic parsers, named entity recognition, negation/uncertainty
discovery, and coreference resolver, to name a few. For method
details and formal evaluations, see Wu et al,15 Albright et al,13
Choi and Palmer,16 17 Clark et al,18 Savova et al,19 Sohn et al,20
and Zheng et al.21 Additional details on cTAKES are available
from http://ctakes.apache.org.

Data transfer, processing, storage and retrieval


infrastructure
Data transfer/processing
We have primarily leveraged two industry standard, open-source
tools for data exchange and transfer: Mirth Connect and
Aurion (NwHIN). Aurion provides gateway-to-gateway data
exchange using the NwHIN XDR (Document Submission)
protocol, enabling participating partners to push clinical documents in a variety of forms (HL7 2.x, clinical document architecture (CDA), and CEM). Mirth Connect software and
channels have been developed to provide Sender, Receiver,
Transformation, and Persistence channels in support of the
variety of use cases as well as enabling the interconnectivity of
the various SHARPn systems.

Data storage/retrieval
The SHARPn teams have developed two different technology
solutions for the storage and retrieval of normalized data. The
rst is an open-source SQL-based solution. This solution
enables each individual CEM instance record to be stored in a
standardized SQL data model. Data can be queried directly
from the SQL database tables and extracted as individual elds
or as complete XML records. The second is a document storage
solution that leverages the open-source CouchDB22 repository
to store the CEM XML data as individual JavaScript Object
Notation ( JSON) documents, providing a more documentcentric view into the data. Both solutions leverage toolsets built
around the CEM data models for a variety of clinical data
including NotedDrugs, Labs, Administrative Diagnosis, and
other clinically relevant data.

High-throughput cohort identication and phenotype


extraction
SHARPn overloads the term phenotyping to imply the algorithmic recognition of any cohort within an EHR for a dened
purpose, including casecontrol cohorts for genome-wide association studies, clinical trials, quality metrics, and clinical decision support. In the recent past, several projects, including

eMERGE,23 PGRN,24 and i2b2,25 have developed tools and


technologies for identifying patient cohorts using EHRs. A key
aspect of this process is to dene inclusion and exclusion criteria
involving EHR data elds (eg, diagnoses, procedures, laboratory
results, and medications) and logical operators. We refer to
them as phenotyping algorithms, and typically represent them
as pseudocodes with varying degrees of formality and structure.
CQMs, for instance in the MU program, are examples of phenotyping algorithms maintained by the National Quality Forum
(NQF), based on the Quality Data Model (QDM)26 and represented in the HL7 Health Quality Measures Format (HQMF27
or eMeasure). The QDM is an information model and grammar
intended to represent data collected during routine clinical care
in EHRs as well as the basic logic required to articulate the algorithmic criteria for phenotype denitions.
While the NQF, individual measure developers, the National
Library of Medicine, and others have made improvements in the
clarity of eMeasure logic and coded value sets for the 2014 set of
CQMs28 for MU stage 2, at least in MU stage 1, these measures
often required human interpretation and translation into local
queries in order to extract necessary data and produce calculations. There have been two challenges in particular: (1) local data
elements in an EHR may not be natively represented in a format
consistent with the QDM including the required code systems
and value sets; (2) An EHR typically does not natively have the
capability to automatically consume and execute eMeasure logic.
In other words, work has been needed locally to translate the
logic (eg, using an approach such as SQL) and to map local data
elements and codes (eg, mapping an element using a proprietary
code to SNOMED). This greatly erodes the advantages of the
eMeasure approach, but has been the reality for many vendors
and institutions in the rst stage of MU.29
To address these challenges, the SHARPn project has investigated aligning the representation of data with the QDM along
with automatic interpretation and execution of eMeasures using
the open-source JBoss Drools30 rules management system.
Specically, using the Apache UIMA platform, we have developed a translator tool that converts QDM-dened phenotyping
algorithm criteria into executable Drools rules scripts. In
essence, the tool takes as input the HQMF XML le along with
the relevant value sets as inputs, maps the QDM categories and
criteria to CEMs, and, at run-time, generates JavaScript queries
to the JSON-based CouchDB CEM datastore containing patient
information. Figure 3 shows the overall architecture of the conversion tool, with additional details presented in Li et al.31

Data quality and consistency


Accurate and representative EHR data are required for effective
secondary use in health improvement measures and research.32
There are often multiple sources with the same information in
the EHR, such as medications from an application that supports

Figure 3 Architecture for Quality


Data Model (QDM) to Drools translator
system. AE, annotation engine; CAS,
common analysis system; CEM, Clinical
Element Model; DB, database; UIMA,
Unstructured Information Management
Architecture; XSLT, extensible
stylesheet transformation language.

e344

Pathak J, et al. J Am Med Inform Assoc 2013;20:e341e348. doi:10.1136/amiajnl-2013-001939

Research and applications


the printing or messaging of prescriptions versus medication
data that were processed from progress notes using NLP.
Identical data elements from hospital, ambulatory, and nursing
home care, for example, have variations in their original use
context. We take a two-pronged view toward the requirement of
providing accurate and correct processing of source data. The
rst is to validate correct functionality of the SHARPn components and services. The second is to develop methods and components to monitor and report end-to-end data validity for
intended users of the SHARPn deliverables. Our approach to
validating the normalization pipeline was to construct an environment in which representative samples of various types of clinical data are transmitted across the internet, the data are
persisted as normalized CEM-based data instances, and the
CEM-based instances are reconciled with the originally transmitted data. Figure 4 illustrates this approach.
In particular, we have validated a stream of 1000 HL7 V.2.5
medication messages end-to-end: from the source data submitted through an NwHIN gateway at Intermountain Healthcare to
normalized, Secondary Use, Noted Drug CEM instances
retrieved from the platform datastores at Mayo Clinic. Once a
validation procedure had been successfully tested in the SQL
datastore environment, this evaluation process was implemented
on output from the CouchDB. The content of all messages were
matched correctly between input and output data, with the
exception of one medication RxNorm code that was not populated in the CTS2 terminology server (more details in the
Results section). Other SHARPn CEMs are in various stages of
validation as the pipeline components continue to evolve. Our
continuing objective is to develop standard procedures to identify transmission/translation errors as data from new sites are
added to a centralized repository. The errors expose the modications of underlying system resources within the normalization
pipeline necessary to produce an accurate conversion of the new
data into the canonical CEM standard.

RESULTS
Medication CEMs and terminology mappings
As discussed above, for testing our data normalization pipeline,
1000 ambulatory medication order instances were evaluated for
the correctness of structural and semantic mapping from HL7
V.2.3 Medication Order messages to CEMs. These medication
data compared included ingredient, strength, frequency, route,
dose and dose unit. Specically, 266 distinct First Data Bank
GCN sequence_codes (GCN-SEQNO) were used in the test

messages. These generic clinical medication terms were a


random selection from Intermountains EHR data. All but one
(GCN-SEQNO, 031417, Serevent Diskus (Salmeterol) 50 g/
Dose Inhalation Disk with Device) failed to map to an RxNorm
ingredient code in the pipeline. It has been noted in previous
evaluations of RxNorm coverage of actual medication data that
metered dose inhalers are, in general, challenging to map.33 34

MU quality measures
To demonstrate the applicability of our phenotyping infrastructure, we experimented with one of the MU phase 1 eMeasures:
NQF 0064 (Diabetes: Low Density Lipoprotein (LDL)
Management and Control35; gure 5). NQF 0064 determines the
percentage of patients between 18 and 75 years with diabetes
whose most recent LDL-cholesterol (LDL-C) test result during the
measurement year was <100 mg/dL. Thus, the denominator criterion identies patients between 18 and 75 years of age who had
a diagnosis of diabetes (type 1 or type 2), and the numerator criterion identies patients with an LDL-C measurement of <100 mg/
dL. The patient is not numerator-compliant if the result for the
most recent LDL-C test during the measurement period is
100 mg/dL, or is missing, or if an LDL-C test was not performed
during the measurement time period. NQF 0064 also excludes
patients with diagnoses of polycystic ovaries, gestational diabetes
or steroid-induced diabetes.
As the goal is to execute such an algorithm in a semiautomatic fashion, we rst invoked the SHARPn data normalization and NLP pipelines in UIMA to populate the CEM database.
Specically, we extracted demographics, diagnoses, medications,
clinical procedures and laboratory measurements for 273
patients from Mayos EHR systems (table 2) and normalized the
data to HL7 value sets, International Classication of Diseases,
Ninth Revision, Clinical Modication (ICD-9-CM), RxNorm,
Current Procedural Terminology (CPT)-4 and Logical
Observation Identiers Names and Codes (LOINC) codes,
respectively. The NLP pipeline was particularly invoked in processing the medication data as demonstrated in prior work.36
For translating the QDM criteria into Drools rules, a key component comprised mapping from QDM data elements to CEMs.
This enables the mapping and extraction from disparate EHRs
to this normalized representation of the source data.
Additionally, CEMs contain attributes of provenance, such as
date/time, originating clinician, clinic, and entry application,
and other model-specic attributes, such as the order status of a
medication. These care process-related attributes are required in

Figure 4 Conceptual diagram of


validation processes. (1) Use-case data
are translated to HL7 and submitted to
the MIRTH interface engine, and then
(2) processed and stored as normalized
data objects. (3) A Java application
pulls data from the source database
and the Clinical Element Model (CEM)
database, compares them, then (4)
prints inconsistencies for manual
review.

Pathak J, et al. J Am Med Inform Assoc 2013;20:e341e348. doi:10.1136/amiajnl-2013-001939

e345

Research and applications


Figure 5 Denominator, numerator
and exclusion criteria for NQF 0064:
Diabetes: Low Density Lipoprotein
(LDL) Management and Control.

QDM data specications. CEMs are bound to a standard value


set. The key code and the data code are generally the CEM
qualiers that enable a mapping from a QDM data element
specication to one or more CEMs. Table 3 shows a sample of
mappings developed for NQF 0064. Two QDM data categories,
Diagnosis and Encounter, may be satised by one or more

Table 2 Basic demographics for patients (N=273) evaluated for


NQF 0064
Category
Gender
Male
Female
Race
White
Black
Alaskan Indian
Asian
Pacific Islander
Other
Ethnicity
Hispanic, Latino or Spanish origin
Non-Hispanic, Latino or Spanish origin
Unknown
Age (years)
18
1930
3150
5175
>75

e346

Number of patients

CEMs. Both of these map to an AdministrativeDiagnosis CEM:


the QDM data element will be instantiated based on matching
the data qualier to the QDM value set. All QDM data specications for NQF 0064 were successfully mapped to CEMs.
Executing NQF 0064 on this population of 273 patients identied 21 and 18 patients for the denominator and numerator,
respectively (table 4). Nineteen patients were excluded because
they did not meet either the denominator or numerator criterion for NQF 0064. Furthermore, we have implemented an
open-source platform (http://phenotypeportal.org) providing a
library of such cohort identication algorithms, as well as for
visualization, reporting and analysis of algorithm results.

DISCUSSION
132 (48%)
141 (52%)
236 (85%)
8 (3%)
10 (4%)
5 (2%)
4 (2%)
10 (4%)
7 (4%)
235 (85%)
31 (11%)
31
32
70
105
35

(11%)
(11%)
(26%)
(38%)
(14%)

We have briey presented the SHARPn project whose core


mission is to develop methods and modular open-source
resources for enabling secondary use of EHR data through normalization into standards-based, comparable, and consistent
formats. A key aspect to achieving this is normalization of both
structured and unstructured clinical data into a unied, conceptbased formalism. Using CEMs and Apache UIMA, we have
developed a data-normalization platform that ensures data
security, end-to-end connectivity, and reliable data ow within
and across institutions. This platform leverages and builds on
our existing work on cTAKES and CTS2 open-source tools
which have been widely adopted in the health informatics community. With phenotyping as SHARPns major use case, we
have collaborated, and continue to collaborate, with NQF on
further development and adoption of QDM and HQMF for
standardized representation of cohort denition criteria. As illustrated in the previous section, our goal is to create a national
library of standardized phenotyping algorithms, and provide
publicly available infrastructure that facilitates their execution in
a semi-automatic fashion. Finally, this article describes a
Pathak J, et al. J Am Med Inform Assoc 2013;20:e341e348. doi:10.1136/amiajnl-2013-001939

Research and applications


Table 3 Sample QDM category to CEM mapping for NQF 0064
QDM data element

QDM code system

CEM

CEM qualifier(s)

Diagnosis, active: diabetes

Laboratory test, result: LDL test

ICD-9-CM
SNOMED-CT
CPT
ICD-9-CM
LOINC

Data, Encounter Date


Data, Date Of Onset or Observed Start Time
Data, Encounter Date
Data, Encounter Date
Key, data

Medication, active: Meds indicative of diabetes


Patient characteristic: birth date
Race
ONC administrative Sex

Rx Norm
LOINC
CDCREC
Administrative Sex

Administrative Diagnosis
Secondary Use Assertion
Administrative Procedure
Administrative Diagnosis
Secondary Use Standard
LabObs Quantitative
Secondary Use Noted Drug
Secondary Use Patient
Secondary Use Patient
Secondary Use Patient

Encounter: non-acute inpatient and outpatient

data, Start Time


Birth Date
Administrative Race
Administrative Gender

CDREC, US Centers for Disease Control and Prevention Race and Ethnicity Code Set; CEM, Clinical Element Model; CPT, Current Procedural Terminology; ICD-9-CM, International
Classification of Diseases, Ninth Revision, Clinical Modification; LDL, low-density lipoprotein; LOINC, Logical Observation Identifiers Names and Codes; ONC, Office of the National
Coordinator; QDM, Quality Data Model.

systematic implementation for an informatics infrastructure.


While there are alternative commercial (eg, Oracle Cohort
Explorer37) and open-source (eg, PopHealth38) tools that
achieve similar functionality, to our knowledge, ours is the rst
system that leverages an emerging national standardnamely
QDMfor representing phenotypic criteria, as well as using a
rules-based system (Drools) for execution of such criteria.
Several limitations and challenges remain to full achievement
of the SHARPn vision. First, to achieve any form of practical
semantic normalization, we are faced with the unavoidable
requirement for well-curated semantic mapping tables among
native data representations and their target standards. While
several semi-automated methods for mappings are being investigated,3942 terminology mappings remain a largely laborintensive process. Second, within the context of secondary use,
metadata and other means of understanding the provenance and
meaning of source EHR data are highly pertinent. While CEMs
do accommodate provenance details, such information is typically either not readily available and, in a few cases, not
adequately processed by the data normalization framework, suggesting the requirement for future enhancement. Furthermore,
at present, for structured clinical data, we use HL7 2.x and
CCD messages as the payload of an NwHIN Document
Submission message, which are then transformed and normalized using the UIMA pipeline. However, the emergence of more
comprehensive standards, such as Consolidated Clinical
Document Architecture,43 will warrant further evaluation,
including investigation of the use of Mirth Connect and Aurion
NwHIN for transferring the message payload. Further, our
existing evaluations with eMeasures have been limited to MU
phase 1 as well as to criteria with core logical operators (eg,
OR, AND). However, QDM denes a comprehensive list of
operators including temporal logic (eg, NOW, WEEK,
CURRTIME), mathematical functions (eg, MEAN, MEDIAN),

and qualiers (eg, FIRST, SECOND, RELATIVE FIRST) that


require a deeper understanding of the underlying data semantics. Our plan is to incorporate these additional capabilities as
well as 2014 MU phase 2 CQMs in future releases of the
QDM-to-Drools translator.

CONCLUSION
End-to-end automated systems for extracting clinical information from diverse EHR systems require extensive use of standardized vocabularies and ontologies, as well as robust information
models for storing, discovering, and processing that information. The main objective of SHARPn is to develop methods and
modular open-source resources for enabling secondary use of
EHR data through normalization to standards-based, comparable, and consistent formats for high-throughput phenotyping.
In this article, we present our current research progress and
plans for future work.
Correction notice This paper has been corrected since it was published Online
First. The middle initial of the fth author has been corrected.
Contributors JP and CGC designed the study and wrote the manuscript. All other
authors checked the individual sections in the manuscript and discussion. All
authors contributed to and approved the nal manuscript.
Funding This research and the manuscript was made possible by funding from the
Strategic Health IT Advanced Research Projects (SHARP) Program (90TR002)
administered by the Ofce of the National Coordinator for Health Information
Technology.
Competing interests None.
Patient consent Obtained.
Ethics approval This study was approved by the Mayo Clinic Institutional Review
Board (IRB#: 12-003424) as a minimal risk protocol.
Provenance and peer review Not commissioned; externally peer reviewed.

REFERENCES
1

Table 4 Criteria results for patients (N=273) evaluated for NQF


0064

NQF 0064 population

Number of patients

Initial patient population


Denominator
Numerator
Exclusion

273
21
18
19

Ofce_of_the_National_Coordinator. Strategic Health IT Advanced Research


Projects: SHARP. 2010. http://healthit.hhs.gov/sharp (accessed 7 Sep 2010).
SHARPs: Strategic Health IT Project Advanced Research Project on Security. http://
sharps.org (accessed 18 Dec 2012).
SHARPc: Strategic Health IT Advanced Research Project on Cognitive Informatics
and Decision Making. http://sharpc.org (accessed 18 Dec 2012).
Mandl KD, Mandel JC, Murphy SN, et al. The SMART Platform: early experience
enabling substitutable applications for electronic health records. J Am Med Inform
Assoc 2012;19:597603.
Rea S, Pathak J, Savova G, et al. Building a robust, scalable and standards-driven
infrastructure for secondary use of EHR data: the SHARPn project. J Biomed Inform
2012;45:76371.

Pathak J, et al. J Am Med Inform Assoc 2013;20:e341e348. doi:10.1136/amiajnl-2013-001939

e347

Research and applications


6

7
8

9
10
11
12
13
14

15
16

17

18
19
20

21
22
23

24

e348

Kern LM, Malhotra S, Barrn Y, et al. Accuracy of electronically reported


meaningful use clinical quality measures: a cross-sectional study. Ann Intern Med
2013;158:7783.
Blumenthal D, Tavenner M. The meaningful use regulation for electronic health
records. N Engl J Med 2010;363:5014.
Oniki T, Coyle J, Parker C, et al. Lessons learned in detailed clinical modeling at
intermountain healthcare. American Medical Informatics Association (AMIA) Annual
Symposium; 2012:16389.
Common Terminology Services 2 (CTS 2). http://informatics.mayo.edu/cts2 (accessed
25 Dec 2012).
Ferrucci D, Lally A. Building an example application with the Unstructured
Information Management Architecture. IBM Syst J 2004;43:45575.
Kaggal V, Oniki T, Marchant K, et al. SHARPn data normalization pipeline. AMIA
Annual Symposium; 2013. (Accepted for publication).
Aberdeen J, Bayer S, Yeniterzi R, et al. The MITRE identication scrubber toolkit:
design, training, and assessment. Int J Med Inform 2010;79:84959.
Albright D, Lanfranchi A, Fredriksen A, et al. Towards syntactic and semantic annotations
of the clinical narrative. J Am Med Inform Assoc 2013;20:92230.
Savova G, Masanz J, Ogren P, et al. Mayo clinical Text Analysis and Knowledge
Extraction System (cTAKES): architecture, component evaluation and applications.
J Am Med Inform Assoc 2010;17:50713.
Wu S, Kaggal V, Dligach D, et al. A common type system for clinical Natural
Language Processing. J Biomed Semantics 2013. In press.
Choi JD, Palmer M. Getting the most out of transition-based dependency parsing.
Proceedings of the 49th Annual Meeting of the Association for Computational
Linguistics: Human Language Technologies: short papersVolume 2; Portland,
Oregon: 2011.
Choi JD, Palmer M. Transition-based semantic role labeling using predicate
argument clustering. Proceedings of the ACL 2011 Workshop on Relational Models
of Semantics; Portland, Oregon: 2011.
Clark C, Aberdeen J, Coarr M, et al. MITRE system for clinical assertion status
classication. J Am Med Inform Assoc 2011;18:5637.
Savova G, Ogren P, Duffy P, et al. Mayo Clinic NLP system for patient smoking
status identication. J Am Med Inform Assoc 2008;15:258.
Sohn S, Murphy SN, Masanz J, et al. Classication of medication status change in
clinical narratives. American Medical Informatics Association (AMIA) Annual
Symposium; 2010.
Zheng J, Chapman WW, Miller TA, et al. A system for coreference resolution for the
clinical narrative. J Am Med Inform Assoc 2012;19:6607.
Apache CouchDB. http://couchdb.apache.org (accessed 12 Dec 2012).
McCarty C, Chisholm R, Chute C, et al. The eMERGE Network: a consortium of
biorepositories linked to electronic medical records data for conducting genomic
studies. BMC Med Genomics 2011;4:13.
Long RM, Berg JM. What to expect from the pharmacogenomics research network.
Clin Pharmacol Ther 2011;89:33941.

25

26
27
28

29
30
31

32
33
34
35

36

37
38
39
40

41
42
43

Kohane IS, Churchill SE, Murphy SN. A translational engine at the national scale:
informatics for integrating biology and the bedside. J Am Med Inform Assoc
2012;19:1815.
National Quality Forum (NQF) Quality Data Model (QDM). http://www.qualityforum.
org/Projects/h/QDS_Model/Quality_Data_Model.aspx (accessed 23 May 2012).
HL7 Health Quality Measure Format (HQMF). http://code.google.com/p/hqmf/
(accessed 23 May 2012).
Center for Medicare and Medicaid Services: 2014 Clinical Quality Measures. http://
www.cms.gov/Regulations-and-Guidance/Legislation/EHRIncentivePrograms/2014_
ClinicalQualityMeasures.html (accessed 25 Dec 2012).
Fu PC Jr, Rosenthal D, Pevnick JM, et al. The impact of emerging standards
adoption on automated quality reporting. J Biomed Inform 2012;45:77281.
JBoss Drools Business Logic Integration Platform. http://www.jboss.org/drools
(accessed 23 May 2012).
Li D, Shrestha G, Murthy S, et al. Modeling and executing electronic health records
driven phenotyping algorithms using the NQF quality data model and JBoss
drools engine. American Medical Informatics Association (AMIA) Annual
Symposium; 2012.
Hammond WE, Bailey C, Boucher P, et al. Connecting information to improve
health. Health Aff (Millwood) 2010;29:2848.
Bell DS, ONeill SM, Reynolds KA, et al. Evaluation of RxNorm in ambulatory
electronic prescribing. Santa Monica, California: RAND Health, 2011.
ONeill SM, Bell DS. Evaluation of RxNorm for representing ambulatory
prescriptions. AMIA Annu Symp Proc 2010;2010:5626.
NQF 0064: Diabetes: Low Density Lipoprotein (LDL) Management and Control.
http://ushik.ahrq.gov/ViewItemDetails?system=mu&itemKey=122574000 (accessed
18 Dec 2012).
Pathak J, Murphy SN, Willaert B, et al. Using RxNorm and NDF-RT to classify
medication data extracted from electronic health records: experiences from the
Rochester Epidemiology Project. American Medical Informatics Association (AMIA)
Annual Symposium; 2011.
Oracle Health Sciences Cohort Explorer. http://docs.oracle.com/cd/E24441_01/doc.
10/e25021/toc.htm (accessed 6 Sep 2013).
Project popHealth. http://projectpophealth.org/ (accessed 13 Mar 2012).
Ghazvinian A, Noy N, Jonquet C, et al. What four million mappings can tell you
about two hundred ontologies. 8th International Semantic Web Conference; 2009.
Ghazvinian A, Noy N, Jonquet C, et al. Creating mappings for ontologies in
biomedicine: simple methods work. American Medical Informatics Association
(AMIA) Annual Symposium; 2009.
Wang Y, Patrick J, Miller G, et al. Linguistic mapping of terminologies to
SNOMED-CT. Semantic Mining Conference on SNOMED CT; 2006.
Choi N, Song I, Han H. A survey on ontology mapping. ACM SIGMOD Record
2006;35:3441.
HL7 Consolidated Clinical Document Architecture (CCDA). http://www.hl7.
org/implement/standards/product_brief.cfm?product_id=258 (accessed 11 Dec 2012).

Pathak J, et al. J Am Med Inform Assoc 2013;20:e341e348. doi:10.1136/amiajnl-2013-001939

You might also like