Professional Documents
Culture Documents
Editors choice
Scan to access more
free content
Department of Health
Sciences Research, Mayo Clinic,
Rochester, Minnesota, USA
2
Department of Linguistics,
University of Colorado,
Boulder, Colorado, USA
3
Group Health Research
Institute, Seattle, Washington,
USA
4
Boston Childrens Hospital,
Harvard University, Boston,
Massachusetts, USA
5
Homer Warner Center for
Informatics Research,
Intermountain Healthcare, Salt
Lake City, Utah, USA
6
Agilex Technologies, Chantilly,
Virginia, USA
7
School of Biomedical
Informatics, University of Texas
Health Sciences Center,
Houston, Texas, USA
Correspondence to
Dr Jyotishman Pathak, Mayo
Clinic College of Medicine,
Rochester, MN 55902, USA;
Pathak.Jyotishman@mayo.edu
Received 17 April 2013
Revised 7 October 2013
Accepted 11 October 2013
Published Online First
5 November 2013
INTRODUCTION
The Ofce of the National Coordinator for Health
Information Technology (HIT) in 2010 established
the Strategic Health IT Research Program (SHARP)
to address research and development challenges in
wide-scale adoption of HIT tools and technologies
for improved patient care and a cost-effective healthcare ecosystem. SHARP has four areas of focus1:
security in HIT (SHARPs2; led by University of
Illinois, Urbana-Champaign), patient-centered cognitive support (SHARPc3; led by University of Texas,
Houston), healthcare applications and network
design (SMART4; led by Harvard University), and
secondary use of electronic health records (EHRs)
(SHARPn5; led by Mayo Clinic).
The Mayo Clinic-led SHARPn program aims to
enhance patient safety and improve patient medical
outcomes by enabling the use of standards-based
EHR data in research and clinical practice. A core
component for achieving this goal is the ability to
transform heterogeneous patient health information,
typically stored in multiple clinical and health IT
systems, into standardized, comparable, consistent,
and queryable data.6 7 To this end, over the last
3 years, SHARPn has developed a suite of opensource and publicly available applications and
resources that support ubiquitous exchange, sharing
and reuse of EHR data collected during routine clinical processes in the delivery of patient care. In particular, the research and development efforts have
focused on four main areas: (1) clinical information
modeling and terminology standards; (2) a framework for normalization and standardization of clinical databoth structured and unstructured data
extracted from clinical narratives; (3) a platform for
representing and executing patient cohort identication and phenotyping logic; and (4) evaluation of
data quality and utility of the SHARPn resources.
In this article, we describe current research progress and future plans for these foci of the SHARPn
program. We illustrate our work via use cases in highthroughput phenotype extraction from EHR data,
specically for meaningful use (MU) clinical quality
measures (CQMs), and discuss the challenges to
shared, secondary use of EHR data relative to the
work presented.
e341
METHODS
SHARPn organizational and functional framework
The SHARPn vision is to develop and foster robust, scalable and
pragmatic open-source informatics resources and applications
that can facilitate large-scale use of heterogeneous EHR data for
secondary purposes. In the past 3 years, via this highly collaborative project with team members from seven different organizations, we have developed an organizational and functional
framework for SHARPn that is structured around informatics
infrastructure for (1) information modeling and terminology
standards, (2) clinical natural language processing (NLP), (3) clinical phenotyping, and (4) data quality. We describe these infrastructural components briey in the following sections.
Figure 2 SHARPn clinical data normalization pipeline. (1) Data to be normalized are read from the le system. These data can also be transmitted
on NwHIN via TCP/IP from an external entity. (2) Mirth Connect invokes the normalization pipeline using one of its predened channels and passes
the data (eg, HL7, CCD, tabulardata) to be normalized. (3) The normalization pipeline goes through initialization of the components (including
loading resources from the le system or other predened resource such as the Common Terminology Services 2 (CTS2) and then performs syntactic
parsing and semantic normalization to generate normalized data in the form of a Clinical Element Model (CEM). (4) Normalized data are handed
back to Mirth Connect. (5) Mirth Connect uses one of the predened channels to serialize the normalized CEM data to CouchDB or MySQL based on
the conguration. CTAKES, clinical Text Analysis and Knowledge Extraction System; DB, database; NLP, natural language processing; UIMA,
Unstructured Information Management Architecture.
evaluate the output. Therefore, a corpus representative of the
EHR from two SHARPn institutions (Mayo Clinic and Seattle
Group Health) was sampled and deidentied following Health
Insurance Portability and Accountability Act guidelines. The deidentication process was a combination of automatic output
from MITRE Identication Scrubber Tool (MIST)12 and manual
review. The corpus size is 500 K words that we intend to share
with the research community under data use agreements with
Table 1 Template generalizations of the Clinical Element Model used for the normalization of clinical narrative
Template-specific attributes
Common attributes
Template
Attribute
Template
Attribute
Template
Attribute
AssociatedCode
Conditional
Generic
Negation_indicator
Subject
Uncertainty_indicator
Medication
Change_status
Dosage
Duration
End_date
Form
Relative_temporal_context
Frequency
Route
Start_date
Strength
Body_laterality
Body_location
Body_side
Device
Duration
End_date
Method
Relative_temporal_context
Start_date
Sign/symptom
Alleviating_factor
Body_laterality
Body_location
Body_side
Course
Duration
End_date
Exacerbating_factor
Relative_temporal_context
Severity
Start_date
Abnormal_interpretation
Delta_flag
End_date
Lab_value
Ordinal_interpretation
Reference_range_narrative
Start_date
Disease/disorder
Alleviating_factor
Associated_sign_symptom
Body_laterality
Body_location
Body_side
Course
Duration
End_date
Exacerbating_factor
Relative_temporal_context
Start_date
Body_laterality
Body_side
Procedure
Laboratory results
Anatomic site
e343
Data storage/retrieval
The SHARPn teams have developed two different technology
solutions for the storage and retrieval of normalized data. The
rst is an open-source SQL-based solution. This solution
enables each individual CEM instance record to be stored in a
standardized SQL data model. Data can be queried directly
from the SQL database tables and extracted as individual elds
or as complete XML records. The second is a document storage
solution that leverages the open-source CouchDB22 repository
to store the CEM XML data as individual JavaScript Object
Notation ( JSON) documents, providing a more documentcentric view into the data. Both solutions leverage toolsets built
around the CEM data models for a variety of clinical data
including NotedDrugs, Labs, Administrative Diagnosis, and
other clinically relevant data.
e344
RESULTS
Medication CEMs and terminology mappings
As discussed above, for testing our data normalization pipeline,
1000 ambulatory medication order instances were evaluated for
the correctness of structural and semantic mapping from HL7
V.2.3 Medication Order messages to CEMs. These medication
data compared included ingredient, strength, frequency, route,
dose and dose unit. Specically, 266 distinct First Data Bank
GCN sequence_codes (GCN-SEQNO) were used in the test
MU quality measures
To demonstrate the applicability of our phenotyping infrastructure, we experimented with one of the MU phase 1 eMeasures:
NQF 0064 (Diabetes: Low Density Lipoprotein (LDL)
Management and Control35; gure 5). NQF 0064 determines the
percentage of patients between 18 and 75 years with diabetes
whose most recent LDL-cholesterol (LDL-C) test result during the
measurement year was <100 mg/dL. Thus, the denominator criterion identies patients between 18 and 75 years of age who had
a diagnosis of diabetes (type 1 or type 2), and the numerator criterion identies patients with an LDL-C measurement of <100 mg/
dL. The patient is not numerator-compliant if the result for the
most recent LDL-C test during the measurement period is
100 mg/dL, or is missing, or if an LDL-C test was not performed
during the measurement time period. NQF 0064 also excludes
patients with diagnoses of polycystic ovaries, gestational diabetes
or steroid-induced diabetes.
As the goal is to execute such an algorithm in a semiautomatic fashion, we rst invoked the SHARPn data normalization and NLP pipelines in UIMA to populate the CEM database.
Specically, we extracted demographics, diagnoses, medications,
clinical procedures and laboratory measurements for 273
patients from Mayos EHR systems (table 2) and normalized the
data to HL7 value sets, International Classication of Diseases,
Ninth Revision, Clinical Modication (ICD-9-CM), RxNorm,
Current Procedural Terminology (CPT)-4 and Logical
Observation Identiers Names and Codes (LOINC) codes,
respectively. The NLP pipeline was particularly invoked in processing the medication data as demonstrated in prior work.36
For translating the QDM criteria into Drools rules, a key component comprised mapping from QDM data elements to CEMs.
This enables the mapping and extraction from disparate EHRs
to this normalized representation of the source data.
Additionally, CEMs contain attributes of provenance, such as
date/time, originating clinician, clinic, and entry application,
and other model-specic attributes, such as the order status of a
medication. These care process-related attributes are required in
e345
e346
Number of patients
DISCUSSION
132 (48%)
141 (52%)
236 (85%)
8 (3%)
10 (4%)
5 (2%)
4 (2%)
10 (4%)
7 (4%)
235 (85%)
31 (11%)
31
32
70
105
35
(11%)
(11%)
(26%)
(38%)
(14%)
CEM
CEM qualifier(s)
ICD-9-CM
SNOMED-CT
CPT
ICD-9-CM
LOINC
Rx Norm
LOINC
CDCREC
Administrative Sex
Administrative Diagnosis
Secondary Use Assertion
Administrative Procedure
Administrative Diagnosis
Secondary Use Standard
LabObs Quantitative
Secondary Use Noted Drug
Secondary Use Patient
Secondary Use Patient
Secondary Use Patient
CDREC, US Centers for Disease Control and Prevention Race and Ethnicity Code Set; CEM, Clinical Element Model; CPT, Current Procedural Terminology; ICD-9-CM, International
Classification of Diseases, Ninth Revision, Clinical Modification; LDL, low-density lipoprotein; LOINC, Logical Observation Identifiers Names and Codes; ONC, Office of the National
Coordinator; QDM, Quality Data Model.
CONCLUSION
End-to-end automated systems for extracting clinical information from diverse EHR systems require extensive use of standardized vocabularies and ontologies, as well as robust information
models for storing, discovering, and processing that information. The main objective of SHARPn is to develop methods and
modular open-source resources for enabling secondary use of
EHR data through normalization to standards-based, comparable, and consistent formats for high-throughput phenotyping.
In this article, we present our current research progress and
plans for future work.
Correction notice This paper has been corrected since it was published Online
First. The middle initial of the fth author has been corrected.
Contributors JP and CGC designed the study and wrote the manuscript. All other
authors checked the individual sections in the manuscript and discussion. All
authors contributed to and approved the nal manuscript.
Funding This research and the manuscript was made possible by funding from the
Strategic Health IT Advanced Research Projects (SHARP) Program (90TR002)
administered by the Ofce of the National Coordinator for Health Information
Technology.
Competing interests None.
Patient consent Obtained.
Ethics approval This study was approved by the Mayo Clinic Institutional Review
Board (IRB#: 12-003424) as a minimal risk protocol.
Provenance and peer review Not commissioned; externally peer reviewed.
REFERENCES
1
Number of patients
273
21
18
19
e347
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
e348
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
Kohane IS, Churchill SE, Murphy SN. A translational engine at the national scale:
informatics for integrating biology and the bedside. J Am Med Inform Assoc
2012;19:1815.
National Quality Forum (NQF) Quality Data Model (QDM). http://www.qualityforum.
org/Projects/h/QDS_Model/Quality_Data_Model.aspx (accessed 23 May 2012).
HL7 Health Quality Measure Format (HQMF). http://code.google.com/p/hqmf/
(accessed 23 May 2012).
Center for Medicare and Medicaid Services: 2014 Clinical Quality Measures. http://
www.cms.gov/Regulations-and-Guidance/Legislation/EHRIncentivePrograms/2014_
ClinicalQualityMeasures.html (accessed 25 Dec 2012).
Fu PC Jr, Rosenthal D, Pevnick JM, et al. The impact of emerging standards
adoption on automated quality reporting. J Biomed Inform 2012;45:77281.
JBoss Drools Business Logic Integration Platform. http://www.jboss.org/drools
(accessed 23 May 2012).
Li D, Shrestha G, Murthy S, et al. Modeling and executing electronic health records
driven phenotyping algorithms using the NQF quality data model and JBoss
drools engine. American Medical Informatics Association (AMIA) Annual
Symposium; 2012.
Hammond WE, Bailey C, Boucher P, et al. Connecting information to improve
health. Health Aff (Millwood) 2010;29:2848.
Bell DS, ONeill SM, Reynolds KA, et al. Evaluation of RxNorm in ambulatory
electronic prescribing. Santa Monica, California: RAND Health, 2011.
ONeill SM, Bell DS. Evaluation of RxNorm for representing ambulatory
prescriptions. AMIA Annu Symp Proc 2010;2010:5626.
NQF 0064: Diabetes: Low Density Lipoprotein (LDL) Management and Control.
http://ushik.ahrq.gov/ViewItemDetails?system=mu&itemKey=122574000 (accessed
18 Dec 2012).
Pathak J, Murphy SN, Willaert B, et al. Using RxNorm and NDF-RT to classify
medication data extracted from electronic health records: experiences from the
Rochester Epidemiology Project. American Medical Informatics Association (AMIA)
Annual Symposium; 2011.
Oracle Health Sciences Cohort Explorer. http://docs.oracle.com/cd/E24441_01/doc.
10/e25021/toc.htm (accessed 6 Sep 2013).
Project popHealth. http://projectpophealth.org/ (accessed 13 Mar 2012).
Ghazvinian A, Noy N, Jonquet C, et al. What four million mappings can tell you
about two hundred ontologies. 8th International Semantic Web Conference; 2009.
Ghazvinian A, Noy N, Jonquet C, et al. Creating mappings for ontologies in
biomedicine: simple methods work. American Medical Informatics Association
(AMIA) Annual Symposium; 2009.
Wang Y, Patrick J, Miller G, et al. Linguistic mapping of terminologies to
SNOMED-CT. Semantic Mining Conference on SNOMED CT; 2006.
Choi N, Song I, Han H. A survey on ontology mapping. ACM SIGMOD Record
2006;35:3441.
HL7 Consolidated Clinical Document Architecture (CCDA). http://www.hl7.
org/implement/standards/product_brief.cfm?product_id=258 (accessed 11 Dec 2012).