Using Association Rule Mining For Phenotype Extraction From EHRs

Using Association Rule Mining for
Phenotype Extraction from Electronic

Health Records
Dingcheng Li, PhD
1
Gyorgy Simon, PhD
2
Christopher G. Chute, MD, DrPH
1
Jyotishman Pathak, PhD
1

1
Mayo Clinic, Rochester
2
University of Minnesota, Twin Cities
2013 AMIA Clinical Research Informatics Summit
High-Throughput Phenotyping from EHRs
Outline
Clinical phenotyping from electronic health
records (EHRs)
Machine learning techniques for
phenotyping
Association Rule Mining and T2DM
Results
Discussion
2013 MFMER | slide-2
Data
Transform Transform
EHR-driven Phenotyping: The Process
Phenotype
Algorithm
Visualization
Evaluation
NLP, SQL
Rules
Mappings
[eMERGE Network]
Example: Hypothyroidism Algorithm
[Conway et al. AMIA 2011: 274-83]
Drugs
Labs
Diagnosis
NLP
Proce-
dures
High-Throughput Phenotyping from EHRs 2013 MFMER | slide-5
http://gwas.org

[eMERGE Network]
0.5 5
Genotype-Phenotype Association Results
0.5 5
0.5 5.0 1.0
Odds Ratio
rs2200733 Chr. 4q25
rs10033464 Chr. 4q25
rs11805303 IL23R
rs17234657 Chr. 5
rs1000113 Chr. 5
rs17221417 NOD2
rs2542151 PTPN22
rs3135388 DRB1*1501
rs2104286 IL2RA
rs6897932 IL7RA
rs6457617 Chr. 6
rs6679677 RSBN1
rs2476601 PTPN22
rs4506565 TCF7L2
rs12255372 TCF7L2
rs12243326 TCF7L2
rs10811661 CDKN2B
rs8050136 FTO
rs5219 KCNJ11
rs5215 KCNJ11
rs4402960 IGF2BP2
Atrial fibrillation
Crohn's disease
Multiple sclerosis
Rheumatoid arthritis
Type 2 diabetes
disease
gene /
region
marker
2.0
[Ritchie et al. AJHG 2010; 86(4):560-72]
observed published
Data
Transform Transform
EHR-driven Phenotyping: The Process
Phenotype
Algorithm
Visualization
Evaluation
NLP, SQL
Rules
Mappings
[eMERGE Network]
Time consuming!
Our research agenda
Our research agenda
Develop effective machine learning methods
for automatic phenotype extractions to
reduce the workload of manual development
of phenotyping algorithms
Explore effective ways to extract features
from EHR data and generate highly
predictive models
Study phenotype extractions methods from
EHRs to facilitate population-based studies
for clinical and translational research
Common Modeling Approaches
Logistic regression/Survival Analysis
No ability to discover interactions
Decision Trees/RandomForest/Gradient-
boosted Trees
Greedy approach to discover interaction
Association Rule Mining (ARM)
Specifically designed to discover interactions
Association rule mining
Proposed by Agrawal et al., VLDB1994
It is an important data mining model
studied extensively by the database and
data mining community
Assume all data are categorical
No good algorithm for numeric data
Initially used for Market Basket Analysis to
find how items purchased by customers
are related
The model: data
I = {i
1
, i
2
, , i
m
}: a set of items.
Transaction t :
t a set of items, and t I.
Transaction Database T: a set of
transactions T = {t
1
, t
2
, , t
n
}.
The model: data
Market basket transactions:
t
1
: {bread, cheese, milk}
t
2
: {apple, eggs, salt, yogurt}

t
n
: {biscuit, eggs, milk}
Concepts:
An item: an item/article in a basket
I: the set of all items sold in the store
A transaction: items purchased in a basket; it
may have TID (transaction ID)
A transactional dataset: A set of transactions
The model: rules
A transaction t contains X, a set of items (itemset)
in I, if X t.
An association rule is an implication of the form:
X Y, where X, Y I, and X Y =

An itemset is a set of items.
E.g., X = {milk, bread, cereal} is an itemset.
A k-itemset is an itemset with k items.
E.g., {milk, bread, cereal} is a 3-itemset
Rule strength measures
Support: The rule holds with support sup in T (the
transaction data set) if sup% of transactions
contain X Y.
sup = Pr(X Y).
Confidence: The rule holds in T with confidence
conf if conf% of tranactions that contain X also
contain Y.
conf = Pr(Y | X)
An association rule is a pattern that states when X
occurs, Y occurs with certain probability.
An example
Transaction data
Assume:
minsup = 30%
minconf = 80%
An example frequent itemset:
{Chicken, Clothes, Milk} [sup = 3/7]
Association rules from the itemset:
Clothes Milk, Chicken [sup = 3/7, conf = 3/3]

Clothes, Chicken Milk, [sup = 3/7, conf = 3/3]
t1: Beef, Chicken, Milk
t2: Beef, Cheese
t3: Cheese, Boots
t4: Beef, Chicken, Cheese
t5: Beef, Chicken, Clothes, Cheese, Milk
t6: Chicken, Clothes, Milk
t7: Chicken, Milk, Clothes
Distributional Association Rules associate an itemset with a
continuous outcome.
Distributional Association Rule Mining
[Simon et al. KDD 2011; 823-831]
Based on Apriori Algorithm (Agarwal, VLDB 1994)
Algorithm Apriori(T)
C
1
init-pass(T);
F
1
{f | f C
1
, f.count/n minsup}; // n: no. of transactions in T
for (k = 2; F
k-1
; k++) do
C
k
candidate-gen(F
k-1
);
for each transaction t T do
for each candidate c C
k
do
if c is contained in t then
c.count++;
end
end
F
k
{c C
k
| c.count/n minsup}
end
return F
k
F
k
;
Use Case: Type 2 Diabetes
Mayo Clin Proc. July 2011,86(7).606-614 doi.10.4065/mcp.2011.0178 www.mayoclinicproceedings.com 607
MAYO GENOME CONSORTIA
For personal use. Mass reproduce only with permission from Mayo Clinic Proceedings a .
genetic research within EMR systems.
1,2
Successful use
of this approach in the eMERGE Network has inspired
the creation of the intramural Mayo Genome Consortia
(MayoGC). The goal of the MayoGC is to assemble a
large cohort of participants from research studies across
Mayo Clinic with high-throughput genetic data and to use
EMR for phenotype extraction for cost-effective genetic
research.
Herein, we describe the design of the MayoGC, includ-
ing the current participating cohorts, expansion efforts, data
processing, and study management and organization. As a
test of the genetic research capability of the MayoGC, we
conducted a GWA study to identify genetic variants associ-
ated with total bilirubin levels. Bilirubin levels have a large
variability in the population, with heritability of roughly
0.50.
3
Two previous GWA studies identied variants from
similar genomic locations with strong and moderate effects
on bilirubin levels,
4,5
making this phenotype an ideal candi-
date for testing. The MayoGC provides a model of a unique
collaborative effort in the environment of a common EMR
for the investigation of genetic determinants of diseases.
PARTICIPANTS AND METHODS
MayoGC is a large cohort of Mayo Clinic patients with EMR
and genotype data. Eligible participants include those who
gave general research (ie, not disease-specic) consent in the
contributing studies to share high-throughput genotyping data
with other investigators. This cohort is being built in 2 phas-
es. Phase 1, which has been completed, includes participants
from 3 studies funded by the National Institutes of Health,
which sought to identify genetic determinants of peripheral
arterial disease (PAD), venous thromboembolism, and pan-
creatic cancer, respectively, with a combined total sample
size of 6307 unique participants (Table 1). The eMERGE
study contributed genotype data for 3336 participants with
PAD and control participants recruited from Mayo Clinics
noninvasive vascular and exercise stress testing laboratories,
respectively.
2
Peripheral arterial disease was dened by docu-
mentation of at least 1 of the following: (1) an ankle-brachial
index (ABI) of 0.9 or less at rest or 1 minute after exercise,
(2) the presence of poorly compressible arteries, or (3) a nor-
mal ABI but history of revascularization for PAD. Control
participants had a normal ABI and no history of PAD.
2
The GENEVA (Gene Environment Association Stud-
ies) Study of Venous Thromboembolism of the National
Human Genome Research Institute enrolled consecutive
Mayo Clinic outpatients with objectively diagnosed deep
venous thrombosis and/or pulmonary embolism who resid-
ed in the upper Midwest and had been referred by a Mayo
Clinic physician to the Mayo Clinic Special Coagulation
Laboratory or to the Mayo Clinic Thrombophilia Center.
6

A deep venous thrombosis or pulmonary embolism was
categorized as objectively diagnosed (1) when it was con-
rmed by venography or pulmonary angiography or via a
pathology examination of a thrombus removed at surgery
or (2) if ndings on at least 1 noninvasive test (compression
duplex ultrasonography, lung scan, computed tomography,
magnetic resonance imaging) were positive. Persons with
venous thromboembolism related to active cancer were
excluded. A control group was prospectively recruited for
this study. Control participants were frequency-matched
TABLE 1. MayoGC Phase 1 Studies
a,b

eMERGE Network (PAD)
2
GENEVA (VTE)
6
PANC
7,8
Cases Controls Cases Controls Controls
Characteristics (n=1612) (n=1585) (n=1233) (n=1264) (n=613)
Age (y), mean SD 66.010.7 61.07.4 55.016.2 56.015.8 66.010.0
Female (%) 36 40 50 52 45
Medical record length (y)
Mean SD 23.420.0 26.120.3 13.716.3 21.115.4 30.216.5
Median (range) 18.7 (1.0-78.6) 23.0 (1.0-79.2) 6.3 (1.0-71.8) 17.8 (1.0-70.2) 29.8 (1.0-75.0)
White (%) 94 94 96 99 100
Geographic location, No. (%)
c

Olmsted County 328 (20) 590 (37) 7 (1) 10 (1) 64 (10)
Southeast Minnesota 191 (12) 62 (4) 205 (17) 378 (30) 107 (17)
Greater Minnesota 393 (24) 343 (22) 314 (25) 317 (25) 135 (22)
Iowa 212 (13) 97 (6) 176 (14) 191 (15) 65 (11)
South and North Dakota 50 (3) 31 (2) 79 (6) 71 (6) 19 (3)
Wisconsin 128 (8) 68 (4) 121 (10) 138 (11) 32 (5)
Other states or international 309 (19) 394 (25) 330 (27) 159 (13) 191 (31)
a
eMERGE = Electronic Medical Records and Genomics; GENEVA = Gene Environment Association Studies; MayoGC = Mayo
Genome Consortia; PAD = peripheral arterial disease; PANC = Mayo Clinic Molecular Epidemiology of Pancreatic Cancer
Study; VTE = venous thromboembolism.
b

Percentages may not total 100% because of rounding.
c
Southeast Minnesota includes 7 counties in the southeast corner of Minnesota: Dodge, Goodhue, Wabasha, Winona, Houston,
Fillmore, and Mower. Olmsted County, Minnesota, is a mutually exclusive category.
Use Case: Type 2 Diabetes

Find all item sets I of co-morbid
conditions, such that the distribution of
risk R is significantly different between
the patient population having I and
without I
Items and Frequencies (based on AHRQ CCS)
Items Times Diagnosis meaning
V48
V82080
10
10
Diabetes melitus without complication
Hemoglobin, A1c
V86 8 Hypertension with complications and secondary hypertension
V56 6 Deficiency and other anemia
V217 4 Other fractures
V52 3 Gout and other crystal arthropathies
V245 3 Residual codes; unclassified
V246 3 Adjustment disorders
V221 3 Open wounds of head; neck; and trunk
V73 3 Retinal detachments; defects; vascular occlusion; and
retinopathy
V244 2 Other screening for suspected conditions (not mental
disorders or infectious diseases)
V216 2 Fracture of lower limb
V143 1 Chronic renal failure
V142 1 Acute and unspecified renal failure
Rule Ranking Top 5
Rank Support SupportD Precision Item Set
1 281 270 0.961 V48 V86 V142 V245 V82080
2 280 269 0.96
V48 V57 V74 V86 V245 V82080
3 274 263 0.95
V48 V52 V57 V74 V244 V246
V82080
4 278 263 0.94
V48 V52 V57 V87 V82080
5 278 263 0.94
V48 V57 V86 V216 V221 V82080
Confusion Matrix
Mod
el
Predictive
class
False True False True False True
ARM Cutoff 0.93 0.92 0.95
Actual
class
N 801 6 752 55 736 71
Y 429 54 51 432 17 466
D-
Tree
Cutoff 0.88 0.75 0.70
Actual
class
N 766 42 754 54 747 60
Y 393 393 53 429 36 447
LR Cutoff 0.95 0.7 0.6
Actual
class
N 772 35 755 48 752 52
Y 149 335 98 384 88 395
SVM Cutoff 0.7 0.6 0.55
Actual
class
N 768 40 758 50 751 57
Y 104 378 68 414 55 424
Measure Metrics for All Models
Model Cutoff Precision Recall F-score
ARM 0.95 0.868 0.966 0.914
0.92 0.887 0.894 0.895
0.93 0.9 0.112 0.199
D-Tree 0.88 0.903 0.812 0.855
0.75 0.888 0.889 0.889
0.70 0.881 0.925 0.902
LR 0.95 0.904 0.693 0.785
0.7 0.889 0.796 0.840
0.6 0.883 0.819 0.849
SVM 0.7 0.901 0.784 0.839
0.6 0.893 0.858 0.875
0.55 0.881 0.878 0.879
ROC for ARM and SVM
Discussion
Clearly the space of all association rules is
exponential, O(2
m
), where m is the number
of items in I.
The mining exploits sparseness of data,
and high minimum support and high
minimum confidence values.
Still, it always produces a huge number of
rules, thousands, tens of thousands,
millions, ...
Discussion
A machine learning framework for semi-
automatic phenotype extraction from EHRs
Initial results on DM classification with ARM
seems to be encouragingscalable, robust and
efficient
Item Sets and Association Rules are human
interpretable
Next steps will explore more complex
phenotypes, and incorporate additional items
(e.g., medications, procedures)
Acknowledgment
Material adapted from Agrawal and Liu
Mayo Clinic SHARP project on Secondary Use
of EHR data (90TR002)
Mayo Clinic eMERGE project (HG006379)
Mayo Clinic Career Development Award
(FP00058504)

Thank You!

Pathak.Jyotishman@mayo.edu

Using Association Rule Mining For Phenotype Extraction From EHRs

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Using Association Rule Mining For Phenotype Extraction From EHRs

Uploaded by

Copyright:

Available Formats

Using Association Rule Mining for

Phenotype Extraction from Electronic

You might also like