Health Records Dingcheng Li, PhD 1 Gyorgy Simon, PhD 2 Christopher G. Chute, MD, DrPH 1 Jyotishman Pathak, PhD 1
1 Mayo Clinic, Rochester 2 University of Minnesota, Twin Cities 2013 AMIA Clinical Research Informatics Summit High-Throughput Phenotyping from EHRs Outline Clinical phenotyping from electronic health records (EHRs) Machine learning techniques for phenotyping Association Rule Mining and T2DM Results Discussion 2013 MFMER | slide-2 High-Throughput Phenotyping from EHRs Data Transform Transform EHR-driven Phenotyping: The Process Phenotype Algorithm Visualization Evaluation NLP, SQL Rules Mappings [eMERGE Network] 2013 MFMER | slide-3 High-Throughput Phenotyping from EHRs Example: Hypothyroidism Algorithm 2012 MFMER | slide-4 [Conway et al. AMIA 2011: 274-83] Drugs Labs Diagnosis NLP Proce- dures High-Throughput Phenotyping from EHRs 2013 MFMER | slide-5 http://gwas.org
[eMERGE Network] High-Throughput Phenotyping from EHRs 0.5 5 Genotype-Phenotype Association Results 0.5 5 0.5 5.0 1.0 Odds Ratio rs2200733 Chr. 4q25 rs10033464 Chr. 4q25 rs11805303 IL23R rs17234657 Chr. 5 rs1000113 Chr. 5 rs17221417 NOD2 rs2542151 PTPN22 rs3135388 DRB1*1501 rs2104286 IL2RA rs6897932 IL7RA rs6457617 Chr. 6 rs6679677 RSBN1 rs2476601 PTPN22 rs4506565 TCF7L2 rs12255372 TCF7L2 rs12243326 TCF7L2 rs10811661 CDKN2B rs8050136 FTO rs5219 KCNJ11 rs5215 KCNJ11 rs4402960 IGF2BP2 Atrial fibrillation Crohn's disease Multiple sclerosis Rheumatoid arthritis Type 2 diabetes disease gene / region marker 2.0 [Ritchie et al. AJHG 2010; 86(4):560-72] observed published 2013 MFMER | slide-6 High-Throughput Phenotyping from EHRs Data Transform Transform EHR-driven Phenotyping: The Process Phenotype Algorithm Visualization Evaluation NLP, SQL Rules Mappings [eMERGE Network] 2013 MFMER | slide-7 Time consuming! High-Throughput Phenotyping from EHRs Our research agenda 2013 MFMER | slide-8 High-Throughput Phenotyping from EHRs Our research agenda Develop effective machine learning methods for automatic phenotype extractions to reduce the workload of manual development of phenotyping algorithms Explore effective ways to extract features from EHR data and generate highly predictive models Study phenotype extractions methods from EHRs to facilitate population-based studies for clinical and translational research 2013 MFMER | slide-9 High-Throughput Phenotyping from EHRs Common Modeling Approaches Logistic regression/Survival Analysis No ability to discover interactions Decision Trees/RandomForest/Gradient- boosted Trees Greedy approach to discover interaction Association Rule Mining (ARM) Specifically designed to discover interactions 2013 MFMER | slide-10 High-Throughput Phenotyping from EHRs Association rule mining Proposed by Agrawal et al., VLDB1994 It is an important data mining model studied extensively by the database and data mining community Assume all data are categorical No good algorithm for numeric data Initially used for Market Basket Analysis to find how items purchased by customers are related 2013 MFMER | slide-11 High-Throughput Phenotyping from EHRs The model: data I = {i 1 , i 2 , , i m }: a set of items. Transaction t : t a set of items, and t I. Transaction Database T: a set of transactions T = {t 1 , t 2 , , t n }. 2013 MFMER | slide-12 High-Throughput Phenotyping from EHRs The model: data Market basket transactions: t 1 : {bread, cheese, milk} t 2 : {apple, eggs, salt, yogurt}
t n : {biscuit, eggs, milk} Concepts: An item: an item/article in a basket I: the set of all items sold in the store A transaction: items purchased in a basket; it may have TID (transaction ID) A transactional dataset: A set of transactions 2013 MFMER | slide-13 High-Throughput Phenotyping from EHRs The model: rules A transaction t contains X, a set of items (itemset) in I, if X t. An association rule is an implication of the form: X Y, where X, Y I, and X Y =
An itemset is a set of items. E.g., X = {milk, bread, cereal} is an itemset. A k-itemset is an itemset with k items. E.g., {milk, bread, cereal} is a 3-itemset 2013 MFMER | slide-14 High-Throughput Phenotyping from EHRs Rule strength measures Support: The rule holds with support sup in T (the transaction data set) if sup% of transactions contain X Y. sup = Pr(X Y). Confidence: The rule holds in T with confidence conf if conf% of tranactions that contain X also contain Y. conf = Pr(Y | X) An association rule is a pattern that states when X occurs, Y occurs with certain probability. 2013 MFMER | slide-15 High-Throughput Phenotyping from EHRs An example Transaction data Assume: minsup = 30% minconf = 80% An example frequent itemset: {Chicken, Clothes, Milk} [sup = 3/7] Association rules from the itemset: Clothes Milk, Chicken [sup = 3/7, conf = 3/3]
Clothes, Chicken Milk, [sup = 3/7, conf = 3/3] t1: Beef, Chicken, Milk t2: Beef, Cheese t3: Cheese, Boots t4: Beef, Chicken, Cheese t5: Beef, Chicken, Clothes, Cheese, Milk t6: Chicken, Clothes, Milk t7: Chicken, Milk, Clothes 2013 MFMER | slide-16 High-Throughput Phenotyping from EHRs Distributional Association Rules associate an itemset with a continuous outcome. Distributional Association Rule Mining 2013 MFMER | slide-17 [Simon et al. KDD 2011; 823-831] High-Throughput Phenotyping from EHRs Based on Apriori Algorithm (Agarwal, VLDB 1994) 2013 MFMER | slide-18 Algorithm Apriori(T) C 1 init-pass(T); F 1 {f | f C 1 , f.count/n minsup}; // n: no. of transactions in T for (k = 2; F k-1 ; k++) do C k candidate-gen(F k-1 ); for each transaction t T do for each candidate c C k do if c is contained in t then c.count++; end end F k {c C k | c.count/n minsup} end return F k F k ; High-Throughput Phenotyping from EHRs Use Case: Type 2 Diabetes 2013 MFMER | slide-19 Mayo Clin Proc. July 2011,86(7).606-614 doi.10.4065/mcp.2011.0178 www.mayoclinicproceedings.com 607 MAYO GENOME CONSORTIA For personal use. Mass reproduce only with permission from Mayo Clinic Proceedings a . genetic research within EMR systems. 1,2 Successful use of this approach in the eMERGE Network has inspired the creation of the intramural Mayo Genome Consortia (MayoGC). The goal of the MayoGC is to assemble a large cohort of participants from research studies across Mayo Clinic with high-throughput genetic data and to use EMR for phenotype extraction for cost-effective genetic research. Herein, we describe the design of the MayoGC, includ- ing the current participating cohorts, expansion efforts, data processing, and study management and organization. As a test of the genetic research capability of the MayoGC, we conducted a GWA study to identify genetic variants associ- ated with total bilirubin levels. Bilirubin levels have a large variability in the population, with heritability of roughly 0.50. 3 Two previous GWA studies identied variants from similar genomic locations with strong and moderate effects on bilirubin levels, 4,5 making this phenotype an ideal candi- date for testing. The MayoGC provides a model of a unique collaborative effort in the environment of a common EMR for the investigation of genetic determinants of diseases. PARTICIPANTS AND METHODS MayoGC is a large cohort of Mayo Clinic patients with EMR and genotype data. Eligible participants include those who gave general research (ie, not disease-specic) consent in the contributing studies to share high-throughput genotyping data with other investigators. This cohort is being built in 2 phas- es. Phase 1, which has been completed, includes participants from 3 studies funded by the National Institutes of Health, which sought to identify genetic determinants of peripheral arterial disease (PAD), venous thromboembolism, and pan- creatic cancer, respectively, with a combined total sample size of 6307 unique participants (Table 1). The eMERGE study contributed genotype data for 3336 participants with PAD and control participants recruited from Mayo Clinics noninvasive vascular and exercise stress testing laboratories, respectively. 2 Peripheral arterial disease was dened by docu- mentation of at least 1 of the following: (1) an ankle-brachial index (ABI) of 0.9 or less at rest or 1 minute after exercise, (2) the presence of poorly compressible arteries, or (3) a nor- mal ABI but history of revascularization for PAD. Control participants had a normal ABI and no history of PAD. 2 The GENEVA (Gene Environment Association Stud- ies) Study of Venous Thromboembolism of the National Human Genome Research Institute enrolled consecutive Mayo Clinic outpatients with objectively diagnosed deep venous thrombosis and/or pulmonary embolism who resid- ed in the upper Midwest and had been referred by a Mayo Clinic physician to the Mayo Clinic Special Coagulation Laboratory or to the Mayo Clinic Thrombophilia Center. 6
A deep venous thrombosis or pulmonary embolism was categorized as objectively diagnosed (1) when it was con- rmed by venography or pulmonary angiography or via a pathology examination of a thrombus removed at surgery or (2) if ndings on at least 1 noninvasive test (compression duplex ultrasonography, lung scan, computed tomography, magnetic resonance imaging) were positive. Persons with venous thromboembolism related to active cancer were excluded. A control group was prospectively recruited for this study. Control participants were frequency-matched TABLE 1. MayoGC Phase 1 Studies a,b
eMERGE Network (PAD) 2 GENEVA (VTE) 6 PANC 7,8 Cases Controls Cases Controls Controls Characteristics (n=1612) (n=1585) (n=1233) (n=1264) (n=613) Age (y), mean SD 66.010.7 61.07.4 55.016.2 56.015.8 66.010.0 Female (%) 36 40 50 52 45 Medical record length (y) Mean SD 23.420.0 26.120.3 13.716.3 21.115.4 30.216.5 Median (range) 18.7 (1.0-78.6) 23.0 (1.0-79.2) 6.3 (1.0-71.8) 17.8 (1.0-70.2) 29.8 (1.0-75.0) White (%) 94 94 96 99 100 Geographic location, No. (%) c
Olmsted County 328 (20) 590 (37) 7 (1) 10 (1) 64 (10) Southeast Minnesota 191 (12) 62 (4) 205 (17) 378 (30) 107 (17) Greater Minnesota 393 (24) 343 (22) 314 (25) 317 (25) 135 (22) Iowa 212 (13) 97 (6) 176 (14) 191 (15) 65 (11) South and North Dakota 50 (3) 31 (2) 79 (6) 71 (6) 19 (3) Wisconsin 128 (8) 68 (4) 121 (10) 138 (11) 32 (5) Other states or international 309 (19) 394 (25) 330 (27) 159 (13) 191 (31) a eMERGE = Electronic Medical Records and Genomics; GENEVA = Gene Environment Association Studies; MayoGC = Mayo Genome Consortia; PAD = peripheral arterial disease; PANC = Mayo Clinic Molecular Epidemiology of Pancreatic Cancer Study; VTE = venous thromboembolism. b
Percentages may not total 100% because of rounding. c Southeast Minnesota includes 7 counties in the southeast corner of Minnesota: Dodge, Goodhue, Wabasha, Winona, Houston, Fillmore, and Mower. Olmsted County, Minnesota, is a mutually exclusive category. High-Throughput Phenotyping from EHRs Use Case: Type 2 Diabetes
Find all item sets I of co-morbid conditions, such that the distribution of risk R is significantly different between the patient population having I and without I 2013 MFMER | slide-20 Items and Frequencies (based on AHRQ CCS) Items Times Diagnosis meaning V48 V82080 10 10 Diabetes melitus without complication Hemoglobin, A1c V86 8 Hypertension with complications and secondary hypertension V56 6 Deficiency and other anemia V217 4 Other fractures V52 3 Gout and other crystal arthropathies V245 3 Residual codes; unclassified V246 3 Adjustment disorders V221 3 Open wounds of head; neck; and trunk V73 3 Retinal detachments; defects; vascular occlusion; and retinopathy V244 2 Other screening for suspected conditions (not mental disorders or infectious diseases) V216 2 Fracture of lower limb V143 1 Chronic renal failure V142 1 Acute and unspecified renal failure High-Throughput Phenotyping from EHRs Rule Ranking Top 5 Rank Support SupportD Precision Item Set 1 281 270 0.961 V48 V86 V142 V245 V82080 2 280 269 0.96 V48 V57 V74 V86 V245 V82080 3 274 263 0.95 V48 V52 V57 V74 V244 V246 V82080 4 278 263 0.94 V48 V52 V57 V87 V82080 5 278 263 0.94 V48 V57 V86 V216 V221 V82080 2013 MFMER | slide-22 High-Throughput Phenotyping from EHRs Confusion Matrix Mod el Predictive class False True False True False True ARM Cutoff 0.93 0.92 0.95 Actual class N 801 6 752 55 736 71 Y 429 54 51 432 17 466 D- Tree Cutoff 0.88 0.75 0.70 Actual class N 766 42 754 54 747 60 Y 393 393 53 429 36 447 LR Cutoff 0.95 0.7 0.6 Actual class N 772 35 755 48 752 52 Y 149 335 98 384 88 395 SVM Cutoff 0.7 0.6 0.55 Actual class N 768 40 758 50 751 57 Y 104 378 68 414 55 424 2013 MFMER | slide-23 High-Throughput Phenotyping from EHRs Measure Metrics for All Models Model Cutoff Precision Recall F-score ARM 0.95 0.868 0.966 0.914 0.92 0.887 0.894 0.895 0.93 0.9 0.112 0.199 D-Tree 0.88 0.903 0.812 0.855 0.75 0.888 0.889 0.889 0.70 0.881 0.925 0.902 LR 0.95 0.904 0.693 0.785 0.7 0.889 0.796 0.840 0.6 0.883 0.819 0.849 SVM 0.7 0.901 0.784 0.839 0.6 0.893 0.858 0.875 0.55 0.881 0.878 0.879 2013 MFMER | slide-24 High-Throughput Phenotyping from EHRs ROC for ARM and SVM 2013 MFMER | slide-25 High-Throughput Phenotyping from EHRs Discussion Clearly the space of all association rules is exponential, O(2 m ), where m is the number of items in I. The mining exploits sparseness of data, and high minimum support and high minimum confidence values. Still, it always produces a huge number of rules, thousands, tens of thousands, millions, ... 2013 MFMER | slide-26 High-Throughput Phenotyping from EHRs Discussion A machine learning framework for semi- automatic phenotype extraction from EHRs Initial results on DM classification with ARM seems to be encouragingscalable, robust and efficient Item Sets and Association Rules are human interpretable Next steps will explore more complex phenotypes, and incorporate additional items (e.g., medications, procedures) 2013 MFMER | slide-27 High-Throughput Phenotyping from EHRs Acknowledgment Material adapted from Agrawal and Liu Mayo Clinic SHARP project on Secondary Use of EHR data (90TR002) Mayo Clinic eMERGE project (HG006379) Mayo Clinic Career Development Award (FP00058504) 2013 MFMER | slide-28 High-Throughput Phenotyping from EHRs
Educating Translational Researchers in Research Informatics Principles and Methods-An Evaluation of A Model Online Course and Plans For Its Dissemination