You are on page 1of 2

IDR Snapshot: Quantitative Assessment Methodology Evaluating Size and

Comprehensiveness of an Integrated Data Repository


Vojtech Huser, MD, PhD, James J Cimino, MD
Laboratory for Informatics Development, National Institutes of Health, Bethesda
Abstract
We have developed a methodology for evaluating size and comprehensiveness of an integrated data repository
(IDR). Our method uses an extensible set of measured computed with a standard database query language. It can be
applied at multiple institutions to facilitate comparison of institutional repositories or tracking a single repository
size and data coverage through time.
Introduction:
Many institutions have implemented an integrated data repository (IDRs), which is defined as a data warehouse with
clinical, administrative, clinical trial, and -omics data optimized for research purposes rather than clinical care. For
federation of single IDRs into larger research networks (such as DartNet, STRIDE, HMORN, or OMOP), it is
necessary to systematically assess the size and comprehensives of data of each contributing data partner. We have
previously piloted a set of such quantitative measures [1] and in this study we present application of this assessment
methodology to IDR at National Institutes of Health (NIH) and comparison of results with an IDR at Marshfield
Clinic (MC).
Methods and Preliminary Results
Methods: The IDR assessment methodology consists of computing a set of quantitative measures referred to as IDR
Snapshot. Our pilot study included a set of 8 measures that were specifically designed to be intuitive to interpret and
would facilitate continuous monitoring. For example, the measure D3 assessed the total number of patients with at
least one diagnosis, one laboratory result and one prescription. The IDR Snapshot also includes very basic measures,
such as the total number of events (G1) and the total number of patients (G2) within the repository. We used the
database of the Biomedical Translational Research Information System (BTRIS)[2] as the key data source evaluated
in this study. It integrates data from the NIH Clinical Center EHR system (Allscripts Sunrise Clinical Manager) and
numerous other systems across several NIH institutes (e.g., National Cancer Institutes C3D clinical trials data
management system). The IDR Snapshot is an open source project (available at http://code.google.com/p/idrsnapshot) that uses Structured Query Language (ANSI SQL:2008) and can be executed on all major database platforms.
Results: We have successfully applied existing IDR Snapshot measures to the NIH repository and a full results table
listing all measures will be included in the poster. Comparison of the results from NIH and MC shows similar drops
in data comprehensiveness when multiple patient data aspects are considered. NIH and MC had similar total event
counts (G1 measure, NIH: 0.48B events; MC: 0.57B); however, the comparison of the G2 measure (NIH: 0.3M
patients, MC: 1.7M) reveals some differences in data patterns (NIH measures assessed in September 2011). Lifetime
dimension measures within the IDR Snapshot clearly showed differences in covered time periods.
Conclusion, Discussion and Future Work
We have made several modifications to the methodology and the measure set based on the input from two IDRs.
Whereas version 1 used a simple custom event data model, version 2 provides scripts for the i2b2 schema and, via
data transformations present at several institutions, it can be applied to numerous institutions that are direct i2b2
adopters or institutions with indirect mapping to i2b2, such as NIH HMO Collaboratory sites. We also plan to
approach the CTSA Informatics Affinity Group for Data Repository and even implement the IDR Snapshot as an
i2b2 plug-in. The advantage of our measure set is that it does not place any arbitrary value on individual measure
components (e.g., value of 10 years of claims data vs. 10 years of detailed EHR data) and it is easy to extend the
methodology to accommodate additional data characteristics based on user feedback. The methodology has several
limitations based on heterogeneity of data repositories, such as storage of data elements as self standing facts or
event attributes, or different data schemes for sub-events (e.g., lab orderlab results event pairs).
References
1. Huser V, A Methodology for Quantitative Measurement of Quality and Comprehensiveness of a Research Data
Repository, Proc of 16th Annual HMO Research Network Conference 2010
2. Cimino JJ, Ayres EA. The Clinical Research Data Repository of the US National Institutes of Health.
Proceedings of Medinfo 2010:1299-1303.