You are on page 1of 21

Standardizing Phenotype Variables in the Database of Genotypes and Phenotypes (dbGaP) based on Information Models

Ko-Wei Lin, DVM, PhD


Alexander Hsieh, Seena Farzaneh, BS, Son Doan, PhD, Hyeoneui Kim, RN, MPH, PhD Division of Biomedical Informatics, University of California, San Diego, La Jolla, CA

2013 AMIA Summit on Translational Bioinformatics March 18, 2013

Overview
Introduction and Background
- Challenges in dbGaP database - Objective of the study

Materials, Methods and Results


- Phase I: Test of CEM template models - Phase II: Build information model

Conclusions Current Status and Future Direction

dbGaP: database of Genotypes and Phenotypes


Developed by National Center for Biotechnology Information (NCBI) Data repositories of studies such as Genome-Wide Association Studies
(GWAS) to allow researchers investigate association between genotype and phenotype any variant is associated with a trait

GWAS: exam common genetic variant in different individuals to see if

Genotype

Phenotype

Phenotypes: diseases, signs and symptoms, clinical attributesetc. Currently host 400+ studies, 2500+ datasets, 130,000+ phenotype
variables

Reuse of dbGaP data: promote research discovery, validate existing

findings, reduce time and cost, advance translational medical research.

large portion of the clinical variables contained in dbGaP.

Introduction With the advancements in genome-wide association studies (GWAS), public repositories of genotype and phenotype data, such as the database of Genotypes and Phenotypes (dbGaP), have become increasingly available online (1). The proper use or reuse of GWAS data could promote exploratory research, novel scientific discovery, validation of existing findings, and reduction of cost and time for research. However, data in such public repositories are not Unstandardized representation of phenotype variables collected in a standardized or harmonized way, and hence it is challenging to reuse them. For example, as illustrated results in incomplete and inaccurate data retrieval. in Table 1, variables are often named without following a specific naming convention, or are labeled with abbreviated codes that do not naming convey specific meaning. Many of these variables are accompanied by variable No specific convention descriptions that can help users understand what data the variable intends to represent. However, keyword searches to No specific meaning in abbreviated codes applied variable descriptions do not always provide accurate results due to many syntactic and lexical complexities associated with narrative text, such as use of negation and synonyms (3).

Challenges in current dbGaP

height height in dbGaP Table 1.Phenotype Idiosyncratic ways ofvariable representing the variable
Variable ID phv00071000.v1 phv00165340.v1.p2 phv00083471.v1.p2 Variable Names htcm ESP_HEIGHT_BASELINE lunghta4 Variable Descriptions Standing height at follow up visit Standing height in cm at baseline HEIGHT (cm)

Idiosyncrasies in variable names are a major challenge to utilizing the data included in dbGaP. Standardizing phenotype variables in such a way that supports an accurate and complete search against dbGaP data is one of the main purposes of the Phenotype Finder in Data Resource (PFINDR) program funded by the National Heart, Lung, and Blood Institute (NHLBI). As the first step towards standardizing the phenotype variables in dbGaP, we tested the feasibility of using an existing information model for clinical data, the Clinical Element Models (CEM) developed by GE

Challenges in current dbGaP

http://www.ncbi.nlm.nih.gov/gap

Challenges in current dbGaP

Idiosyncrasies of phenotype variables make it difficult to identify relevant data with a sufficient level of accuracy. Standardization phenotype variable is important Focus on variable description

PFINDR program (Phenotype Finder IN Data Resources)


PhenDisco: Phenotype Discoverer
PhenDisco data PhD data flow flow

New workflow

Original workflow

PhenDisco PhD SystemSystem


Study Description Annotator Phenotype Variable Annotator Standardization & annotation

feedback/ confirmation semiautomated standardization & annotation Data submitter

Demographics variables (DIVER) Other variables Clustering ()*+,-$.+)& /+!01 &

!"#$% &
h rch earc a se )s ext nced t e a Fre (adv ed tur c u Unsorted, flat list results Str
Free text search

'!#$% &

Query Parser Structured Query Interface Ranking Algorithms Query support

Data user

Structured search Ranked results/Relevance feedback

Objective
Goal: Investigate an information model based approach to standardizing phenotype variables in dbGaP.

Phase I: Test the feasibility of CEM template models to


formally represent the phenotype variable descriptions in dbGaP.

Phase II: Develop our own information models and applied


them to variable standardization.
The ultimate goal is to develop an Nature Language Processing (NLP) based system that algorithmically standardizes the phenotype variables in PhenDisco.

Objective
Goal: Investigate an information model based approach to standardizing phenotype variables in dbGaP.

Phase I: Test the feasibility of CEM template models to


formally represent the phenotype variable descriptions in dbGaP.

Phase II: Develop our own information models and


applied them to variable standardization.

The Clinical Element Model (CEM)


Developed by GE Health/Intermountain Healthcare Data Modeling and
Terminology Team

Support sharing computable meaning during data exchange between


different systems

A logical structure for representing detailed clinical data models

CEM Template Models


Serve as basis for creating a CEM 6 domains: Disease and Disorders, Procedures, Signs and Symptoms,
Medications, Anatomical Sites, and Laboratory Test

Signs and Symptoms CEM Template Model:


Alleviating_factor UMLS relations {manages, treats, prevents} associatedCode SNOMED CT, UMLS CUI Body_laterality {superior, inferior, medial, lateral, distal, proximal, dorsal, ventral} Body_location UMLS relation {location_of} Body_side {left, right, bilateral, unmarked} Conditional {true, false} Course {unmarked, changed, increased, decreased, improved, worsened, resolved} Duration Temporal Link End_time Temporal Link Exacerbating_factor UMLS relations {complicates, disrupts} Generic {true, false} Negation_indicator {negationAbsent, negationPresent} Relative_temporal_context Temporal Link Severity UMLS relation {degree_of} Start_time Temporal Link Subject {patient, familyMember, donorFamilyMember, donorOther, other} Uncertainty_indicator {indicatorPresent, indicatorAbsent}

Phase I Representing phenotype variable descriptions in Mapping results dbGaP using CEM template models
TABLE 4. RESULTS OF MAPPING PHENOTYPE NAMES TO CEM

Material and ExactMethods Broad

Phenotype categories

Mapped (N=240) Narrow Related Not mapped Total

Diseases and 0 116 0 5 7 128 1. Randomly retrieve 200 non-demographic phenotype variable Disorders Procedures 0 from two 0 phenotype 0 0 dictionaries 0 0 descriptions data in dbGaP. Signs and 2 19 2 2 56 81 Symptoms 2. Manually conduct the modeling using the six CEM template Medications 0 0 0 0 0 0 models. Anatomical 0 0 0 0 0 0

Results

1. 115 unique variables 25 143 48 24 139 379 2. CEM template models represented 70% phenotype variable
descriptions and are overly complex.
Topics Diseases and Disorders Findings (excluding Disease or Disorder) Medications Laboratory tests Not applicable Unknown Total number V. DISCUSSION Number of variables (%) 1 (0.87) 70 (60.87) 2 (1.74) 8 (6.96) 30 (26.09) 4 (3.48) 115
TABLE 5. CATEGORIES OF THE PHENOTYPE VARIABLE AND RELEVANT CEM TEMPLATE MODELS USED

Sites Labs Other Unknown Total number

20 3 0

2 6 0

44 2 0

10 7 0

21 32 23

97 50 23

CEM template models used Diseases and Disorders Signs and Symptoms Medication, Signs and Symptoms Laboratory Tests, Signs and Symptoms ---

as in dbGaP).

The former was often aggregated and

Objective
Goal: Investigate an information model based approach to standardizing phenotype variables in dbGaP.

Phase I: Test the feasibility of CEM template models to


formally represent the phenotype variable descriptions in dbGaP.

Phase II: Develop our own information models and


applied them to variable standardization.

Phase II Methods

MetaMap

eHost Test generalizability of the model

Randomly select 300 Variable descriptions

Mapping

Information model (Semantic roles)

Develop rules

Algorithmic process

South BR et al. BioNLP 2012, page 130-139. http://code.google.com/p/ehost/

Phase II Results
Our information model was
constructed with 10 semantic role classes.
1 2 3 4 5 6 7 8 9 10 Semantic role class name Topic Subject of information Informer Certainty Situational Context Temporal modifier Extent modifier Health outcomes Body site Quantity Qualifier Examples Disease, Signs and symptoms Patient, family members Doctor Diagnosed, confirmed While sleeping, after birth Last month, since last visit Loudly, excessive Hospitalization Right leg, lower back How many, count

Our model fully represented the


key concepts in the 600 phenotype variable descriptions.

Mapping Example 1
Mom has lung cancer diagnosed by doctor last year
Subject of Information Mom

Quantity Qualifier

Informer doctor

Body site

Topic
lung cancer

Certainty diagnosed

Health outcomes Extent modifier Temporal modifier last year

Situation Context

Mapping Example 2 Minor pain in lower back after running


Subject of Information Subject

Quantity Qualifier

Informer

Body site lower back

Topic
pain

Certainty

Health outcomes Extent modifier minor Temporal modifier

Situation Context after running

Conclusions
We developed an information model for a simple NLP
algorithm to standardize phenotype variables

Our experience showed that direct analysis of the


phenotype variable descriptions in dbGaP is an important component for developing a workable information model

Current Status and Future Direction


We have developed a system for tagging the phenotype variables
with two main semantic roles topic and subject of information, and the system achieved 69% accuracy in semantic tagging.

We plan to process all phenotype variables in dbGaP and add


them into the pipeline. We will evaluate whether it improves the accuracy of phenotype query in PhenDisco.
PhenDisco PhD data flow data flow
New workflow Original workflow Study Description Annotator Phenotype Variable Annotator

PhenDisco PhD System System


Standardization & annotation

feedback/ confirmation semiautomated standardization & annotation Data submitter

Demographics variables (DIVER) Other variables Clustering ()*+,-$.+)& /+!01 &

!"#$% &
h arc e t se ed) s x e c t n e a Fre (adv ed r u t uc Unsorted, flat list results Str
Free text search

'!#$% &

h a rc

Query Parser Structured Query Interface Ranking Algorithms Query support

Data user

Structured search Ranked results/Relevance feedback

Acknowledgements
University of California San Diego
Division of Biomedical Informatics Lucila Ohno-Machado, MD, PhD Wendy Chapman, PhD Mike Conway, PhD Jihoon Kim, MS Mindy Ross, MD, MBA Melissa Tharp, BS Current and past PFINDR team members: Dr. Xiaoqian Jiang, Dr. Neda Alipanah, Stephanie Feudjio Feupe, Rebacca Walker, Asher Garland, Jing Zhang, Ustun Yildiz, Karen Truong, Vinay Venkatesh, Rafael Talavera

Collaborator:
Hua Xu, PhD (Vanderbilt University)

NIH/NHLBI (The National Heart, Lung, and Blood Institution)


grant UH2HL108785