You are on page 1of 12

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/234090011

Chapter 11: Genome-Wide Association Studies

Article  in  PLoS Computational Biology · December 2012


DOI: 10.1371/journal.pcbi.1002822 · Source: PubMed

CITATIONS READS

431 2,352

2 authors:

William S Bush Jason H Moore


Case Western Reserve University University of Pennsylvania
191 PUBLICATIONS   2,356 CITATIONS    738 PUBLICATIONS   22,308 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

gene expression View project

The making of next generation biomedical informatics scientists View project

All content following this page was uploaded by William S Bush on 04 March 2014.

The user has requested enhancement of the downloaded file.


Education

Chapter 11: Genome-Wide Association Studies


William S. Bush1*, Jason H. Moore2
1 Department of Biomedical Informatics, Center for Human Genetics Research, Vanderbilt University Medical School, Nashville, Tennessee, United States of America,
2 Departments of Genetics and Community Family Medicine, Institute for Quantitative Biomedical Sciences, Dartmouth Medical School, Lebanon, New Hampshire, United
States of America

Abstract: Genome-wide associa- While understanding the complexity of of GWAS to common diseases that have a
tion studies (GWAS) have evolved human health and disease is an important complex multifactorial etiology.
over the last ten years into a objective, it is not the only focus of human
powerful tool for investigating the genetics. Accordingly, one of the most
2. Concepts Underlying the
genetic architecture of human dis- successful applications of GWAS has been
ease. In this work, we review the in the area of pharmacology. Pharmaco- Study Design
key concepts underlying GWAS, genetics has the goal of identifying DNA 2.1 Single Nucleotide
including the architecture of com- sequence variations that are associated Polymorphisms
mon diseases, the structure of with drug metabolism and efficacy as well The modern unit of genetic variation is
common human genetic variation, as adverse effects. For example, warfarin is the single nucleotide polymorphism or SNP.
technologies for capturing genetic a blood-thinning drug that helps prevent SNPs are single base-pair changes in the
information, study designs, and the blood clots in patients. Determining the DNA sequence that occur with high
statistical methods used for data appropriate dose for each patient is frequency in the human genome [5]. For
analysis. We also look forward to
important and believed to be partly the purposes of genetic studies, SNPs are
the future beyond GWAS.
controlled by genes. A recent GWAS typically used as markers of a genomic
revealed DNA sequence variations in region, with the large majority of them
several genes that have a large influence having a minimal impact on biological
This article is part of the ‘‘Transla- on warfarin dosing [4]. These results, and systems. SNPs can have functional conse-
tional Bioinformatics’’ collection for more recent validation studies, have led to quences, however, causing amino acid
PLOS Computational Biology. genetic tests for warfarin dosing that can changes, changes to mRNA transcript
be used in a clinical setting. This type of stability, and changes to transcription
1. Important Questions in genetic test has given rise to a new field factor binding affinity [6]. SNPs are by
Human Genetics called personalized medicine that aims to far the most abundant form of genetic
tailor healthcare to individual patients variation in the human genome.
A central goal of human genetics is to based on their genetic background and SNPs are notably a type of common
identify genetic risk factors for common, other biological features. The widespread genetic variation; many SNPs are present
complex diseases such as schizophrenia availability of low-cost technology for in a large proportion of human popula-
and type II diabetes, and for rare Mende- measuring an individual’s genetic back- tions [7]. SNPs typically have two alleles,
lian diseases such as cystic fibrosis and ground has been harnessed by businesses meaning within a population there are
sickle cell anemia. There are many that are now marketing genetic testing two commonly occurring base-pair pos-
different technologies, study designs and directly to the consumer. Genome-wide sibilities for a SNP location. The fre-
analytical tools for identifying genetic risk association studies, for better or for worse, quency of a SNP is given in terms of the
factors. We will focus here on the genome- have ushered in the exciting era of minor allele frequency or the frequency of
wide association study or GWAS that personalized medicine and personal ge- the less common allele. For example, a
measures and analyzes DNA sequence netic testing. The goal of this chapter is to SNP with a minor allele (G) frequency of
variations from across the human genome
introduce and review GWAS technology, 0.40 implies that 40% of a population
in an effort to identify genetic risk factors
study design and analytical strategies as an has the G allele versus the more common
for diseases that are common in the
important example of translational bioin- allele (the major allele), which is found in
population. The ultimate goal of GWAS
formatics. We focus here on the application 60% of the population.
is to use genetic risk factors to make
predictions about who is at risk and to
identify the biological underpinnings of
disease susceptibility for developing new Citation: Bush WS, Moore JH (2012) Chapter 11: Genome-Wide Association Studies. PLoS Comput Biol 8(12):
prevention and treatment strategies. One e1002822. doi:10.1371/journal.pcbi.1002822
of the early successes of GWAS was the Editors: Fran Lewitter, Whitehead Institute, United States of America and Maricel Kann, University of Maryland,
identification of the Complement Factor H Baltimore County, United States of America
gene as a major risk factor for age-related Published December 27, 2012
macular degeneration or AMD [1–3]. Not Copyright: ß 2012 Bush, Moore. This is an open-access article distributed under the terms of the Creative
only were DNA sequence variations in this Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium,
provided the original author and source are credited.
gene associated with AMD but the bio-
logical basis for the effect was demonstrat- Funding: This work was supported by NIH grants ROI-LM010098, ROI-LM009012, ROI-AI59694, RO1-EY022300,
and RO1-LM011360. The funders had no role in the preparation of the manuscript.
ed. Understanding the biological basis of
genetic effects will play an important role in Competing Interests: The authors have declared that no competing interests exist.
developing new pharmacologic therapies. * E-mail: william.s.bush@vanderbilt.edu

PLOS Computational Biology | www.ploscompbiol.org 1 December 2012 | Volume 8 | Issue 12 | e1002822


What to Learn in This Chapter technology needed to gather genetic
information and the sample size needed
to discover statistically significant genetic
N Basic genetic concepts that drive genome-wide association studies
effects. The spectrum of potential genetic
N Genotyping technologies and common study designs effects is sometimes visualized and parti-
N Statistical concepts for GWAS analysis tioned by effect size and allele frequency
N Replication, interpretation, and follow-up of association results (figure 1). Genetic effects in the upper right
are more amenable to smaller family-
based studies and linkage analysis, and
Commonly occurring SNPs lie in stark frequency (including alleles in the apolipo- may require genotyping relatively few
contrast to genetic variants that are protein E or APOE gene for Alzheimer’s genetic markers. Effects in the lower right
implicated in more rare genetic disorders, disease [11] and PPARg gene in type II are typical of findings from GWAS,
such as cystic fibrosis [8]. These conditions diabetes [12]), led to the development of requiring large sample sizes and a large
are largely caused by extremely rare the common disease/common variant (CD/CV) panel of genetic markers. Effects in the
genetic variants that ultimately induce a hypothesis [13]. upper right, most notably CFH, have been
detrimental change to protein function, This hypothesis states simply that com- identified using both linkage analysis and
which leads to the disease state. Variants mon disorders are likely influenced by GWAS. Effects in the lower left are
with such low frequency in the population genetic variation that is also common in perhaps the most difficult challenge, re-
are sometimes referred to as mutations, the population. There are several key quiring genomic sequencing of large
though they can be structurally equivalent ramifications of this for the study of samples to associate rare variants to
to SNPs - single base-pair changes in the complex disease. First, if common genetic disease.
DNA sequence. In the genetics literature, variants influence disease, the effect size Over the last five years, the common
the term SNP is generally applied to (or penetrance) for any one variant must disease/common variant hypothesis has
common single base-pair changes, and the be small relative to that found for rare been tested for a variety of common
term mutation is applied to rare genetic disorders. For example, if a SNP with 40% diseases, and while much of the heritability
variants. frequency in the population causes a for these conditions is not yet explained,
highly deleterious amino acid substitution common alleles certainly play a role in
2.2 Failures of Linkage for Complex that directly leads to a disease phenotype, susceptibility. The National Human Ge-
Disease nearly 40% of the population would have nome Institute GWAS catalog (http://
that phenotype. Thus, the allele frequency www.genome.gov/gwastudies) lists over
Cystic fibrosis (and most rare genetic
and the population prevalence are com- 3,600 SNPs identified for common diseas-
disorders) can be caused by multiple
pletely correlated. If, however, that same es or traits, and in general, common
different genetic variants within a single
SNP caused a small change in gene diseases have multiple susceptibility alleles,
gene. Because the effect of the genetic
expression that alters risk for a disease by each with small effect sizes (typically
variants is so strong, cystic fibrosis follows
some small amount, the prevalence of the increasing disease risk between 1.2–2
an autosomal dominant inheritance pat-
disease and the influential allele would be times the population risk) [14]. From these
tern in families with the disorder. One of
only slightly correlated. As such, common results we can say that for most common
the major successes of human genetics was
variants almost by definition cannot have diseases, the CD/CV hypothesis is true,
the identification of multiple mutations in
high penetrance. though it should not be assumed that the
the CFTR gene as the cause of cystic
Secondly, if common alleles have small entire genetic component of any common
fibrosis [8]. This was achieved by geno-
genetic effects (low penetrance), but com- disease is due to common alleles only.
typing families affected by cystic fibrosis
mon disorders show heritability (inheri-
using a collection of genetic markers across
tance in families), then multiple common
the genome, and examining how those
alleles must influence disease susceptibility.
3. Capturing Common Variation
genetic markers segregate with the disease 3.1 The Human Haplotype Map
For example, twin studies might estimate
across multiple families. This technique, Project
the heritability of a common disease to be
called linkage analysis, was subsequently To test the common disease/common
40%, that is, 40% of the total variance in
applied successfully to identify genetic disease risk is due to genetic factors. If the variant hypothesis for a phenotype, a
variants that contribute to rare disorders allele of a single SNP incurs only a small systematic approach is needed to interro-
like Huntington disease [9]. When applied degree of disease risk, that SNP only gate much of the common variation in the
to more common disorders, like heart explains a small proportion of the total human genome. First, the location and
disease or various forms of cancer, linkage variance due to genetic factors. As such, density of commonly occurring SNPs is
analysis has not fared as well. This implies the total genetic risk due to common needed to identify the genomic regions
the genetic mechanisms that influence genetic variation must be spread across and individual sites that must be examined
common disorders are different from those multiple genetic factors. These two points by genetic studies. Secondly, population-
that cause rare disorders [10]. suggest that traditional family-based ge- specific differences in genetic variation
netic studies are not likely to be successful must be cataloged so that studies of
2.3 Common Disease Common for complex diseases, prompting a shift phenotypes in different populations can
Variant Hypothesis toward population-based studies. be conducted with the proper design.
The idea that common diseases have a The frequency with which an allele Finally, correlations among common ge-
different underlying genetic architecture occurs in the population and the risk netic variants must be determined so that
than rare disorders, coupled with the incurred by that allele for complex diseases genetic studies do not collect redundant
discovery of several susceptibility variants are key components to consider when information. The International HapMap
for common disease with high minor allele planning a genetic study, impacting the Project was designed to identify variation

PLOS Computational Biology | www.ploscompbiol.org 2 December 2012 | Volume 8 | Issue 12 | e1002822


Figure 1. Spectrum of Disease Allele Effects. Disease associations are often conceptualized in two dimensions: allele frequency and effect size.
Highly penetrant alleles for Mendelian disorders are extremely rare with large effect sizes (upper left), while most GWAS findings are associations of
common SNPs with small effect sizes (lower right). The bulk of discovered genetic associations lie on the diagonal denoted by the dashed lines.
doi:10.1371/journal.pcbi.1002822.g001

across the genome and to characterize netic variation within a population over has existed. As such, different human sub-
correlations among variants. time. It is related to the concept of populations have different degrees and
The International HapMap Project chromosomal linkage, where two markers on patterns of LD. African-descent popula-
used a variety of sequencing techniques a chromosome remain physically joined tions are the most ancestral and have
to discover and catalog SNPs in European on a chromosome through generations of smaller regions of LD due to the accumu-
descent populations, the Yoruba popula- a family. In figure 2, two founder lation of more recombination events in
tion of African origin, Han Chinese chromosomes are shown (one in blue that group. European-descent and Asian-
individuals from Beijing, and Japanese and one in orange). Recombination descent populations were created by
individuals from Tokyo [15,16]. The events within a family from generation founder events (a sampling of chromo-
project has since been expanded to include to generation break apart chromosomal somes from the African population), which
11 human populations, with genotypes for segments. This effect is amplified through altered the number of founding chromo-
1.6 million SNPs [7]. HapMap genotype generations, and in a population of fixed somes, the population size, and the
data allowed the examination of linkage size undergoing random mating, repeated generational age of the population. These
disequilibrium. random recombination events will break populations on average have larger regions
apart segments of contiguous chromo- of LD than African-descent groups.
some (containing linked alleles) until Many measures of LD have been
3.2 Linkage Disequilibrium
eventually all alleles in the population proposed [17], though all are ultimately
Linkage disequilibrium (LD) is a prop-
are in linkage equilibrium or are indepen- related to the difference between the
erty of SNPs on a contiguous stretch of
dent. Thus, linkage between markers on a observed frequency of co-occurrence for
genomic sequence that describes the
population scale is referred to as linkage two alleles (i.e. a two-marker haplotype)
degree to which an allele of one SNP is
disequilibrium. and the frequency expected if the two
inherited or correlated with an allele of The rate of LD decay is dependent on markers are independent. The two com-
another SNP within a population. The multiple factors, including the population monly used measures of linkage disequi-
term linkage disequilibrium was coined by size, the number of founding chromo- librium are D’ and r2 [15,17] shown in
population geneticists in an attempt to somes in the population, and the number equations 1 and 2. In these equations, p12
mathematically describe changes in ge- of generations for which the population is the frequency of the ab haplotype, p1: is

PLOS Computational Biology | www.ploscompbiol.org 3 December 2012 | Volume 8 | Issue 12 | e1002822


Figure 2. Linkage and Linkage Disequilibrium. Within a family, linkage occurs when two genetic markers (points on a chromosome) remain
linked on a chromosome rather than being broken apart by recombination events during meiosis, shown as red lines. In a population, contiguous
stretches of founder chromosomes from the initial generation are sequentially reduced in size by recombination events. Over time, a pair of markers
or points on a chromosome in the population move from linkage disequilibrium to linkage equilibrium, as recombination events eventually occur
between every possible point on the chromosome.
doi:10.1371/journal.pcbi.1002822.g002

the frequency of the a allele, and p2: is the one allele of the first SNP is often observed preventing genotyping SNPs that provide
frequency of the b allele. with one allele of the second SNP, so only redundant information. Based on analy-
one of the two SNPs needs to be sis of data from the HapMap project,
D0 ~ genotyped to capture the allelic variation. .80% of commonly occurring SNPs in
8 p p {p p 9 There are dependencies between these European descent populations can be
> AB ab Ab aB >
< min(pA pb ,pa pB ) if pAB pab {pAb paB w0 >
> =ð1Þ two statistics; r2 is sensitive to the allele captured using a subset of 500,000 to one
frequencies of the tow markers, and can million SNPs scattered across the ge-
>
> pAB pab {pAb paB >
: if pAB pab {pAb paB v0 >
; only be high in regions of high D’. nome [19].
min(pA pB ,pa pb )
One often forgotten issue associated
with LD measures is that current technol- 3.3 Indirect Association
2 ogy does not allow direct measurement of The presence of LD creates two possible
(pAB pab {pAb paB )
r2 ~ ð2Þ haplotype frequencies from a sample positive outcomes from a genetic associa-
pA pB pa pb
because each SNP is genotyped indepen- tion study. In the first outcome, the SNP
dently and the phase or chromosome of influencing a biological system that ulti-
D’ is a population genetics measure that is origin for each allele is unknown. Many mately leads to the phenotype is directly
related to recombination events between well-developed and documented methods genotyped in the study and found to be
markers and is scaled between 0 and 1. A for inferring haplotype phase and estimat- statistically associated with the trait. This is
D’ value of 0 indicates complete linkage ing the subsequent two-marker haplotype referred to as a direct association, and the
equilibrium, which implies frequent re- frequencies exist, and generally lead to genotyped SNP is sometimes referred to as
combination between the two markers and reasonable results [18]. the functional SNP. The second possibility is
statistical independence under principles SNPs that are selected specifically to that the influential SNP is not directly
of Hardy-Weinberg equilibrium. A D’ of 1 capture the variation at nearby sites in the typed, but instead a tag SNP in high LD
indicates complete LD, indicating no genome are called tag SNPs because alleles with the influential SNP is typed and
recombination between the two markers for these SNPs tag the surrounding stretch statistically associated to the phenotype
within the population. For the purposes of of LD. As noted before, patterns of LD are (figure 3). This is referred to as an indirect
genetic analysis, LD is generally reported population specific and as such, tag SNPs association [10]. Because of these two
in terms of r2 , a statistical measure of selected for one population may not work possibilities, a significant SNP association
correlation. High r2 values indicate that well for a different population. LD is from a GWAS should not be assumed as
two SNPs convey similar information, as exploited to optimize genetic studies, the causal variant and may require

PLOS Computational Biology | www.ploscompbiol.org 4 December 2012 | Volume 8 | Issue 12 | e1002822


Figure 3. Indirect Association. Genotyped SNPs often lie in a region of high linkage disequilibrium with an influential allele. The genotyped SNP
will be statistically associated with disease as a surrogate for the disease SNP through an indirect association.
doi:10.1371/journal.pcbi.1002822.g003

additional studies to map the precise needed to capture the variation across the change in LDL level per allele or by
location of the influential SNP. African genome. genotype class. With an easily measurable
Conceptually, the end result of GWAS It is important to note that the technol- ubiquitous quantitative trait, GWAS of
under the common disease/common var- ogy for measuring genomic variation is blood lipids have been conducted in
iant hypothesis is that a panel of 500,000 changing rapidly. Chip-based genotyping numerous cohort studies. Their results
to one million markers will identify platforms such as those briefly mentioned were also easily combined to conduct an
common SNPs that are associated to above will likely be replaced over the next extremely well-powered massive meta-
common phenotypes. To conduct such a few years with inexpensive new technolo- analysis, which revealed 95 loci associated
study practically requires a genotyping gies for sequencing the entire genome. to lipid traits in more than 100,000 people
technology that can accurately capture These next-generation sequencing meth- [21]. Here, HDL and LDL may be the
the alleles of 500,000 to one million SNPs ods will provide all the DNA sequence primary traits of interest or can be
for each individual in a study in a cost- variation in the genome. It is time now to considered intermediate quantitative traits
effective manner. retool for this new onslaught of data. or endophenotypes for cardiovascular
disease.
4. Genotyping Technologies 5. Study Design Other disease traits do not have well-
established quantitative measures. In these
Genome-wide association studies were Regardless of assumptions about the circumstances, individuals are usually clas-
made possible by the availability of chip- genetic model of a trait, or the technology sified as either affected or unaffected – a
based microarray technology for assaying used to assess genetic variation, no genetic binary categorical variable. Consider the
one million or more SNPs. Two primary study will have meaningful results without vast difference in measurement error
platforms have been used for most GWAS. a thoughtful approach to characterize the associated with classifying individuals as
These include products from Illumina phenotype of interest. When embarking either ‘‘case’’ or ‘‘control’’ versus precisely
(San Diego, CA) and Affymetrix (Santa on a genetic study, the initial focus should measuring a quantitative trait. For exam-
Clara, CA). These two competing tech- be on identifying precisely what quantity or ple, multiple sclerosis is a complex clinical
nologies have been recently reviewed [20] trait genetic variation influences. phenotype that is often diagnosed over a
and offer different approaches to measure long period of time by ruling out other
SNP variation. For example, the Affyme- 5.1 Case Control versus Quantitative possible conditions. However, despite the
trix platform prints short DNA sequences Designs ‘‘loose’’ classification of case and control,
as a spot on the chip that recognizes a There are two primary classes of GWAS of multiple sclerosis have been
specific SNP allele. Alleles (i.e. nucleotides) phenotypes: categorical (often binary enormously successful, implicating more
are detected by differential hybridization case/control) or quantitative. From the than 10 new genes for the disorder [22].
of the sample DNA. Illumina on the other statistical perspective, quantitative traits So while quantitative outcomes are pre-
hand uses a bead-based technology with are preferred because they improve power ferred, they are not required for a
slightly longer DNA sequences to detect to detect a genetic effect, and often have a successful study.
alleles. The Illumina chips are more more interpretable outcome. For some
expensive to make but provide better disease traits of interest, quantitative 5.2 Standardized Phenotype Criteria
specificity. disease risk factors have already been A major component of the success with
Aside from the technology, another identified. High-density lipoprotein multiple sclerosis and other well-conduct-
important consideration is the SNPs that (HDL) and low-density lipoprotein (LDL) ed case/control studies is the definition of
each platform has selected for assay. This cholesterol levels are strong predictors of rigorous phenotype criteria, usually pre-
can be important depending on the heart disease, and so genetic studies of sented as rule list based on clinical
specific human population being studied. heart disease outcomes can be conducted variables. Multiple sclerosis studies often
For example, it is important to use a chip by examining these levels as a quantitative use the McDonald criteria for establishing
that has more SNPs with better overall trait. Assays for HDL and LDL levels, case/control status and defining clinical
genomic coverage for a study of Africans being already useful for clinical practice, subtypes [23]. Standardized methods like
than Europeans. This is because African are precise and ubiquitous measurements the McDonald criteria establish a concise,
genomes have had more time to recom- that are easy to obtain. Genetic variants evidence-based approach that can be
bine and therefore have less LD between that influence these levels have a clear uniformly applied by multiple diagnosing
alleles at different SNPs. More SNPs are interpretation – for example, a unit clinicians to ensure that consistent pheno-

PLOS Computational Biology | www.ploscompbiol.org 5 December 2012 | Volume 8 | Issue 12 | e1002822


type definitions are used for a genetic billing and procedure codes, along with chi-square test (and the related Fisher’s
study. free text are necessary. Because every exact test).
Standardized phenotype rules are par- medical center has its own set of policies, Logistic regression is an extension of
ticularly critical for multi-center studies care providers, and health insurance linear regression where the outcome of a
to prevent introducing a site-based effect providers, some algorithms developed in linear model is transformed using a
into the study. And even when estab- one clinical setting may not work as well logistic function that predicts the proba-
lished phenotype criteria are used, there in another. bility of having case status given a
may be variability among clinicians in Once a manageable subset of records is genotype class. Logistic regression is often
how those criteria are used to assign obtained by an algorithm, the accuracy of the preferred approach because it allows
case/control status. Furthermore, some the results is examined by clinicians or for adjustment for clinical covariates (and
quantitative traits are susceptible to bias other phenotype experts as gold-standard other factors), and can provide adjusted
in measurement. For example, with for comparison. The positive predictive odds ratios as a measure of effect size.
cataract severity lens photographs are value (PPV) of the initial algorithm is Logistic regression has been extensively
used to assign cases to one of three types assessed, and based on feedback from case developed, and numerous diagnostic pro-
of lens opacity. In situations where there reviewers, the selection algorithm is re- cedures are available to aid interpretation
may be disagreement among clinicians, a fined. This process of case-review followed of the model.
subset of study records is often examined by algorithmic refinement is continued For both quantitative and dichotomous
by clinicians at multiple centers to assess until the desired PPV is reached. trait analysis (regardless of the analysis
interrater agreement as a measure of This approach has been validated by method), there are a variety of ways that
phenotyping consistency [24]. High in- replicating established genotype-pheno- genotype data can be encoded or shaped
terrater agreement means that phenotype type relationships using EMR-derived for association tests. The choice of data
rules are being consistently applied across phenotypes [16], and has been applied to encoding can have implications for the
multiple sites, whereas low agreement multiple clinical and pharmacogenomic statistical power of a test, as the degrees of
suggests that criteria are not uniformly conditions [26–28]. freedom for the test may change depend-
interpreted or applied, and may indicate ing on the number of genotype-based
a need to establish more narrow pheno- 6. Association Test groups that are formed. Allelic association
type criteria. 6.1 Single Locus Analysis tests examine the association between one
When a well-defined phenotype has allele of the SNP and the phenotype.
5.3 Phenotype Extraction from been selected for a study population, and Genotypic association tests examine the
Electronic Medical Records genotypes are collected using sound tech- association between genotypes (or geno-
The last few years of genetic research niques, the statistical analysis of genetic type classes) and the phenotype. The
has seen the growth of large clinical bio- data can begin. The de facto analysis of genotypes for a SNP can also be grouped
repositories that are linked to electronic genome-wide association data is a series of into genotype classes or models, such as
medical records (EMRs) [25]. The devel- single-locus statistic tests, examining each dominant, recessive, multiplicative, or
opment of these resources will certainly SNP independently for association to the additive models [29].
advance the state of human genetics phenotype. The statistical test conducted Each model makes different assump-
research and foster integration of genetic depends on a variety of factors, but first tions about the genetic effect in the data –
information into clinical practice. From a and foremost, statistical tests are different assuming two alleles for a SNP, A and a,
study design perspective, identifying phe- for quantitative traits versus case/control a dominant model (for A) assumes that
notypes from EMRs can be challenging. studies. having one or more copies of the A allele
Electronic medical records were estab- Quantitative traits are generally ana- increases risk compared to a (i.e. Aa or
lished for clinical care and administrative lyzed using generalized linear model (GLM) AA genotypes have higher risk). The
purposes – not for research. As such, approaches, most commonly the Analysis recessive model (for A) assumes that two
idiosyncrasies arise due to billing practices of Variance (ANOVA), which is similar to copies of the A allele are required to alter
and other logistical reasons, and great care linear regression with a categorical pre- risk, so individuals with the AA genotype
must be taken not to introduce biases into dictor variable, in this case genotype are compared to individuals with Aa and
a genetic study. classes. The null hypothesis of an ANOVA aa genotypes. The multiplicative model
The established methodology for con- using a single SNP is that there is no (for A) assumes that if there is 36 risk for
ducting ‘‘electronic phenotyping’’ is to difference between the trait means of any having a single A allele, there is a 96 risk
devise an initial selection algorithm genotype group. The assumptions of GLM for having two copies of the A allele: in
(using structured EMR fields, such as and ANOVA are 1) the trait is normally this case if the risk for Aa is k, the risk for
billing codes, or text mining procedures distributed; 2) the trait variance within AA is k2 . The additive model (for A)
on unstructured text), which identifies a each group is the same (the groups are assumes that there is a uniform, linear
record subset from the bio-repository. In homoskedastic); 3) the groups are inde- increase in risk for each copy of the A
cases where free text is parsed, natural pendent. allele, so if the risk is 36 for Aa, there is a
language processing (NLP) is used in Dichotomous case/control traits are 66 risk for AA - in this case the risk for
conjunction with a controlled vocabulary generally analyzed using either contingen- Aa is k and the risk for AA is 2k. A
such as the Unified Medical Language cy table methods or logistic regression. common practice for GWAS is to exam-
System (UMLS) to relate text to more Contingency table tests examine and ine additive models only, as the additive
structured and uniform medical con- measure the deviation from independence model has reasonable power to detect
cepts. In some instances, billing codes that is expected under the null hypothesis both additive and dominant effects, but it
alone may be sufficient to accurately that there is no association between the is important to note that an additive
identify individuals with a particular phenotype and genotype classes. The most model may be underpowered to detect
phenotype, but often combinations of ubiquitous form of this test is the popular some recessive effects [30]. Rather than

PLOS Computational Biology | www.ploscompbiol.org 6 December 2012 | Volume 8 | Issue 12 | e1002822


choosing one model a priori, some studies GWAS, hundreds of thousands to mil- generate an empirical distribution with
evaluate multiple genetic models coupled lions of tests are conducted, each one with resolution N, so a permutation procedure
with an appropriate correction for multi- its own false positive probability. The with an N of 1000 gives an empirical p-
ple testing. cumulative likelihood of finding one or value within 1/1000th of a decimal place.
more false positives over the entire Several software packages have been
6.2 Covariate Adjustment and GWAS analysis is therefore much higher. developed to perform permutation testing
Population Stratification For a somewhat morbid analogy, consider for GWAS studies, including the popular
In addition to selecting an encoding the probability of having a car accident. If PLINK software [35], PRESTO [36], and
scheme, statistical tests should be adjusted you drive your car today, the probability PERMORY [37].
for factors that are known to influence the of having an accident is fairly low. Another commonly used approach is to
trait, such as sex, age, study site, and However if you drive every day for the rely on the concept of genome-wide signifi-
known clinical covariates. Covariate ad- next five years, the probability of you cance. Based on the distribution of LD in
justment reduces spurious associations due having one or more accidents over that the genome for a specific population,
to sampling artifacts or biases in study time is much higher than the probability there are an ‘‘effective’’ number of
design, but adjustment comes at the price of having one today. independent genomic regions, and thus
of using additional degrees of freedom One of the simplest approaches to an effective number of statistical tests that
correct for multiple testing is the Bonfer- should be corrected for. For European-
which may impact statistical power. One
roni correction. The Bonferroni correction descent populations, this threshold has
of the more important covariates to
adjusts the alpha value from a = 0.05 to been estimated at 7.2e-8 [38]. This
consider in genetic analysis is a measure
a = (0.05/k) where k is the number of reasonable approach should be used with
of population substructure. There are
statistical tests conducted. For a typical caution, however, as the only scenario
often known differences in phenotype
GWAS using 500,000 SNPs, statistical where this correction is appropriate is
prevalence due to ethnicity, and allele
significance of a SNP association would when hypotheses are tested on the
frequencies are highly variable across
be set at 1e-7. This correction is the most genome scale. Candidate gene studies or
human subpopulations, meaning that in
conservative, as it assumes that each replication studies with a focused hypoth-
a sample with multiple ethnicities, ethnic-
association test of the 500,000 is indepen- esis do not require correction to this level,
specific SNPs will likely be associated to
dent of all other tests – an assumption that as the number of effective, independent
the trait due to population stratification.
is generally untrue due to linkage disequi- statistical tests is much, much lower than
To prevent population stratification, the
librium among GWAS markers. what is assumed for genome-wide signif-
ancestry of each sample in the dataset is
An alternative to adjusting the false icance.
measured using STRUCTURE [31] or
positive rate (alpha) is to determine the
EIGENSTRAT [32] methods that com-
pare genome-wide allele frequencies to
false discovery rate (FDR). The false 6.4 Multi-Locus Analysis
discovery rate is an estimate of the In addition to single-locus analyses,
those of HapMap ethnic groups. The
proportion of significant results (usually genome-wide association studies provide
results of these analyses can be used to
at alpha = 0.05) that are false positives. an enormous opportunity to examine
either exclude samples with similarity to a
Under the null hypothesis that there are interactions among genetic variants
non-target population, or they can be used
no true associations in a GWAS dataset, p- throughout the genome. Multi-locus analy-
as a covariate in association analysis.
values for association tests would follow a sis, however, is not nearly as straightfor-
EIGENSTRAT is commonly used in this
uniform distribution (evenly distributed ward as conducting single-locus tests, and
circumstance, where principle component
from 0 to 1). Originally developed by presents numerous computational, statisti-
analysis is used to generate principle
Benjamini and Hochberg, FDR proce- cal, and logistical challenges [39].
component values that could be described
dures essentially correct for this number of Because most GWAS genotype be-
as an ‘‘ethnicity score’’. When used as
expected false discoveries, providing an tween 500,000 and one million SNPs,
covariates, these scores adjust for minute
estimate of the number of true results examining all pair-wise combinations of
ancestry effects in the data.
among those called significant [33]. These SNPs is a computationally intractable
techniques have been widely applied to approach, even for highly efficient algo-
6.3 Corrections for Multiple Testing GWAS and extended in a variety of ways rithms. One approach to this issue is to
A p-value, which is the probability of [34]. reduce or filter the set of genotyped SNPs,
seeing a test statistic equal to or greater Permutation testing is another approach eliminating redundant information. A
than the observed test statistic if the null for establishing significance in GWAS. simple and common way to filter SNPs
hypothesis is true, is generated for each While somewhat computationally inten- is to select a set of results from a single-
statistical test. This effectively means that sive, permutation testing is a straightfor- SNP analysis based on an arbitrary
lower p-values indicate that if there is no ward way to generate the empirical significance threshold and exhaustively
association, the chance of seeing this result distribution of test statistics for a given evaluate interactions in that subset. This
is extremely small. dataset when the null hypothesis is true. can be perilous, however, as selecting
Statistical tests are generally called This is achieved by randomly reassigning SNPs to analyze based on main effects
significant and the null hypothesis is the phenotypes of each individual to will prevent certain multi-locus models
rejected if the p-value falls below a another individual in the dataset, effec- from being detected – so called ‘‘purely
predefined alpha value, which is nearly tively breaking the genotype-phenotype epistatic’’ models with statistically unde-
always set to 0.05. This means that 5% of relationship of the dataset. Each random tectable marginal effects. With these
the time, the null hypothesis is rejected reassignment of the data represents one models, a large component of the herita-
when in fact it is true and we detect a false possible sampling of individuals under the bility is concentrated in the interaction
positive. This probability is relative to a null hypothesis, and this process is repeat- rather than in the main effects. In other
single statistical test; in the case of ed a predefined number of times N to words, a specific combination of markers

PLOS Computational Biology | www.ploscompbiol.org 7 December 2012 | Volume 8 | Issue 12 | e1002822


(and only the combination of markers) Replication of a significant result in an 7.2 Meta-Analysis of Multiple
incurs a significant change in disease risk. additional population is sometimes re- Analysis Results
The benefits of this analysis are that it ferred to as generalization, meaning the The results of multiple GWAS studies
performs an unbiased analysis for inter- genetic effect is of general relevance to can be pooled together to perform a meta-
actions within the selected set of SNPs. It multiple human populations. analysis. Meta-analysis techniques were
is also far more computationally and Identical phenotype criteria should be originally developed to examine and refine
statistically tractable than analyzing all used in both GWAS and replication significance and effect size estimates from
possible combinations of markers. studies. Replication of a GWAS result multiple studies examining the same hypothesis
Another strategy is to restrict examina- should be thought of as the replication of a in the published literature. With the
tion of SNP combinations to those that specific statistical model – a given SNP development of large academic consortia,
fall within an established biological con- predicts a specific phenotype effect. Using meta-analysis approaches allow the syn-
text, such as a biochemical pathway or a even slightly different phenotype defini- thesis of results from multiple studies
protein family. As these techniques rely tions between GWAS and replication without requiring the transfer of protected
on electronic repositories of structured studies can cloud the interpretation of genotype or clinical information to parties
biomedical knowledge, they generally the final result. who were not part of the original study
couple a bioinformatics engine that gen- A similar effect should be seen in the approval – only statistical results from a
erates SNP-SNP combinations with a replication set from the same SNP, or a study need be transferred. For example, a
statistical method that evaluates combi- SNP in high LD with the GWAS-identi- recent publication examining lipid profiles
nations in the GWAS dataset. For exam- fied SNP. Because GWAS typically use was based on a meta-analysis of 46 studies
ple, the Biofilter approach uses a variety SNPs that are markers that were chosen [21]. A study of this magnitude would be
of public data sources with logistic based on LD patterns, it is difficult to say logistically difficult (if not impossible)
regression and multifactor dimensionality what SNP within the larger genomic without meta-analysis. Several software
reduction methods [40,41]. Similarly, region is mechanistically influencing dis- packages are available to facilitate meta-
INTERSNP uses logistic regression, log- ease risk. With this in mind, the unit of analysis, including STATA products and
linear, and contingency table approaches replication for a GWAS should be the METAL [45,46].
to assess SNP-SNP interaction models genomic region, and all SNPs in high LD are A fundamental principle in meta-anal-
[42]. potential replication candidates. However, ysis is that all studies included examined
continuity of effect should be demonstrat- the same hypothesis. As such, the general
7. Replication and Meta- ed across both studies, with the magnitude design of each included study should be
Analysis and direction of effect being similar for the similar, and the study-level SNP analysis
7.1 Statistical Replication genomic region in both datasets. If SNPs should follow near-identical procedures
The gold standard for validation of any in high LD are used to demonstrate the across all studies (see Zeggini and Ioanni-
genetic study is replication in an additional effect in replication, the direction of effect dis [47] for an excellent review). Quality
independent sample. That said, there are a must be determined using a reference control procedures that determine which
variety of criteria involved in defining panel to determine two-SNP haplotype SNPs are included from each site should
‘‘replication’’ of a GWAS result. This was frequencies. For example, if allele A is be standardized, along with any covariate
the subject of an NHGRI working group, associated in the GWAS with an odds adjustments, and the measurement of
which outlined several criteria for estab- ratio of 1.5, and allele T of a nearby SNP clinical covariates and phenotypes should
lishing a positive replication [43]. These is associated in the replication set with an be consistent across multiple sites. The
criteria are discussed in the following odds ratio of 1.46, it must be demonstrated sample sets across all studies should be
paragraphs. that allele A and allele T carry effects in independent – an assumption that should
Replication studies should have suffi- the same direction. The most straightfor- always be examined as investigators often
cient sample size to detect the effect of the ward way to assess this is to examine a contribute the same samples to multiple
susceptibility allele. Often, the effects reference panel, such as the HapMap studies. Also, an extremely important and
identified in an initial GWAS suffer from data, for a relevant population. If this somewhat bothersome logistical matter is
winner’s curse, where the detected effect is panel shows that allele A from SNP 1 and ensuring that all studies report results
likely stronger in the GWAS sample than allele T from SNP 2 form a two-marker relative to a common genomic build and
in the general population [44]. This means haplotype in 90% of the sample, then this reference allele. If one study reports its
that replication samples should ideally be is a reasonable assumption. If however the results relative to allele A and another
larger to account for the over-estimation of panel shows that allele A from SNP 1 and relative to allele B, the meta-analysis result
effect size. With replication, it is important allele A from SNP 2 form the predomi- for this SNP may be non-significant
for the study to be well-powered to identify nant two-marker haplotype, the effect has because the effects of the two studies
spuriously associated SNPs where the null probably flipped in the replication set. nullify each other.
hypothesis is most likely true – in other Mapping the effect through the haplotype With all of these factors to consider, it is
words, to confidently call the initial would be equivalent to observing an odds rare to find multiple studies that match
GWAS result a false-positive. ratio of 1.5 in the GWAS and 0.685 in the perfectly on all criteria. Therefore, study
Replication studies should be conducted replication set. heterogeneity is often statistically quantified
in an independent dataset drawn from the In brief, the general strategy for a in a meta-analysis to determine the degree
same population as the GWAS, in an replication study is to repeat the ascertain- to which studies differ. The most popular
attempt to confirm the effect in the GWAS ment and design of the GWAS as closely as measures of study heterogeneity are the Q
target population. Once an effect is possible, but examine only specific genetic statistic and the I2 index [48], with the I2
confirmed in the target population, other effects found significant in the GWAS. index favored in more recent studies.
populations may be sampled to determine Effects that are consistent across the two Coefficients resulting from a meta-analysis
if the SNP has an ethnic-specific effect. studies can be labeled replicated effects. have variability (or error) associated with

PLOS Computational Biology | www.ploscompbiol.org 8 December 2012 | Volume 8 | Issue 12 | e1002822


them, and the I2 index represents the haplotypes contain genotypes for surround- three billion nucleotides. Challenges asso-
approximate proportion of this variability ing markers that were not genotyped in the ciated with data storage and manipula-
that can be attributed to heterogeneity study sample. Because the study sample tion, quality control and data analysis will
between studies [49]. I2 values fall into low haplotypes may match multiple reference be manifold more complex, thus chal-
(,25), medium (.25 and ,75), and high haplotypes, surrounding genotypes may be lenging computer science and bioinfor-
(.75) heterogeneity, and have been pro- given a score or probability of a match based matics infrastructure and expertise. Merg-
posed as a way to identify studies that on the haplotype overlap. For example, ing sequencing data with that from other
should perhaps be removed from a meta- rather than assign an imputed SNP a single high-throughput technology for measur-
analysis. It is important to note that these allele A, the probability of possible alleles is ing the transcriptome, the proteome, the
statistics should be used as a guide to reported (0.85 A, 0.12 C, 0.03 T) based on environment and phenotypes such as the
identifying studies that perhaps examine a haplotype frequencies. This information can massive amounts of data that come from
different underlying hypothesis than others be used in the analysis of imputed data to neuroimaging will only serve to compli-
in the meta-analysis, much like outlier take into account uncertainty in the geno- cate our goal to understand the genotype-
analysis is used to identify unduly influential type estimation process, typically using phenotype relationship for the purpose of
points. Just as with outliers, however, a Bayesian analysis approaches [51]. Popular improving healthcare. Integrating these
study should only be excluded if there is an algorithms for genotype imputation include many levels of complex biomedical data
obvious reason to do so based on the BimBam [52], IMPUTE [53], MaCH [54], along with their coupling with experi-
parameters of the study – not simply and Beagle [55]. mental systems is the future of human
because a statistic indicates that this study Much like conducting a meta-analysis, genetics.
increases heterogeneity. Otherwise, agnos- genotype imputation must be conducted
tic statistical procedures designed to reduce with great care. The reference panel (i.e. 9. Exercises
meta-analysis heterogeneity will increase the 1000 Genomes data or the HapMap
false discoveries. project) must contain haplotypes drawn 1. True or False: Common diseases, such
from the same population as the study as type II diabetes and lung cancer, are
7.3 Data Imputation sample in order to facilitate a proper likely caused by mutations to a single
To conduct a meta-analysis properly, the haplotype match. If a study was conducted gene. Explain your answer.
effect of the same allele across multiple distinct using individuals of Asian descent, but only 2. Will the genotyping platforms designed
studies must be assessed. This can prove European descent populations are repre- for GWAS of European Descent pop-
difficult if different studies use different sented in the reference panel, the genotype ulations be of equal utility in African
genotyping platforms (which use different imputation quality will be poor as there is a Descent populations? Why or why not?
SNP marker sets). As this is often the case, lower probability of a haplotype match.
3. When conducting a genetic study, what
GWAS datasets can be imputed to generate Also, the reference allele for each SNP must
additional factors should be measured
results for a common set of SNPs across all be identical in both the study sample and
and adjusted for in the statistical
studies. Genotype imputation exploits the reference panel. Finally, the analysis of
analysis?
known LD patterns and haplotype frequen- imputed genotypes should account for the
cies from the HapMap or 1000 Genomes uncertainty in genotype state generated by 4. True or False: SNPs that are associated
project to estimate genotypes for SNPs not the imputation process. to disease using GWAS design should
directly genotyped in the study [50]. be immediately considered for molec-
The concept is similar in principle to 8. The Future ular studies. Explain your answer.
haplotype phasing algorithms, where the con-
tiguous set of alleles lying on a specific Genome-wide association studies have Answers to the Exercises can be found
chromosome is estimated. Genotype impu- had a huge impact on the field of human in Text S1.
tation methods extend this idea to human genetics. They have identified new genet-
populations. First, a collection of shared ic risk factors for many common human Supporting Information
haplotypes within the study sample is diseases and have forced the genetics
Text S1 Answers to Exercises
computed to estimate haplotype frequencies community to think on a genome-wide
(DOCX)
among the genotyped SNPs. Phased haplo- scale. On the horizon is whole-genome
types from the study sample are compared sequencing. Within the next few years we
to reference haplotypes from a panel of will see the arrival of cheap sequencing Acknowledgments
much more dense SNPs, such as the technology that will replace one million Thanks are extended to Ms. Davnah Urbach
HapMap data. The matched reference SNPs with the entire genomic sequence of for her editorial assistance.

Further Reading

N 1000 Genomes Project Consortium, Altshuler D, Durbin RM, Abecasis GR, Bentley DR, et al. (2010) A map of human genome
variation from population-scale sequencing. Nature 467: 1061–1073.
N Haines JL, Pericak-Vance MA (2006) Genetic analysis of complex disease. New York: Wiley-Liss. 512 p.
N Hartl DL, Clark, AG (2006) Principles of population genetics. Sunderland (Massachusetts): Sinauer Associates, Inc. 545 p.
N NCI-NHGRI Working Group on Replication in Association Studies, Chanock SJ, Manolio T, Boehnke M, Boerwinkle E, et al.
(2007) Replicating genotype-phenotype associations. Nature 447: 655–660.

PLOS Computational Biology | www.ploscompbiol.org 9 December 2012 | Volume 8 | Issue 12 | e1002822


Glossary

GWAS: genome-wide association study; a genetic study design that attempts to identify commonly occurring genetic variants that
contribute to disease risk
Personalized Medicine: the science of providing health care informed by individual characteristics, such as genetic variation
SNP: single nucleotide polymorphism; a single base-pair change in the DNA sequence
Linkage Analysis: the attempt to statistically relate transmission of an allele within families to inheritance of a disease
Common disease/Common variant hypothesis: The hypothesis that commonly occurring diseases in a population are caused in part
by genetic variation that is common to that population
Linkage disequilibrium: the degree to which an allele of one SNP is observed with an allele of another SNP within a population
Direct association: the statistical association of a functional or influential allele with a disease
Indirect association: the statistical association of an allele to disease that is in strong linkage disequilibrium with the allele that is
functional or influential for disease
Population stratification: the false association of an allele to disease due to both differences in population frequency of the allele and
differences in ethnic prevalence or sampling of affected individuals
False positive: from statistical hypothesis testing, the rejection of a null hypothesis when the null hypothesis is true
Genome-wide significance: a false-positive rate threshold established by empirical estimation of the independent genomic regions
present in a population
Replication: the observation of a statistical association in a second, independent dataset (often the same population as the first
association)
Generalization: the replication of a statistical association in a second population
Imputation: the estimation of unknown alleles based on the observation of nearby alleles in high linkage disequilibrium

References
1. Haines JL, Hauser MA, Schmidt S, Scott WK, complex traits. Nat Rev Genet 6: 95–108. doi: studies. Methods Mol Biol 700: 3–16. doi:
Olson LM, et al. (2005) Complement factor H 10.1038/nrg1521 10.1007/978-1-61737-954-3_1
variant increases the risk of age-related macular 11. Corder EH, Saunders AM, Strittmatter WJ, 21. Teslovich TM, Musunuru K, Smith AV, Ed-
degeneration. Science 308: 419–421. doi: Schmechel DE, Gaskell PC, et al. (1993) Gene mondson AC, Stylianou IM, et al. (2010)
10.1126/science.1110359 dose of apolipoprotein E type 4 allele and the risk Biological, clinical and population relevance of
2. Edwards AO, Ritter R, III, Abel KJ, Manning A, of Alzheimer’s disease in late onset families. 95 loci for blood lipids. Nature 466: 707–713. doi:
Panhuysen C, et al. (2005) Complement factor H Science 261: 921–923. 10.1038/nature09270
polymorphism and age-related macular degener- 12. Altshuler D, Hirschhorn JN, Klannemark M, 22. Habek M, Brinar VV, Borovecki F (2010) Genes
ation. Science 308: 421–424. doi: 10.1126/ Lindgren CM, Vohl MC, et al. (2000) The associated with multiple sclerosis: 15 and count-
science.1110189 common PPARgamma Pro12Ala polymorphism ing. Expert Rev Mol Diagn 10: 857–861. doi:
3. Klein RJ, Zeiss C, Chew EY, Tsai JY, Sackler is associated with decreased risk of type 2 diabetes. 10.1586/erm.10.77
RS, et al. (2005) Complement factor H polymor- Nat Genet 26: 76–80. doi: 10.1038/79216 23. Polman CH, Reingold SC, Edan G, Filippi M,
phism in age-related macular degeneration. 13. Reich DE, Lander ES (2001) On the allelic spectrum Hartung HP, et al. (2005) Diagnostic criteria for
Science 308: 385–389. doi: 10.1126/sci- of human disease. Trends Genet 17: 502–510. multiple sclerosis: 2005 revisions to the ‘‘McDon-
ence.1109557 14. Hindorff LA, Sethupathy P, Junkins HA, Ramos ald Criteria’’. Ann Neurol 58: 840–846. doi:
4. Cooper GM, Johnson JA, Langaee TY, Feng H, EM, Mehta JP, et al. (2009) Potential etiologic 10.1002/ana.20703
Stanaway IB, et al. (2008) A genome-wide scan and functional implications of genome-wide 24. Chew EY, Kim J, Sperduto RD, Datiles MB, III,
for common genetic variants with a large association loci for human diseases and traits. Coleman HR, et al. (2010) Evaluation of the age-
influence on warfarin maintenance dose. Blood Proc Natl Acad Sci U S A 106: 9362–9367. doi: related eye disease study clinical lens grading
112: 1022–1027. doi: 10.1182/blood-2008-01-
10.1073/pnas.0903103106 system AREDS report No. 31. Ophthalmology
134247
15. International HapMap Consortium (2005) A 117: 2112–2119. doi: 10.1016/j.ophtha.2010.02.
5. Genomes Project Consortium (2010) A map of
haplotype map of the human genome. Nature 033
human genome variation from population-scale
437: 1299–1320. doi: 10.1038/nature04226 25. Denny JC, Ritchie MD, Crawford DC, Schildcr-
sequencing. Nature 467: 1061–1073. doi:
10.1038/nature09534 16. Ritchie MD, Denny JC, Crawford DC, Ramirez out JS, Ramirez AH, et al. (2010) Identification
6. Griffith OL, Montgomery SB, Bernier B, Chu B, AH, Weiner JB, et al. (2010) Robust replication of of genomic predictors of atrioventricular con-
Kasaian K, et al. (2008) ORegAnno: an open- genotype-phenotype associations across multiple duction: using electronic medical records as a
access community-driven resource for regulatory diseases in an electronic medical record. tool for genome science. Circulation 122: 2016–
annotation. Nucleic Acids Res 36: D107-D113. Am J Hum Genet 86: 560–572. doi: 10.1016/ 2021. doi: 10.1161/CIRCULATIONAHA.110.
doi: 10.1093/nar/gkm967 j.ajhg.2010.03.003 948828
7. Altshuler DM, Gibbs RA, Peltonen L, Altshuler 17. Devlin B, Risch N (1995) A comparison of linkage 26. Wilke RA, Berg RL, Linneman JG, Peissig P,
DM, Gibbs RA, et al. (2010) Integrating common disequilibrium measures for fine-scale mapping. Starren J, et al. (2010) Quantification of the
and rare genetic variation in diverse human Genomics 29: 311–322. doi: 10.1006/ clinical modifiers impacting high-density lipopro-
populations. Nature 467: 52–58. doi: 10.1038/ geno.1995.9003 tein cholesterol in the community: Personalized
nature09298 18. Fallin D, Schork NJ (2000) Accuracy of haplotype Medicine Research Project. Prev Cardiol 13: 63–
8. Kerem B, Rommens JM, Buchanan JA, Markiewicz frequency estimation for biallelic loci, via the 68. doi: 10.1111/j.1751-7141.2009.00055.x
D, et al. (1989) Identification of the cystic fibrosis expectation-maximization algorithm for un- 27. Kullo IJ, Fan J, Pathak J, Savova GK, Ali Z, et al.
gene: genetic analysis. Science 245: 1073–1080. phased diploid genotype data. Am J Hum Genet (2010) Leveraging informatics for genetic studies:
9. MacDonald ME, Novelletto A, Lin C, Tagle D, 67: 947–959. doi: 10.1086/303069 use of the electronic medical record to enable a
Barnes G, et al. (1992) The Huntington’s disease 19. Li M, Li C, Guan W (2008) Evaluation of genome-wide association study of peripheral
candidate region exhibits many different haplo- coverage variation of SNP chips for genome-wide arterial disease. J Am Med Inform Assoc 17:
types. Nat Genet 1: 99–103. doi: 10.1038/ association studies. Eur J Hum Genet 16: 635– 568–574. doi: 10.1136/jamia.2010.004366
ng0592-99 643. doi: 10.1038/sj.ejhg.5202007 28. McCarty CA, Wilke RA (2010) Biobanking and
10. Hirschhorn JN, Daly MJ (2005) Genome-wide 20. Distefano JK, Taverna DM (2011) Technological pharmacogenomics. Pharmacogenomics 11: 637–
association studies for common diseases and issues and experimental design of gene association 641. doi: 10.2217/pgs.10.13

PLOS Computational Biology | www.ploscompbiol.org 10 December 2012 | Volume 8 | Issue 12 | e1002822


29. Lewis CM (2002) Genetic association studies: tion scans. Genet Epidemiol 32: 227–234. doi: 47. Zeggini E, Ioannidis JP (2009) Meta-analysis in
design, analysis and interpretation. Brief Bioin- 10.1002/gepi.20297 genome-wide association studies. Pharmacoge-
form 3: 146–153. 39. Moore JH, Ritchie MD (2004) STUDENT- nomics 10: 191–201. doi: 10.2217/
30. Lettre G, Lange C, Hirschhorn JN (2007) Genetic JAMA. The challenges of whole-genome ap- 14622416.10.2.191
model testing and statistical power in population- proaches to common diseases. JAMA 291: 48. Huedo-Medina TB, Sanchez-Meca J, Marin-
based association studies of quantitative traits. 1642–1643. doi: 10.1001/jama.291.13.1642 Martinez F, Botella J (2006) Assessing heteroge-
Genet Epidemiol 31: 358–362. doi: 10.1002/ 40. Grady BJ, Torstenson ES, McLaren PJ, de neity in meta-analysis: Q statistic or I2 index?
gepi.20217 Bakker PI, Haas DW, et al. (2011) Use of Psychol Methods 11: 193–206. doi: 10.1037/
31. Falush D, Stephens M, Pritchard JK (2003) biological knowledge to inform the analysis of 1082-989X.11.2.193
Inference of population structure using multilocus gene-gene interactions involved in modulating 49. Higgins JP (2008) Commentary: Heterogeneity in
genotype data: linked loci and correlated allele virologic failure with efavirenz-containing treat- meta-analysis should be expected and appropri-
frequencies. Genetics 164: 1567–1587. ment regimens in art-naive actg clinical trials ately quantified. Int J Epidemiol 37: 1158–1160.
32. Price AL, Patterson NJ, Plenge RM, Weinblatt participants. Pac Symp Biocomput 253–264. doi: 10.1093/ije/dyn204
ME, Shadick NA, et al. (2006) Principal compo- 41. Bush WS, Dudek SM, Ritchie MD (2009) 50. Li Y, Willer C, Sanna S, Abecasis G (2009)
nents analysis corrects for stratification in ge- Biofilter: a knowledge-integration system for the Genotype imputation. Annu Rev Genomics Hum
nome-wide association studies. Nat Genet 38: multi-locus analysis of genome-wide association Genet 10: 387–406. doi: 10.1146/annurev.-
904–909. doi: 10.1038/ng1847 studies. Pac Symp Biocomput 368–379. genom.9.081307.164242
33. Hochberg Y, Benjamini Y (1990) More powerful
42. Herold C, Steffens M, Brockschmidt FF, Baur 51. Marchini J, Howie B, Myers S, McVean G,
procedures for multiple significance testing. Stat
MP, Becker T (2009) INTERSNP: genome-wide Donnelly P (2007) A new multipoint method for
Med 9: 811–818.
interaction analysis guided by a priori informa- genome-wide association studies by imputation of
34. van den Oord EJ (2008) Controlling false
tion. Bioinformatics 25: 3275–3281. doi: genotypes. Nat Genet 39: 906–913. doi: 10.1038/
discoveries in genetic studies. Am J Med
10.1093/bioinformatics/btp596 ng2088
Genet B Neuropsychiatr Genet 147B: 637–644.
doi: 10.1002/ajmg.b.30650 43. Chanock SJ, Manolio T, Boehnke M, Boerwinkle 52. Guan Y, Stephens M (2008) Practical issues in
35. Purcell S, Neale B, Todd-Brown K, Thomas L, E, Hunter DJ, et al. (2007) Replicating genotype- imputation-based association mapping. PLoS
Ferreira MA, et al. (2007) PLINK: a tool set for phenotype associations. Nature 447: 655–660. Genet 4: e1000279. doi: 10.1371/journal.p-
whole-genome association and population-based doi: 10.1038/447655a gen.1000279
linkage analyses. Am J Hum Genet 81: 559–575. 44. Zollner S, Pritchard JK (2007) Overcoming the 53. Howie BN, Donnelly P, Marchini J (2009) A
doi: 10.1086/519795 winner’s curse: estimating penetrance parameters flexible and accurate genotype imputation meth-
36. Browning BL (2008) PRESTO: rapid calculation from case-control data. Am J Hum Genet 80: od for the next generation of genome-wide
of order statistic distributions and multiple-testing 605–615. doi: 10.1086/512821 association studies. PLoS Genet 5: e1000529.
adjusted P-values via permutation for one and 45. Sanna S, Jackson AU, Nagaraja R, Willer CJ, doi: 10.1371/journal.pgen.1000529
two-stage genetic association studies. BMC Bioin- Chen WM, et al. (2008) Common variants in the 54. Biernacka JM, Tang R, Li J, McDonnell SK,
formatics 9: 309. doi: 10.1186/1471-2105-9-309 GDF5-UQCC region are associated with varia- Rabe KG, et al. (2009) Assessment of genotype
37. Pahl R, Schafer H (2010) PERMORY: an LD- tion in human height. Nat Genet 40: 198–203. imputation methods. BMC Proc 3 Suppl 7: S5.
exploiting permutation test algorithm for power- doi: 10.1038/ng.74 55. Browning BL, Browning SR (2009) A unified
ful genome-wide association testing. Bioinfor- 46. Willer CJ, Sanna S, Jackson AU, Scuteri A, approach to genotype imputation and haplotype-
matics 26: 2093–2100. doi: 10.1093/bioinfor- Bonnycastle LL, et al. (2008) Newly identified loci phase inference for large data sets of trios and
matics/btq399 that influence lipid concentrations and risk of unrelated individuals. Am J Hum Genet 84: 210–
38. Dudbridge F, Gusnanto A (2008) Estimation of coronary artery disease. Nat Genet 40: 161–169. 223. doi: 10.1016/j.ajhg.2009.01.005
significance thresholds for genomewide associa- doi: 10.1038/ng.76

PLOS Computational Biology | www.ploscompbiol.org 11 December 2012 | Volume 8 | Issue 12 | e1002822

View publication stats

You might also like