You are on page 1of 245

Functional and cross-trait genetic architecture of common

diseases and complex traits


by
Hilary Kiyo Finucane
Submitted to the Department of Mathematics
in partial fulfillment of the requirements for the degree of

Doctor of Philosophy

at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY

June 2017
@ Massachusetts Institute of Technology 2017. All rights reserved.

Author ....
Signature redacted
L) Department of Mathematics
April 28, 2017

Certified by..
Signature redacted
Alkes Price
Associate Professor of Statistical Genetics, Harvard T.H. Chan School of Public
Health
Thesis Supervisor

Accepted by.
Signature redacted-
Jonathan Kelner
ARCHIVES
Chairman, Applied Mathematics Committee
MASSACHUSETTSINSTITUTE
OF TECHNOLOGY

AUG 0 1 2017
LI BRARIES
I
Functional and cross-trait genetic architecture of common diseases and
complex traits
by
Hilary Kiyo Finucane

Submitted to the Department of Mathematics


on April 28, 2017, in partial fulfillment of the
requirements for the degree of
Doctor of Philosophy

Abstract
In this thesis, I introduce new methods for learning about diseases and traits from genetic data.
First, I introduce a method for partitioning heritability by functional annotation from genome-wide
association summary statistics, and I apply it to 17 diseases and traits and many different functional
annotations. Next, I show how to apply this method to use gene expression data to identify disease-
relevant tissues and cell types. I next introduce a method for estimating genetic correlation from
genome-wide association summary statistics and apply it to estimate genetic correlations between
all pairs of 24 diseases and traits. Finally, I consider a model of disease subtypes and I show how to
determine a lower bound on the sample size required to distinguish between two disease subtypes
as a function of several parameters.

Thesis Supervisor: Alkes Price


Title: Associate Professor of Statistical Genetics

3
Acknowledgments
First, I would like to thank my advisor Alkes Price, who has been a fantastic mentor and teacher
during my PhD. I would also like to thank Brendan Bulik-Sullivan, my main collaborator on much
of the work in this thesis, as well as Ben Neale and Jennifer Listgarten, who were valuable mentors
and collaborators. I am grateful to my family, who have been an incredible source of energy and
positivity during this period. And I would like to thank Yakir Reshef in particular for pointing me
in this direction and then providing unending support and encouragement.
I am also grateful to the Hertz Foundation and to the NIH for supporting my PhD.

5
6
Contents

1 Introduction 9

2 Overview of technical contributions 13

3 Partitioning heritability by functional category using genome-wide association


summary statistics 21

4 Heritability enrichment of specifically expressed genes identifies disease-relevant


tissues and cell types 59

5 An atlas of genetic correlations across human diseases and traits 89

6 A statistical framework for gauging when disease subtypes can be detected from
principal components analysis of genotype data 111

A Supplementary information for Chapter 3 133

B Supplementary information for Chapter 4 167

C Supplementary information for Chapter 5 179

D Bibliography 201

7
00
Chapter 1

Introduction

Many common diseases are heritable.1 Studying the genetic basis of these diseases-for example,
identifying particular genetic variants or genes that are associated with increased risk of disease-is
a way to try to advance the understanding of disease biology. One common approach towards
studying the genetics of disease is the genome-wide association study (GWAS), a study design in
which a large number of cases and controls for a disease are genotyped, and each single-nucleotide
polymorphism (SNP) measured is then tested marginally for association with case-control status.2
SNPs that are significantly associated to the phenotype after correcting for multiple testing are
identified for potential follow-up.

Many diseases and traits are highly polygenic; i.e., there are a large number of SNPs that each
have a small effect on the phenotype. 3',4 Because of this, for most common diseases and complex
traits, there are a large number of SNPs that are truly associated to the phenotype that are not
identified by GWAS at current sample sizes. In this thesis, instead of trying to identify SNPs
that are associated to a phenotype, we ask questions about the overall pattern of association: for
example, do SNPs that are causal for one trait tend to be causal for another trait? Are SNPs in
regions of the genome that are "active" in a given cell type (e.g., neurons, liver cells, etc.) more
likely to be causal than SNPs outside of these active regions? To approach these questions, we
model the effects of SNPs on phenotype as random and then infer parameters of the distribution,
without identifying any individual SNPs as causal.

9
The first half of this thesis focuses on combining GWAS data with other types of data such
as functional genomics data and gene expression data. Functional genomics data like that of the
ENCODE project 5 and Roadmap Epigenomics Consortium 6 tell us, for example, about regions of
the genome that have particular signs of activity such as histone marks or DNase I hypersensitivity
in particular tissues or cell types. Gene expression data can tell us about the expression levels
of different genes in different tissues and cell types. To integrate these types of data with GWAS
data, a fundamental question we would like to ask is: given a genome annotation-in this case,
we will define a genome annotation to be a set of SNPs, and we will focus on large sets of SNPs
comprising at least 1% of the SNPs we model-can we quantify, in aggregate, how much SNPs in
this annotation contribute to a given phenotype using GWAS data?

In Chapter 3 (previously published as Finucane*, Bulik-Sullivan* et al. 2015 Nat Genet7 ), we


introduce a method, stratified LD score regression (S-LDSC), to answer this question. S-LDSC
differs from previous methods because instead of requiring full genotype and phenotype data, which
is available for few traits, it requires summary statistics-i.e., the z-scores, one for each SNP, that
are computed during the GWAS. This allows us to apply S-LDSC to a much larger set of traits
at much larger sample sizes than was previously possible. In Chapter 3, we evaluate S-LDSC in
simulations, and then we apply it to summary statistics from 17 traits, with a range of genome
annotations from ENCODE, 5 Roadmap,6 and other sources. We find, for example, that conserved
regions contribute disproportionately to many different phenotypes, and that a particular type of
enhancer contributes disproportionately to immunological diseases but not to other phenotypes.
We also find enrichments of cell-type-specific annotations that are consistent with known biology
for many different diseases and traits including immune enrichment for immunological diseases,
brain enrichment for psychiatric disease, and liver enrichment for lipid traits. Finally, we find new
cell-type-specific enrichments, including an enrichment in the central nervous system for body-mass
index, smoking behavior, and educational attainment. One of the implications of our results is that
S-LDSC applied to cell-type-specific genome annotations can be an effective way to identify the cell
types and tissues that are relevant to disease.

In Chapter 4 (currently posted as Finucane et al. 2017 bioRxiv8 and under revision at Nat

10
Genet), we focused specifically on the problem of identifying disease-relevant cell types and tissues,
and we introduced a new type of cell-type-specific genome annotation to use towards this end. In
particular, we introduced a method for converting gene expression data into genome annotations
that are informative about differences among cell types and tissues. Gene expression data are widely
available, and so the resulting genome annotations allow us to differentiate among, for example,
different brain cell types, something that is not possible with the annotations from ENCODE
and Roadmap. We apply S-LDSC to these annotations and again find many enrichments that
confirm known biology as well as several enrichments not seen before from genetic data, including
an enrichment of the cortex for schizophrenia and the striatum for migraine, and an enrichment in
inhibitory neurons over excitatory neurons for bipolar disorder. We also compare gene expression
to the previously-used chromatin data as a source of cell-type-specific annotations.

In the second half of this thesis, we turn our attention to the relationships between traits.
In Chapter 5 (previously published as Bulik-Sullivan*, Finucane* et al. 2015 Nat Genet9 ), we
focus on estimating genetic correlation, a quantification of the extent to which two traits have a
shared genetic basis. Genetic correlation has been estimated from GWAS data before, but previous
methods 4 again required full genotype and phenotype data. Moreover, those methods required that
the GWAS for the two traits have disjoint sets of individuals, something that is difficult to obtain,
for example because studies often have shared controls. We introduce a new method, cross-trait
LD score regression (CT-LDSC), to estimate genetic correlation that again only requires GWAS
summary statistics, and that allows for overlapping samples in the GWAS for the two traits. We
validate CT-LDSC in simulations, show that on real data it gives results similar to those obtained
using the standard method that requires full genotype and phenotype data on disjoint samples
instead of just summary statistics, 4 and we apply it to estimate genetic correlations between all
pairs in a set of 24 phenotypes. We confirm several known results and find several novel results,
including a genetic correlation between anorexia and schizophrenia.

In Chapter 6, we move from considering two separate diseases that are known to have shared
genetic architecture, and instead we consider a single disease that is suspected to have subtypes.
Identifying latent disease subtypes is important both for elucidating the causes of disease, and for

11
effective clinical treatment. We consider the problem of identifying these subtypes when the data
we observe are genotypes for a mixture of cases of both subtypes, together with controls. A natural
first approach to using this genotype data to identify disease subtypes is to perform Principal
Components Analysis (PCA) on the genotype matrix of cases, in the hope that the top eigenvector
will correspond to the assignment of individuals to disease subtypes. The main result of Chapter 6
is a lower bound for the sample size that will be needed for PCA to reflect the presence of disease
subtypes as a function of how heritable the two traits are and other parameters of our model. We
use our result to give a lower bound on the sample size needed to distinguish schizophrenia and
bipolar disorder, if they were considered subtypes of a single disease. We determine that 180, 000
combined cases would be needed, far from what is currently available.
We present an overview of the technical contributions of the thesis in Chapter 2.

12
Chapter 2

Overview of technical contributions

Here, we give an overview of the technical contributions in the subsequent chapters.

Chapter 3
The model. GWAS data consist of genotypes and phenotypes for a set of randomly chosen
individuals. We will describe a generative model for GWAS data for a quantitative phenotype such
as height; for binary phenotypes, see Chapter 6 and Appendix C. The model we describe here is a
slightly modified version of a model introduced in previous work.3 410 '1 1
In our generative model for a GWAS dataset, we first sample N i.i.d. vectors X 1 ,... , XN E RM,
the genotype vectors of N randomly chosen individuals each genotyped at M SNPs. Here, the
j-th entry of the vector xi denotes the genotype of individual i at SNP j. Genotypes are typically
{0, 1, 2}-valued, but here we will assume that each SNP has been normalized to mean zero and
variance one in the population. We will let R denote the covariance matrix of xi; i.e., the linkage
disequilibrium (LD) matrix.
Given 3 E RM, a vector that tells us the effect of each SNP on the phenotype, we will generate
a phenotype for every individual via
y = XP + Ei (2.1)

whereEi, ... , EN id A(0, U2). We also assume, without loss of generality, that y has mean zero and

13
variance one in the population.
Because M is large, we will model the entries of 3 as random. We won't assume a particular
distribution for the #j, but we will suppose that they are independent and have mean 0. We will
then incorporate functional genomics data into our model by allowing the variance of y to depend
on the genome annotations in which it is contained. More precisely, let C1,..., Cp C { 1,..., M}
be a set of (potentially overlapping) genome annotations whose union is the set of all SNPs. We
will model
Var(0j) = T (2.2)
p:jECp

for constants T 1, . ., p. Thus the parameter Tr represents the contribution of category C, to the

variance of the SNPs in C , ; it allows us to ask, for example, whether the category C, increases or

decreases the variance of the effects of the SNPs in the category.

This model can be interpreted in terms of heritability. In particular, if we define SNP-heritability

h as the proportion of variance explained by the SNPs included in the model after marginalizing

out 0, then it is simple to show that

1:
h = Var( Zxj #)/Var(y) = Var(j). (2.3)
j

This then natural gives a definition of the heritability of a particular category C,:

h (C, ) = Var(O3) (2.4)

There are many questions that can be approached by fitting a model like this one. For example,

Gusev et al. fit this model where the C , were a disjoint partition of the genome into six basic

categories: coding, promoter, UTR, intron, DHS, other, and estimated that 79% of heritability, on

average across 11 diseases, was explained by the 16% of SNPs that fell into DHS regions."

Typically, this model is fit via maximum likelihood (ML), or, in the case that there are additional
covariates such as age and sex, via restricted maximum likelihood (REML), as implemented in a set

of tools called GCTA.4 ,1 2 These methods assume that Y = (Yi, .. . , yN)T and X = (X1,... , X,)T are

14
observed. Let X(p) denote the matrix X restricted to columns corresponding to SNPs in category
Cp. Under the added assumption that the distribution of OQ is normal, we can marginalize out the
Oj to obtain
mx(p)x(p)T + geI) Y|X ~A( 0,( ,
P

and then maximize this likelihood as a function ofi, . .. , rp and oa.


However, for many diseases and traits we do not have access to Y and X, for example for
reasons such as patient privacy. Instead, the researchers who collected Y and X test each SNP j
for non-zero marginal correlation of the j-th column of X, denoted Xj,with Y, and they release the
Z-scores for this test. Therefore, instead of having access to X and Y, we often have access only to
these summary statistics, which are approximately equal to Zj = XfY/v/N.

The method. In Chapter 3, we describe a way to fit this model using summary statistics. The
method, based on the previously-described LD score regression," is called Stratified LD score
regression, or S-LDSC. S-LDSC takes as input the squares of the Z-scores described above; these
squares are also known as chi-square statistics. Let X' denote the chi-square statistic for the j-th
SNP, equal to (XfY) 2 /N. The main equation on which S-LDSC is based is:

E [X2] ~1 + NJ rF(j, Cp). (2.5)


p= 1

where
f(j,C) = ER 2
kEC

(Recall that R is the covariance matrix of xi and has ones on the diagonal, and so Rjk is the
correlation between SNPs j and k in the underlying population.)
Because f(j, C) does not depend on any phenotype, it can be estimated from an external reference
panel such as 1000 Genomes,14 as long as the correlation structure in the reference panel matches
the correlation structure of the sample. Thus, we can estimate Tp by regressing x? on our estimates
of f(j, Cp). To evaluate significance, we break the genome into 200 blocks and jackknife the blocks.

15
Further details of the method are described in Chapter 3.

Derivation sketch. I will give here a sketch of the derivation of Equation (2.5). Starting from
the definition of Zj, plugging in Equation (2.1) for Y, it can be shown that

Z = /N Z jk/k + C' (2.6)


k

where Ryk = kXfTXk, Nv


E= (Eli ...
3, EN) T, and E 1 XTE. Squaring and marginalizing out 3, X,
and E, we get

E[x ] = N EE [N^k] Var(3k)+ 2 (2.7)

Most pairs of SNPs j, k have Rjk = 0, and in these cases, E [Rk is equal to Rjk+1/N. On the other
hand, when Rjk = 1, then E [Rk] is equal to R?. For other values of Rjk, E [N k] falls between
Rjk and Rik + 1/N. Recalling that 1 = Var(y) = Zk Var(k) + U2, we re-write Equation (2.7) as:

E[x2 = 1 + NZ RkVar(A)+ N 1N (E [ ] - Rk - 1/N) Var(3) (2.8)


k k

We seek to show that the final term in the above equation is negligible. We know that E [Nk] - R

-
1/N for all j, k, and that it is equal to zero for most SNPs. Let W denote the number of SNPs
k for which Ryk $ 0; i.e., the number of SNPs k for which E [ - Rk - 1/N $ 0. Because
Var(Ok) < 1/M, we have

N 5 (E [Nk] R~k - 1/N) Var(3k) <W/M.


k

Because W/M is very small, we can neglect the third term on the right hand side of Equation (2.8).
(Note: when there is population structure, then Mjk is inflated and this term becomes non-negligible
and goes into the intercept of the regression.) Finally, using Equation (2.2), it is straightforward to
derive Equation (2.5).

16
Chapter 4

In Chapter 4, we apply S-LDSC to gene expression data to identify disease-relevant tissues and cell
types. To do this, we introduce a simple method for converting a matrix of gene expression data
into a genome annotation for each tissue. The matrices of gene expression data that we consider
have samples for several different tissues, and for each sample and each gene, a measure of the
expression of that gene in that sample. We use this matrix to compute, for each gene and each
tissue, a t-statistic reflecting whether the expression of that gene is higher in that tissue than in
other reference tissues. We then take the 10% of genes with the highest t-statistics, and add a
100kb window around the transcribed region to obtain a genome annotation. Alternate approaches
involving, for example, a continuous score for each gene instead of a threshold, did not outperform
this approach.
The approach can result in different annotations for the same tissue, if the t-statistics are
computed with respect to different sets of reference tissues. For example, we use gene expression
data from many different tissues, including 13 brain regions, to obtain genome annotations, and the
13 resulting brain annotations are highly correlated with each other and show similar enrichments

(e.g., all are enriched for schizophrenia). However, we then use gene expression data only from
these 13 brain regions to compute t-statistics, for each tissue using the remaining 12 brain tissues
as the reference tissues. This results in annotations with much less correlation, and the enrichment
patterns differ more among the brain regions (e.g., only the annotations corresponding to cortex
are enriched for schizophrenia).

Chapter 5

The model and method in Chapter 5 are analogous to the model and method described above for
Chapter 3.

The model. In the generative model for GWAS data for two traits A and B, we draw x .A)
. (
,

i.i.d. as above, to be the genotypes for the GWAS of trait A. As before, let X(A) - (X(A) ... , (A))T.

17
We allow for sample overlap in the two GWAS, letting N, be the number of overlapping samples.
Let x(B) = ') for i = 1,..., N, and let x 1 ... (B) be additional i.i.d. genotype vectors,
independent of X(A). Let X(B) = (XB), ... , )T

We will model phenotype as depending linearly on genotype, as above. Let

(q) = ((q))T O(q) + E(q)


(2.9)

for q E {A, B}, with

(O(A) O(B)) i.i.d. [,, 0), 1


2
A
p
Pg

g,B _
(2.10)

and
A
B)i.i
(~A)())~X d. (010)l ( 2
e
e,A

p
2
Pe

e,B)
(2.11)

where the notation [p, V] denotes an arbitrary distribution with mean yu and covariance V. Let
y(q) =(y) .(., y )T and E(q) = (E4 ,q)
... , E))T for q E {A, B}.

The method. We are interested in estimating the genetic covariance, pg, and the corresponding
genetic correlation, p9 / ~ ,A'2 2 again using summary statistics. We will observe the z-scores

from GWAS for two different traits; i.e., Z(A) - X (A)/ VNiA and Z(B)
YA) X(B)Ty(B) NB for

j=1, ... ,M.


To estimate genetic correlation, we show that

E[Z(A)Z(B)] = NANB N8 P. (2.12)


M NANB

where p := E[y(A)y(B)] pg - Pe. This is the main equation that cross-trait LD score regression, or

CT-LDSC, is based on. The method, analogously to S-LDSC, consists again of estimating the LD

scores fj from a reference panel, and this time regressing the product of z-scores on the LD scores,

with the slope giving genetic covariance and the intercept reflecting sample overlap. Details of the

method are given in Chapter 5.

18
Intuition for the main equation. To get an intuition for Equation (2.12), consider the following.
As in Equation (2.6), for q G {A, B} we have

j VNq Z7 Rjk/ 3

(q (q)TX(q)I(q=X(qT q)/A
where jk Xj X /N and 6 ' = X / N/q. If there are no shared samples, then E'
and c') are independent, and E[RN N ] = Rjk. In this case, it is easy to see that

NANB
N Pgn E[Z(A)Z(B) IN= 0] M

where ij = Zk Rk. On the other hand, if N, = NA = NB-i.e., there is full sample overlap-then
a derivation very similar to the one given above for S-LDSC shows that

E[Z(A)ZP)IN = NA = NB = N] = Njpg + P.
M

For 0 < N, < max{NA, NB}, Equation (2.12) can be derived by writing Z(A) and Z(B) each as a
sum of one component from the shared samples and one component from the non-shared samples,
and then applying arguments from these two examples to the relevant components of the product
of z-scores.

Chapter 6

We model two latent disease subtypes using the model for two traits described above for CT-LDSC,
though instead of specifying only the moments of (O3A), 3B), we specify that they come from a
mixture of a point mass at (0, 0) and a bivariate normal distribution. Let X denote the N x M
genotype matrix of cases for two disease subtypes, with each column mean-centered and normalized
by the standard deviation of the genotype in the population. To derive a lower bound for the
sample size that will be needed for PCA on XXT/M to reflect the presence of disease subtypes,
we show that E[XXT] can be approximated by the identity plus a rank one matrix, where the

19
top eigenvector corresponds to the assignment of individuals to disease subtypes. We then apply a
result from random matrix theory 5 -19 that has been applied to genetic data before in the context
of identifying geographic population structure from genotype data.2 0 This result tells us that the
top eigenvector of XXT will have negligible correlation to the top eigenvector of E[XXT] as long
as N x M is below a threshold called the "BBP threshold." The BBP threshold depends on the
parameters of the model, and we derive an approximation that allows us to lower bound the sample
size required for N x M to pass this threshold.
We then consider the question of using data from controls to choose a subset of SNPs that is
most informative about disease subtypes. We come up with an approach that we show has the
potential to greatly decrease the required sample size in many cases.
Finally, we derive a moment-based method for using summary statistics to estimate the param-
eters of the distribution of (,A)OB)). We estimate these parameters for schizophrenia and bipolar
disorder and use our estimates to determine that even with our proposed approach to choosing
SNPs, at least 180, 000 combined cases would be required to pass the BBP threshold.

20
Chapter 3

Partitioning heritability by functional


category using genome-wide association
summary statistics

Recent work has demonstrated that some functional categories of the genome contribute dispro-
portionately to the heritability of complex diseases. Here, we analyze a broad set of functional
elements, including cell-type-specific elements, to estimate their polygenic contributions to her-
itability in genome-wide association studies (GWAS) of 17 complex diseases and traits with an
average sample size of 73,599. To enable this analysis, we introduce a new method, stratified LD
score regression, for partitioning heritability from GWAS summary statistics while accounting for
linked markers. This new method is computationally tractable at very large sample sizes, and lever-
ages genome-wide information. Our results include a large enrichment of heritability in conserved
regions across many traits; a very large immunological disease-specific enrichment of heritability in
FANTOM5 enhancers; and many cell-type-specific enrichments including significant enrichment of
central nervous system cell types in body mass index, age at menarche, educational attainment,
and smoking behavior. 1

'The material in this chapter previously appeared in the September 2015 edition of Nature Genetics as "Partition-
ing heritability by functional annotation using genome-wide association summary statistics" by Hilary Finucane*,
Brendan Bulik-Sullivan* et al. 7 (* = co-first).

21
Introduction

In GWAS of complex traits, much of the heritability lies in single-nucleotide polymorphisms (SNPs)
that do not reach genome-wide significance at current sample sizes.2 1' 22 However, many current
approaches that leverage functional information5, 6 and GWAS data to inform disease biology use
only SNPs in genome-wide significant loci,22 assume only one causal SNP per locus,2 6 or do not
account for linkage disequilibrium (LD).27 We aim to improve power by estimating the proportion
of genome-wide SNP-heritability 21 attributable to various functional categories, using information
from all SNPs and explicitly modeling LD.

Previous work on partitioning SNP-heritability has used restricted maximum likelihood (REML)
as implemented in GCTA.4,1 1 ,12 , 28 REML requires individual genotypes, but many of the largest
GWAS analyses are conducted through meta-analysis of study-specific results, and so typically
only summary statistics, not individual genotypes, are available for these studies. Even when
individual genotypes are available, using REML to analyze multiple functional categories becomes
computationally intractable at sample sizes in the tens of thousands. Here, we introduce a method
for partitioning heritability, stratified LD score regression, that requires only GWAS summary
statistics and LD information from an external reference panel that matches the population studied
in the GWAS.

We apply our novel approach to 17 complex diseases and traits with an average sample size
of 73,599. We first analyze non-cell-type-specific annotations and identify heritability enrichment
in many of these functional annotations, including a large enrichment in conserved regions across
many traits and a very large immunological disease-specific enrichment in FANTOM5 CAGE-defined
enhancers. We then analyze cell-type-specific annotations and identify many cell-type-specific heri-
tability enrichments, including enrichment of central nervous system (CNS) cell types in body mass
index, age at menarche, educational attainment, and smoking behavior.

22
Results

Overview of methods

Our method for partitioning heritability from summary statistics, called stratified LD score regres-
sion, relies on the fact that the x 2 association statistic for a given SNP includes the effects of all
SNPs that it tags. 13 , 29 Thus, for a polygenic trait, SNPs with high LD score will have higher X2
statistics on average than SNPs with low LD score.1 3 This might be driven either by the higher
likelihood of these SNPs to tag an individual large effect, or their ability to tag multiple weak
effects. If we partition SNPs into functional categories with different contributions to heritability,
then LD to a category that is enriched for heritability will increase the x 2 statistic of a SNP more
than LD to a category that does not contribute to heritability. Thus, our method determines that
a category of SNPs is enriched for heritability if SNPs with high LD to that category have higher
x2 statistics than SNPs with low LD to that category.
More precisely, under a polygenic model, 21 the expected x 2 statistic of SNP j is

E[x2] = N ETc(j, C) + Na + 1, (3.1)


C

where N is sample size, C indexes disjoint categories, f(j, C) is the LD Score of SNP j with respect
to category C (defined as f(j, C) := EkeC r 2 (j, k)), a is a term that measures the contribution of
confounding biases, 13 and TO is the per-SNP heritability in category C (Methods). Equation (3.1)
allows us to estimate TC via a (computationally simple) multiple regression of xj against f(j, C).
The method easily generalizes to overlapping categories; when there are overlapping categories then
Tc, instead of being the per-SNP heritability in category C, is the contribution of category C to
the per-SNP heritability of SNPs in category C (i.e., E[ j] = Ec:jEc Tc) and Equation (3.1) still
holds (Methods). The method can also be applied to case-control studies (Methods). We define
the enrichment of a category to be the proportion of SNP-heritability explained divided by the
proportion of SNPs. We estimate standard errors with a block jackknife, 13 and use these standard
errors to calculate z-scores, P-values, and FDRs (Methods). We have released open-source software

23
implementing the method (Web Resources).

To apply stratified LD score regression (or REML) we must first specify which categories we
include in our model. We created a "full baseline model" from 24 publicly available main annota-
tions that are not specific to any cell type (Table A.1; see Methods and Web Resources). Below,
we show that including many categories in our model leads to more accurate estimates of enrich-
ment. The 24 main annotations include: coding, UTR, promoter, and intron;1 1 , 3 0 histone marks
H3K4mel, H3K4me3, H3K9ac5 ,6 , 2 3 and two versions of H3K27ac; 31 , 3 2 open chromatin reflected
by DNase I hypersensitivity Site (DHS) regions;",23 combined chromHMM/Segway predictions,3 3
which make use of many ENCODE annotations to produce a single partition of the genome into
seven underlying "chromatin states"; regions that are conserved in mammals;3 4 35 super-enhancers,
which are large clusters of highly active enhancers; 32 and enhancers with balanced bidirectional
capped transcripts identified using cap analysis of gene expression (CAGE) in the FANTOM5 panel
of samples, which we call FANTOM5 CAGE-defined enhancers 36 or FANTOM5 enhancers. For the
histone marks and other annotations that differ among cell types, we combined the different cell
types into a single annotation for the full baseline model by taking a union (except for Repressed,
where we took an intersection). To prevent our estimates from being biased upwards by enrichment
in nearby regions, 11 we also included 500bp windows around each functional category as separate
functional categories in the full baseline model, as well as 100bp windows around ChIP-seq peaks
when appropriate (see Methods). This yielded a total of 53 (overlapping) functional categories in
the full baseline model, including a category containing all SNPs.

In addition to the analyses using the full baseline model, we also performed analyses using cell-
type-specific annotations of four of the histone marks, H3K4mel, H3K4me3, H3K9ac, and H3K27ac.
Each cell-type-specific annotation corresponded to a histone mark in a cell type-for example,
H3K27ac in liver cells-and there were 220 such annotations in total. We compare these cell types
by adding them individually to the baseline model and ranking by the z-score the coefficient TC

of the cell type. We also grouped the 220 cell-type-specific annotations into 10 groups and took
a union within each group, resulting in 10 new cell-type-group annotations (for example, SNPs in
regions with any of the four histone modifications in any immune cell type), and we repeated the

24
analysis with these 10 cell-type groups.

Simulation results: power and lack of bias

In order to assess the power and bias of the method, we performed a variety of simulations. For
these simulations, we used genotypes from the Wellcome Trust Case Control Consortium,37 which
after quality control included 14,526 individuals and 162,575 SNPs (Methods). In our first set of
simulations, we let heritability vary between 0.1 and 0.9, with the proportion of causal SNPs equal
to 0.05 and 0.005 (i.e., 8,129 and 813 causal SNPs on average, respectively), and we simulated
quantitative phenotypes from an additive model. For each simulation, effect sizes for causal SNPs
were drawn from a normal distribution with mean zero and variance (i.e., average per-SNP heri-
tability) determined by functional categories. We simulated realistic enrichment of categories in the
baseline model and CNS cell-type-group (see Methods). For each simulation, we used stratified LD
score regression to estimate total heritability, the heritability of the CNS cell-type-group, and the
proportion of heritability in the CNS cell-type-group.
-4 000604 40-OO-101..00 4~ 0.

Fig- tion r power ttt an

s 04 t0.4

(a) .(b) .(c).

-00--.- - -
---_-------_---- -
4.a.0
DA 1.000
10000
2 ato30214

I4-02g N"02 Z-score of t"ta SUP-hWV_ ilt

(a) . (b).- (c).-


Figure 3-1: Simulation results: power and type 1 error. We simulated null genetic architectures and genetic
architectures with true enrichment for two values Of Peausal and a range of values of N - h2. For each set
of simulations, we computed the proportion of simulations rejected at P < 0.05, and the z-score of total
SNP-heritability.

In each of these simulations, stratified LD score regression gave unbiased estimates of heritability
and of the heritability of the CNS cell-type group (Figures A-2a and A-2b). While in theory the
ratio of these two unbiased estimators could be a biased estimator of the proportion of heritability,
especially when estimates of the denominator are noisy, in practice we did not see non-negligible

25
bias in our estimates of proportion of heritability (Figure A-2c). Results were similar for simulations
with no enrichment (Figure A-3).

We also looked at power as a function of SNP-heritability (h2), sample size (N), and proportion

of causal SNPs (peausal). At a fixed proportion of causal SNPs, power depends on N and h2 only
through N - h2 (Figure A-1), and increases as N - h2 increases (Figure 3-ia). Power also increases
as Pcausai increases (Figure 3-la). To reduce these two quantities, N - h2 and pcausa, into a single

quantity, we looked at the z-score for total SNP-heritability in our analysis, which also increases as
N h 2 and Pcausal increase (Figure 3-1b). We found that the relationship of heritability z-score to

power was the same for both values of Pcausal (Figure 3-1c), indicating that the heritability z-score

is a good indicator of power at a variety of sample sizes, heritabilities, and values of pcausal. These

simulations also demonstrated well-calibrated typei error at all parameter settings tested (Fig 3-1a).

Simulation results: model misspecification

In our second set of simulations, we compared stratified LD score regression to REML, a method

that also estimates partitioned heritability but requires genotype data. For computational ease

using REML, we decreased our sample size to the 2,680 controls in the WTCCCi dataset, and we

correspondingly restricted ourselves to only SNPs on chromosome 1. For this set of simulations, a

dense set of SNPs was particularly important, so we used genotypes imputed using a 1000 Genomes

reference panel' 4 (Web Resources), giving us 360,106 SNPs after quality control (Methods). We
again simulated quantitative phenotypes using an additive model, with effect sizes of causal SNPs

drawn from a normal distribution with mean zero and variance (i.e., average per-SNP heritability)

determined by functional categories. Heritability was set to 0.5. We estimated the enrichment of the

DHS category, i.e., (Prop. h 2 )/(Prop. SNPs), using three methods: (1) REML with two categories
(DHS/non-DHS), (2) stratified LD score regression with two categories (DHS/non-DHS), and (3)
stratified LD score regression with the full baseline model (53 categories, described above). Since
REML with 53 categories did not converge at this sample size and would be computationally

intractable at sample sizes in the tens of thousands, we did not include it in our comparison;

an advantage of stratified LD score regression is that it is possible to include a large number of

26
categories in the underlying model. We report means and standard errors of the mean over 100
independent simulations.

Causal category Is in the model Causal category is not in the model


7t
True enrichment -- - True enrichment
REML using two categories (DHS and non-DHS) 9 REML using two categories (DHS and non-DHS)
LD Score regression using two categories a * LD Score regression using two categories
ID Score regression using the full baseline ID Score regression using the full baseline
model model without the causal category

4
Causal = Coding
E
U
I I
Causal - 200bp
flanking
4
Im
C 4
E
Z

~1 I
Causal = FANTOMS
Enhancer
2

'I,,
0

0
0 1 2 3 4 5 6 0 1 2 3 4
True enrichment True enrichment

(a) No model misspecification. (b) Model is misspecified.

Figure 3-2: Simulation results for model misspecification. Enrichment is the proportion of heritability in
DHS regions divided by the proportion of SNPs in DHS regions. Bars show 95% confidence intervals around
the mean of 100 trials. (a) From left to right, the simulated genetic architectures are 1x DHS enrichment,
3x DHS enrichment, and 5.5x DHS enrichment (100% of heritability in DHS SNPs). (b) From left to
right, the simulated genetic architectures are 200bp flanking regions causal, coding regions causal, and
FANTOM5 CAGE-defined Enhancer regions causal. For simulations with coding or FANTOM5 CAGE-
defined Enhancer as the causal.category, we removed the causal category and the 500bp window around that
category from the full baseline model in order to simulate enrichment in an unknown functional category.

We first performed three sets of simulations where the causal pattern of enrichment was well
modeled by the two-category (DHS/non-DHS) model. In these simulations, all three methods
performed well, although stratified LD score regression with the full baseline model had larger
standard errors around the mean (Figure 3-2a). For example, the standard errors around the
mean in simulations with no DHS enrichment were 0.08 for REML, 0.11 for two-category stratified
LD score regression and 0.19 for stratified LD score regression with the full baseline model. For
the first of these three sets of simulations, all SNPs were causal and SNP effect sizes were drawn
independently from a normal distribution with fixed variance. For the second set of simulations,

27
all SNPs were causal and SNP effect sizes were drawn independently from a normal distribution,
but the variance of the normal distribution depended on whether the SNP was in a DHS region,
and the two variances (DHS and non-DHS) were chosen so that the proportion of heritability of
DHS would be 3x more than the proportion of SNPs. For the third set of simulations, only SNPs in
DHS regions were causal, resulting an 5.5x enrichment, and effect sizes of DHS SNPs were drawn
independently from a normal distribution with fixed variance.

Next, to explore the realistic scenario where the model used to estimate enrichment does not
match the (unknown) causal model, we performed three sets of simulations where all causal SNPs
were in a particular category, but the model used to estimate heritability did not include this causal
category. The three sets of simulations were (1) all causal SNPs in coding regions, yielding 1.6x
DHS enrichment due to coding/DHS overlap, (2) all causal SNPs in FANTOM5 enhancers, yielding
4.Ox DHS enrichment due to FANTOM5 enhancer/DHS overlap, and (3) all causal SNPs in 200bp
DHS flanking regions, yielding Ox DHS enrichment. For the coding and FANTOM5 enhancer causal
simulations, we transformed the full baseline model into a misspecified model by removing the causal
category and window around the causal category. Results from these simulations are displayed in

Figure 3-2b.

The two-category estimators were not robust to model misspecification and consistently over-
estimated DHS enrichment by a wide margin. Stratified LD score regression with the full baseline
model gave more accurate mean estimates of enrichment. Specifically, for the simulations with
coding and FANTOM5 enhancers causal, stratified LD score regression with the full baseline model
gave unbiased mean enrichment estimates of 1.8x (s.e. 0.22) and 4.2x (s.e. 0.22), respectively, while
the mean enrichment estimates of REML and two-category stratified LD score regression were
nearly double these. The full baseline model includes a 500bp window around DHS but not a 200bp
window, and gave a mean estimated DHS enrichment of 0.65x (s.e. 0.22) when the 200bp flanking

regions were causal, which is inflated relative to the true enrichment of Ox but much less inflated
than > 3x mean enrichment estimates given by the two-category methods.

In summary, while these simulations include exaggerated patterns of enrichment (e.g., 100%
of heritability in DHS flanking regions), the results highlight the possibility that two-category

28
estimators of enrichment can yield incorrect conclusions. Although we cannot entirely rule out

model misspecification as a source of bias for stratified LD score regression with the full baseline
model, we have shown here that it is robust to a wide variety of patterns of enrichment, because
including many categories gives it the flexibility to adapt to the unknown causal model.

Simulation results: cell-type and cell-type-group analyses

We simulated realistic baseline enrichment plus enrichment in a cell-type-group, performing 500


replicates for each cell-type group, and we performed our cell-type-group analysis on the resulting
summary statistics. First, we calibrated enrichment of the causal cell-type-group so that at least
475 of the 500 simulations had at least one cell-type-group that reached significance; we found that

this calibration gave us an average top chi-square statistic between 19 and 50 for the ten cell-type-
groups, a realistic range for the data we analyzed (Methods). Of the simulations in which at least
one cell-type-group reached significance, we found that the top cell-type-group was the cell type
simulated to be causal 99.8% of the time (Figure 3-3). In simulations in which the enrichment
was weaker-i.e., in 35% of simulations there was no causal cell type group-the cell-type-group
simulated to be causal was the top cell-type group in 95% of simulations with at least one causal
cell type (Figure 3-3). Results for each of the ten cell-type group are displayed in Figure A-5.

We next repeated these simulations with a cell-type-specific mark-H3K4me3 in fetal brain

cells-instead of a cell-type group as the simulated causal category. We again found that when the
level of enrichment was reasonably high, giving a realistic top chi-square statistic, the simulated
causal cell type was the most significant cell type in over 99% of simulations. In simulations with
weak enrichment, we saw a larger number of simulations in which either the cell type simulated
as causal was significant but not the most significant cell type (13% of total), or in which it was
not significant but another cell type was (9% of total). We attribute the difference between cell
type and cell-type group results to the high levels of correlation among the 220 cell-type-specific
annotations.

29
1.0 1 1

0.8.

0
E
120.6-

C
0

0 04
a
0

0.2
Top significant annotation is the causal annotation.
Top significant annot has 0.5 < r2 < 1 to the causal annot.
Top significant annot has r2 < 0.5 to the causal annot.
=~ No annotation is significant.
0.0
Cell-type group, Cell-type group, Single cell type, Single cell type,
realistic enrichment low enrichment realistic enrichment low enrichment

Figure 3-3: Simulation results for ranking cell type groups and cell types. For each cell-type group, 500
simulations were performed with baseline enrichment and either high enrichment or low enrichment in that
cell type group. Results for the left two columns are aggregated over the ten cell type groups; results for
individual groups are displayed in Figure A-5. The right two columns represent 500 simulations each of
high or low enrichment of a single cell-type-specific annotation, H3K4me3 in fetal brain cells.

Analysis of 17 traits using the full baseline model

We analyzed 17 diseases and quantitative traits: height, BMI, age at menarche, LDL levels, HDL
levels, triglyceride levels, coronary artery disease, type 2 diabetes, fasting glucose levels, schizophre-
nia, bipolar disorder, anorexia, educational attainment, smoking behavior, rheumatoid arthritis,
Crohn's disease, and ulcerative colitis (Table A.2, Web Resources). This includes all traits with
publicly available summary statistics with sufficient sample size, SNP-heritability, and polygenicity

30
measured by the z-score of total SNP-heritability (Methods). We removed the MHC region from
all analyses, due to its unusual LD and genetic architecture.
We applied stratified LD score regression with the full baseline model to the 17 traits. Figure 3-
4 shows results for the 24 main functional annotations, averaged across nine independent traits
(Methods). Figure 3-5 shows trait-specific results for selected annotations and traits (Methods).
Table A.3 shows meta-analysis results for all traits and all 53 categories in the full baseline model.

16[ 1

14

0
12
C

6
8.

No enrichment

V I I..a' Ih~IS~F~flmflL
-

110-~~~c k~.0- O 5

Figure 3-4: Enrichment estimates for the 24 main annotations, averaged over nine independent traits.
Annotations are ordered by size. Error bars represent jackknife standard errors around the estimates of
enrichment, and stars indicate significance at P < 0.05 after Bonferroni correction for 24 hypotheses tested.
Negative point estimates are discussed in Methods, as is significance testing.

We observed large and statistically significant enrichments for many functional categories. A
few categories stood out in particular. First, regions conserved in mammals3 4 showed the largest
enrichment of any category, with 2.6% of SNPs explaining an estimated 35% of SNP-heritability on
average across traits (P < 10-5 for enrichment). This is a significantly higher average enrichment

31
39x (s.e. 7.4x)
A

lMW
Height
25|- BMI

4.
Age at menarche
Schizophrenia
Immune meta-analysis
V)
0-
20-
0
C
0
'E
0
C.
0 151-

IN
4-
0
C 101-
0
'E
0
0L
0
L_
C-
5e-
++

No enrichment

S~eTh
- A
IZ~ UL4
c
e C2P
(. 9 qG ?S)

Figure 3-5: Enrichment estimates for selected annotations and traits. Error bars represent jackknife
standard errors around the estimates of enrichment.

than for coding regions, and provides evidence for the biological importance of conserved regions,
despite the fact that the biochemical function of many conserved regions remains uncharacterized. 3 8
Second, FANTOM5 CAGE-defined Enhancers 36 were extremely enriched in the three immunological
diseases, with 0.4% of SNPs explaining an estimated 15% of SNP-heritability on average across

32
these three diseases (P < 10-5), but showed no evidence of enrichment for non-immunological
traits (Figure 3-5). The immune-specific enrichment could be because immune cells have better
coverage, altered degradation, and/or a higher number of enhancers. Third, repressed regions were
depleted: 46% of SNPs explain only 29% of heritability on average (P < 0.006 for depletion). While
this depletion is consistent with the hypothesis that these are regions of low activity relative to the
genome-wide average,33 the non-zero heritability in repressed regions suggests either that repressed
regions in the six cell types used to define this annotation do not match the repressed regions in the
cell types relevant for the 17 phenotypes, or that there is significant activity in repressed regions. We
did not see a large enrichment of super-enhancers vs. regular enhancers; the estimates for enrichment
were 1.8x (s.e. 0.2) for super-enhancers vs. 1.6x (s.e. 0.1) for regular enhancers from the same paper
(denoted "H3K27ac (Hnisz) "in Figure 3-4). We also did not see increased cell-type-specificity in
super-enhancers (Methods). This lack of enrichment supports the hypothesis that super-enhancers
may not play a much more important role in regulating transcription than regular enhancers. 39
For many annotations, there was also enrichment in the 500bp flanking regions (Table A.3); this
could be because the boundaries are not well defined, because the boundaries of the regions are
different in different individuals, or because unknown regulatory elements often appear close to
known regulatory elements. Analyses stratified by minor allele frequency produced broadly similar
results for all of these enrichments (Table A.4; see Methods).

Cell-type-specific analysis of 17 traits

We performed two different cell-type-specific analyses: an analysis of 220 individual cell-type-specific


annotations, and an analysis of 10 cell-type groups. The 220 individual cell-type-specific annotations
are a combination of cell-type-specific annotations from four histone marks: 77 from H3K4mel, 2 3
81 from H3K4me3, 2 3 27 from H3K9ac, 23
and 35 from H3K27ac (PGC2)31 (Table A.5, Methods).
When ranking these 220 cell-type-specific annotations, we wanted to control for overlap with the
functional categories in the full baseline model, but not for overlap with the 219 other cell-type-
specific annotations. Thus, we added the 220 cell-type-specific annotations individually, one at a
time, to the full baseline model, and ranked these 220 annotations by the P-value for the coefficient

33
corresponding to the annotation. This P-value tests whether the annotation contributes significantly
to per-SNP heritability after controlling for the effects of the annotations in the full baseline model.
We assessed statistical significance at the 0.05 level after Bonferroni correction for 220 x 17 = 3, 740
hypotheses tested. (This is conservative, since the 220 annotations are not independent.) We also
report results with false discovery rate (FDR) < 0.05 (computed over 220 cell types for each trait).
For 15 of the 17 traits, the top cell type passed an FDR threshold of 0.05. The top cell type for
each trait is displayed in Table 3.1, with additional top cell types reported in Table A.6.

Phenotype Cell type Tissue Mark -loio (p)


Height Chondrogenic dif** Bone H3K27ac 6.81
BMI Fetal brain* Fetal brain H3K4me3 4.48
Age at menarche Fetal brain** Fetal brain H3K4me3 12.25
LDL Liver (BI)* Liver H3K4me1 4.76
HDL Liver (BI)* Liver H3K4me1 4.51
Triglycerides Liver (BI)* Liver H3K4mel 3.99
Coronary artery disease Adipose nuclei* Adipose H3K4me1 4.21
Type 2 Diabetes Pancreatic islets Pancreas H3K4me3 2.87
Fasting Glucose Pancreatic islets* Pancreas H3K27ac 3.93
Schizophrenia Fetal brain** Fetal brain H3K4me3 18.51
Bipolar disorder Mid frontal lobe* Brain H3K27ac 4.42
Anorexia Angular gyrus Brain H3K9ac 2.61
Years of education Angular gyrus** Brain H3K4me3 6.63
Ever smoked Inferior temporal lobe* Brain H3K4me3 3.21
Rheumatoid arthritis CD4+ CD25- IL17+ stim Th17** Immune H3K4mel 6.76
Crohn's disease CD4+ CD25- IL17+ stim Th17** Immune H3K4me1 7.59
Ulcerative colitis CD4+ CD25- IL17+ stim Th17** Immune H3K4mel 6.37

Table 3.1: Enrichment of individual cell types. We report the cell type with the lowest P-value for each
trait analyzed. * denotes FDR < 0.05. ** denotes significant at p < 0.05 after Bonferroni correction for
multiple hypotheses. Sample sizes are in Table A.2.

We combined information from related cell types by aggregating the 220 cell-type-specific anno-
tations into 10 groups (Figure 3-6 legend and Table A.5; see Methods). For each trait, we performed
the same analysis on the 10 group-specific annotations as with the 220 cell-type-specific annotations.
We assessed statistical significance at the 0.05 level after Bonferroni correction for 10 x 17 = 170
hypotheses tested, and we again also report results with false discovery rate (FDR) < 0.05 (now
computed over all cell-type groups and traits). For 16 of the 17 traits (all traits except anorexia),
the top cell-type group passed an FDR threshold of 0.05. Results for the 11 traits with the most

34
Adrenal/Pancreas
Schizophrenia Bipolar disorder Height Central Nervous System
Cardiovascular
Z I Connective/Bone
Gastrointestinal
y
Ch
o I
-- I Immune/Hematopoietic
I. Kidney
_,J
Liver
Skeletal Muscle
Other
02 4 6 8 10 12 14 14 S 0 1 2 3 4 0 2 4 6 8 10 12

Rheumatoid arthritis Crohn's disease Fasting Glucose BMI

5 ii

I..

of I
0 1 2 3 4 5
II
Years of education Ever smoked HDL Age at menarche

0
C

U
Ever smoked

0 1
0 1 2 0 2 4 6 40 0 1 2 3 4 ) 1 2 3
Rhe1m()a -ailo(p) -1090(m)

Figure 3-6: Enrichment of cell-type groups. We report significance of enrichment for each of 10 cell-
type groups, for each of 11 traits. The black dotted line at - loglo(P) = 3.5 is the cutoff for Bonferroni
significance. The grey dotted line at - logio(P) = 2.1 is the cutoff for FDR < 0.05. For HDL, three of the
top individual cell types are adipose nuclei, which explains the enrichment of the "Other" category.

significant enrichments (after pruning closely related traits) are shown in Figure 3-6, with remaining
traits in Figure A-6.

These two analyses are generally concordant, and show highly trait-specific patterns of cell-type
enrichment. They also recapitulate several well-known findings. For example, the top cell type for
each of the three lipid traits is liver (FDR < 0.05 for all three traits). This is concordant with the

medical literature,4 0 and previous analysis of this GWAS data found a signal of liver enrichment.26 ' 4 1

35
For both type 2 diabetes and fasting glucose, the top cell type is pancreatic islets (FDR < 0.05
for fasting glucose but not type 2 diabetes). This, too, is concordant with medical literature4 2 and
genetic evidence for this association has been established using overlapping GWAS data and more
comprehensive functional data.4 1'4 4 For the three psychiatric traits, the top cell type is a brain cell
type (FDR < 0.05 for schizophrenia and bipolar disorder but not for anorexia) and the top cell-type
group is CNS (significant after multiple testing for schizophrenia and bipolar disorder but not for
anorexia). Previous analysis of the schizophrenia GWAS using a subset of the functional data found
enrichment in CNS cell types as well as immune cell types," and previous analysis of the bipolar
GWAS found several significant SNPs near CNS-related genes.45

There are also several new insights among these results. For example, the three immunological
disorders show patterns of enrichment that reflect biological differences among the three disorders.
Crohn's disease has 40 cell types with FDR < 0.05, of which 39 are immune cell types and one

(colonic mucosa) is a GI cell type. On the other hand, the 39 cell types with FDR < 0.05 for
ulcerative colitis include nine GI cell types in addition to 30 immune cell types, whereas all 39
cell types with FDR < 0.05 for rheumatoid arthritis are immune cell types. The top cell type for
all three traits is CD4+ CD25- IL17+ PMA lonomycin simulated Th17 primary. Th17 cells are
thought to act in opposition to Treg cells, which have been shown to suppress immune activity and
whose malfunction has been associated with immunological disorders.46

We also identified several non-psychiatric phenotypes with enrichments in brain cell types. For
both BMI and age at menarche, cell types in the central nervous system (CNS) ranked highest among
individual cell types, and the top cell-type group was CNS, all with FDR < 0.05. These enrichments
support previous human and animal studies that propose a strong neural basis for the regulation
of energy homeostasis." For educational attainment, the top cell-type group is CNS (FDR < 0.05)
and of the ten cell types that are significant after multiple testing, nine are CNS cell types. This
is consistent with our understanding that the genetic component of educational attainment, which
excludes environmental factors and population structure, is highly correlated with IQ. 4 ' Finally, for
smoking behavior, the CNS cell-type group is significant after multiple testing correction, and the
top cell type is again a brain cell type, likely reflecting CNS involvement in nicotine processing.

36
Discussion

We developed a new statistical method, stratified LD score regression, for identifying functional
enrichment from GWAS summary statistics that uses genome-wide information from all SNPs and
explicitly models LD. We applied this method to summary statistics from 17 traits with an aver-
age sample size of 73,599. Our method identified strong enrichment for conserved regions across
all traits, and immunological disease-specific enrichment for FANTOM5 CAGE-defined enhancers.
Our cell-type-specific enrichment results confirmed previously known enrichments, such as liver
enrichment for HDL levels and pancreatic islet enrichment for fasting glucose. In addition, we
identified enrichments that would have been challenging to detect using existing methods, such as
CNS enrichment for smoking behavior and educational attainment-traits with only one and three
genome-wide significant loci, respectively.48, 49 Stratified LD score regression represents a signifi-
cant departure from previous methods that require raw genotypes, 12 use only SNPs in genome-wide
significant loci,23-25 assume only one causal SNP per locus, 26 or do not account for LD 27 (see Meth-
ods for a discussion of other methods and comparison on simulated data). Our method is also
computationally efficient, despite the 53 overlapping functional categories analyzed.
Although our polygenic approach has enabled a powerful analysis of genome-wide summary
statistics, it has several limitations. First, for the method to have reasonable power, the dataset
analyzed must have a very large sample size and/or large SNP-heritability, and the trait analyzed
must be polygenic. Second, the method requires an LD reference panel matched to the population
studied to give accurate results; all results in this paper are from European datasets and use 1000G
Europeans as a reference panel. Third, our method is currently not applicable to studies using
custom genotyping arrays (e.g., Metabochip; see Methods). Fourth, our method is based on an
additive model and does not consider the contribution of epistatic or other non-additive effects, nor
does it model causal contributions of SNPs not in the reference panel; in particular, it is possible
that patterns of enrichment at extremely rare variants may be different from those inferred using
this method. Fifth, the method is limited by available functional data: if a trait is enriched in a cell
type for which we have no data, we cannot detect the enrichment. Sixth, our method currently gives
large standard errors when applied to very small categories. Last, though we have shown our method

37
to be robust in a wide range of scenarios, we cannot rule out bias due to model misspecification
caused by enrichment in an unidentified functional category as a possible source of bias; however our
simulations show that our method gives nearly unbiased results even under very extreme scenarios
of unmodeled functional categories unlike other methods.

In conclusion, the polygenic approach described here is a powerful and efficient way to learn
about functional enrichments from summary statistics It will likely become increasingly useful as
functional data continues to grow and improve, and as GWAS studies of larger sample size are

conducted.

Web Resources

" ldsc software:


github.com/bulik/ldsc

" Baseline and cell-type-group annotations:

texttthttp://data.broadinstitute.org/alkesgroup/LDSCORE/

" 1000 Genomes:


www.1000genomes.org

" Height50 and BM1 5 1 summary statistics:

www.broadinstitute . org/collaboration/giant/index.php/GIANT-consortium-data_files

" Menarche summary statistics:5 2

www.reprogen.org

" LDL, HDL, and Triglycerides summary statistics:4 1


www.broadinstitute.org/mpg/pubs/lipids20lO/

" Coronary artery disease summary statistics:5 3


www. cardiogramplusc4d. org

38
" Type 2 diabetes summary statistics:5 4
www. diagram-consortium. org

" Fasting glucose summary statistics:5 5


www.magicinvestigators. org/downloads/

" Schizophrenia," Bipolar Disorder,45 Anorexia, 56 and Smoking behavior 49 summary statistics:
www.med.unc.edu/pgc/downloads

" Education attainment summary statistics: 48


www.ssgac.org

* Rheumatoid arthritis summary statistics: 57


http://plaza.umin.ac. jp/ yokada/datasource/software .htm

" Crohn's disease and ulcerative colitis summary statistics: 58


www. ibdgenetics . org/downloads . html

Acknowledgements
We thank Brad Bernstein, Mariel Finucane, Alistair Forrest, Eran Hodis, Dylan Kotliar, X. Shirley
Liu, Manolis Kellis, Michael O'Donovan, Bogdan Pasaniuc, Albin Sandelin, Abhishek Sarkar,
Patrick Sullivan, Bjarni Vilhjalmsson, and Adrian Veres for helpful discussions. This research
was funded by NIH grants R01 MH101244, R03 CA173785, and 1U01HG0070033. H.K.F. was also
supported by the Fannie and John Hertz Foundation. S.R. is supported by funding from the Arthri-
tis Foundation and by a Doris Duke Clinical Scientisit Development Award. This study made use
of data generated by the Wellcome Trust Case Control Consortium (WTCCC) and the Wellcome
Trust Sanger Institute. A full list of the investigators who contributed to the generation of the
WTCCC data is available at www.wtccc.org.uk. Funding for the WTCCC project was provided by
the Wellcome Trust under award 076113. The members of the Schizophrenia Working Group of the
Psychiatric Genetics Consortium are listed in the Supplementary Information.

39
Methods

Derivation of Equation (3.1). We begin with a derivation of Equation (3.1) in a sample with
no population structure or other confounding. The derivation of the intercept term in the presence
of confounding is identical to the derivation in previous work.' 3 We do not assume here that the
categories are disjoint.

Let y2 be a quantitative phenotype in individual i, standardized to mean 0 and variance 1 in the


population, and let Xij be the genotype of individual i at the j-th SNP, standardized so that for
each SNP j, Xij has mean 0 and variance 1 in the population. We will assume a linear model:

yi = ZXi3j(G) + Ei,
JEG

where G is some fixed set of SNPs, i3j(G) is the effect size of SNP j, and Ei is mean-0 noise.

Letting M = IG1, we define ,3(G) = ( ,3(G), . . . , /3M(G)) as the hypothetical result of multiple
linear regression of y on X at infinite sample size. Thus, 3(G) depends on the set G; for example,
if G is the set of genotyped SNPs then j(G) includes the causal effects of non-typed SNPs that
are tagged by SNP j, whereas if G contains all SNPs, then Oj (G) will reflect only the true effect at
SNP j.
We will define the heritability of the set G of SNPs to be

h2 = : G) 2
JEG

and the heritability of a category C C G to be

hG (G)2
jeC

The definition of hG(C) depends on both G and C; for example, if C is the set of SNPs with minor
allele frequency (MAF) greater than 5%, h2(C) will be larger if G = C than if G contains SNPs with
lower MAF since in the first case hG (C) includes tagged effects of low-frequency SNPs, whereas in

40
the second case the low-frequency effects are included in hG (G \ C). From now on, we will omit the
dependence on G, considering it to be fixed.

Suppose that we have a sample of N individuals. Let y = (Y1,..., YN), and let X be the
N x M matrix of standardized genotypes. (We will assume that our sample is large enough that
standardizing each SNP within our sample is roughly equivalent to standardizing each SNP in the
population.) Let c = (6i,... , EN) be a vector of residuals. Then we can write

Let Oj be the estimate of the marginal effect of SNP j in our sample, given by

1 Y

where Xj is the j-th column of X. Define x 2 statistics X2 := N2

Substituting y = X3 + E into the the definition of 3j, we get

=N1 X (X3 + XT
N N
=N X B+N Xi

= k
( X X Ax -+NX

where ijk :--XfXk is the in-sample correlation between SNPs j and k, and [ = XfE/N. Note
that E6has mean 0 and variance U2/N.

We will model 3 as a mean-0 random vector with independent entries. We allow the variance of
f3 to depend on the functional categories that we include in our model; i.e., we have C functional

41
categories C, ... ,c C {1,.... , M} and we model the variance of Oj as

Var(#3) = E T. (3.2)
c:jECC

In the case that the C, are disjoint, we will have -T = h 2 (CC)/M(CC), where M(Cc) is the number
of SNPs in Cc. Each SNP must be in at least one category; in practice we either have a set of
categories that forms a disjoint partition of the genome, or we include the set of all SNPs as one of
the categories.

2
Consider the expectation of X = N
.

exp[X'] N exp ( k
jkfk + E'
)

=T T
jk exp p2] + N exp[',]
k

= N fk ( rc) + N(o2/N)
k \c:kEC,

= NZTCZik+Je,
c kECc

where the second equality follows because the random variables are all independent with mean 0.

Let rjk denote the true correlation between SNPs j and k in the underlying population. In an
unstructured sample, expi k] = r, + 1/N.

42
We now have

exp[X2] = NE TC E rjk + or2


C c:kECc

NJ =NE
c
XV -11N+1/N)+e 2
C c:keCC

NJ =NE rE -11N)+ N E (1IN)+0r


c
C c:kECc c kECc

~NZ T5r~ + T + or-2


C kcCc c kECc

=NZ Te E rik + S T+a 02


C kEC, k c:keC

NJ: Tf(j, c) + Var(k)+ 02


k

where i(j, c) := ZkeCc rik. The variance of yj is EZ Var(3)+ o. Since our phenotype has variance
one, we can replace >j, Var(#3) + 0-2 with 1, giving us our main equation:

exp[ ] = N E5T7(j, c) + 1 (3.3)

An extension of this derivation to case-control traits is in Bulik-Sullivan et al.'

Stratified LD score regression Given a vector of x 2 statistics and LD information either from
the sample or from a reference panel, Equation (3.3) allows us to obtain estimates fc of Tc by comput-
ing f(j, c) and regressing x? on f(j, c). Our goal, though, is to estimate h 2 (Cc) := EZEcc /j. Because

the #j have mean zero, we can approximate this quantity with its expectation, Ejec, Var(/3). When
the categories are disjoint, Var(oj) = -r where SNP j is in category Cc, and so h2 (Cc) = |C
.

43
When the categories overlap, we apply Equation (3.2), which gives us

h 2 (CC) = Var(0j)
jec

jeCe c':jECc'

= n0c,
\c nc
C1

For some analyses we will be interested in c and for some analyses we will use h 2 (CC) or h 2 (Cc)/h 2
(See "Partitioned heritability vs. regression coefficients" below). The details of the regression are
in the following sections.

Significance testing. We estimate standard errors using a block jackknife over SNPs with 200
equally-sized blocks of adjacent SNPs.13 This gives us an empirical covariance matrix of coefficient
estimates. In the baseline analysis, to evaluate whether a category is enriched for heritability,
we want to test whether h 2 (C)/h2 > JCJ/M. This is the same as testing whether the per-SNP
heritability is greater in the category than out of the category; i.e., whether h 2 (C) - h2 -- (C) > 0.
10 M-CI
Because our estimates of the regression coefficients are approximately normally distributed, and
therefore h2 (C)/h2 is not normally distributed but h2 (C)
is weeuetelatrepeso
use the latter expression
-O-C)
MJCJ
is
lci
to test for significance. Because this expression is linear in the coefficients, we can estimate its
standard error using the covariance matrix for the coefficient estimates, and then we compute a
Z-score to test for significance. This procedure is well-calibrated; see Figure 3-la.
For the cell-type-specific analyses, we use the Z-score of the coefficient directly.

Full baseline model. The 53 functional categories, derived from 24 main annotations, were
obtained as follows:

* Coding, 3'-UTR, 5'-UTR promoter, and intron annotations from the RefSeq gene model were
obtained from UCSCO and post-processed by Gusev et al."

44
" Digital genomic footprint and transcription factor binding site annotations were obtained from
ENCODE 5 and post-processed by Gusev et al."

" The combined chromHMM/Segway annotations for six cell lines were obtained from Hoffman
et al. 33 CTCF, promoter flanking, transcribed, transcription start site, strong enhancer, and
weak enhancers are a union the six cell lines; repressed is an intersection over the six cell lines.

" DNase I hypersensitive sites (DHSs) are a combination of ENCODE and Roadmap data,
post-processed by Trynka et al. 23 We combined the cell-type-specific annotations into two
annotations for inclusion in the full baseline model: a union of all cell types, and a union of
only fetal cell types. DHS and fetal DHS.

" Cell-type-specific H3K4mel, H3K4me, and H3K9ac data were all obtained from Roadmap
and post-processed by Trynka et al.2 3 For each mark, we took a union over cell types for the
full baseline model, and used the individual cell types for our cell-type-specific analysis.

" Cell-type-specific H3K27ac was obtained from Roadmap and post-processed. 31 A second
version of H3K27ac was obtained from the data of Hnisz et al.32 For each mark, we took a
union over cell types for the full baseline model, and used the individual cell types for our
cell-type-specific analysis.

" Super-enhancers were also obtained from Hnisz et al, 32 and comprise a subset of the H3K27ac
annotation from that paper. We took a union over cell types for the full baseline model

" Regions conserved in mammals were obtained from Lindblad-Toh et al. , post-processed by
Ward and Kellis.3 5

" FANTOM5 CAGE-defined enhancers were obtained from Andersson et al.3

" For each of these 24 categories, we added a 500bp window around the category as an additional
category to keep our heritability estimates from being inflated by heritability in flanking
regions.1

45
" For each of DHS, H3K4mel, H3K4me3, and H3K9ac, we added a 100bp window around the
ChIP-seq peak as an additional category.

" We added an additional category containing all SNPs.

When we report results in Tables A.3 and A.4, we do not report results from the category
containing all SNPs, as it has 100% of the heritability with standard error zero. (It might have a
coefficient T. that is non-trivial, but in these tables we report proportions of heritability.)
According to our simulations (Figure 3-2), including these 53 categories in our baseline model
allows us to obtain unbiased or nearly unbiased estimates of enrichment for a wide range of potential
new categories. Thus, as new annotations become available, we can create from each new annotation
a model with 54 functional categories to assess enrichment of the new (54th) annotation. For
example, for the cell-type-specific analysis, we add each cell-type-specific annotation to the baseline
model one at a time, and asses enrichment using the z-score of the cell-type-specific annotation.

Simulations: Figure 3-1. For these simulations, we used genotypes from the Wellcome Trust
Case Control Consortium. 37 QC was performed as described in Gusev et al.:" we removed any SNPs
that were below a MAF of 0.01, were above 0.002 missingness, or deviated from Hardy-Weinberg
equilibrium at a P < 0.01. The resulting dataset had 14,526 individuals and 162,574 SNPs. We let
heritability vary between 0.1 and 0.9, with the proportion of causal SNPs equal to 0.05 and 0.005
(i.e., 8,129 and 813 causal SNPs on average, respectively), and we simulated quantitative phenotypes
from an additive model. For each simulation, effect sizes for causal SNPs were drawn from a
normal distribution with mean zero and variance (i.e., average per-SNP heritability) determined by
functional categories. To simulate realistic enrichment for the 53 categories in the baseline model
plus the CNS cell-type group, we fit the model to the schizophrenia summary statistics3 1 and took
the resulting coefficients, replacing negative coefficients with 0. We then scaled these coefficients
as needed to give the desired heritability at the desired level of polygenicity. For each simulation,
we used stratified LD score regression to estimate total heritability, the heritability of the CNS
cell-type group, and the proportion of heritability in the CNS cell-type group.

46
Simulations: out of sample LD. In this paper, we use LD scores computed from an out-of-
sample reference panel. To evaluate this, we used the summary statistics simulated above, but ran
stratified LD score regression using a 1000G reference panel rather than in-sample LD. We found
that estimates of total h' and category-specific h' were biased downwards, but that estimates of
proportion of h' were approximately unbiased and type 1 error was well calibrated (Figure A-4).

Simulations: Figure 3-2. For computational ease using REML, we decreased our sample size
to the 2,680 samples in the NBS and 1966BC control cohorts of the WTCCC1 dataset, and we
correspondingly restricted ourselves to only SNPs on chromosome 1. For this set of simulations, a
dense set of SNPs was particularly important, so we used genotypes imputed to integrated phasel
v3 1000 Genomes1 4 (URLs), giving us 360,106 SNPs after quality control. We again simulated
quantitative phenotypes using an additive model, with effect sizes of causal SNPs drawn from a
normal distribution with mean zero and variance determined by functional categories. Heritability
was set to 0.5, and all SNPs were causal unless in a category simulated to have zero variance.

Simulations: Figure 3-3. We began with the simulations of realistic enrichment in the baseline
categories and the CNS cell-type group as in Figure 3-1. Then for each other cell-type group, we
removed the CNS cell-type group and added the new cell-type group to the model, scaling the
coefficient -rc of the new cell-type group to keep the total heritability constant. We then increased
the coefficients of the cell-type groups by a multiplicative constant so that the average top z-score
over 5,000 simulations (10 cell-type groups x 500 replicates each) was close to the mean top z-score
found in our analysis of 17 real traits. In a second set of simulations, we decreased the coefficients
so that the top cell-type group was significant 50% of the time.
We then repeated the process with the H3K4me3 fetal brain annotation (though with just one
annotation instead of 10 cell-type-groups). First we fit a model with this annotation plus the
baseline model to the schizophrenia summary statistics.31 We then scaled the coefficient of the
cell-type-specific annotation until the mean z-score over 500 replicates matched the mean z-score in
real data. In a second set of simulations, we decreased the coefficient so that that the top cell-type
group was significant in 50% of 500 replicates.

47
Meta-analysis across traits. We performed a random-effects meta-analysis of proportion of
heritability over the nine phenotypes listed above for each functional category. The results are in
Figure 3-4 and Table A.3. Results meta-analyzed over all 17 traits are in Figure A-8; however these
results have artificially deflated standard errors due to correlated traits such as HDL, LDL, and
Triglycerides being treated as independent.

Robustness to derived allele frequency. Stratified LD score regression is based on the as-
sumption that the per-normalized-genotype effect size of a SNP is drawn i.i.d. with mean zero,
conditioned on functional annotation. So if allele frequency bins are not included as annotations in
the model, then we are assuming that per-allele effect sizes have variance proportional to I/(p(l -p))
for allele frequency p.
To check that our results are not affected by an allele-frequency-dependent genetic architecture,
we repeated the meta-analysis over traits with a model that contained all of the categories of the
full baseline model as well as seven derived allele frequency bins to the model as extra annotations:
0-0.1, 0.1-0.2, 0.2-0.3, 0.3-0.4, 0.4-0.6, 0.6-0.8, and 0.8-1. This allowed for effect size to depend on
derived allele frequency, independently of functional annotation. These results are very similar to
our results without derived allele frequency bins, and are displayed in Table A.4.
In this paper, we do not consider heritability of very rare SNPs. If stratified LD score regres-
sion were to be used to analyze a large dataset with rare variants-for example, one obtained by
sequencing a very large number of people-then there would be several issues to consider that did
not come up in our analysis. For example, in the current analysis, we could use LD estimates from
a reference panel because the LD patterns in the reference panel matched the LD patterns in our
samples for the allele frequency range we were interested in; this might not hold for rare variants.59
Also, our analysis described above shows that allele-dependent architectures are not causing bias
in our current analyses, but this robustness result does not extend to potential future analyses of
datasets with rare variants.

Comparison to other methods. We are not aware of any other methods designed to esti-
mate genome-wide components of heritability from summary statistics. However, there are existing

48
methods that identify enriched functional categories and cell types from summary statistics. We
compared our method to four other methods, described below. For each of these methods, we
assessed the rejection rate over 100 simulations for true cell-type-specific enrichment, null baseline
enrichment (i.e., baseline enrichment with no cell-type-specific signal), and null simulations with no
enrichment in any category. We performed this analysis for both a cell type (fetal brain) and cell
type group (CNS) as well as for two proportions of causal SNPs, 0.05 and 0.005. All simulations
had a sample size of 14000 and h' of 0.7. Results are displayed in Figure 3-7; below, we discuss the
results for each method individually.

A paper by Pickrell2" combines GWAS data with functional data to identify enriched and de-
pleted functional categories, and leverages the resulting model to increase GWAS power. While the
method, called fgwas, is effective at increasing power and identifies many interesting enrichments
in the published paper, it does not show very good power to detect enrichment in the particular
situations we simulate. Of the four scenarios, fgwas performs best for when identifying enrichment
of the smaller category (fetal brain) in the more polygenic trait (Pcausal = 0.05), but stratified
LD score regression outperforms fgwas in all four situations. Fgwas could have an advantage for
annotations smaller than the ones tested in this manuscript, but we do not explore that issue here.

GoShifter is a recent method of Trynka et al.60 (see also their previous published work 2 3 ).
Goshifter is conservative in its identification of enrichment, comparing to a null obtained by local
shifting rather than a genome-wide null, and it only leverages only genome-wide significant SNPs.
As a result, it cannot be applied to traits with few significant loci. In the four situations we
simulated, stratified LD score regression outperformed go shifter in the more polygenic scenarios,
and the two methods performed comparably in the less polygenic scenarios.

Maurano et al." use enrichment of SNPs passing P-value thresholds of increasing stringency to
identify important cell types. However, they are implicitly assuming that the functional annotation
at a GWAS SNP matches the functional annotation at the causal SNP. While this could be true
for functional annotations composed of very wide regions, it is not likely to be true for functional
annotations composed of smaller regions, such as conserved regions. Moreover, the method does
not account for total LD, and so could give biased results if used to compare functional annotations

49
with different average amounts of total LD.11 We implemented a "top SNPs" method analogous to
the method of Maurano et al. that tests for enrichment of the functional category among SNPs that
pass statistical significance. The method had very good power to identify enrichment, but also had
a high rejection rate for the null baseline simulations, detecting cell-type-specific signal where there
was none.

Similarly, a recent method from Farh et al." focuses on fine-mapping and considers only genome-
wide significant loci. This method performed similarly to the top SNPs method in our simulations,
with high power to detect enrichment, increasing power as the level of polygenicity was reduced,
but a high rejection rate in null simulations with baseline enrichment.

In addition to stratified LD score regression as used in this manuscript for cell-type-speciifc


analyses, we also compared to "unadjusted" stratified LD score regression; i.e., LD score regression
used to test for enrichment in total proportion of heritability. As expected, this unadjusted version
had a high rejection rate both for null baseline enrichment and true cell-type-specific signal.

Of the methods will null rejection rates for baseline enrichment, stratified LD score regression
using the z-score of the coefficient to test for enrichment was the most powerful for the polygenic
traits. For the less polygenic traits, stratified LD score regression had power similar to GoShifter
for the larger category, and none of the three methods had any power for the small category with
less polygenic genetic architecture.

In recent work, Kichaev et al." introduce a new method that leverages functional data for
improved fine-mapping. The method also outputs annotations associated with disease. While the
method is effective in increasing fine-mapping resolution, it is again unclear whether the method is
effective at ranking cell types; for example, cell types identified as contributing the most to HDL,
LDL, and Triglycerides (using data from Teslovich et al.") are muscle, kidney, and fetal small
intestine, respectively, whereas the top cell types for those three phenotypes identified using our
method (also using data from Teslovich et al.") are liver, liver, and liver. The lower effectiveness
of this method in ranking cell types may be because it considers only genome-wide significant loci.

50
Proportion SNPs causal = 0.05 Proportion SNPs causal = 0.005
CL
1.0 1.0
0 M GoShifter M GoShifter
L- M fgwas M fgwas
to M top snps M top snps
0.8 M PICS 0.8 - PICS
M LD score (unadj) M LD score (unadj)
U,
0 M LD score U,
0 m LD score
V V
0.6 C. 0.6
'J
*0 0.4

U CL
0.4
2
CL 0.o
0.4
0
0I
00 0.2- 0.2

(1)

o.
1.0
Mm GoShifter
0.0
1.c
M GoShifter
L
U fgwas M fgwas
M top snps M top snps
0.8 - PICS 0.8 M PICS
M LD score (unadj) M LD score (unadj)
U, Lfl
I LD score o M LD score
LL 0.0.
V
n. 0.6

0
0)
4_J V,.4.4
0.4
2

0.2 0.2

10k

0.-L 0.0
e .0 e\'!x-s
ev
-~Se

Figure 3-7: Comparison to other methods for identifying enriched cell types. Each simulation has baseline
enrichment plus enrichment in either the CNS cell-type group or the fetal brain cell type, and the proportion
of 100 simulations in which the null is rejected is reported. In all simulations, N = 14000, h2 = 0.7. LD
score (unadj) refers to enrichment, i.e., (Prop. h9)/(Prop. SNPs); LD score refers to the coefficient T of the
category, controlling for all other categories in the model. Top SNPs is a method testing for enrichment
among SNPs that pass statistical significance.

Outlier removal. To minimize standard error, we remove outlier SNPs by excluding SNPs j with

2 > max{80, 0.001N}, where N is the maximum sample size in the study. We also remove the
MHC region from all analyses, because of its unusual LD patterns and genetic architecture.

Out-of-bounds estimates. Like other heritability estimation methods, stratified LD score re-
gression can produce heritability estimates that are not between 0 and 1. When unbiasedness is

51
important-for example, when we are averaging estimates over several simulation replicates-we
do not adjust these out-of-bounds estimates. However when mean squared error is more important
than unbiasedness-for example, when reporting the results of a single analysis-we truncate these
estimates to be between 0 and 1. To get a confidence interval around the truncated estimate, we
intersect the original confidence interval with the interval [0, 1].

Choice of regression SNPs and reference SNPs. The derivation above does not incorporate
imperfect imputation. Ideally, we would prune our X2 statistics to a set of "regression SNPs" with
imputation accuracy above 0.9, but since imputation accuracy is not always available, we instead
use HapMap Project Phase 3 (HapMap3 61 ) SNPs as a proxy for well-imputed SNPs. Thus, for the
purposes of this paper, regression SNPs are always the HapMap3 SNPs.
However, the choice of which SNPs to include in our regression is distinct from the choice of
which SNPs we model as causal. It is important that our model allow as many SNPs as possible
to contribute causally, since if we use a model with, for example, only HapMap3 SNPs causal then
we are assigning causality of any SNP that is tagged by HapMap3 (but not included in HapMap3)
to the HapMap3 SNPs that tag it. This is problematic for functional partitioning because the
functional categories containing the causal SNP may not be the same as the functional categories
of the HapMap3 SNPs that tag it.
Recall that h2(B) is the heritability of set B defined using a model that allows any SNP in
set A to be causal. Another way to restate our above point is that we are interested in h2000G
rather than h rapMap3 (C) because a model that only allows HapMap3 SNPs to be causal is allowing
non-HapMap3 heritability to be tagged by HapMap3 SNPs and therefore potentially assigning
heritability to the wrong functional category. For this reason, our set of potentially causal SNPs-
i.e., the set of SNPs in our reference panel-is the set of 9,254,535 1000G SNPs 1 4 (see Web Resources)
with minor allele count greater than five in the 379 European samples.

However, there is a problem introduced by having many reference panel SNPs that are not well-
tagged by regression SNPs: it may be inappropriate to extrapolate the enrichments at well-tagged
SNPs to the rare SNPs on our reference panel that are not well-tagged. To better understand this

52
issue, recall that stratified LD score regression for disjoint categories works in two steps. First, x 2 is
regressed on Ne(j, c), and the resulting regression coefficients are (if the categories are disjoint) per-
SNP heritabilities in each category. Second, the per-SNP heritability of each category is multiplied
by the number of SNPs in that category to obtain an estimate of the heritability in the category.
This multiplication involves an implicit assumption that the per-SNP heritability estimate extends
uniformly across the whole category. If the per-SNP heritabilities are estimated using HapMap3
SNPs, and the category contains many rare SNPs, this assumption may no longer hold.
SNPs with MAF > 0.05 (we denote this set of SNPs 9) are generally well-tagged by HapMap3
SNPs. So for any category C, we can estimate h200OG(C 9) without potentially inaccurate extrap-
olation simply by multiplying by the per-SNP heritability by the number of SNPs in C 0 g instead
of the number of SNPs in C. We do this by default, and the proportions of heritability that we
report throughout this manuscript are hOOOG(C 00G

-
Regression weights. There are two considerations for how to weight the regression. First,
because of LD, the x 2 statistics used in the regression are not independent. To correct for this
non-independence, we down-weight each SNP in proportion to its LD to the other SNPs used in
the regression. The second problem is heteroskedasticity: the x 2 statistic of a SNP with a high LD
score has higher variance than the x 2 statistic of a SNP with a low LD score, and so we down-weight
SNPs with high LD scores.
For over-counting, we compute LD scores within HapMap3 SNPs; call these fhm3(j). For het-
eroskedasticity, we compute f1oooG(j, c) for all categories c in our model. The variance of X2 is
proportional to (1 + N Ec TFc1OOOG(J, c)) 2 , but we do not have Tc. We use a rough approximation of
T, obtained by taking the mean over regression SNPs of both sides of Equation (3.3) and assuming
that all the rc are equal. This gives us = (V2 - 1) / (N - 2), where V2 is the mean of x and 1
is the mean of Ec 1OOOG(j, c), both taken over regression SNPs j. We then weight SNP j by the
inverse product of the over counting weights and heteroskedasticity weights:

1
hm3(J)(1 + NfZ ElOO1OG(i, c)) 2
'

53
Analyzing summary statistics with GC correction. Artificial deflation of X2 statistics by
GC correction causes an equal deflation in the estimates of category-specific heritability and total

heritability. Because these two estimates are deflated by the same amount, though,' 3 the estimate

of proportion of heritability is unaffected. Moreover, statistical tests for heritability enrichment and

for T, > 0 remain valid.

Many of the datasets analyzed in this paper have GC correction applied, and so we report
proportions of heritability, enrichment, and z-scores/p-values, but we do not report estimates of

total heritability or of category-specific heritability.

Partitioned heritability vs. regression coefficients. The parameters T, and h2 (C) have dif-
ferent interpretations. Because h 2 (C) is defined as the sum of squared effects of SNPs in C, it

should not depend on the categories chosen to be in the model, and is generally a robust quantity

to estimate. On the other hand, T, is the contribution of category C, after controlling for all other

categories in the model; it is defined by the equation Var(O3) = Ec:jECC T, and depends explicitly

on the choice of categories to put in the model.

As a result, h2 (Cc) is a more robust quantity, and we report this value as the outcome of our

baseline analysis. On the other hand, when comparing cell types, we believe that it is important

to control for the overlap of cell-type-specific annotations with other functional categories such as

coding. Thus, we rank cell types by the P-value of Tc, rather than the P-value of total enrichment,

(Prop. h2 ) / (Prop. SNPs).

Custom genotyping arrays. Stratified LD score regression is not currently applicable to studies

using a custom genotyping arrays. For these arrays, SNPs that are more likely to be in large-effect

loci also have better coverage, and this dependency is not modeled in the above derivation. In
a meta-analysis of several studies, some of which have custom genotyping arrays, coverage is still

dependent on effect size in a way that violates model assumptions.

Simulations: Figures A-2 and A-3. The data for these two figures is the same as for Figure 3-
1, but here we plot point estimates to assess bias, rather than plotting rejection probabilities to

54
assess power. In Figure A-2 , we display results from the data simulated to have enrichment in the
CNS cell-type group, and in Figure A-3 display results from the null simulations.

Simulations: Figure A-4. The summary statistics for this figure are the same as the summary
statistics in Figure 3-1, and we ran stratified LD score regression with the same model. However, the
LD scores we used in these analyses were computed from an out-of-sample reference panel instead
of being computed from the same individuals used to generate the data. In particular, we computed
LD scores using individuals from the GBR, FIN, IBS, CEU, and TSI populations of 1000 Genomes.

Heritability z-score method for deciding which phenotypes are amenable to stratified
LD score regression. We wanted to run our method on phenotypes for which we would get
meaningful results, but we did not want to filter results by what the standard errors were around
our estimates. A phenotype will give low standard errors if it is sufficiently polygenic and has high
enough sample size and heritability. It is hard to know ahead of time how polygenic a trait is,
but these three dimensions are conveniently captured in the z-score of total SNP-heritability, which
increases with N - h' and with proportion of causal SNPs (Figure 3-1b), and which corresponds
closely to power to detect enrichment (Figure 3-1c). We chose a cutoff of z-score > 7 for this paper.
The z-scores of the datasets analyzed in this paper appear in Table A.7.

Nh' method for deciding which phenotypes are amenable to stratified LD score regres-
sion. An alternative method for choosing which phenotypes are amenable to our method is to use
a cutoff of Nh'. The disadvantage is that this statistic does not take into account polygenicity,
which is an important determinant of power (Figure 3-la). We did not use this as a criterion in
this work, but we recommend it to potential users of our method who would like to get a rough
idea of whether their dataset has sufficient sample size. In our simulations, a heritability z-score of
7 corresponds to Nh' of roughly 4,500 for very polygenic traits, and Nh' of roughly 12,500 for less
polygenic traits (Figure 3-1b).

55
Choice of 17 phenotypes to include in the main analysis. We applied our method to all
traits with available summary statistics, and removed all traits with a heritability z-score less than
7. (See "Heritability z-score method for deciding which phenotypes are amenable to stratified LD
score regression" above.) We then removed one of each pair of traits with a large genetic correlation
(> 0.95): we removed college attendance, which has a very high genetic correlation with years of
education, and total cholesterol, which has a very high genetic correlation with LDL.9
For Crohn's disease and ulcerative colitis, we used a dataset with 1000 Genomes imputation
which is newer than the dataset available at the link in Online Methods.

Choice of nine traits to include in the meta-analysis of traits For our meta-analysis over
traits, we identified pairs of traits with substantial sample overlap and trait correlation by using the
intercept of cross-trait LD score regression.' Specifically, for each pair of traits, we computed the
genetic covariance intercept on the N1N2 scale, which for quantitative traits estimates phenotypic
correlation times sample overlap, and for case-control estimates a related quantity that is high,
for example, if two unrelated traits share controls (9). This intercept is downwardly biased in
the presence of GC correction, so we divided by the square root of the product of the heritability
estimates of the two traits to correct for this bias. We identified pairs of traits for which this
quantity was at least 15% of the sample size of either of the traits, and we excluded one of each
such pair. The remaining set of traits was: Height, BMI, menarche, LDL levels, coronary artery
disease, schizophrenia, educational attainment, smoking behavior, and rheumatoid arthritis.

Choice of traits to include in Figure 3-5 (Enrichment estimates for selected annotations
and traits). Height, BMI, age at menarche, and schizophrenia are the four traits with the highest
combination of SNP-heritability and sample size, which we quantify by the z-score of total heritabil-
ity in the full baseline analysis. We also included a meta-analysis of immunological diseases, since
they have a different pattern of enrichment from other traits; for example FANTOM5 enhancers
are very enriched for immunological diseases but not for other traits. This meta-analysis included
rheumatoid arthritis and an inflammatory bowel disease dataset that included both Crohn's disease
and ulcerative colitis as cases; we did not include Crohn's disease and ulcerative colitis separately

56
since the two studies share controls.

Cell-type-specificity of super-enhancers. We were interested in whether we would get in-


creased signal from a cell-type-specific analysis using super-enhancers compared to regular en-
hancers, and compared to the four histone marks we used in our primary analysis. To test this, we
repeated the cell-type-specific analysis described in the overview of methods for super-enhancers
and regular enhancers from Hnisz et al.,32 obtaining a z-score for each cell type in each of six marks
(four marks from the primary analysis, plus regular enhancers and super-enhancers from Hnisz et
al). Then, for each of the six marks, we computed the average over traits of the highest z-score
achieved by any cell type for that trait, obtaining a score that should be higher for a mark with a
stronger cell-type-specific signal.
Super-enhancers did not receive a higher score than regular enhancers. In fact, the average top
z-score for super-enhancers was 3.0, while the average top z-score for regular enhancers was 4.7.
The average top z-scores for the four histone marks used in the primary analysis were 4.9, 3.7, 3.5,
and 4.5, respectively.

Power as a function of annotation size. Stratified LD score regression has more power to
detect enrichment in large categories than in small categories. To quantify this, we performed
simulations in which the CNS cell-type group annotation was pruned to be 0.25x, 0.5x, 0.75x, or 1.Ox
the original size by choosing to drop or keep each region with equal probability. We then simulated
phenotypes and summary statistics with the same baseline enrichment, and with enrichment in this
new annotation that matched the enrichment of the original CNS cell-type group annotation. We
then plotted probability of rejection; the results are displayed in Figure A-7 and show a strong
dependence of power on category size.

57
58
Chapter 4

Heritability enrichment of specifically


expressed genes identifies disease-relevant
tissues and cell types

Genetics can provide a systematic approach to discovering the tissues and cell types relevant for
a complex disease or trait. Identifying these tissues and cell types is critical for following up on
non-coding allelic function, developing ex-vivo models, and identifying therapeutic targets. Here,
we analyze gene expression data from several sources, including the GTEx and PsychENCODE
consortia, together with genome-wide association study (GWAS) summary statistics for 48 diseases
and traits with an average sample size of 86,850, to identify disease-relevant tissues and cell types.
We develop and apply an approach that uses stratified LD score regression to test whether disease
heritability is enriched in regions surrounding genes with the highest specific expression in a given
tissue. We detect tissue-specific enrichments at FDR < 5% for 30 diseases and traits across a broad
range of tissues that recapitulate known biology. In our analysis of traits with observed central
nervous system enrichment, we detect an enrichment of neurons over other brain cell types for
several brain-related traits, enrichment of inhibitory neurons over excitatory neurons for bipolar
disorder, and enrichments in the cortex for schizophrenia and in the striatum for migraine. In our
analysis of traits with observed immunological enrichment, we identify enrichments of alpha beta

59
T cells for asthma and eczema, B cells for primary biliary cirrhosis, and myeloid cells for lupus
and Alzheimer's disease. Our results demonstrate that our polygenic approach is a powerful way
to leverage gene expression data for interpreting GWAS signal. 1

Introduction

There are many diseases whose causal tissues or cell types are uncertain or unknown. Identifying
these tissues and cell types is critical for developing systems to explore gene regulatory mechanisms
that contribute to disease. In recent years, researchers have been gaining an increasingly clear
picture of which parts of the genome are active in a range of tissues and cell types: for example, which
parts of the genome are accessible, which enhancers are active, and which genes are expressed. 6
Combining this type of information with GWAS data offers the potential to identify causal tissues
and cell types for disease.
Many different types of data characterizing tissue- and cell-type-specific activity have been
analyzed together with GWAS data to identify disease-relevant tissues and cell types: histone
marks, 7,23,24,2 DNase I hypersensitivity (DHS),"' 25 27
, 6, 3 eQTLs,64 '65 and gene expression data.66 69
Of these data types, gene expression data (without genotypes or eQTLs) has the advantage of being
available in the widest range of tissues and cell types. Therefore, methods for integrating gene
expression data with GWAS data have the potential not only to identify system-level differences
among traits-e.g., brain enrichment vs. immune enrichment-but also to obtain high resolution
within a system-e.g., differentiating among brain regions or among immune cell types.
Indeed, previous work has shown that gene expression can be a useful source of information
for identifying disease-relevant tissues and cell types from GWAS data. An initial application of
the SNPsea algorithm 66' 67 analyzed a data set with gene expression in 249 immune cell types from
mouse, together with genome-wide significant SNPs from GWAS of several immunological diseases,
and reported disease-specific patterns of enrichment.66 The DEPICT software68 includes a method

'The material in this chapter was previously posted to bioRxiv as "Heritability enrichment of specifically expressed
genes identifies disease-relevant tissues and cell types" by Hilary Finucane et al.8 and as of this writing is under
review at Nature Genetics.

60
for joint analysis of GWAS summary statistics with a large gene expression data set,7 0 and has
been used to identify enriched tissues for height71 and BMI.72 In a recent study of migraine, 69
an analysis of genome-wide significant loci with expression data from the GTEx project identified
cardiovascular and digestive/smooth muscle enrichments. These studies show that gene expression
data are informative for disease-relevant tissues and cell types, and have led to biological insights
about the diseases and traits studied. However, the methods applied in these studies restrict their
analyses to subsets of SNPs that pass a significance threshold. To our knowledge, no previous study
has modeled genome-wide polygenic signals to identify disease-relevant tissues and cell types from
GWAS and gene expression data.
Here, we apply stratified LD score regression, 7 a method for partitioning heritability from GWAS
summary statistics, to sets of specifically expressed genes to identify disease-relevant tissues and
cell types across 48 diseases and traits with an average GWAS sample size of 86,850. We first
analyze two gene expression data sets 64 ,68 ,70 containing a wide range of tissues to infer system-level
enrichments, recapitulating known biology. We also analyze chromatin data from the Roadmap
Epigenomics project 6 across the same set of diseases and traits, and conclude that gene expression
and chromatin provide complementary information. We then analyze gene expression data sets
that allow us to achieve higher resolution within a system, 6 4,73-75 identifying enriched brain regions,
brain cell types, and immune cell types for several brain- and immune-related diseases and traits.
Our results underscore that a heritability-based framework applied to gene expression data allows
us to achieve high-resolution enrichments, even for very polygenic traits.

Results

Overview of methods

We analyzed the five gene expression data sets listed in Table 4.1, mapping mouse genes to orthol-
ogous human genes when necessary. To assess the enrichment of a focal tissue for a given trait, we
follow the procedure described in Figure 4-1. We begin with a matrix of normalized gene expression
values across genes, with samples from multiple tissues including the focal tissue. For each gene, we

61
compute a t-statistic for specific expression in the focal tissue (Methods). We rank all genes by their
t-statistic, and define the 10% of genes with the highest t-statistic to be the gene set corresponding
to the focal tissue; we call this the set of specifically expressed genes, but we note that this includes
not only genes that are strictly specifically expressed (i.e. only expressed in the focal tissue), but
also genes that are weakly specifically expressed (i.e. higher average expression in the focal tissue).
For a few of the datasets analyzed, we modified our approach to constructing the set of specifically
expressed genes to better take advantage of the data available (Methods). We add 100kb windows
on either side of the transcribed region of each gene in the set of specifically expressed genes to
construct a genome annotation corresponding to the focal tissue. (The choice of the parameters
10% and 100kb is discussed in Methods.) Finally, we apply stratified LD score regression7 to GWAS
summary statistics to evaluate the contribution of the focal genome annotation to trait heritability
(Methods). We jointly model the annotation corresponding to the focal tissue, a genome annota-
tion corresponding to all genes, and the 52 annotations in the "baseline model" 7 (including genic
regions, enhancer regions, and conserved regions; see Table Si). A positive regression coefficient for
the focal annotation in this regression represents a positive contribution of this annotation to trait
heritability, conditional on the other annotations. We report regression coefficients, normalized
by mean per-SNP heritability, together with a P-value to test whether the regression coefficient
is significantly positive. Stratified LD score regression requires GWAS summary statistics for the
trait of interest, together with an LD reference panel (e.g. 1000 Genomes"), and has been shown
to produce robust results with properly controlled type I error.7 We have released open source
software implementing our approach, and have also released all genome annotations derived from
the publicly available gene expression data that we analyzed (see URLs). We call our approach LD
score regression applied to specifically expressed genes (LDSC-SEG).

Analysis of 48 complex traits across multiple tissues

We first analyzed two gene expression data sets. The first data set, from the GTEx consortium
v6p3,6 4 consists of RNA-seq data for 53 tissues, with an average of 161 samples per tissue (Table S2,
Methods). The second data set, which we call the Franke lab data set, is an aggregation of publicly

62
Name Organism Tissues/cell types Technology
GTEx6 4 Human 53 tissues/cell types RNA-seq
Franke lab68, 70 Human/mouse/rat 152 tissue/cell types Array
Cahoy 73 Mouse 3 brain cell types Array
PsychENCODE 74 Human 2 neuronal cell types RNA-seq
ImmGen 75 Mouse 292 immune cell types Array

Table 4.1: List of gene expression data sets used in this study. We analyzed five gene expression data
sets: two (GTEx and Franke lab) containing a wide range of tissues and three (Cahoy, PsychENCODE,
ImmGen) with more detailed information about a particular tissue.

available microarray gene expression data sets comprising 37,427 samples in human, mouse, and
rat.68 ,70 After removing redundant data, this data set contained 152 tissues, including much better
representation of immune tissues and cell types than the GTEx data set (Table S3, Methods).
The gene expression values in the Franke lab data set already quantify relative expression for a
tissue/ cell-type rather than absolute expression for a single sample, and so we used these values in
place of our t-statistics. For visualization purposes, we classified the 205 tissues and cell types in
these data sets into nine categories; the classification is described in Table S2 and Table S3. The
main goal of this multiple-tissue analysis was to identify system-level enrichments.

We analyzed GWAS summary statistics for 48 diseases and traits with an average sample size of
86,850 (Table S4), applying LDSC-SEG for each of the 205 specifically expressed gene annotations
in turn. The 48 traits included 12 traits from the UK Biobank, 76 17 traits with publicly available
316 90
GWAS summary statistics, 4 1,5 3-55,57,58,77-82 and 19 traits from the Brainstorm Consortium. , 9,83-
We excluded the HLA region from all analyses, due to its unusual genetic architecture and pattern of
LD. For 30 of the 48 traits, at least one tissue was significant at FDR < 5% (Figure 4-2, Figure S1 and
Table S5). Averaging across the most significant tissue for each of these 30 traits, the specifically
expressed gene annotation spanned 17% of the genome and explained 38% of SNP-heritability
(Table S5). Several of our results recapitulate known biology: immunological traits exhibit immune
cell-type enrichments, psychiatric traits exhibit strong brain enrichment, LDL and triglycerides
exhibit liver-specific enrichments, BMI-adjusted waist-hip ratio exhibits adipose enrichment, and
height exhibits enrichments in a variety of tissues in a pattern similar to previous analyses of
this trait. 71 In addition, several of our results validate very recent findings from other genetic

63
Gene expression matrix t-statistic for each gene

L
Gene ID Skin
Gen I Cortex
Cote
Sin ... Blood
Bo~compute a t- Gene ID Cortex t-stat
GENE-00001 -1.38
GENE_00001 7.56 2.23 ... 3.18 statistic for specific GENE-00002 20.95
GENE_-00002 0.03 16.24 ... 0.81 expression in
GENE_20000 ::
1.83 ~1.47 ... 0.00cot exrexni XG GENE-20000
E_0 0 3.

.
Rank by t-statistic, take top 10%

"2,000 genes for cortex


Gene ID
1 GENE_00002
2 GENE 09432
Add 100kb window 2
around the genes 2,000 GENE_01847

GWAS summary _ _ __4


statistics for Genome annotation for cortex LDSC Baseline
schizophrenia - model

Stratified LD score regression

in per-SNP h2 for cortex


genes
Increase in schizophrenia

Figure 4-1: Overview of the approach. For each tissue in our gene expression data set, we compute t-
statistics for differential expression for each gene. We then rank genes by t-statistic, take the top 10% of
genes, and add a 100kb window to get a genome annotation. We use stratified LD score regression 7 to
test whether this annotation is significantly enriched for per-SNP heritability, conditional on the baseline
model 7 and the set of all genes.

analyses: in particular, smoking status, years of education, BMI, and age at menarche show robust
brain enrichments that recapitulate results from our previous analysis of genetic data together
with chromatin data. 7 We also observe a cardiovascular enrichment for intracerebral hemorrhage,
consistent with genetic evidence that this trait shares risk alleles with blood pressure levels,91 and
a brain enrichment for epilepsy, consistent with parallel unpublished work.92

In a data set with many tissues/cell types, related tissues will have highly overlapping gene sets.

64
Sc~hizor hranin* Rheumotnid arthria i tf*
pn m10 g Adipose
Blood/Immune
15 6 8 CNS
Cardiovascular
104 6 Digestive
4 Endocrine
Liver
2 Musculoskeletal/
2 Connective
0 0 0 Other

BMI Lupus Waist-hip ratio Triglycerides

6 6 6 6

4 04 04 04
-- -- - -- - --- - - -- - -

-
0 L0 11.,0 LLJLLk
A
Epilepsy Intracerebral hemorrhage Migraine without aura LDL

6 6 6 6
06 0. 0. 0.
4 4 4 4

Figure 4-2: Results of multiple-tissue analysis for selected traits. Results for the remaining traits are
displayed in Figure S1. Each bar represents a tissue/cell type from either the GTEx data set or the Franke
lab data set. The width of each bar is proportional to its height, for easier visualization. *: y-axis has
been rescaled to fit the data. The dashed line represents the FDR<5% cutoff, -log1O(P)=2.86. Numerical
results are reported in Table S5.

Because of this, and because we fit each tissue without adjusting for the other tissues analyzed,
related tissues often appear enriched as a group. In this analysis, we are focused on identifying
system-level enrichments, and so these correlated results do not limit interpretability. The following
section similarly focuses on identifying system-level enrichments, while in later sections we focus on
differentiating among related tissues/cell types within a system.

65
Comparison of expression-based approach to chromatin-based approach

We compared our approach to analyses of the same 48 diseases and traits using stratified LD score
regression' in conjunction with chromatin data from the Roadmap Epigenomics project6 (see URLs)
instead of gene expression data. We constructed 397 cell-type-/tissue-specific annotations from
narrow peaks in six chromatin marks-DNase hypersensitivity, H3K27ac, H3K4me3, H3K4mel,
H3K9ac, and H3K36me3-each in a subset of 88 primary cell types/tissues. This analysis differed
from our previous analysis of chromatin data 7 in that we used more recently available data on a
larger set of chromatin marks, we consistently used narrow peaks as processed by Roadmap for
all marks, and we controlled not only for the union of annotations for each mark, but also for the
average of annotations for each mark (Methods).

We analyzed GWAS summary statistics for the 48 traits, applying stratified LD score regression
to each of the 397 tissue-specific chromatin-based annotations in turn. For 43 of the 48 traits, at
least one tissue was significant at FDR<5% (Figure S2 and Table S6). Averaging across the most
significant annotation for each of these 43 traits, the tissue-specific chromatin annotation spanned
2.8% of the genome and explained 41% of the SNP-heritability (Table S6). Our results using
chromatin data were generally concordant with the results of our gene expression analysis (Figure 4-
3a). However, in many instances the analysis of chromatin data detected more enrichments and/or
enrichments at higher significance levels than the analysis of gene expression data. There are two
potential explanations for this. First, the set of tissues and cell types for which data is available
is different for the two analyses; while in general gene expression is available in a wider variety
of tissues and cell types (particularly for within-system analyses; see below), in some instances
the most significantly enriched tissue in the chromatin analysis was not available in the GTEx or
Franke lab data sets. For example, fetal lung was highly significantly enriched for lung capacity
(FEV1/FVC) in our analysis of chromatin data, but there was no data on fetal lung in the GTEx or
Franke lab data sets. Second, the enrichments were generally much larger for the chromatin-based
annotations than for the gene expression-based annotations that we analyzed. However, the gene
expression-based annotations were larger (i.e. spanned more of the genome) than the chromatin-
based annotations and were comprised of larger regions, reducing the amount of LD between SNPs

66
in the annotation and SNPs not in the annotation; this explains why LDSC-SEG was well-powered
to identify much smaller enrichments.

We observed notable differences between the enrichments identified by the two approaches for
migraine (Figure 4-3b). There is a long-standing scientific debate as to whether migraine has a
primarily neurological or vascular basis, 93 and a previous analysis of the migraine GWAS data

(not restricted to any subtype) together with the GTEx gene expression data reported both car-
diovascular and digestive/smooth muscle enrichments.6 9 Our analysis of gene expression data did
not identify any significant enrichments for this migraine data set, and identified a cardiovascular
enrichment but no significant digestive/smooth muscle enrichment for migraine without aura. On
the other hand, our analysis with chromatin data identified a significant neurological enrichment
as well as quantitatively smaller and less significant cardiovascular and digestive /smooth muscle
enrichments for the migraine data set, but identified only a borderline significant enrichment in
fibroblasts for migraine without aura (Figure 4-3b). We hypothesize that this difference reflects a
difference in power and in the cell types available in the two sources of data. For example, the top
annotations for migraine in the chromatin analysis were neurospheres and fetal brain, neither of
which was present in the gene expression data analyzed. Our results are consistent with the hypoth-
esis that migraine without aura does indeed have a vascular component, and that another subtype
of migraine may have a neurological basis which is sufficiently cell-type specific that the relevant cell
types are not represented in either the GTEx or Franke lab data sets. Our results demonstrate that
for a multiple-tissue analysis, chromatin and gene expression data are complementary sources of
data, and that it is of interest to test both gene expression annotations and chromatin annotations
for enrichment, since there are diseases such as migraine and migraine without aura for which only
one of the two types of data yields a significant enrichment.

A major advantage of gene expression data is that it is available at finer tissue/cell-type resolu-
tion within several systems. In the within-system analyses that follow, we analyzed gene expression
data from tissues/cell types for which we did not have comparable chromatin data to investigate
more detailed patterns of tissue/cell-type specificity. Thus, these analyses could not have been
conducted using chromatin data.

67
20 Schizophrenia Rheumatoid Arthritis Height Waist-hip ratio
6 15
15 10
0.
0. 4
0 CL 10
E 0) 0
0
10hh 0
U- 01 -5
0

C 20 Schizophrenia Rheumatoid Arthritis Height Waist-hip ratio


0
61 15
15 10

CL 0.10
X1 10
Q)
5
5 2
0

0 0 Li
Migraine Migraine without aura
4 4

03 3
E
0 cm2 .2 2 Adipose
U Blood/Immune
CNS
0 0 Cardiovascular
Digestive
Migraine Migraine without aura Endocrine
0 Liver
4 4
0. Musculoskeletal/
CL 3 _-
.. _ -_
_ -_ _ _---_
-- -- --- 0.3 .. _-------- Connective
Other
0.
0 Ji

Figure 4-3: Comparison of chromatin and gene expression results


(A) Results using chromatin data from
Roadmap (top) and results from the multiple-tissue gene expression analysis (bottom),
for selected traits.
Results using chromatin data from Roadmap for the remaining traits are displayed
in Figure S2, with
numerical results in Table S6. For the chromatin results, each bar represents a
track of narrow peaks for
H3K4me3, H3K4mel, H3K9ac, H3K27ac, H3K36me3, or DHS in a single cell type/tissue.
The width of
each bar is proportional to its height, for easier visualization. The dashed line
represents the FDR<5%
cutoff, -log1O(P)=2.85 (chromatin) or -log1O(P)=2.99 (gene expression). (B) Results using
chromatin and
gene expression data for migraine and migraine without aura.

68
Analysis of 13 brain-related traits using fine-scale brain expression data

We identified 13 traits with CNS enrichment at FDR<5% in our gene expression and/or chromatin
analyses: schizophrenia, bipolar disorder, Tourette syndrome, epilepsy, generalized epilepsy, ADHD,
migraine, depressive symptoms, BMI, smoking status, years of education, neuroticism, and systolic
blood pressure. The nervous system has been implicated, either with genetic evidence or non-
genetic evidence, for each of these traits,7, 28 ,31,4 0 ,78, 83, 92- 94 including systolic blood pressure, which
is regulated in part via the autonomic nervous system. 40
We first investigated whether some brain regions are enriched over other brain regions for these
traits. While the multiple-tissue analysis included annotations for many different brain regions,
the gene sets for the different brain regions were often highly overlapping so that for many traits,
many brain regions were identified as enriched. For example, nearly every brain region in either the
GTEx or Franke lab data was found to be enriched at FDR<5% (Figure 4-2) in schizophrenia. To
differentiate among brain regions, we restricted ourselves to gene expression data only from samples
from the brain in the GTEx data. We computed t-statistics within the brain-only data set; e.g. we
computed t-statistics for cortex vs. other brain regions instead of cortex vs. other tissues in GTEx,
and we used these new t-statistics to construct and test gene sets as in the multiple-tissue analysis.
Individual-level data was not available for the Franke lab data set, and thus we could not compute
within-brain t-statistics for this data set.
An alternative approach would be to undertake a joint analysis of the original 13 annotations
from the multiple-tissue analysis. However, joint analysis of 13 highly correlated annotations is
likely to be underpowered, while re-computing t-statistics within the brain allows us to construct
new annotations with lower correlations (Figure S3), increasing our power. Moreover, differential
expression within the brain may allow us to isolate signals from cell types or processes that are
unique to a single brain region, separately from the cell types or processes that are unique to the
brain but shared among brain regions. Thus, we use differential expression within the brain, rather
than joint analysis of the original annotations, to differentiate among brain regions.
The results of our analysis comparing brain regions are displayed in Figure 4-4a, Figure S4a
and Table S7a. We identified significant enrichments in the cortex relative to other brain regions at

69
Migraine 6
Schizophrenia*
(A) Cortex 3
0.4
Cerebellum _O2
Striatum 0)
0
02

Other
0 0
Depressive symptoms Bipolar disorder* Systolic blood pressure
3 6 3
C. 0. 0.
C)
0 12 2
0 0
0i-
0 0
Smoking status BMI 6 Years of education*
3 3
C. 0. 0.4
2 2
0 0
02

0 0 0

(B) Migraine 5 Schizophrenia (C) Schizophrenia


3
4 4
Astrocyte 0.
03 0. 3 GABA. 2
Neuron ---- Glu.
Oliaoden.
92-~~~~-
-r
~---- 92 ---- 1

0 U- 0 07
Depressive symptoms 5 Bipolar disorder Sygtolic blood pressure Bipolar Systolic b.p.
4
3 3
4 4
C. 0.
o3 0.
C3 03 2
0 21D
1
02 0

0 = 0-
-

5 Smoking status BMI BMI Years of Ed.


5 Years of education
4 4 4 3 3
0.
C,3 03 c 3 2.
2
92 ---- ---- $2---- - 0TO 2
-

1
0=
0 0- 0 0

Figure 4-4: Results of brain analysis for selected traits. Results for remaining traits are displayed in Figure
S4, with numerical results reported in Table S7. (A) Results from within-brain analysis of 13 brain regions
in GTEx, classified into four groups, for eight of 13 brain-related traits. *: y-axis has been rescaled to fit the
data. The dashed line represents the FDR<5% cutoff, -log10(P)=2.35. (B) Results from the data of Cahoy
et al. on three brain cell types for eight of 13 brain-related traits. The dashed line represents the FDR<5%
cutoff, -log1O(P)=2.24. (C) Results from PyschENCODE data on two neuronal subtypes for five of seven
neuron-related traits. The dashed line represents the cutoff for Bonferroni significance, -log1O(P)=2.72.

70
FDR<5% for bipolar disorder, schizophrenia, depressive symptoms, and BMI, and in the striatum
for migraine. These enrichments are consistent with our understanding of the biology of these
traits,95-98 but to our knowledge have not previously been reported in any integrative analysis using
genetic data. We also identified enrichments in cerebellum for bipolar disorder, years of education,
smoking status, and BMI. However, we caution that differential gene expression in samples from
different brain regions can reflect the cell type composition of these brain regions as well as their
function. In particular, the cerebellum is known to have a very high concentration of neurons, 99 and
thus cerebellar enrichments could indicate either that the cerebellum is a region that is important
in disease etiology, or that neurons are an important cell type.

To address the question of the relative importance of brain cell types, as opposed to brain regions,
we analyzed the same set of traits using a publicly available data set of specifically expressed genes
identified from different brain cell types purified from mouse forebrain." The authors of this data
set made lists of specifically expressed genes for each of the three brain cell types available, and these
lists were all approximately the same size as the sets of specifically expressed genes in our previous
analyses. We created annotations from these lists in the same way that we created annotations
from the lists of top 10% expressed genes. The results of this analysis are displayed in Figure 4-4b,
Figure S4b and Table S7b. We identified neuronal enrichments at FDR<5% for seven traits: bipolar
disorder, schizophrenia, years of education, smoking status, BMI, neuroticism, and systolic blood
pressure. The other cell types did not exhibit significant enrichment for any of the 13 brain-related
traits. The enrichment of neurons for all four of the traits with enrichment in cerebellum in the
brain-region analysis supports the hypothesis that analyses of brain regions may be confounded
by cell-type composition. The enrichment for systolic blood pressure is consistent with the role of
autonomic regulation of this trait.

To more precisely characterize the neuronal enrichments, we analyzed the seven traits with
neuronal enrichment at FDR<5% using t-statistics computed by the PsychENCODE consortium7 4
on differential expression in glutamatergic (excitatory) vs. GABAergic (inhibitory) neurons. The
results are displayed in Figure 4-4c, Figure S4c and Table S7c; we used Bonferroni correction in
this analysis, as we were testing only 7x2=14 hypotheses. For bipolar disorder, genes that are

71
specifically expressed in GABAergic neurons exhibited heritability enrichment, while genes specific
to glutamatergic neurons did not. This result supports the theory that pathology in GABAergic
100 101
neurons can contribute causally to risk for bipolar disorder. ,

Analysis of 22 immune-related traits using immune cell expression data

We identified 22 traits with immune enrichment at FDR<5% in our gene expression and/or chro-
matin analyses. This includes many immunological disorders: celiac disease, Crohn's disease, in-
flammatory bowel disease, lupus, primary biliary cirrhosis, rheumatoid arthritis, type 1 diabetes,
ulcerative colitis, asthma, eczema, and multiple sclerosis. It also includes Alzheimer's and Parkin-
son's diseases, which are neurodegenerative diseases with an immune component previously identi-
fled from genetics,10 2,103 as well as several brain-related traits-ADHD, anorexia nervosa, bipolar
disorder, schizophrenia, Tourette syndrome, and neuroticism-and HDL, LDL, and BMI. Several
of the brain-related traits have been previously suggested to have an immune component,31 ,104,105
HDL and LDL have been linked to immune activation,106-108 and obesity, in addition to contribut-
ing to inflammation, 109 can also be induced in mice through alterations of the immune system.11 0
We investigated cell-type-specific enrichments for these traits in 292 immune cell types using gene
expression data from the ImmGen project, which contains microarray data on these cell types
from mice. This data set contains data for many immune cell types that are not available in the
multiple-tissue analysis, and because we compute t-statistics within the data set-i.e., each immune
cell vs. all other immune cells-the gene sets are less overlapping than those of immune cell types
in the multiple-tissue analysis.
We identified enrichments at FDR<5% for 13 traits. Results are displayed in Figure 4-5, Fig-
ure S5 and Table S8, and reveal highly trait-specific patterns of enrichment. For primary biliary
cirrhosis, we identified an enrichment in B cells, consistent with literature on the importance of B
11 2
cells for this trait.', Lupus and Alzheimer's disease both exhibit enrichment in myeloid cells.
The Alzheimer's disease result is consistent with existing literature on the importance of the innate
immune system in Alzheimer's disease etiology.11 3 Asthma and eczema both exhibited enrichment
in alpha beta T cells. Several subclasses of alpha beta T cells have been shown to be important

72
Pgimary biliary cirrhosis 6 Lupus 6 Asthma Rheumatoid arthritis 6 BMI

4 CL4 4 CL4 4

2 2 2 2

0 6L -L 0 0 0

Bi h
Innate
disease 6 Eczema 6 Crohn's Disease 6 Schizophrenia

lymphocyte 44 4 4
Myeloid - - -
-

Stem ~
Stromal 2 2 2 2
alpha beta T
gamma delta T 0 0 0 L

Figure 4-5: Results of ImmGen analysis for selected traits. Results for the remaining traits are displayed
in Figure S5. The width of each bar is proportional to its height, for easier visualization. The dashed line
represents the FDR<5% cutoff, -log1O(P)=3.08. Numerical results are reported in Table S8.

in asthma;" 4 to our knowledge, this result has not previously been reported in analyses of genetic
data. Rheumatoid arthritis, Crohn's disease, inflammatory bowel disease, and multiple sclerosis all
exhibited enrichments in a variety of cell types, consistent with complex etiologies for these diseases
that involve many different immune cell types.115-"1 7 Schizophrenia and bipolar disorder both ex-
hibited an enrichment in alpha beta T cells. Patients with bipolar disorder have been shown to have
a reduction in types of alpha beta T cells, but have equal levels of B cells, NK cells, and monocytes
compared to controls." 8 T cell levels have been shown to vary between schizophrenia cases and
controls, but existing literature is not consistent in its description of the direction of effect.119 Note
that our analysis excludes the HLA region; a previous analysis of the HLA region for schizophrenia
implicated the complement system through its role in synaptic pruning, a signal that is distinct
from the signal we observe here.1 20 Finally, we identified an enrichment in gamma delta T cells for
BMI. While obesity is known to cause inflammation,1 09 and gamma delta T cells are known to be
involved in obesity-related inflammation,121 gamma delta T cells have not to our knowledge been
previously suggested to have a role in BMI etiology.

73
Discussion

We have shown that applying stratified LD score regression to sets of specifically expressed genes

identifies disease-relevant tissues and cell types. Our approach, LDSC-SEG, allows us to take

advantage of the large amount of gene expression data available-including fine-grained data for

which we do not currently have a comparable chromatin counterpart-to ask questions ranging in

resolution from whether a trait is brain-related to whether excitatory or inhibitory neurons are more

important for disease etiology. We identified many significant enrichments that confirm or extend

our current understanding of biology, including an enrichment of striatum for migraine, enrichment

of GABAergic neurons for bipolar disorder, and an enrichment of myeloid cells for Alzheimer's

disease. These results improve our understanding of these diseases, and highlight the power of

GWAS as a source of biological insight.

There are several key differences between LDSC-SEG, which relies on gene expression data

without genotypes or eQTLs, and approaches that require eQTL data.6 4 65


First, our approach

can be applied to expression data sets such as the Franke lab data set, the Cahoy data set, the

PsychENCODE data set, and the ImmGen data set that do not have genotypes or eQTLs available

(Table 4.1). Second, to our knowledge, no method based on eQTLs has been shown to consistently

identify system-level enrichments such as brain enrichments for psychiatric traits and immune en-

richment for immunological traits, as we do here.6 4' 65 Third, methods based on eQTLs require gene

expression sample sizes that are large enough to detect eQTLs. In an analysis of data from the

GTEx project, we determined that we could identify strong enrichments such as brain enrichment

for schizophrenia with just one brain sample, though subtler enrichments had decreasing levels of

significance as the gene expression data were down-sampled (Figure S6, Methods). Results from

our analysis of ImmGen data, which has 2.8 samples per cell type on average, confirm that LDSC-

SEG can identify significant enrichments even when the gene expression data has a small number

of samples per tissue/cell type, in contrast to eQTL-based methods.


66 67
Our polygenic approach also differs from other gene expression-based approaches such as SNPsea

and DEPICT, 68 which restrict their analyses to subsets of SNPs that pass a significance threshold.

For comparison purposes, we repeated the multiple-tissue analysis using SNPsea and DEPICT. We

74
also repeated the multiple-tissue analysis by analyzing our annotations using MAGMA, a recently

developed gene set enrichment method... instead of stratified LD score regression.7 Results are

displayed in Figures S7-S11 (see Methods). Many broad patterns were consistent across all ap-

proaches: immune enrichment for many immunological diseases, liver enrichment for lipid traits,

adipose enrichment for BMI-adjusted waist-hip ratio, and enrichment in several tissues for height

and heel T-score. However, there were also several discrepancies. First, SNPsea and DEPICT,

the two approaches based on top SNPs, did not identify many of the CNS enrichments for brain-

related traits identified by LDSC-SEG and by MAGMA. Second, DEPICT and MAGMA identified
more enrichments than LDSC-SEG overall, including some enrichments with unclear relationships

to known biology. We hypothesized that LDSC-SEG did not identify some of these enrichments

because we jointly model our gene expression-based annotations with the many potential genomic

confounders that are included in the baseline model (e.g. exons). We conducted simulations that

confirmed that LDSC-SEG is the only approach that is well-powered to identify true enrichments

for polygenic traits while avoiding genomic confounding (Figure S12; see Methods).

Our work has several limitations. First, our approach is fundamentally limited by the availability

of gene expression data; for example, if the tissue/cell type that is most relevant for a disease occurs

in a stage of development that has not been assayed, then we cannot identify enrichments in that

tissue/cell type. Second, when analyzing gene expression data from different tissues, cell type

composition can confound the analysis, as we demonstrated in our comparison of brain regions.

Third, tissues/cell types with similar gene expression profiles to a causal tissue/cell type will be

identified as relevant to disease, just as SNPs in LD with a causal SNP will be identified as associated

to disease in a GWAS; thus, significant tissues/cell types should be cautiously interpreted as the

"best proxy" for the truly causal tissue/cell type, which may be unobserved. Finally, because our

approach uses stratified LD score regression, it cannot be applied to custom array data, and it

requires a sequenced reference panel that matches the population studied in the GWAS. 7

Our power to identify disease-relevant tissues and cell types will improve as GWAS sample sizes

continue to grow and gene expression data is generated in new tissues and cell types. This will

help advance our understanding of disease biology and lay the groundwork for future experiments

75
exploring specific variants and mechanisms.

Methods

Computing t-statistics. When computing the t-statistic of each gene for a focal tissue, we

excluded all samples from the same tissue category (see "Tissue categories and covariates" below).

For example, when computing the t-statistic of specific expression for each gene in cortex using

GTEx data, we compared expression in cortex samples to expression in all other samples, excluding

other brain regions. We chose to exclude other brain regions because we wanted to include genes

that are more highly expressed in brain tissues than in non-brain tissues, even if they are not specific

to cortex within the brain. This procedure results in a higher correlation among the t-statistics for

the different brain regions; in a separate analysis, we compute within-brain t-statistics to disentangle

this signal.

Thus, for a focal tissue (e.g., cortex) in a larger tissue category (e.g., brain), we computed the

t-statistic for gene g as follows. We first constructed a design matrix X where each row corresponds

to a sample either in cortex or outside of the brain. The first column of X has a 1 for every cortex

sample and a -1 for every non-brain sample. The remaining columns are an intercept and covariates

(see "Tissue categories and covariates" below). The outcome Y in our model is expression. We fit
this model via ordinary least squares, and compute a t-statistic for the first explanatory variable in

the standard way:

= (XTX)-XTY[]
/MSE (XTX)-1[O,01
where MSE is the mean squared error of the fitted model; i.e.,

MSE = (Y - X(XTX)-1XTY) T (y - X(XTX)lXTY)

where N is the number of rows in X. This gives us a t-statistic for each gene for the focal tissue.

We then select the top 10% of genes, add a 100kb window around their transcribed regions, and

76
apply stratified LD score regression to the resulting genome annotations as described below.

Modifications of our approach. For some analyses, we modified our approach to constructing
sets of specifically expressed genes to better take advantage of the data available.

" Franke lab data set. The values in the publicly available matrix are not a quantification of
expression intensity, but rather a quantification of differential expression relative to other
tissues in this data set. Thus, it was not appropriate to compute t-statistics in this data
set. We used the original values in place of our t-statistics, then proceeded as described in
Figure 4-1.

" Cahoy data set. The data set of Cahoy et al. had available sets of specifically expressed genes
for the three cell types that each had between 1,700 and 2,100 genes. We took these to be
the gene sets for the three cell types, then proceeded as in the standard approach, adding a
100kb window and applying stratified LD score regression.

* PsychENCODE data set. The PsychENCODE data set had available t-statistics for GABAer-
gic neurons vs. Glutamatergic neurons. We used these t-statistics, rather than computing
our own.

For the other data sets we analyzed (GTEx, GTEx brain regions, ImmGen), we used the approach
described in Figure 4-1. We view it as an advantage of our method that it can be flexibly adapted
to many different types of data.

Tissue categories and covariates.

" For the multiple-tissue GTEx analysis, we used the "SMTS" variable ("Tissue Type, area from
which the tissue sample was taken") to define the tissue categories (Table S2). We used age
and sex as covariates.

" For the analysis of GTEx brain regions, we set each tissue to be its own category, and we used
age and sex as covariates.

77
* For the ImmGen analysis, we defined tissue categories using the classification on the main page
of immgen.org of cell types into categories: B cells, gamma delta T cells, alpha beta T cells,
innate lymphocytes, myeloid cells, stromal cells, and stem cells (Table S8). The classification
at immgen.org also has a "T cell activation" category that we collapsed into the alpha beta T
cell category because it had data on alpha beta T cells at different stages of activation. We
did not have any covariates.

" For the Franke lab data set, Cahoy data set, and PsychENCODE data set, we did not compute
t-statistics and so we did not have tissue categories or covariates (see "Modifications of our
approach" above).

Choice of parameters. Our approach includes two parameters: the proportion of genes selected,
which we set to 10%, and the window size around each gene, which we set to 100kb. To choose
these two parameters, we ran the approach with six different parameter settings (2%, 5%, 10% of
genes x 20kb, 100kb windows) on two diseases-schizophrenia and rheumatoid arthritis-and two
corresponding GTEx tissues-brain (all brain regions) and blood (LCLs and whole blood)-which
are widely known to be disease-relevant tissues. We determined that of the parameter settings
we tested, 10% of genes and 100kb produced the most significant P-values for identifying brain
enrichment for schizophrenia and blood enrichment for rheumatoid arthritis, so we used these
parameters for the remaining analyses.
Application of stratified LD score regression. Stratified LD score regression 7 is a method for
partitioning heritability. Given (potentially overlapping) genomic annotations C1, . . , CK, one of
which is the category of all SNPs, we model the causal effect of SNP i on phenotype Y as drawn
from a distribution with mean 0 and variance

Var(0j) = ZTk1{z e Ck} (4.1)


k

(If the genomic annotations are real-valued rather than subsets of SNPs, we can replace 1{i E Ck
with any other function of the SNP indices."') We then model the phenotype Y as depending

78
linearly on genotype: Y = X3 + E, where X is a vector of SNP values for an individual, and each
SNP has been standardized to mean 0 and variance 1 in the population. Because each SNP is
standardized, and because ,i has mean zero, we can call Var(#3) the per-SNP heritability of SNP
i. (Note that here, because we model 3 as random, our definition of heritability is different from
definitions of heritability in which # is fixed, and so we are estimating a fundamentally different
quantity than some other methods.1 2 4

)
Under this model, the expected marginal chi-square association statistic for SNP i reflects the
causal contributions not only of SNP i but of SNPs in LD with SNP i. Specifically,

E [X2] = 1+ Na + NZ rak(i, k)
k

where N is the GWAS sample size, a is a constant that reflects population structure and other
sources of confounding, 13 and f(i, k) is the LD score of SNP i to category Ck, defined as f(i, k) =
7 r 2 (i, j)1{j C Ck}, where r2 (i, j) is the squared correlation between SNPs i and j in the pop-
ulation. To estimate the Tk, we first estimate f(i, k) from a reference panel, and we then perform
weighted regression of x2 on NC(i, k), using a jackknife over blocks of SNPs to estimate standard
errors.

The regression coefficient Tk quantifies the importance of annotation Ck, correcting for all other
annotations in the model; r will equal zero if Ck is not enriched, will be negative if belonging to
Ck decreases per-SNP heritability accounting for all other annotations included, and will be pos-
itive if belonging to Ck increases per-SNP heritability, accounting for all other factors. Thus, as
in our previous cell-type-specific analysis, 7 we compute P-values that test whether Tk is positive.
When reporting quantitative results, we normalize the coefficient Tk by our estimate of the mean
per-SNP heritability E Var(#3)/M to make it comparable across phenotypes. The normalized co-
efficient can be interpreted as the proportion by which the per-SNP heritability of an average SNP
would increase if rk were added to it. In addition, it is possible to estimate the total heritabil-
ity, defined as EZ Var(o3), as well as the heritability in category Ck, defined as Ziec Var(),
by plugging estimates of Tk into Equation (1), and to compare the proportion of heritability,

79
ZieCk Var(i)/ Ei Var(i), to the proportion of SNPs, |Ck/M, where M is the total number of
SNPs. 7
We analyzed autosomes only and excluded the HLA from all analyses. In each analysis, we
jointly fit the following annotations:

* The annotation created for our focal tissue by adding 100kb windows around the top 10% of
genes ranked by t-statistic.

* An identical annotation created for all genes included in the gene expression data set being
analyzed.

" The baseline model with 52 functional categories, described previously6 and listed in Table
Si.

Gene expression data: quality control and normalization.

" GTEx data set. We downloaded the RNA-seq read counts from GTEx v6p (see URLs),
removed genes for which fewer than 4 samples had at least one read count per million, removed
samples for which fewer than 100 genes had at least one read count per million, and applied
TPM normalization.1 2 1 We used the "SMTSD" variable ("Tissue Type, more specific detail of
tissue type") to define our tissues (Table S2).

" Franke lab data set. We downloaded the publicly available gene expression data from the
DEPICT website (see URLs). We determined that several pairs of tissues had values that
were correlated at r2 > 0.99, including several that had r2 = 1. We pruned our data so
that no two tissues had r2 > 0.99. Most of the closely correlated pairs were also biologically
closely related so that the interpretation did not depend on which tissue we chose to keep

(e.g., plasma and plasma cells, joint and joint capsule). For pairs of tissues where one tissue
was more specific than the second, we kept the more specific pair (e.g., nose vs. nasal mucosa,
quadriceps muscle vs. skeletal muscle). There were two clusters of highly correlated tissues
for which we decided to remove the entire cluster, not keeping any of the tissues, because

80
these clusters had very strong but biologically implausible correlations. The first such cluster
was made up of eyelids, conjunctiva, anterior eye segment, tarsal bones, foot bones, and bones
of the lower extremity. The second such cluster was made up of connective tissue, bone and
bones, skeleton, and bone marrow. After pruning, this data set contained 152 tissues, listed
in Table S3.

" Cahoy et al. data set. We downloaded sets of specifically expressed genes for each of the three
cell types (see URLs). To obtain a list of all genes, we also downloaded a list of all genes that
passed quality control in their analysis (Table S3b of Cahoy et al.). We mapped from mouse
to human genes using orthologs from ENSEMBL (see URLs).

* PsychENCODE data set. We used the t-statistics released by the PsychENCODE consortium
for differential expression in GABAergic vs. Glutamatergic neurons.7 4 These t-statistics were
computed using limma. 126

" ImmGen data set. We downloaded publicly available gene expression data from the ImmGen
Consortium (see URLs). We used both Phase 1 (GSE15907) and Phase 2 (GSE37448) data.
The data on GEO were on an exponential scale, so we log transformed the data and mapped
to human genes using ENSEMBL orthologs. We tested each of the 297 cell types.

We modified the makegenes.sh script1 2 7 (see URLs) for some of our data processing.

Chromatin analysis. We downloaded narrow peaks from the Roadmap Epigenomics consortium
for DNase hypersensitivity and five activating histone marks: H3K27ac, H3K4me3, H3K4mel,
H3K9ac, and H3K36me3 (see URLs). Each of these six features was present in a subset of the 88
primary cell types/tissues, for a total of 397 cell-type-/tissue-specific annotations. For each of these
annotations, we tested for enrichment by adding the annotation to the baseline model (see Table
Si), together with the union of cell-type-specific annotations within each mark and the average of
cell-type-specific annotations within each mark. A positive regression coefficient for a tissue-/cell-
type-specific annotation represents a positive contribution of the annotation to per-SNP heritability,

81
conditional on the other annotations. We again computed a P-value to test whether the regression
coefficient was positive.

Our analysis of chromatin in this work differs from our previous analysis of chromatin data7
in three ways. First, we use a larger range of marks and tissues/cell types: every track available
from the Roadmap Epigenomics website (see URLs) for any of six activating marks, H3K27ac,
H3K4mel, H3K4me3, H3K9ac, H3K36me3, and DHS, in any of the 88 primary tissues and cell
types available, for a total of 397 annotations. Second, we used narrow peaks from Roadmap for all
of the marks. Previously, we analyzed H3K27ac data from one source" and H3K4mel, H3K4me3,
and H3K9ac data from another source;" 23
now that there is a single standard source with uniformly
processed data for all marks of interest, we have switched to using this data. Finally, we controlled
more strictly for confounders by including the average across cell types of the cell-type-specific
annotations for a given mark as an annotation in the model, so that annotations that tend to fall
in areas that are more active overall are not falsely interpreted as cell-type-specific signal.

Number of gene expression samples needed. Because the GTEx consortium data set in-
cluded tens of samples for many of the tissues, we were able to assess how sensitive our results were
to the sample size of the gene expression data set used to construct the gene sets. To do this, we
repeatedly sub-sampled our data set to a variety of sample sizes, each time re-creating gene sets
using the smaller sub-sampled data set. We chose two results to re-analyze in this way. First, we
re-analyzed cortex enrichment for schizophrenia, in which cortex was compared to all non-brain
samples and was highly significant (Figure 4-2). This result was very robust: the enrichment was
highly significant in all of our downsampled data sets, even with only a single cortex sample (Figure
S6A). We then assessed enrichment for schizophrenia in the within-brain analysis, in which cortex
was compared to all other brain regions and was moderately significant (Figure 4-4A). In this anal-
ysis, sample size was more important, and while there was high variance in z-score among random
samples at a given sample size, there was a clear trend that increasing the sample size increases
the significance of the enrichment on average (Figure S6B). In conclusion, these analyses provide
evidence that sample size can be important when the enrichment being identified is near the border

82
of significance, but that our method is well-powered to detect strong signals even with a single
sample in the tissue of interest.

Comparison to existing methods: real phenotypes. To our knowledge, SNPsea 66 ' 67 is the
only existing method that takes as input GWAS summary statistics, together with a matrix of gene
expression values, and identifies enriched tissues and cell types. SNPsea leverages only genome-wide
significant SNPs, rather than all SNPs, a notable difference from our approach. We ran SNPsea on
the summary statistics and gene expression data analyzed in our multiple-tissue analysis; results
are displayed in Figure S7. We found that SNPsea identified biological plausible enrichments at
high levels of significance for traits such as LDL for which a large proportion of SNP-heritability
lies in genome-wide significant loci, but that it was not well-powered for more polygenic traits; for
example, it found zero tissues with FDR < 5% for bipolar disorder, while our approach found many
brain regions to be enriched at P-values as low as 2e-12 (Figure Si). The lack of power of SNPsea
on more polygenic traits is unsurprising, as SNPsea leverages only genome-wide significant loci.

The DEPICT software 68 includes a method for identifying disease-relevant tissues and cell types
from GWAS summary statistics and gene expression data. However, this method takes as input
only the GWAS summary statistics and not gene expression data; the method is designed to be
run only with the Franke lab data set15,17, which is built into the software. Thus, DEPICT could
not be used to obtain the results in our brain-specific and immune-specific analyses, for which we
analyzed data sets that allowed us to differentiate among tissues and cell types within each of these
systems. However, DEPICT does perform a multiple-tissue analysis analogous to the Franke lab
data set component of our multiple-tissue analysis, and so we ran DEPICT on the set of summary
statistics that we analyzed. Like SNPsea, DEPICT is run on a subset of SNPs, but unlike SNPsea,
DEPICT documentation recommends that it be run twice, once on SNPs that pass genome-wide
significance at 5e-8, and once on SNPs that pass a less stringent threshold of le-5; we followed this
recommendation, and our results are displayed in Figures S8 and S9. We determined that DEPICT
failed to identify some enrichments identified by our analysis of the Franke lab data set, such as
brain enrichment for several brain-related traits (epilepsy, Tourette syndrome, neuroticism, and

83
smoking status), but that it identified a large number of enrichments for other traits and tissues
that our approach did not find. In simulations described below, we found that DEPICT sometimes
reported significant results in the absence of true enrichment.
Our approach, described in Figure 4-1, has two main steps: constructing a genome annotation
from gene expression data, and testing this annotation for enrichment with GWAS summary statis-
tics using stratified LD score regression. We tested whether the success of our approach depended
on using stratified LD score regression in the second step by instead analyzing the specifically
expressed gene annotations from the first step using MAGMA,122 a gene set enrichment method
that allows inclusion of a window around each gene and leverages all SNPs in the gene set (Figure
S10). MAGMA and LDSC-SEG identified many of the same enrichments, but MAGMA identi-
fied several enrichments that LDSC-SEG did not. In simulations described below, we determined
that MAGMA can report significant results in the absence of true enrichment due to uncorrected
genomic confounding.
For comparison purposes, we report LDSC-SEG results for the multiple tissue analysis as a
heatmap in Figure S11, in addition to the bar charts in Figure 4-2 and Figure S1.

Comparison to existing methods: simulated phenotypes. We performed simulations using


genotypes from Genetic Epidemiology Research on Aging (GERA) data set 1 2 8-13 0 with 47,360 in-
dividuals and 6,507,309 SNPs with imputation R 2 > 0.5. We simulated five genetic architectures,
where "null" refers to a heritable trait with no tissue-specific enrichment and "causal" refers to a
heritable trait with cortex enrichment:

1. (Polygenic null) All SNPs causal, causal SNP effects are drawn independently from a normal
distribution with mean zero and constant variance across the genome, with a total heritability
of 0.9.

2. (Sparse null) Same as (1), but each SNP has probability 0.001 of being causal.

3. (Exon-enriched null) A SNP is causal if and only if it is in an exon, causal SNP effects are
drawn independently from a normal distribution with mean zero and constant variance for all
exonic SNPs, with a total heritability of 0.9.

84
4. (Polygenic causal) We use the annotation corresponding to cortex genes from the multiple-
tissue analysis to simulate a true effect. All SNPs are causal, causal SNP effects are drawn in-
dependently from a normal distribution with a constant variance within the cortex annotation
and constant variance outside of the cortex annotation so that 50% of the total heritability is
assigned to the cortex annotation, 50% of the total heritability is distributed uniformly across
the genome, and the total heritability is 0.2. We chose a smaller value of heritability in the
causal simulations because we wanted to test power to identify true enrichment rather than
control of type I error.

5. (Sparse causal) Same as (4), but each SNP has a probability of 0.001 to be causal.

For each genetic architecture, we simulated phenotypes and summary statistics using PLINK1

'
(see URLs) with 100 replicates for each genetic architecture. We then ran the multiple-tissue analysis
as described above for every method on each of the simulated data sets, and for each method and
each simulated genetic architecture we performed FDR correction within the set of 100 simulated
phenotypes. Results are displayed in Figure S12.
Of the five methods tested (LDSC-SEG, SNPsea, DEPICT (le-5), DEPICT (5e-8), and MAGMA),
only LDSC-SEG and SNPsea correctly reported no significant enrichments passing FDR<5% for
all 3 null simulations (scenarios 1-3). In particular, DEPICT with a threshold of le-5 reported
significant enrichments at FDR<5% for all three null simulations (scenarios 1-3), while DEPICT
with a threshold of 5e-8 reported significant enrichments at FDR<5% for the sparse null simulation

(scenario 2). MAGMA correctly reported no significant enrichment for the null simulations with no
enrichment (scenarios 1-2) but reported a large number of significant enrichments at FDR<5% for
the null simulation with enrichment in exons (scenario 3). This is consistent with the fact MAGMA
does not control for exon content.
All five methods reported significant cortex enrichments at FDR<5% for the sparse causal
simulation (scenario 5), but only MAGMA and LDSC-SEG reported significant cortex enrichments
for the polygenic causal simulation (scenario 4). These simulations, together with the analysis of
real phenotypes described above, indicate that only LDSC-SEG and SNPsea control type I error,
and that of these two methods, LDSC-SEG is better powered for polygenic traits.

85
Acknowledgements

We are thankful to Tune Pers, Sam Riesenfeld, Rebecca Herbst, Adrian Veres, and Eran Hodis for
helpful conversations. This research has been conducted using the UK Biobank Resource (Applica-
tion Number: 16549). This research was funded by NIH grants ROl MH107649, ROl MH109978 and
U01 CA194393. HKF is supported by the Fannie and John Hertz Foundation. Data were gener-
ated as part of the PsychENCODE Consortium, supported by: U01MH103339, U01MH103365,
U01MH103392, U01MH103340, U01MH103346, R01MH105472, R01MH94714, R01MH105898,
R21MH102791, R21MH105881, R21MH103877, and P50MH106934 awarded to: Schahram Ak-
barian (Icahn School of Medicine at Mount Sinai), Gregory Crawford (Duke), Stella Dracheva
(Icahn School of Medicine at Mount Sinai), Peggy Farnham (USC), Mark Gerstein (Yale), Daniel
Geschwind (UCLA), Thomas M. Hyde (LIBD), Andrew Jaffe (LIBD), James A. Knowles (USC),
Chunyu Liu (UIC), Dalila Pinto (Icahn School of Medicine at Mount Sinai), Nenad Sestan (Yale),
Pamela Sklar (Icahn School of Medicine at Mount Sinai), Matthew State (UCSF), Patrick Sullivan
(UNC), Flora Vaccarino (Yale), Sherman Weissman (Yale), Kevin White (UChicago) and Peter
Zandi (JHU).

URLs

" LDSC software, including LDSC-SEG:


https ://github. com/bulik/ldsc.

" Gene sets and LD scores from this paper:


https://data.broadinstitute.org/alkesgroup/LDSCORE/.

* GTEx:
http://www.gtexportal.org.

* Franke lab data:


https://data.broadinstitute.org/mpg/depict/depict-download/tissue-expression.

86
* Cahoy et al. data:
http: //jneurosci.org/content/supp/2008/01/03/28.1.264.DC1, see Tables S4-S6.

" PsychENCODE:
https ://www. synapse . org//#! Synapse :syn4921369/wiki/235539.

" ImmGen:
https: //www. immgen. org/.

" Roadmap Epigenomics:


http: //www.roadmapepigenomics. org.

" GERA data set (database of Genotypes and Phenotypes (dbGaP), phs000674.vl.pl):
http: //www-ncbi-nlm-nih-gov. libproxy .mit . edu/proj ects/gap/cgi-bin/study. cgi?study_
id=phs000674.v1.p1.

* PLINK:
https://www.cog-genomics.org/plink2.

" makegenes.sh:
https://github.com/freeseek/gwaspipeline

87
00
00
Chapter 5

An atlas of genetic correlations across


human diseases and traits

Identifying genetic correlations between complex traits and diseases can provide useful etiological in-
sights and help prioritize likely causal relationships. The major challenges preventing estimation of
genetic correlation from genome-wide association study (GWAS) data with current methods are the
lack of availability of individual genotype data and widespread sample overlap among meta-analyses.
We circumvent these difficulties by introducing a technique - cross-trait LD Score regression - for
estimating genetic correlation that requires only GWAS summary statistics and is not biased by
sample overlap. We use our method to estimate 300 genetic correlations among 25 traits, totaling
more than 1.5 million unique phenotype measurements. Our results include genetic correlations
between anorexia nervosa and schizophrenia, anorexia and obesity and associations between educa-
tional attainment and several diseases. These results highlight the power of genome-wide analyses,
since there currently are no genome-wide significant SNPs for anorexia nervosa and only three for
educational attainment. 1

'The material in this chapter previously appeared in the September 2015 edition of Nature Genetics as "An atlas
of genetic correlations across human diseases and traits" by Brendan Bulik-Sullivan*, Hilary Finucane* et al. 9 (*
co-first).

89
Introduction

Understanding the complex relationships among human traits and diseases is a fundamental goal
of epidemiology. Randomized controlled trials and longitudinal studies are time-consuming and
expensive, so many potential risk factors are studied using cross-sectional correlations studies at a
single time point. Obtaining causal inferences from such studies can be challenging, due to issues
such as confounding and reverse causation, which can lead to spurious associations and mask the
effects of real risk factors. 132 133 Genetics can help elucidate cause and effect, since inherited genetic
risks cannot be subject to reverse causation and are correlated with a smaller list of confounders.
The first methods for testing for genetic overlap were family studies.134-138 In order to estimate
genetic overlaps among many pairs of phenotypes, family designs require measuring multiple traits
on the same individuals. Consequently, it is challenging to scale family designs to a large number of
traits, especially traits that difficult or costly to measure (e.g., low-prevalence diseases). Genome-
wide association studies (GWAS) produce effect-size estimates for specific genetic variants, so it
is possible to test for shared genetics by looking for correlations in effect-sizes across traits, which
does not require measuring multiple traits per individual.
A widely-used technique for testing for relationships between pairs of phenotypes using GWAS
data is Mendelian randomization (MR).132,133 In MR, SNPs are used as instrumental variables to
establish a relationship between a risk factor and disease. A set of SNPs associated with the risk
factor is tested for association with disease. Under the strong assumption that these SNPs affect
disease status only through the risk factor, an association allows us to conclude that the risk factor
has a causal effect on disease status. MR is effective for traits where significant associations account
for a substantial fraction of heritability. 139 ,140 For many complex traits, heritability is distributed
over thousands of variants with small effects, and the proportion of heritability accounted for by
significantly associated variants at current sample sizes is small.2 In such situations, MR suffers
14 1 142
from both lower power and bias.,

A complementary approach is to estimate genetic correlation, a quantity that includes the


effects of all SNPs, including those that do not reach genome-wide significance (Methods). Genetic
correlation is meaningful both for quantitative traits and diseases. For pairs of diseases, genetic

90
correlation can be interpreted as the genetic analogue of comorbidity. The two main existing
techniques for estimating genetic correlation from GWAS data are restricted maximum likelihood
(REML) 4 ,12,21,85, 143,144 and polygenic scores. 145,146 These methods have only been applied to a few
traits, because they require individual genotype data, which are difficult to obtain due to informed
consent limitations.
In order to overcome these limitations, we have developed a technique for estimating genetic
correlation using only GWAS summary statistics that is not biased by sample overlap. Our method,
cross-trait LD Score regression, is a simple extension of single-trait LD Score regression1 3 and is
computationally very fast. We apply this method to data from 25 GWAS and report genetic
correlations for 300 pairs of phenotypes, demonstrating shared genetic bases for many complex
diseases and traits.

Results

Overview of Methods

The method presented here for estimating genetic correlation from summary statistics relies on the
fact that the GWAS effect-size estimate for a given SNP incorporates the effects of all SNPs in
linkage disequilibrium (LD) with that SNP. 13 ,29 For a polygenic trait, SNPs with high LD will have
higher x 2 statistics on average than SNPs with low LD. 13 A similar relationship holds if we replace
X2 statistics for a single study with the product of z-scores from two studies of traits with non-zero
genetic correlation.
More precisely, under a polygenic model,4 21 the expected value of Zi Z2j is

E[zijz 2j] = NN,


yNN2Pf
+ (5.1)
M V NN2

where Ni is the sample size for study i, pg is genetic covariance (defined in Methods), fj is LD Score, 13
N, is the number of individuals included in both studies, and p is the phenotypic correlation among
the N, overlapping samples. We derive this equation in the Supplementary Note. If study 1 and

91
study 2 are the same study, then Equation 5.1 reduces to the single-trait result from," because
genetic covariance between a trait and itself is heritability, and x 2 z 2. As a consequence of
equation 1, we can estimate genetic covariance using the slope from the regression of Zi z2j on LD
Score, which is computationally very fast (Methods).
Sample overlap creates spurious correlation between zij and z 2j and inflates zij z2j, but the
expected magnitude of this inflation is uniform across all markers, and in particular does not depend
on LD Score. As a result, sample overlap only affects the intercept from this regression (the term
pN/ VN 1 N2 ) and not the slope, so the estimates of genetic correlation will not be biased by sample
overlap. Similarly, shared population stratification will alter the intercept but have minimal impact
on the slope, because the correlation between LD Score and the rate of genetic drift is minimal.1 3
If we are willing to assume no shared population stratification, and we know the amount of sample
overlap and phenotypic correlation in advance (i.e., the true value of pN,/ VNN 2 ), we can constrain
the intercept to this value. We refer to this approach as constrained intercept LD Score regression.
Constrained intercept LD Score regression has lower standard error - often by as much as 30% - than
LD Score regression with unconstrained intercept, but will yield biased and misleading estimates
if the intercept is misspecified, e.g., if we miscount the overlapping samples or do not control for
population stratification.
Normalizing genetic covariance by the SNP-heritabilities yields genetic correlation: rg := pg/ h

,
where h2 denotes the SNP-heritability 21 from study i. Genetic correlation ranges between -1 and
1. Results similar to Equation 5.1 holds if one or both studies is a case/control study, in which case
genetic covariance is on the observed scale (Supplementary Note). There is no distinction between
observed and liability scale genetic correlation for case/control traits, so we can talk about genetic
correlation between a case/control trait and a quantitative trait and genetic correlation between
pairs of case/control traits without the need to specify a scale (Supplementary Note).

Simulations

We performed a series of simulations to evaluate the robustness of the model to potential confounders
such as sample overlap and model misspecification, and to verify the accuracy of the standard error

92
estimates (Methods).

Table 5.1 shows cross-trait LD Score regression estimates and standard errors from 1,000 simu-
lations of quantitative traits. For each simulation replicate, we generated two phenotypes for each
of 2,062 individuals in our sample by drawing effect sizes approximately 600,000 SNPs on chro-
mosome 2 from a bivariate normal distribution. We then computed summary statistics for both
phenotypes and estimated heritability and genetic correlation with cross-trait LD Score regression.
The summary statistics were generated from completely overlapping samples. Results are shown in
Table 5.1. These simulations confirm that cross-trait LD Score regression yields accurate estimates
of the true genetic correlation and that the standard errors match the standard deviation across
simulations. Thus, cross-trait LD Score regression is not biased by sample overlap, in contrast to
estimation of genetic correlation via polygenic risk scores, which is biased in the presence of sample
overlap. 14 6 We also evaluated simulations with one quantitative trait and one case/control study
and show that cross-trait LD Score regression can be applied to binary traits and is not biased by
oversampling of cases (Table C.1).

Parameter Truth Estimate SD SE


h2 0.58 0.58 0.072 0.075
pg 0.29 0.29 0.057 0.058
rg 0.50 0.49 0.079 0.073
Table 5.1: Simulations with complete sample overlap. Truth shows the true parameter values. Estimate
shows the average cross-trait LD Score regression estimate across 1000 simulations. SD shows the standard
deviation of the estimates across 1000 simulations, and SE shows the mean cross-trait LD Score regression
SE across 1000 simulations. Further details of the simulation setup are given in the Methods.

Estimates of heritability and genetic covariance can be biased if the underlying model of genetic
architecture is misspecified, e.g., if variance explained is correlated with LD Score or MAF.47
Because genetic correlation is estimated as a ratio, it is more robust; biases that affect the numer-
ator and the denominator in the same direction tend to cancel. We obtain approximately correct
estimates of genetic correlation even in simulations with models of genetic architecture where our
estimates of heritability and genetic covariance are biased (Table C.2).

93
Replication of Pyschiatric Cross-Disorder Results

As technical validation, we replicated the estimates of genetic correlations among psychiatric disor-
ders obtained with individual genotypes and REML in, 85 by applying cross-trait LD Score regression
to summary statistics from the same data.14 8 These summary statistics were generated from non-
overlapping samples, so we applied cross-trait LD Score regression using both unconstrained and
constrained intercepts (Methods). Results from these analyses are shown in Figure 5-1. The results
from cross-trait LD Score regression were similar to the results from REML. cross-trait LD Score
regression with constrained intercept gave standard errors that were only slightly larger than those
from REML, while the standard errors from cross-trait LD Score regression with intercept were
substantially larger, especially for traits with small sample sizes (e.g., ADHD, ASD).

Application to Summary Statistics From 25 Phenotypes

We used cross-trait LD Score regression to estimate genetic correlations among 25 phenotypes


(URLs, Methods). Genetic correlation estimates for all 300 pairwise combinations of the 25 traits
are shown in Figure 5-2. For clarity of presentation, the 25 phenotypes were restricted to contain
only one phenotype from each cluster of closely related phenotypes (Methods). Genetic correla-
tions among the educational, anthropometric, smoking, and insulin-related phenotypes that were
excluded from Figure 5-2 are shown in Table C.4 and Figures C-1, C-2 and C-3, respectively.
References and sample sizes are shown in Table C.3.
For the majority of pairs of traits in Figure 5-2, no GWAS-based genetic correlation estimate has
been reported; however, many associations have been described informally based on the observation
of overlap among genome-wide significant loci. Examples of genetic correlations that are consistent
with overlap among top loci include the correlations between plasma lipids and cardiovascular
disease;' 4 0 age at onset of menarche and obesity; 2 type 2 diabetes, obesity, fasting glucose, plasma
lipids and cardiovascular disease;5 4 birth weight, adult height and type 2 diabetes; 149'150 birth length,
51 5 2
adult height and infant head circumference; " and childhood obesity and adult obesity.1 5 ' For
many of these pairs of traits, we can reject the null hypothesis of zero genetic correlation with
overwhelming statistical significance (e.g., p < 10-20 for age at onset of menarche and obesity).

94
0.8 ..
U REML
LDSC
LDSC no intercept

(D 0.6

-0.4 ........... ...................................................... . ....

-0 .6 - --- -)--)---) ----


---------------
----

aaq

o0-0
U)0.00
-I-

Figure 5-1: Replication of Psychiatric Cross-Disorder Results. This plot compares cross-trait LD Score
regression estimates of genetic correlation using the summary statistics from148 to estimates obtained from
REML with the same data.85 The horizontal axis indicates pairs of phenotypes, and the vertical axis
indicates genetic correlation. Error bars are standard errors. Green is REML; orange is LD Score with
intercept and white is LD Score with constrained intercept. The estimates of genetic correlation among
psychiatric phenotypes in figure 5-2 use larger sample sizes; this analysis is intended as a technical validation.
Abbreviations: ADHD = attention deficit disorder; ASD = autism spectrum disorder; BPD = bipolar
disorder; MDD = major depressive disorder; SCZ = schizophrenia.

The first section of Table 5.2 lists genetic correlation results that are consistent with epidemi-

ological associations, but, as far as we are aware, have not previously been reported using genetic

data. The estimates of the genetic correlation between age at onset of menarche and adult height, 15

triglycerides 154 and type 2 diabetes5155 are consistent with the epidemiological associations. The
estimate of a negative genetic correlation between anorexia nervosa and obesity suggests that the

same genetic factors influence normal variation in BMI as well as dysregulated BMI in psychiatric

illness. This result is consistent with the observation that BMI GWAS findings implicate neuronal,

rather than metabolic, cell-types and epigenetic marks s4t The negative genetic correlation between

95
I
Everilevormoker

*
Obeely (Adut) V*

*
Childhood Obesity
- 0.83
tYpe 2 Diabetes
Faeling Gluoce
" pi
Teglyoerdes
Extreme Wls-.Hlp Rlati
Coronary Artery Disease
Eu -0.0

LDL Cholesterol - 0.48


College (*s9lo) 3'. *aflo*
CC
Height (Adult)
Infant Head Clrcumterenoe - 0.31
Birth Length
Birth Weight
HDL Cholesterol : a-
Age at Menarche
AnoredaNervosa
*

* flu
Schizophrenia

Bloolar Disorder
Major Depression -0.21
Autism Spectrum Disorder
Rheumatoid Arthritis

Awiherne's Disease -0.38


Crohn's Disease
Ulcerative Costie

Figure 5-2: Genetic Correlations among 25 GWAS. Blue represents positive genetic correlations; red
represents negative. Larger squares correspond to more significant p-values. Genetic correlations that are
different from zero at 1% FDR are shown as full-sized squares. Genetic correlations that are significantly
different from zero after Bonferroni correction for the 300 tests in this figure have an asterisk. We show
results that do not pass multiple testing correction as smaller squares in order to avoid whiting out positive
controls where the estimate points in the expected direction, but does not achieve statistical significance due
to small sample size. This multiple testing correction is conservative, since the tests are not independent.

adult height and coronary artery disease agrees with a replicated epidemiological association.' 56- 5

We observe several significant associations with the educational attainment phenotypes from Ri-
etveld et al.:48 we estimate a statistically significant negative genetic correlation between college and
Alzheimer's disease, which agrees with epidemiological results.1 59 ,16 0 The positive genetic correla-
tion between college and bipolar disorder is consistent with previous epidemiological reports. 6 1,16 2
The estimate of a negative genetic correlation between smoking and college is consistent with the
observed differences in smoking rates as a function of educational attainment.16 3

The second section of table 5.2 lists three results that are, to the best of our knowledge, new both

96
Phenotype 1 Phenotype 2 r.
Phenotype 1 Phenotype 2 rQ (se)
(se) p-value
p-value
Age at Menarche Height (Adult) 0.11 (0.03) 6 x i0-5 **
Age at Menarche Type 2 Diabetes -0.13 (0.04) 3 x 10-3
3
Age at Menarche Triglycerides -0.15 (0.04) 1 x 10-
Coronary Artery Disease Age at Menarche -0.11 (0.05) 4 x 10-2
4
Coronary Artery Disease College (Yes/No) -0.278 (0.07) 1 x i0- **
Coronary Artery Disease Height (Adult) -0.17 (0.05) 2 x i0-4

*
4
Epidemiological Alzheimer's College (Yes/No) -0.30 (0.08) 1 x 0- **
5
Bipolar Disorder College (Yes/No) 0.26 (0.064) 6 x i0- **
Obesity (Adult) College (Yes/No) -0.23 (0.04) 2 x 10-8 **
Triglycerides College (Yes/No) -0.30 (0.04) 5 x 10-12 **
Anorexia Nervosa Obesity (Adult) -0.20 (0.04) 4 x 10-6 **
Ever/Never Smoker College (Yes/No) -0.39 (0.07) 1 x 10-9 **
5
Ever/Never Smoker Obesity (Adult) 0.22 (0.05) 7 x i0- **
Autism Spectrum Disorder College (Yes/No) 0.28 (0.08) 5 x 10-1

*
New/Nonzero Ulcerative Colitis Childhood Obesity -0.33 (0.08) 3.9 x 10-5 **
Anorexia Nervosa Schizophrenia 0.19 (0.04) 1.5 x 10- **
Schizophrenia Alzheimer's 0.05 (0.05) 0.58
Schizophrenia Ever/Never Smoker 0.03 (0.06) 0.26
Schizophrenia Triglycerides -0.05 (0.04) 0.21
Schizophrenia LDL Cholesterol -0.02 (0.03) 0.64
New/Low
Schizophrenia HDL Cholesterol 0.03 (0.04) 0.50
Schizophrenia Rheumatoid Arthritis -0.05 (0.05) 0.38
Crohn's Disease Rheumatoid Arthritis -0.02 (0.09) 0.83
Ulcerative Colitis Rheumatoid Arthritis -0.09 (0.09) 0.33

Table 5.2: Genetic correlation estimates, standard errors and p-values for selected pairs of traits. Results
are grouped into genetic correlations that are new genetic results, but are consistent with established
epidemiological associations ("Epidemiological"), genetic correlations that are new both to genetics and
epidemiology ("New/Nonzero") and interesting null results ("New/Low"). The p-values are uncorrected p-
values. Results that pass multiple testing correction for the 300 tests in Figure 5-2 at 1% FDR have a single
asterisk; results that pass Bonferroni correction have two asterisks. We present some genetic correlations
that agree with epidemiological associations but that do not pass multiple testing correction in these data.

to genetics and epidemiology. One, we find a positive genetic correlation between anorexia nervosa
and schizophrenia. Comorbidity between eating and psychotic disorders has not been thoroughly
investigated in the psychiatric literature, 164' 165 and this result raises the possibility of similarity
between these classes of disease. Two, we estimate a negative genetic correlation between ulcerative
colitis (UC) and childhood obesity. The relationship between premorbid BMI and ulcerative colitis
is not well-understood; exploring this relationship may be a fruitful direction for further investiga-

97
tion. Three, we estimate a positive genetic correlation between autism spectrum disorder (ASD)
and educational attainment ( which has very high genetic correlation with IQ4 8'166'167). The ASD
summary statistics were generated using a case-pseudocontrol study design, so this result cannot
be explained by oversampling of ASD cases from the more highly educated parents, which is ob-
served epidemiologically.1 68 The distribution of IQ among individuals with ASD has lower mean
than the general population, but with heavy tails 169 (i.e., an excess of individuals with low and
high IQ). There is also emerging evidence that the genetic architecture of ASD varies across the IQ
distribution. 170

The third section of table 5.2 lists interesting examples where the genetic correlation is close
to zero with small standard error. The low genetic correlation between schizophrenia and rheuma-
toid arthritis is interesting because schizophrenia has been observed to be protective for rheumatoid
arthritis, 171 though the epidemiological effect is weak, so it is possible that there is a real genetic cor-
relation, but it is too small for us to detect. The low genetic correlation between schizophrenia and
smoking is notable because of the hincreased tobacco use (both prevalence and number of cigarettes
per day) among individuals with schizophrenia. 172 The low genetic correlation between schizophre-
nia and plasma lipid levels contrasts with a previous report of pleiotropy between schizophrenia and
triglycerides.1 73 Pleiotropy (unsigned) is different from genetic correlation (signed; see Methods);
however, the pleiotropy reported by Andreassen, et al.173 could be explained by the sensitivity
of the method used to the properties of a small number of regions with strong LD, rather than
trait biology (Figure C-5). We estimate near-zero genetic correlation between Alzheimer's disease
and schizophrenia. The genetic correlations between Alzheimers disease and the other psychiatric
traits (anorexia nervosa, bipolar, major depression, ASD) are also close to zero, but with larger
standard errors, due to smaller sample sizes. This suggests that the genetic basis of Alzheimer's
disease is distinct from psychiatric conditions. Last, we estimate near zero genetic correlation be-
tween rheumatoid arthritis (RA) and both Crohn's disease (CD) and UC. Although these diseases
share many associated loci, 24 1 74 there appears to be no directional trend: some RA risk alleles
are also risk alleles for UC and CD, but many RA risk alleles are protective for UC and CD,1 74
yielding near-zero genetic correlation. This example highlights the distinction between pleiotropy

98
and genetic correlation (Methods).
Finally, the estimates of genetic correlations among metabolic traits are consistent with the esti-
mates obtained using REML in Vattikuti et al.' 43 (Supplementary Table C-4), and are directionally
consistent with the recent Mendelian randomization results from Wuertz et al.175 The estimate of
0.57 (0.074) for the genetic correlation between CD and UC is consistent with the estimate of 0.62
(0.042) from Chen et al.14 4

Discussion

We have described a new method for estimating genetic correlation from GWAS summary statistics,
which we applied to a dataset of GWAS summary statistics consisting of 25 traits and more than
1.5 million unique phenotype measurements. We reported several new findings that would have
been difficult to obtain with existing methods, including a positive genetic correlation between
anorexia nervosa and schizophrenia. Our method replicated many previously-reported GWAS-
based genetic correlations, and confirmed observations of overlap among genome-wide significant
SNPs, MR results and epidemiological associations.
This method is an advance for several reasons: it does not require individual genotypes, genome-
wide significant SNPs or LD-pruning (which loses information if causal SNPs are in LD). Our method
is not biased by sample overlap and is computationally fast. Furthermore, our approach does not
require measuring multiple traits on the same individuals, so it scales easily to studies of thousands
of pairs of traits. These advantages allow us to estimate genetic correlation for many more pairs of
phenotypes than was possible with existing methods.
The challenges in interpreting genetic correlation are similar to the challenges in MR. We high-
light two difficulties. First, genetic correlation is immune to environmental confounding, but is
subject to genetic confounding, analogous to confounding by pleiotropy in MR. For example,
the genetic correlation between HDL and CAD in Figure 5-2 could result from a causal effect
HDL -+ CAD, but could also be mediated by triglycerides (TG),140, 17 represented graphically 17 7
as HDL +- G -+ TG -+ CAD, where G is the set of genetic variants with effects on both HDL and

99
TG. Extending genetic correlation to multiple genetically correlated phenotypes is an important
direction for future work. 178 Second, although genetic correlation estimates are not biased by over-
sampling of cases, they are affected by other forms of biased sampling, such as misclassification 85
and matching (e.g., a BMI-matched study of T2D).
We note several limitations of cross-trait LD Score regression as an estimator of genetic corre-
lation. First, cross-trait LD Score regression requires larger sample sizes than methods that use
individual genotypes in order to achieve equivalent standard error. Second, cross-trait LD Score
regression is not currently applicable to samples from recently-admixed populations. Third, we
have not investigated the potential impact of assortative mating on estimates of genetic correla-
tion, which remains as a future direction. Fourth, methods built from polygenic models, such as
cross-trait LD Score regression and REML, are most effective when applied to traits with polygenic
genetic architectures. For traits where significant SNPs account for a sizable proportion of heri-
tability, analyzing only these SNPs can be more powerful. Developing methods that make optimal
use of both large-effect SNPs and diffuse polygenic signal is a direction for future research.
Despite these limitations, we believe that the cross-trait LD Score regression estimator of genetic
correlation will be a useful addition to the epidemiological toolbox, because it allows for rapid
screening for correlations among a diverse set of traits, without the need for measuring multiple
traits on the same individuals or genome-wide significant SNPs.

Methods

Definition of Genetic Covariance and Correlation

All definitions refer to narrow-sense heritabilities and genetic covariances. Let S denote a set of
M SNPs, let X denote a vector of additively (0-1-2) coded genotypes for the SNPs in S, and
let yi and Y2 denote phenotypes. Define 3 := argmax.ERMCor [yi, Xa], where the maximization
is performed in the population (i.e., in the infinite data limit). Let 7 denote the corresponding
vector for Y2. This is a projection, so 3 is unique modulo SNPs in perfect LD. Define h , the
heritability explained by SNPs in S, as h2(yl) : ? and ps(yi, Y2), the genetic covariance among

100
SNPs in S, as ps(yi, Y2) := Ejs fj7. The genetic correlation among SNPs in S is rs(Yi, Y2)
ps(Y1, Y2)/ h(yi)h (y 2 ), which lies in [-1,1]. Following, 2 1 we use subscript g (as in h9, pg, rg) when
the set of SNPs is genotyped and imputed SNPs in GWAS.

SNP genetic correlation (rg) is different from family study genetic correlation. In a family

study, the relationship matrix captures information about all genetic variation, not just common

SNPs. As a result, family studies estimate the total genetic correlation (S equals all variants).

Unlike the relationship between SNP-heritability 2 1 and total heritability, for which h2 < h2 , no
similar relationship holds between SNP genetic correlation and total genetic correlation. If # and

-y are more strongly correlated among common variants than rare variants, then the total genetic

correlation will be less than the SNP genetic correlation.

Genetic correlation is (asymptotically) proportional to Mendelian randomization estimates. If we

use a genetic instrument gi := E s Xiyjf to estimate the effect b 12 of Y1 on Y2, the 2SLS estimate is

b2sLs := TY2 /gTy 1 .l 4 l The expectations of the numerator and denominator are E[gTy 2] = ps(yi, Y2)

and E[gTy1] = h2(yi). Thus, plimN oo)2SLS = rS(y2, yi) h5(y1 ) /h(y 2 ). If we use the same set S
of SNPs to estimate b 12 and b 21 (e.g., if S is the set of all common SNPs, as in the genetic correlation

analyses in this paper), then this procedure is symmetric in y, and Y2.

Genetic correlation is different from pleiotropy. Two traits have a pleiotropic relationship if
many variants affect both. Genetic correlation is a stronger condition than pleiotropy: to exhibit

genetic correlation, the directions of effect must also be consistently aligned.

Reverse Causation

Consider a scenario where a risk factor E, causes a disease D, but incidence of disease D changes

postmorbid levels of E1 (this could occur e.g., incidence of disease persuades affected individuals

to change their behavior in ways that lower E1 ). If D is sufficiently common in our GWAS sample,

then the genetic correlation may be affected by reverse causation. LD Score regression (or any ge-

netic correlation estimator) will yield a consistent estimate of the cross-sectional genetic correlation
between E1 and D at the given timepoint; however, the cross-sectional genetic correlation between
E, and D will be attenuated relative to the genetic correlation between disease and pre-morbid

101
levels of E1 . The genetic correlation between disease and pre-morbid levels of the risk factor will
typically be the more interesting quantity to estimate, because it is more closely related to the
causal effect of El on D. We can estimate this quantity by excluding all post-morbid measurements
of the risk factor from the risk factor GWAS. This allows us to circumvent reverse causation, at
the cost of a small decrease in sample size. If D is uncommon, then modification of behavior after
onset of D will account for only a small fraction of the population variance in E1 , so the effect of
reverse causation on the genetic correlation will be small. Thus, reverse causation is primarily a
concern for high-prevalence diseases.

Genetic Correlation vs Comorbidity

It is possible for two diseases to display excess comorbidity without genetic correlation and vice
versa.
Examples of genetic correlation without comorbidity tend to arise in situations where two dis-
eases are mutually exclusive for technical or diagnostic reasons. For example, consider the pairs

{schizophrenia, bipolar} and {Crohn's disease, ulcerative colitis}. Both pairs have r. > 60%, but
are mutually exclusive diagnostic categories and technically cannot co-occur (though in practice,
some fraction of patients initially diagnosed with one are reclassified as having the other).
For an example of large excess comorbidity with small genetic correlation, consider two diseases,
denoted D 1 and D 2 . Suppose that D 1 has prevalence 0.1%, D 2 has prevalence 1% and that D1 is a
cause of D 2 (e.g., perhaps D, is Crohn's disease and D 2 is colorectal cancer). If 10% of individuals
with D1 develop D 2 , then this means that D1 increases risk for D 2 10-fold (a very large effect).
Nevertheless, 99% of cases of D 2 will be unrelated to D1 , so the genetic correlation between D1 and
D 2 induced by the causal path D, -+ D 2 will be small.

Biased Sampling

We show in the Supplementary Note that LD Score regression is robust to oversampling of cases
in case/control studies, modulo transformation observed and liability scale heritability and genetic
covariance. Oversampling of cases is the most common form of biased sampling, but it is far from

102
the only form of biased sampling. For example, we know that high BMI is a major risk factor
for T2D. If we wish to discover genetic variants that influence risk for T2D via mechanisms other
than BMI, we may wish to perform a case/control study for T2D where we compare BMI-matched
cases and controls If we were to use such a T2D study and a random population study of BMI to
compute the genetic correlation between BMI and T2D, the result would be substantially attenuated
relative to the population genetic correlation between T2D and BMI. (Note that this example holds
irrespective of whether there is sample overlap and applies to all genetic correlation estimators, not
just LD Score).

More generally, let si = 1 denote the event that individual i is selected into our study, and
let Ci denote a vector of covariates describing individual i. Then we can represent an arbitrary
biased sampling scheme by specifying the probability function f(Ci) := P[si = 11 Ci]. Suppose that
phenotypes are generated following the model from section 1.1 of the Supplementary Note, but that
our sample is selected following the biased sampling scheme f. Let aii denote the additive genetic
component for phenotype j in inidividual i. If there is no direct ascertainment on genotype (i.e.,
if Ci does not include genotypes), then the proof of proposition 1 in the Supplementary Note goes
through, except that p is replaced with E[yiyi 2 I si = 1] and pg is replaced with E[aiiai 2 I si = 1].

This has two practical implications: first, in studies with biased sampling schemes and sample
overlap, if one wishes to constrain the intercept, one should use the sample correlation between
phenotypes p rather than the population correlation p. Under biased sampling, plimN~,0 p
E[yyi2 I si = 1], which is typically not equal to p. Second, even if there is no sample overlap,
biased sampling can affect the genetic correlation estimate. If the biased sampling mechanism (i.e.,
the function f(C2 ) := P[si = 1 Ci]) is known, then it may be possible to explicitly model the
biased sampling and derive a function for converting genetic correlation estimates from the biased
sample to population genetic correlations (similar to the derivations in sections 1.3 and 1.4 of the
Supplementary Note). If the biased sampling mechanism can only be described qualitatively, then
it should at least be possible to guess the magnitude and direction of the bias by reasoning about
IE[yiyi2 | si = 1] and E[ai1 ai 2 I si = 1].

103
Cross-Trait LD Score Regression

We estimate genetic covariance by regressing zijz2j against j /NigN 23 , (where Niy is the sample
size for SNP j in study i) then multiplying the resulting slope by M, the number of SNPs in the
reference panel with MAF between 5% and 50% (technically, this is an estimate of P5-50%, see the
Supplementary Note).
If we know the amount of sample overlap ahead of time, we can reduce the standard error by
constraining the intercept with the -constrain-intercept flag in ldsc. This works even if there is
nonzero sample overlap, in which case the intercept should be constrained to Nap/ N1 N2 (for pairs
of binary traits, we give a corresponding expression in terms of the number of overlapping cases and
controls in the Supplementary Note). We recommend using the in-sample estimate of p (denoted
,), rather than the population value of p. Under unbiased sampling p is consistent for p with
0(1/N) variance, so in this case, the distinction between p and 3 is not of great importance. Under
biased sampling (as discussed in the previous section), the expected LD Score regression intercept
depends on the expected sample correlation E[yiyli2 s = 1] (which is estimated consistently by

)
not population p. Thus, we advise to use p rather than p when constraining the intercept.

Regression Weights

For heritability estimation, we use the regression weights from." If effect sizes for both phenotypes
are drawn from a bivariate normal distribution, then the optimal regression weights for genetic
covariance estimation are

Var[zijz 2jy = y(N Mhite


1 N N2 h ~
+ 1)
M
N
+ 1) + (
( N1N2Pg
,Mef)
M
+
pAeNs
N
VN1 N2
2
2
(5.2)

(Supplementary Note). This quantity depends on several parameters (hl, h2, pg, p, N,) which are
not known a priori, so it is necessary to estimate them from the data. We compute the weights in
two steps:

1. The first regression is weighted using heritabilities from the single-trait LD Score regressions,
pN, = 0, and pg estimated as _g:= (! N 1 N 2) 1 Ej ZijZ2j.

104
2. The second regression is weighted using the estimates of pN8 and pg from step 1. The genetic
covariance estimate that we report is the estimate from the second regression.

Linear regression with weights estimated from the data is called feasible generalized least squares
(FGLS). FGLS has the same limiting distribution as WLS with optimal weights, so WLS p-values
are valid for FGLS. 1 We multiply the heteroskedasticity weights by 1/fj (where fj is LD Score with
sum over regression SNPs) in order to downweight SNPs that are overcounted. This is a heuristic:
the optimal approach is to rotate the data so that it is de-correlated, but this rotation matrix is
difficult to compute.

Two-Step Estimator

As noted in,13 SNPs with very large effect sizes can result in large LD Score regression standard
errors for single-trait LD Score regression with unconstrained intercept; cross-trait LD Score regres-
sion with unconstrained intercept behaves similarly. This is due to the well-known fact that linear
regression deals poorly with outliers in the response variable (LD Score regression with constrained
intercept is not nearly as adversely affected by large-effect SNPs). The solution proposed in13 was
to remove SNPs with X 2 > 80 from the LD Score regression. This is a satisfactory solution when the
goal is to estimate the LD Score regression intercept. If the goal is to distinguish polygenicity from
population stratification, and we are willing to assume that the population stratification is subtle,
such that SNPs with X 2 > 80 are much more likely to be real causal SNPs rather than artifacts,
then we can make the task much easier by removing those SNPs. However, this is unsatisfactory
if the goal is to estimate h 2 : ignoring large-effect SNPs with x 2 > 80 would yield estimates of h 2
or p, biased towards zero. Therefore, for estimating h 2 or pg, we take a two step approach. The
first step is to estimate the LD Score regression intercept with all SNPs with X2 > 30 removed
(i.e., all genome-wide significant SNPs; the threshold can be adjusted with the -two-step flag in
ldsc). The second step is to estimate h 2 or pg using all SNPs and constrained intercept LD Score
regression with the intercept constrained to the value from the first step (note that we account for
uncertainty in the intercept when computing a standard error; see the next section).

105
Assessment of Statistical Significance via Block Jackknife

Summary statistics for SNPs in LD are correlated, so the OLS standard error will be biased down-
wards. We estimate a heteroskedasticity-and-correlation-robust standard error with a block jack-
knife over blocks of adjacent SNPs. This is the same procedure used in,' 3 and gives accurate
standard errors in simulations (Table 5.1). We obtain a standard error for the genetic correlation
by using a ratio block jackknife over SNPs. The default setting in ldsc is 200 blocks per genome,
which can be adjusted with the -num-blocks flag.
For the two-step estimator, if we were to estimate the intercept in the first step, then obtain
a jackknife standard error for the second step treating the intercept as fixed, the standard error
would be biased downwards, because it would not take into account the uncertainty in the intercept.
Instead, we jackknife both steps of the procedure, which appropriately accounts for uncertainty in
the intercept and yields a valid standard error.

Computational Complexity

Let N denote sample size and M the number of SNPs. The computational complexity of the steps
involved in LD Score regression are as follows:

1. Computing summary statistics takes 6'(MN) time.

2. Computing LD Scores takes O(MN) time, though the N for computing LD Scores need not
be large. We use the N = 378 Europeans from 1000 Genomes.

3. LD Score regression takes 6(M) time and space.

For a user who has already computed summary statistics and downloads LD Scores from our website
(URLs), the computational cost of LD Score regression is O(M) time and space. For comparison,
REML takes time 6(MN 2 ) for computing the GRM and 6(N 3 ) time for maximizing the likelihood.
Practically, estimating LD Scores takes roughly an hour parallelized over chromosomes, and LD
Score regression takes about 15 seconds per pair of phenotypes on a 2014 MacBook Air with 1.7
GhZ Intel Core i7 processor.

106
Simulations

We simulated quantitative traits under an infinitesimal model in 2062 controls from a Swedish
study. To simulate the standard scenario where many causal SNPs are not genotyped, we simu-
lated phenotypes by drawing causal SNPs from 622,146 best-guess imputed 1000 Genomes SNPs
on chromosome 2, then retained only the 90,980 HM3 SNPs with MAF above 5% for LD Score
regression.
We note that the simulations in1 3 show that single-trait LD Score regression is only minimally
biased by uncorrected population stratification and moderate ancestry mismatch between the ref-
erence panel used for estimating LD Scores and the population sampled in GWAS. In particular,
LD Scores estimated from the 1000 Genomes reference panel are suitable for use with European-
ancestry meta-analyses. Put another way, LD Score is only minimally correlated with FST, and the
differences in LD Score among European populations are not so large as to bias LD Score regression.
Since we use the same LD Scores for cross-trait LD Score regression as for single-trait LD Score
regression, these results extend to cross-trait LD Score regression.

Summary Statistic Datasets

We selected traits for inclusion in the main text via the following procedure:

1. Begin with all publicly available non-sex-stratified European-only summary statistics.

2. Remove studies that do not provide signed summary statistics.

3. Remove studies not imputed to at least HapMap 2.

4. Remove studies that adjust for heritable covariates. 179

5. Remove all traits with heritability z-score below 4. Genetic correlation estimates for traits
with heritability z-score below 4 are generally too noisy to interpret.

6. Prune clusters of correlated phenotypes (e.g., obesity classes 1-3) by picking the trait from
each cluster with the highest heritability heritability z-score.

107
We then applied the following filters (implemented in the script munge-sumstats.py included
with ldsc):

1. For studies that provide a measure of imputation quality, filter to INFO above 0.9.

2. For studies that provide sample MAF, filter to sample MAF above 1%.

3. In order to restrict to well-imputed SNPs in studies that do not provide a measure of imputa-
tion quality, filter to HapMap3 61 SNPs with 1000 Genomes EUR MAF above 5%, which tend
to be well-imputed in most studies. This step should be skipped if INFO scores are available
for all studies.

4. If sample size varies from SNP to SNP, remove SNPs with effective sample size less than 0.67
times the 90th percentile of sample size.

5. Remove indels and structural variants.

6. Remove strand-ambiguous SNPs.

7. Remove SNPs whose alleles do not match the alleles in 1000 Genomes.

Genomic control (GC) correction at any stage biases the heritability and genetic covariance
estimates downwards (see the Supplementary Note of." The biases in the numerator and denom-
inator of genetic correlation cancel exactly, so genetic correlation is not biased by GC correction.
A majority of the studies analyzed in this paper used GC correction, so we do not report genetic
covariance and heritability.
Data on Alzheimer's disease were obtained from the following source:

International Genomics of Alzheimer's Project (IGAP) is a large two-stage study based


upon genome-wide association studies (GWAS) on individuals of European ancestry. In stage 1,
IGAP used genotyped and imputed data on 7,055,881 single nucleotide polymorphisms (SNPs)
to meta-analyze four previously-published GWAS datasets consisting of 17,008 Alzheimer's
disease cases and 37,154 controls (The European Alzheimer's Disease Initiative, EADI; the
Alzheimer Disease Genetics Consortium, ADGC; The Cohorts for Heart and Aging Research
in Genomic Epidemiology consortium, CHARGE; The Genetic and Environmental Risk in AD
consortium, GERAD). In stage 2, 11,632 SNPs were genotyped and tested for association in an

108
independent set of 8,572 Alzheimer's disease cases and 11,312 controls. Finally, a meta-analysis

was performed combining results from stages 1 and 2.

We only used stage 1 data for LD Score regression.

109
110
Chapter 6

A statistical framework for gauging when


disease subtypes can be detected from
principal components analysis of genotype
data

Identifying latent disease subtypes is important both for elucidating the causes of disease, and for
effective clinical treatment. Genotype data, which has led to new insights about causal variants,
genes, pathways, and cell types for disease, has the potential to be informative also about disease
subtypes. A natural first approach to using genotype data to identify disease subtypes is to perform
Principal Components Analysis (PCA) on a genotype matrix of cases and to examine the first
eigenvector as potentially informative about the disease subtypes. Here, we use a result from random
matrix theory that has been applied before in the context of identifying geographic population
structure from genotype data to quantify the genetic difference between two diseases, analogous
to Ft for geographic populations. This allows us to lower bound the sample size that will be
needed for PCA to reflect the presence of disease subtypes as a function of the genetic architecture
of the disease subtypes. We use our result to lower bound the sample size needed to distinguish
schizophrenia and bipolar disorder, if they were consider subtypes of a single disease, determining

111
that 180,000 combined cases would be needed.'

Introduction

Many diseases, including many psychiatric diseases, are suspected to have subtypes, but these
subtypes have not been fully defined and characterized. Different disease subtypes can have different
causal mechanisms, and so identifying disease subtypes is important both for elucidating the causes
of disease, and for effective clinical treatment.
Approaches to identifying disease subtypes have mostly relied on clinical data; for example, there
has been work using latent class analysis on a set of depression symptoms to identify depression
subtypes.18 0 Recent work 81 uses genetic data to determine whether diseases subtypes have different
genetic architectures, once candidate disease subtypes have been identified based on phenotypic
data. A controversial study 8 2 claimed to use genetic data to identify subtypes of schizophrenia,
but was shown to be affected by artifacts such as linkage disequilibrium and population structure. 183
We are not currently aware of any other work that has used genetic data alone to identify disease
subtypes.
Here, we investigate the utility of genetic data for identifying disease subtypes. Specifically, we
consider the case where we have full genotype data for a large set of cases and controls for a given
disease, and we wish to identify structure among the cases using principal components analysis
(PCA). We give a lower bound on the sample size that will be required to solve this problem by
considering an easier problem: using PCA to distinguish subtypes in a theoretical setting in which
our cases are a mixture of individuals with two distinct diseases, with no additional structure arising,
for example, from ancestry. We find that very large sample sizes will be needed to solve even this
easier problem formulation.
This problem is related to the question of when PCA on a genotype matrix of individuals from
two distinct ancestral populations will be informative about ancestry. In now-classical work,20
'This work was done in collaboration with Nicolo Fusi, Yakir Reshef, Alex Bloemendal, Luke O'Connor, Alkes
Price, and Jennifer Listgarten. A large part of this work was done while I was a summer intern at Microsoft Research,
mentored by Jennifer Listgarten.

112
Patterson et al. used a result from random matrix theory to show that PCA is informative about
ancestry when the Ft between the two populations exceeds 1/ NM, where N is the total sample
size and M is the number of SNPs. Here, we use the same result from random matrix theory
to derive a similar condition for differentiating between disease subtypes, deriving the equivalent
of Ft for a pair of disease subtypes. Because of the sample ascertainment procedure inherent
in case-control data, though, we obtain a lower bound on the sample size rather than a precise
threshold.
To translate our theoretical result into a practical bound, we consider the problem of differen-
tiating between two known diseases, schizophrenia and bipolar disorder, which we treat as latent
subtypes of a hypothetical disease. We use genome-wide association summary statistics for these
two disease to estimate the parameters of a model of shared genetic architecture, and we use these
parameter values to estimate that over 100,000 cases for each of the two diseases would be required
before PCA will distinguish between them. We conclude that even in the idealized scenario of no
structure other than disease subtypes, current sample sizes will likely not suffice to identify disease
subtypes by applying PCA to cases.

Results

The BBP threshold for genetic data

Disease subtypes are one type of structure that could potentially be discovered from genetic data. A
more well-studied type of structure in genetic data is geographic population structure. Patterson et
al.2 show that under a model of neutral drift, the first principal component of an N x M genotype
matrix with individuals from two populations in equal proportions has non-negligible correlation
with the assignment of individuals to populations if and only if Ft, a measure of the distance
between the two populations, is greater than 1/(NM). In this section, we review the model under
which this result holds, and we then present a generalization of the model and use it to generalize
the result of Patterson et al. Below, we apply our general result to disease data.
First, we introduce some notation. Let <I denote the standard normal cumulative distribution

113
function and p denote the standard normal probability density function. For a matrix A, let Ai
denote the i-th row, and A ,j denote the j-th column. We will use Bin(k; n, p) to denote the binomial
probability mass function evaluated at k with parameters n and p.

In a classical population genetics model used by Patterson et al., (see also1 84-186) each SNP m

has an allele frequency qm in an ancestral population. The process of neutral drift results in random
allele frequencies qMA and q, in populations A and B, respectively, such that

E[qijqm] E[qqm] =q

and
Var[qWfqm] = Var[qmqm] = Ftqr(1 - qm).

For an individual n in population Sn E {A, B}, the individual's genotype Gnm at SNP m is drawn
from the binomial distribution Bin(2, qmn). Assuming independence among individuals and among
SNPs, this determines the distribution of an N x M matrix G of genotypes, where each column
corresponds to a SNP and each row corresponds to an individual from either population A or B.
It is sometimes convenient to model the ancestral allele frequencies qm as being drawn i.i.d. from
some distribution so that the columns of G are i.i.d. as well.

More generally, let G be an N x M matrix whose entries are generated by the following random
process. Each row n of G is labeled by S,,, C {A, B}. Random variables Om are drawn independently
for each column, with em ~ D. In the above example, e, is the pair (qA, qg) of allele frequencies
in the two populations, drawn from a distribution centered at the ancestral allele frequency q,. Each
entry Gm of G is drawn independently from a distribution D(S,, em) with mean p(S, em) and
variance V(Sn, Em); in the above example, D(S, (qn, qB)) = Bin(2, qg), with p(Sn, (qA, q)) =

2qS1 and V(Sn, (qm, q.)) = 2q (1 - q)).

We require the distribution D to have the following three properties:

E[p(A, 9m)] = E[,p(B, Em)], (6.1)

114
E[V(A, em)] = E[V(B, em)] (6.2)

and
Var[pt(A, 8m) - p(B, em)] =?]E[V(A, em)] (6.3)

for some q that does not depend on m. In our example, these can each be shown to hold, with

E[p(A, (qm, qm))] = E[p(B, (qd, q!))] = 2qm,

E[V(A, (qf, q!))] = E[V(B, (qA, q!))] = 2qm(1 - qm)(1 - Ft),

and
4V ar(q~ - q$) _
4 ,r(q-,, -
_F
~Ft4Fet,
2qm(1 - q,,)(1 - Ft) 1 - Ft

where the approximation holds when Ft is small. Finally, we model Dm as being drawn i.i.d.
from some distribution of distributions-for example, now the ancestral allele frequencies are also
random-and we suppose that PN rows n have S(n) = A and (1 - P)N rows have S(n) = B.

Given a random matrix G drawn from such a process, we would like to normalize it in some
natural way to obtain a matrix X, and then characterize when the top eigenvector of XXT is
correlated with the assignment of individuals to classes A and B. Let

Am = Pp (A, Em) + (1 - P)[p(B, em),

and let
Sm = E[V(A, em)]1/2.

Let
1
Xnm= - (Gm - Am).
Sm

The columns of X are i.i.d. random vectors with mean zero. Using the law of total covariance, the

115
covariance matrix can be shown to be

Cov(X"m) ~ I + rjvv T

where v is an N x 1 vector with v, = (1 - P) if Sn, = A and -P if Sn = B. The eigenvalues of this


matrix are (1 + NP(1 - P)77, 1,... , 1), and the top eigenvector is v.

Let u be the top eigenvector of XTX. To detect structure, we would like u and v to have a high
squared correlation. Applying results from random matrix theory, 15- 19 we see that as N, M -+ oc
with N/M fixed,
0 vNMP(1 - P)T < 1
corr(u, v) 2 -+ 1 1 (6.4)
2
(P(1-P)7 ) NM
vINMP(1 - P), > 1.
P(1-P)'qM

In the example of population structure, we do not have s and M and so we cannot exactly
compute X, but we can approximate them from the data. Equation (6.4) is shown to be applicable
to this case in Patterson et al.2 O In particular, if P = 0.5, then we are powered to detect the
presence of structure if and only if Fst >
.

The liability threshold model

To discover disease subtypes, we will suppose that we are given a genotype matrix G for which each
row is the genotype vector for an individual who is either a disease-A case or a disease-B case. We
would like to apply this theory to determine under what circumstances the first principal component
of the genotype matrix of cases will be positively correlated with the true assignment of individuals to
disease subtypes. We will start by reviewing a standard model of case-control traits. 18 7 We assume
an underlying population with allele frequencies q 1,... , qm and Hardy-Weinberg equilibrium, so
that the genotype at SNP m of a random individual in the population is distributed Binomial(2,
qm). We will also assume linkage equilibrium, so that in the population, the genotype values of an
individual at two SNPs are independent.

We will model a binary trait Y using a linear model with a probit link function, also known as

116
a liability threshold model. 187 In particular, we will model Y = 1{f > T} for some T, where

M
S= p + Ebgm + E,
M=1

with c - N(O, 1 - o,). Without loss of generality, we will constrain f to have mean zero and
variance one in the underlying population, and let h2 = Var(Em-i bmgm). The constraint on
the mean implies that p = -2 $M- bmqm. It will also be convenient to write things in terms of
normalized genotypes,
-m - 2qm
Xm 2qm(1 - qm)

Let /m = bm 2qm(1 - qm) be the corresponding normalized effect sizes. Then

M
E=L m! m + E. (6.5)
m=1

If no individual #m is too large, then f is approximately normally distributed in the population. We


let K = 1 - Ib(T) = Pr(Y = 1) denote the population disease prevalence.

In the Methods section, we prove the following two results about the mean and variance of the
genotype distribution in cases.

E[gmIY = 1, 3]~2qm + ym /2qm(1- qm) (6.6)

where -ym = 3m MK ), and


a

Var(gmIY = 1, 3) ~ 2qm(1 - q,)(1 - '}m) + 7ym 2qm(1 - qm)(1 - 2qm) (6.7)

The BBP threshold for disease subtypes

Let G be an N x M matrix of genotypes of cases for a disease that has two subtypes, A and B. Of
the N rows, PN are cases for disease A and have S(n) = A, and (1 - P)N are cases for disease B
and have S(n) = B. We would like to model this matrix as being generated according to a process

117
that fits the framework described above, and apply Equation (6.4). However, while we can model
SNPs as independent in the underlying population, they may not be independent conditioned on
an individual being a case for a given disease. Thus Equation (6.4) will not apply directly, because
we cannot assume that the columns of G are independent. For now we will continue, assuming
the dependence is negligible, and in a subsequent section we will perform simulations to assess the
magnitude of the effect of non-independence of SNPs.

We will follow standard practice in genetics, and model the pair of effects of SNP m on two
traits, (# ), 3)), as being drawn i.i.d. from a two-dimensional distribution with mean zero. We
will model each SNP as being causal or not with probability p, with a bivariate normal pair of effect
sizes for causal SNPs; i.e.,

(0,0) with prob. I - p

K (0 0), [ h with prob. p


rg AhhB h2

We then have, as in Equation (6.5),

M
f =S Z3QXm'+ I(S) (6.8)
m=1

for S E {A, B}, where (C(A),E(B)) are bivariate normal with arbitrary correlation. We let K(A) and
K(B) denote the respective disease prevalences.

Let Em = (g", #4)), and let D(S(n), ( , /3w )) be the distribution of genotypes for cases of
disease S(n), conditioned on the effect size Om" . In the previous section, we showed that

P(Sn, (1(), ())) = 2qm + S'n) V2qm(1 - qm)

and

V(SM, 2qr(1 - qm) (I - (7")2 +; V-2qm(1-- qm)(1 - 2qm).

118
We now want to check that Equations (6.1), (6.2), and (6.3) are satisfied.
Because # and # have mean zero, so do y ) and ). Thus,

E[pz(A, em)] = E[p(B, em)] = 2qm,

satisfying Equation (6.1). We also have

E[V(A, Orn)] = 2qm(1 - qm) ( 1 -- E (' (A) ~ 2 q,(I - qm),I

where the approximation holds for K > 10-4 because p(T)


K_
< 4 for K > 10-' and because
Var(o3$(n) < 1/M, and so E[-ym] is negligible. This also holds for E[V(B, em)], and so Equa-
tion (6.2) is satisfied. To check Equation (6.3), we have

Var [p(A, Em) - p(B, em)] = Var [<A) 2q(I - q) -- Y(B) 2qM(

-
= 2qm(1 - qm)Var [y(A) - Y(B)]
~ _UE[V(A, E8m)]

where r, = Var [<(A) - (B)].

Thus, Equation (6.4) suggests that PCA on X will be informative about disease subtypes if and
only if T > 1, where
T= P(1 - P) /NMq.

Specifically, that result suggests that the squared correlation between the top eigenvector and
the true assignment of individuals to disease subtypes will be approximately a, where

0 7 <1
a= 1 1(6.9)
(P(1-P)T)2NM
-P(1-P)M

Because ascertainment induces correlations among SNPs, simulations are required to assess
whether either of these two phenomena hold. In the following section, we show that while a does

119
not accurately predict squared correlation and there is sometimes negligible correlation between the
top eigenvector and the true assignment when T > 1, the correlation is negligible when T < 1. This
allows us to find a lower bound on the necessary sample size, given an estimate of q, by determining
the N at which T passes one.

Simulations to evaluate the BBP threshold for disease subtypes

We simulated genotypes and phenotypes to assess (i) whether the


squared correlation between the top eigenvector of the matrix of geno- Parameter Range

types and the true assignment of individuals to disease subtypes was neg- h2A,)h 2B [01
[0,1]
ligible for T < 1, (ii) whether this squared correlation was non-negligible rg [-1,1
for T > 1, and (iii) whether the squared correlation was approximately P [0,1]
equal to a. We conducted 1,000 simulations. For each simulation, we K ,(A)
K(B) [0, 0.5]
sampled parameter settings uniformly from the ranges listed in Table 6.1, N [100, 2000]
retaining parameter settings for which there were at least 30 causal SNPs M [100, 2000]
in expectation, and for which the size of the simulation required was not
Table 6.1: Parameter
prohibitively large. For each simulation, we recorded
a, and the ranges for simulations in
T,

squared correlation of the top eigenvector with the true assignment of Figure 6-1.
individuals to diseases.

We first examined the relationship of the squared correlation to r (Figure 6-1a). We found that
while squared correlation was indeed negligible for T < 1, it remained low for some simulations in
which r > 1. We then examined the relationship of the squared correlation to a (Figure 6-1b).
Here, we again found that the predictions for independent SNPs were sometimes optimistic for the
liability threshold model. This leads us to a one-sided conclusion: if T < 1 then the first principal
component of X will most likely be uninformative about disease subtypes. If T > 1, though,
we cannot accurately predict the squared correlation. Thus, we can lower bound the sample size
needed for non-negligible correlation by the sample size needed to achieve T > 1, but this bound is
one-sided: we cannot upper bound the sample size needed to achieve any particular correlation.

120
1.0
0.7 '00
0.6 0.8
0 0
0.5
0.6

-
5 -

8 0.4 8
0
0.3 . @0-. 0.4
cc
0* 0.29 Cu
Co
0.2 :0. 0 CO
0.2
0.1
'

0.0 0.0
-

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 0.0 0.2 0.4 0.6 0.8 1. 0
tau alpha
(a) Liability threshold model. (b) Liability threshold model.

Figure 6-1: Results using simulated genotypes and phenotypes. (a) When r < 1, squared correla-
tion is negligible. (b) The predicted squared correlation a is often smaller than the true squared
correlation for T > 1. Together, these results allow us to lower bound the sample size needed for
non-negligible squared correlation, but do not allow us to upper bound the sample size needed to
obtain a particular squared correlation.

Choosing informative SNPs

When we have genotypes not just for cases for the disease subtypes, but also for controls, we may
be able to choose a set of SNPs to include in X so that the distribution of 0 conditioned on having
been chosen by our process has a larger 7 than had we included all SNPs.

For example, if we could observe the true (('), (42() for each SNP m, we could compute

(Y(A) Iy(B) ) and choose to include a SNP only if (y(A) - (B)) 2 > t for some threshold t. This would
lead to a new distribution of 0 and a corresponding new q, but would also lead to a new M. For
such a threshold t, let 77t, Mt, and -r denote the resulting values of 7, M, and r. Recall that r
is proportional to ?/M. So if -r < 1, there might be a threshold t such that -rt > 1. (At a given
threshold, Mt is random; we neglect this.)

In real data, we cannot accept or reject a SNP based on ( (A), /q$()) because we do not observe
these values. Instead, we can impose a threshold t on the overall association between controls and

121
the combined set of cases. Here, we will not focus on how to choose t, but instead we will consider
the best case performance given oracle t. (See Methods.) This is consistent with our goal being to
lower bound the required sample size for good performance.
How much we gain from this procedure depends on the distribution
of , 0,3). For example, if P = 1/2 and KA) - KB), then the
observed case-control association that we are thresholding on is propor- Parameter Range

tional to t3 + 03 while ) 0C) - hi, hbu [0.2, 0.8

are distributed bivariate normal with equal variances then # + #jB) rg [-1,1]
and m - 0. are independent random variables and thresholding will p [0,11
not increase T. On the other hand, if the fraction of causal SNP is less K(A), K(B) [0, 0.2]

than one, then there is correlation between ((av9))2 and (A) - (B)) 2 . M 60,000

Sample size is also important, as it determines the level of noise in the


Table 6.2: Parameter
observed case-control association. ranges for simulations in
We compared the sample size needed to reach T> 1 under the orig- Figure 6-2.

inal method in which all SNPs are used, and under the thresholding
procedure described above. Because our goal is to get a sense for how much this might matter in
practice, we generated random parameters within more realistic ranges, given in Table 6.2. Note
that our theory allows us to predict the behavior of PCA at arbitrarily large sample sizes without
performing simulations.
Our results are given in Figure 6-2. We find that thresholding can greatly decrease the sample
size necessary to reach T > 1 for some parameter settings, and has less of an effect for other
parameter settings.

Predicted sample size needed to distinguish schizophrenia and bipolar


disorder

We used summary statistics for schizophrenia and bipolar disorder, computed in disjoint samples, to
estimate the parameters of our model, using a moment-based method (see Methods). We estimated
heritabilities of 0.28 and 0.23, respectively, with a genetic correlation of 0.57 and probability p of

122
10

8 A

0
M

0 2 4 6 8 10
Required n using all SNPs (x 100,000)

Figure 6-2: Sample size required to achieve -r > 1 when thresholding the observed association
statistic for the combined set of cases with controls vs. sample size required when all SNPs are
used.

being causal 0.25.

Using these estimated parameter values, we found values of -rat a range of sample sizes, assuming

an equal split between schizophrenia cases and bipolar disorder cases. Our results are in Figure 6-3.

We find that -r exceeds one at a sample size of roughly 180,000 individuals, showing that we will

need at least approximately 90K cases for schizophrenia and 90K cases for bipolar disorder before

there is any non-trivial correlation between the top PC of the genotype matrix of these cases and

the assignment of individuals to disease.

We next used our theory to determine the sensitivity of tau to each of the parameters that we

estimated. We first fixed the genetic correlation and probability of being causal at 0.57 and 0.25,

respectively, and we multiplied the two heritabilities by a single factor ranging from 0.5 to 1.5,

resulting in a range of heritabilities from (0.14, 0.115) to (0.42, 0.345). This had a large effect on

-r; at the highest heritability the sample size required for -r> 1 decreased from 180,000 to 100,000

(Figure 6-4a). We then repeated the process for genetic correlation, resulting in genetic correlation

varying from 0.285 to 0.855, with heritabilities fixed at 0.28 and 0.23, and probability of being

causal fixed at 0.25. At the lowest genetic correlation, this decreased the required sample size to

123
2.0

1.5

0.5

0.0
0 1 2 3 4 5
Sample size (x 100,000)

Figure 6-3: Tau as a function of sample size for the parameter values estimated from data on
schizophrenia and bipolar disorder. At n = 180, 000, the threshold -r = 1 is passed, indicating the
potential for non-trivial correlation between the top eigenvector and true assignment of individuals
to subtypes.

90,000 (Figure 6-4b). Finally, we varied the probability of being causal from 0.125 to 0.375. This
reduced the required sample size to 110,000 (Figure 6-4c).

Discussion
Here, we used a combination of theory, simulations, and real data to estimate that using PCA
to differentiate between disease subtypes will require sample sizes in the hundreds of thousands, if
those subtypes have a joint genetic architecture similar to that of schizophrenia and bipolar disorder.
While our result does not apply to other methods that use only genotype data, it does highlight
the importance of using external phenotypic data, or perhaps functional genomic data, to identify
disease subtypes.
Our result would be difficult to obtain without the theory we use. Simulations under a liability
threshold model of low-prevalence diseases at large sample sizes are computationally intensive (since
the number of samples that must be simulated is much larger than the desired number of disease
cases), and so it would be difficult to simulate two diseases with a combined sample size of 180,000,

124
3.0 3.0 3.0
2.5 2.5 2.5
2.0 2.0 2.0

1551.5
1.01.0 - -...-.....- - - 1.0 ---
0.5 0.5 0.5
0.0 0 0.0 0.0
0 1 2 3 4 0 1 2 3 4 5 0 1 2 3 4 5
Sample size (x 100,000) Sample size (x 100,000) Sample size (x 100,000)
(a) Varying h and h2(B). (b) Varying r9 . (c) Varying p11.

Figure 6-4: The affect of varying the parameters in the model. (a) h2 (A) and h2(B) are multiplied
9
by a factor ranging from 0.5 (blue) to 1.5 (red), with r and pu fixed. (b) rg 9is multiplied by a
factor ranging from 0.5 (blue) to 1.5 (red), with h2(A), h2(B), and P1 fixed. (c) pu is multiplied by
a factor ranging from 0.5 (blue) to 1.5 (red) with h9(A, h (Badgfixed.

each with a prevalence of 1% and parameter values estimated from real data. However, our theory
gives us a way to understand what would happen at such large sample sizes.
Our lower bound is also a potentially loose lower bound; it could be that many more than 180,000
cases would be required. First, as Figure 6-1 shows, the theory gives a lower bound that in some
cases is a loose lower bound. Second, we are assuming that an oracle gives us the best threshold for

choosing SNPs; choosing the threshold from the available data may decrease performance. Third,
schizophrenia and bipolar disorder are clearly clinically distinct diseases. We may suspect that
disease subtypes that have not yet been defined will have a higher genetic correlation than the
genetic correlation between bipolar disorder and schizophrenia, since they are likely to have more
shared phenotypic features. Finally, we do not consider the issue of population structure; it could be
that identifying subtypes in a dataset with some population structure will require a larger sample
size than identifying subtypes in a dataset known to have no population structure. Thus, it 'is
likely that identifying subtypes of a disease such as schizophrenia using PCA on genotype data will
require a sample size well above 180,000.

125
Methods

Moments of the distribution of genotypes in cases: derivations

Recall that
M

M=1
Z mxm + 6,

with Var(s) = 1 and Var(ZM 1 3mxm) = h9. Because the xm are independent with unit variance,
we have Var(f 10, xm) = 1 - (/mXm) 2 . By the CLT, f 10, Xm is approximately normal. Thus

f Xm, 3 ~ f(!mXm, 1 - (/mXm) 2 ).

Now,

Pr(Y = 1xm, ) = Pr(f > T I xm, 3)

T -mm
~-- 1 - <D
1 - (0mXm) 2
(

)
~~,1 - (D(T) + 0,,x,, p(T
)

=K+3m gm -2qm p(T)


V2qm(1 - q)

where the second approximation follows from a Taylor expansion around /3 m = 0. Now we can
apply Bayes' rule to get genotype frequencies among cases. For i = 0, 1, 2,

Pr(gm = ilY = 1, 0) = Pr(Y = 1|gm = i, 3) Pr(Y 1)

i - 2qm p(T) Bin(i; 2, qm)


(K 0
+

mv qm ( 1- qm) (PK

= Bin(i; 2, qm) + Om 2p(T) =)(i - 2qm)Bin(i; 2, qm) (6.10)


KVf2-qm(1 - qm)

126
Next, we can compute the expected genotype among cases:

E[gmlY = 1,1 3] = Pr(gm = 1Y = 1, 0) + 2 Pr(gn = 21Y = 1, 3)

~(Bin( 1; 2, qm) +2M q(T) (1 - 2qm)Bin(1; 2, qm)


KVf2qm(1 - qm)

+ 2 Bin(2; 2, qm) + m ( (2 2qm)Bin(2; 2, qm)


K( 2qm(1- qm)

2qm( - qm) + m (1 - 2qm)2qm(1 - qm)


KV2-qm(1 - qm)
(t) 2
+ 2qm2+ 2/m
K 2qm(1 - qm) M

Because 2q,(1 - q,) + 2q2 = 2qm and (1- 2q,)2qm (1- qm) + 2(2 - 2qm)q2 = 2qrn(1 - qm), we can
conclude

E[gmlY = 1,/3] = 2qm + m K 2qm(1 - qm). (6.11)

Next, we consider Var(gmIY = 1, 0) First, we use Equations (6.10) and (6.11) to write

Var(gm Y = 1, 0) = Z(i - E[gm|Y = 1, /3])2 Pr(gm = ilY = 1, 0)

= E(i - (2qm + C))2 (Bin(i; 2, qm) + C'(i - 2qm)Bin(i; 2, qm))


2

= ((i - 2qm) 2 - 2(i - 2qm)C + C2) (Bin(i; 2, qm) + C'(i - 2qm)Bin(i; 2, q,))

127
where C = /m(T 2qm(1 - q,) and C' = m *(T) . Now we have.
K V2qm(1-qm)

2 2
Var(gmY = 1, )= (i - 2qm) Bin(i; 2, qn) +
2
Z(i - 2qm) 2C'(i - 2qm)Bin(i; 2, q,)
i=O i=O
2 2
- 2 1(i - 2q)CBin(i; 2, qm) - 2 Z(i - 2q)CC'(i - 2qm)Bin(i; 2, qm)
i=0 i=O
2 2

+ E C 2 Bin(i; 2, q,) +
i=O i=O
S
C2C'(i - 2qm)Bin(i; 2, qn)

Using the moments of the binomial distribution, and then applying C'2qm (1 - qm) C, we can
simplify to

Var(gmIY = 1, 3) = 2qm(I - qm) + C'2qm(1 - qm)(1 - 2qm)

+ 0 - 2CC'(2qm(1 - qm))
+ C2 + 0

= 2qm(1 - q,) + C(1 - 2qm) - C2

Thresholding SNPs

We will assume we have N cases and N controls, and that PN of the cases are for subtype A while
(1 - P)N are cases for subtype B. Without knowing the subtypes, we can compute the in-sample
difference in allele frequencies between cases and controls at SNP m. Let Dm denote this allele
frequency difference divided by /2qm(1 - qm); this is the value we will threshold. We can split
our controls randomly into PN controls for subtype A and (1 - P)N controls for subtype B and
compute allele frequency differences for disease A and disease B separately. Let DA and D' denote
these two allele frequency differences divided byV2qm(1 - qm). Then Dm = PDA + (1 - P)D
.

Neglecting the fact that the controls were ascertained to be shared controls for the two diseases
rather than just for one or the other, we can derive, approximately:

128
DApA ( KA(T
MD^K1- A)KA)IpJ 2

-
D mIFJ
|B B ~Ar KB OKB-B) KB) '( (-
(I- P)NJ

-
Thus
Dm|3 ~ A P AK 1A-;) +(1 p)O3B q TB ) 2
rKB( - KB)'7N)

Let cm be an indicator for whether SNP m is causal. Recall that in our notation, p = Pr(cm 1).
Marginalizing out /3 , we see that

DmIcm = 0 ~ A(0, 2/N)

Dm cm = 1 ~ A(0, 2/N + v),

where
P (TA) 2 h2(A) $(TB) 2 h2(B)
9
KA(1 - KA)) M (1 7)KB(1 - KB)) M

q(TA)q(T B) rg Vh2(A)h2(B)
g g
+2 (P(i - P)
KA(1 - KA)KB(1 - KB) M
+

Thus,

Pr(D2 > t) = 2(1 - p)4


( 2N) + 2pb
2/Nt +v
(

and
2p4D ( /N+v)
Pr(cm = 11D 2>t)=
2(1 - p)<D + 2p4D (N-/+v)
What we would like to compute is

E (Ak) - 7 MB)) 2 1 D 2 > t] = Pr(c, = 11D 2 > t)E (MA - 7 B)) 2 1D > ,c

129
The distribution of (-ym - y$,,, Dn) Icm = 1 is bivariate normal. Moreover, if the heritabilities
and prevalences of A and B are equal, then - - 2 and Dm are independent, conditioned on

cm = 1, and if the heritabilities and prevalences are similar, they are close to independent. Thus,
we will approximate E [Yi D >Yt, )m=I1 by

(A) (B)) 2 1A 2 B h(B)


h2(A) KAKB gA g2)BT 2 h (A) h2(B)
M pp + q$ T ) h( ) rg q5 T > / T ) 9
- C"- = = (KA)2,qy
(K ( KA)2 K
K^KB 9 9

This gives us

KA) 2
(K A) 2 -2g KK h Ah( 2 2/N+vJ

KAKBg( pT2A
()

We approximate Me with its expectation, M (2(1 - p)J -~ + 2pi ( /N+v)) This gives
us Mp ( 2 (1 - p) 4V:,N + 2 pQ
+

p(TA)2h(A)
(KA) 2 +
(TB)2h2(B)
(KA) 2
(TA)(TB)
g KAKB
h2(A)h
( (2N+vJ

We optimized this value numerically in python.

Moment-based estimation of parameter values for schizophrenia and bipo-

lar

Datasets. We used summary statistics from the PGC Cross-Disorder Group 148 for schizophrenia
and bipolar disorder. The schizophrenia dataset had 9,032 cases and 7,980 controls, while the
bipolar disorder dataset had 6,664 cases and 5,258 controls. The control sets do not overlap. We
LD pruned to r 2 <v0.01 in plink,1 retaining SNPs with the lowest p-value for p-values that passed

genome-wide significance, and randomly for SNPs below genome-wide significance, retaining 19,195

130
SNPs. We used Z-scores for each of these SNPs, oriented to the same reference alleles.

Parameter estimation. Let Z(A) and Z(B) denote the association z-scores for schizophrenia and
bipolar disorder, respectively. Let PA and PB denote the proportion of cases in the schizophrenia
and bipolar disorder datasets, respectively. For S E {A, B}, let cS = Ps(Ps(b(")'The LD score
regression equations for case-control data tell us that, in the absence of LD,

= cshg ( 5 Ns/M + 1.

Similarly,
E[Z()Z(B)- CACBNANBP/M,

where p = r9 h2(A)h2(B). We used these equations to estimate h2(A), h2(B) and p, using the
empirical mean across SNPs of Z(A) 2 , Z(B) 2 , and Z(A)Z(B). Similar arguments show that

(Z(^))2 (Z(B) 2= + (A)+ NB CB + (B) NANBCACB 2 g2( 2


j
)
(
M M pM

Again using the empirical mean of (Z(A)) 2 (Z(B)) 2 , we can plug in our estimates of h (A)h
and p and solve for p.

Number of independent SNPs. These parameters were estimated from the 19,195 independent
SNPs in our dataset. It is likely that the more realistic number of SNPs is higher, both because
there might be more truly independent SNPs if the starting point for LD pruning is a denser set
of SNPs than we started from, and because it is possible that including more SNPs that are not
completely independent could improve performance, even though the BBP theory no longer holds
in this case. In other contexts, 188 a quantity called the "effective number of independent SNPs" has
been estimated for common GWAS SNPs to be 60,000; to be conservative, here we suppose that
the analyst in fact has 60,000 independent SNPs instead of 19,195. To do this, we multiply both
heritabilities by 60,000/19,195 and use these as our true heritabilities instead.

131
Appendix A

Supplementary information for Chapter 3

Supplementary Information for "Partitioning heritability by


functional category using GWAS summary statistics"

Members of the ReproGen consortium. The members of the ReproGen consortium are John
RB Perry, Felix Day, Cathy E Elks, Patrick Sulem, Deborah J Thompson, Teresa Ferreira, Chun-
yan He, Daniel I Chasman, T6nu Esko, Gudmar Thorleifsson, Eva Albrecht, Wei Q Ang, Tanguy
Corre, Diana L Cousminer, Bjarke Feenstra, Nora Franceschini, Andrea Ganna, Andrew D John-
son, Sanela Kjellqvist, Kathryn L Lunetta, George McMahon, Ilja M Nolte, Lavinia Paternoster,
Eleonora Porcu, Albert V Smith, Lisette Stolk, Alexander Teumer, Natalia Ternikova, Emmi Tikka-
nen, Sheila Ulivi, Erin K Wagner, Najaf Amin, Laura J Bierut, Enda M Byrne, JoukeJan Hottenga,
Daniel L Koller, Massimo Mangino, Tune H Pers, Laura M YergesArmstrong, Jing Hua Zhao, Irene
L Andrulis, Hoda AntonCulver, Femke Atsma, Stefania Bandinelli, Matthias W Beckmann, Javier
Benitez, Carl Blomqvist, Stig E Bojesen, Manjeet K Bolla, Bernardo Bonanni, Hiltrud Brauch,
Hermann Brenner, Julie E Buring, Jenny ChangClaude, Stephen Chanock, Jinhui Chen, Georgia
ChenevixTrench, J. Margriet Collee, Fergus J Couch, David Couper, Andrea D Coveillo, Angela
Cox, Kamila Czene, Adamo Pio D'adamo, George Davey Smith, Immaculata De Vivo, Ellen W

133
Demerath, Joe Dennis, Peter Devilee, Aida K Dieffenbach, Alison M Dunning, Gudny Eiriksdottir,
Johan G Eriksson, Peter A Fasching, Luigi Ferrucci, Dieter FleschJanys, Henrik Flyger, Tatiana
Foroud, Lude Franke, Melissa E Garcia, Montserrat GarciaClosas, Frank Geller, Eco EJ de Geus,
Graham G Giles, Daniel F Gudbjartsson, Vilmundur Gudnason, Pascal Guenel, Suiqun Guo, Per
Hall, Ute Hamann, Robin Haring, Catharina A Hartman, Andrew C Heath, Albert Hofman, Maartje
J Hooning, John L Hopper, Frank B Hu, David J Hunter, David Karasik, Douglas P Kiel, Julia
A Knight, VeliMatti Kosma, Zoltan Kutalik, Sandra Lai, Diether Lambrechts, Annika Lindblom,
Reedik Mdgi, Patrik K Magnusson, Arto Mannermaa, Nicholas G Martin, Gisli Masson, Patrick
F McArdle, Wendy L McArdle, Mads Melbye Kyriaki Michailidou, Evelin Mihailov, Lili Milani,
Roger L Milne, Heli Nevanlinna, Patrick Neven, Ellen A Nohr, Albertine J Oldehinkel, Ben A
Oostra, Aarno Palotie,, Munro Peacock, Nancy L Pedersen, Paolo Peterlongo, Julian Peto, Paul
DP Pharoah, Dirkje S Postma, Anneli Pouta, Katri Pylkds, Paolo Radice, Susan Ring, Fernando
Rivadeneira, Antonietta Robino, Lynda M Rose, Anja Rudolph, Veikko Salomaa, Serena Sanna,
David Schlessinger, Marjanka K Schmidt, Mellissa C Southey, Ulla Sovio Meir J Stampfer, Doris
St6ckl Anna M Storniolo, Nicholas J Timpson Jonathan Tyrer, Jenny A Visser, Peter Vollenweider,
Henry Vd1zke, Gerard Waeber, Melanie Waldenberger, Henri Wallaschofski, Qin Wang, Gonneke
Willemsen, Robert Winqvist, Bruce HR Wolffenbuttel, Margaret J Wright, Australian Ovarian
Cancer Study The GENICA Network, kConFab, The LifeLines Cohort Study, The InterAct Con-
sortium, Early Growth Genetics (EGG) Consortium, Dorret I Boomsma, Michael J Econs, KayTee
Khaw, Ruth JF Loos, Mark I McCarthy, Grant W Montgomery, John P Rice, Elizabeth A Streeten,
Unnur Thorsteinsdottir, Cornelia M van Duijn, Behrooz Z Alizadeh, Sven Bergmann, Eric Boer-
winkle, Heather A Boyd, Laura Crisponi, Paolo Gasparini, Christian Gieger, Tamara B Harris, Erik
Ingelsson, MarjoRiitta Jirvelin, Peter Kraft, Debbie Lawlor, Andres Metspalu, Craig E Pennell,
Paul M Ridker, Harold Snieder, Thorkild IA Sorensen, Tim D Spector, David P Strachan, Andr6
G Uitterlinden, Nicholas J Wareham, Elisabeth Widen, Marek Zygmunt, Anna Murray, Douglas F
Easton, Kari Stefansson, Joanne M Murabito, Ken K Ong.

134
Members of the Schizophrenia Working Group of the Psychiatric Genetics Consor-
tium. The members of the Schizophrenia Working Group of the Psychiatric Genomics Consor-
tium are Stephan Ripke, Benjamin M. Neale, Aiden Corvin, James T. R. Walters, Kai-How Farh,
Peter A. Holmans, Phil Lee, Brendan Bulik-Sullivan, David A. Collier, Hailiang Huang, Tune
H. Pers, Ingrid Agartz, Esben Agerbo, Margot Albus, Madeline Alexander, Farooq Amin, Silviu
A. Bacanu, Martin Begemann, Richard A. Belliveau Jr, Judit Bene, Sarah E. Bergen, Elizabeth
Bevilacqua, Tim B. Bigdeli, Donald W. Black, Anders D. Borglum, Richard Bruggeman, Nancy
G. Buccola, Randy L. Buckner, William Byerley, Wiepke Cahn, Guiqing Cai, Dominique Cam-
pion, Rita M. Cantor, Vaughan J. Carr, Noa Carrera, Stanley V. Catts, Kimberly D. Chambert,
Raymond C. K. Chan, Ronald Y. L. Chen, Eric Y. H. Chen, Wei Cheng, Eric F. C. Cheung,
Siow Ann Chong, C. Robert Cloninger, David Cohen, Nadine Cohen, Paul Cormican, Nick Crad-
dock, James J. Crowley, David Curtis, Michael Davidson, Kenneth L. Davis, Franziska Degenhardt,
Jurgen Del Favero, Lynn E. DeLisi, Ditte Demontis, Dimitris Dikeos, Timothy Dinan, Srdjan
Djurovic, Gary Donohoe, Elodie Drapeau, Jubao Duan, Frank Dudbridge, Naser Durmishi, Peter
Eichhammer, Johan Eriksson, Valentina Escott-Price, Laurent Essioux, Ayman H. Fanous, Mar-
tilias S. Farrell, Josef Frank, Lude Franke, Robert Freedman, Nelson B. Freimer, Marion Friedl,
Joseph I. Friedman, Menachem Fromer, Giulio Genovese, Lyudmila Georgieva, Elliot S. Gershon,
Ina Giegling, Paola Giusti-Rodriguez, Stephanie Godard, Jacqueline I. Goldstein, Vera Golim-
bet, Srihari Gopal, Jacob Gratten, Jakob Grove, Lieuwe de Haan, Christian Hammer, Marian L.
Hamshere, Mark Hansen, Thomas Hansen, Vahram Haroutunian, Annette M. Hartmann, Frans A.
Henskens, Stefan Herms, Joel N. Hirschhorn, Per Hoffmann, Andrea Hofman, Mads V. Hollegaard,
David M. Hougaard, Masashi Ikeda, Inge Joa, Antonio Julia, Rene S. Kahn, Luba Kalaydjieva,
Sena Karachanak-Yankova, Juha Karjalainen, David Kavanagh, Matthew C. Keller, Brian J. Kelly,
James L. Kennedy, Andrey Khrunin, Yunjung Kim, Janis Klovins, James A. Knowles, Bettina
Konte, Vaidutis Kucinskas, Zita Ausrele Kucinskiene, Hana Kuzelova-Ptackova, Anna K. Kdhler,
Claudine Laurent, Jimmy Lee Chee Keong, S. Hong Lee, Sophie E. Legge, Bernard Lerer, Miaoxin
Li, Tao Li, Kung-Yee Liang, Jeffrey Lieberman, Svetlana Limborska, Carmel M. Loughland, Jan
Lubinski, Jouko L6nnqvist, Milan Macek Jr, Patrik K. E. Magnusson, Brion S. Maher, Wolfgang

135
Maier, Jacques Mallet, Sara Marsal, Manuel Mattheisen, Morten Mattingsdal, Robert W. McCarley,
Colm McDonald, Andrew M. McIntosh, Sandra Meier, Carin J. Meijer, Bela Melegh, Ingrid Melle,
Raquelle I. Mesholam-Gately, Andres Metspalu, Patricia T. Michie, Lili Milani, Vihra Milanova,
Younes Mokrab, Derek W. Morris, Ole Mors, Preben B. Mortensen, Kieran C. Murphy, Robin
M. Murray, Inez Myin-Germeys, Bertram Miiller-Myhsok, Mari Nelis, Igor Nenadic, Deborah A.
Nertney, Gerald Nestadt, Kristin K. Nicodemus, Liene Nikitina-Zake, Laura Nisenbaum, Annelie
Nordin, Eadbhard O'Callaghan, Colm O'Dushlaine, F. Anthony O'Neill, Sang-Yun Oh, Ann Olincy,
Line Olsen, Jim Van Os, Psychosis Endophenotypes International Consortium, Christos Pantelis,
George N. Papadimitriou, Sergi Papiol, Elena Parkhomenko, Michele T. Pato, Tiina Paunio, Milica
Pejovic-Milovancevic, Diana 0. Perkins, Olli Pietilinen, Jonathan Pimm, Andrew J. Pocklington,
John Powell, Alkes Price, Ann E. Pulver, Shaun M. Purcell, Digby Quested, Henrik B. Rasmussen,
Abraham Reichenberg, Mark A. Reimers, Alexander L. Richards, Joshua L. Roffman, Panos Rous-
sos, Douglas M. Ruderfer, Veikko Salomaa, Alan R. Sanders, Ulrich Schall, Christian R. Schubert,
Thomas G. Schulze, Sibylle G. Schwab, Edward M. Scolnick, Rodney J. Scott, Larry J. Seidman,
Jianxin Shi, Engilbert Sigurdsson, Teimuraz Silagadze, Jeremy M. Silverman, Kang Sim, Petr
Slominsky, Jordan W. Smoller, Hon-Cheong So, ChrisC. A. Spencer, Eli A. Stahl, Hreinn Stefans-
son, Stacy Steinberg, Elisabeth Stogmann, Richard E. Straub, Eric Strengman, Jana Strohmaier,
T. Scott Stroup, Mythily Subramaniam, Jaana Suvisaari, Dragan M. Svrakic, Jin P. Szatkiewicz,
Erik S6derman, Srinivas Thirumalai, Draga Toncheva, Paul A. Tooney, Sarah Tosato, Juha Veijola,
John Waddington, Dermot Walsh, Dai Wang, Qiang Wang, Bradley T. Webb, Mark Weiser, Dieter
B. Wildenauer, Nigel M. Williams, Stephanie Williams, Stephanie H. Witt, Aaron R. Wolen, Emily
H. M. Wong, Brandon K. Wormley, Jing Qin Wu, Hualin Simon Xi, Clement C. Zai, Xuebin Zheng,
Fritz Zimprich, Naomi R. Wray, Kari Stefansson, Peter M. Visscher, Wellcome Trust Case-Control
Consortium, Rolf Adolfsson, Ole A. Andreassen, Douglas H. R. Blackwood, Elvira Bramon, Joseph
D. Buxbaum, Anders D. Borglum, Sven Cichon, Ariel Darvasi, Enrico Domenici, Hannelore Ehren-
reich, T6nu Esko, Pablo V. Gejman, Michael Gill, Hugh Gurling, Christina M. Hultman, Nakao
Iwata, Assen V. Jablensky, Erik G. J6nsson, Kenneth S. Kendler, George Kirov, Jo Knight, Todd
Lencz, Douglas F. Levinson, Qingqin S. Li, Jianjun Liu, Anil K. Malhotra, Steven A. McCarroll,

136
Andrew McQuillin, Jennifer L. Moran, Preben B. Mortensen, Bryan J. Mowry, Markus M. N6then,
Roel A. Ophoff, Michael J. Owen, Aarno Palotie, Carlos N. Pato, Tracey L. Petryshen, Danielle
Posthuma, Marcella Rietschel, Brien P. Riley, Dan Rujescu, Pak C. Sham, Pamela Sklar, David St
Clair, Daniel R. Weinberger, Jens R. Wendland, Thomas Werge, Mark J. Daly, Patrick F. Sullivan,
and Michael C. O'Donovan.

Members of the RACI consortium. The members of the RACI consortium who contributed
to the data used here are Yukinori Okada, Robert Graham, Arun Manoharan, Ward Ortmann,
Tushar Bhangale, Joshua Denny, Robert Carroll, Anne Eyler, Jeffrey Greenberg, Joel Kremer,
Dimitrios Pappas, Gang Xie, Ed Keystone, Eli Stahl, Dorothee Diogo, Jing Cui, Katherine Liao,
Marieke Coenen, Piet van Riel, Mart van de Laar, Henk-Jan Guchelaar, Tom Huizinga, Philippe
Dieud6, Xavier Mariette, S. Louis Bridges Jr., Alexandra Zhernakova, Rene Toes, Paul Tak, Corinne
Miceli-Richard, Javier Martin, Miguel Gonzalez-Gay, Luis Rodriguez-Rodriguez, Solbritt Rantapii-
Dahlqvist, Lisbeth Arlestig, Steve Eyre, John Bowes, Anne Barton, Niek de Vries, Larry Moreland,
Lindsey Criswell, Elizabeth Karlson, Jane Worthington, Leonid Padyukov, Lars Klareskog, Peter
Gregersen, Soumya Raychaudhuri, Timothy Behrens, Katherine Siminovitch, Robert Plenge

137
Mean segment
Annotation Prop. SNPs length (bp)
Coding 0.015 315
Conserved 0.026 34
CTCF 0.024 490
DGF 0.138 208
DHS 0.168 358
FANTOM5 Enhancer 0.004 289
Enhancer 0.063 678
Fetal DHS 0.085 339
H3K27ac 32 0.391 12411
H3K27ac 31 0.269 1051
H3K4mel 0.427 1676
H3K4me3 0.133 941
H3K9ac 0.126 964
Intron 0.387 6537
PromoterFlanking 0.008 266
Promoter 0.031 4192
Repressed 0.461 572
Super-enhancer 0.168 54744
TFBS 0.132 509
Transcribed 0.345 484
TSS 0.018 813
3-prime UTR 0.011 844
5-prime UTR 0.005 197
Weak Enhancer 0.021 249
Table A.1: Annotations used. For DHS, H3K4mel, H3K4me3, and H3K9ac, we include peaks and regions
as two annotations. For the annotations from the Hoffman segmentation, 33 we union over six cell lines for
each category except Repressed, where we take an intersection instead. We also include a 500bp window
around each annotation as a separate annotation in the model.

138
Phenotype Reference/consortium N
Height Lango Allen et al., 2010 Nature 133,858
BMI Speliotes et al., 2010 Nat Genet 123,912
Age at menarche Perry et al., 2014 Nature 132,989
LDL Teslovich et al., 2010 Nature 95,454
HDL Teslovich et al., 2010 Nature 99,900
Triglycerides Teslovich et al., 2010 Nature 96,598
Coronary Artery Disease Schunkert et al., Nat Genet 2011 86,995
Type-2 Diabetes Morris et al., 2012 Nat Genet 69,033
Fasting glucose Manning et. al., Nat Genet, 2012 46,186
Schizophrenia SCZ Working Group of the PGC, 2014 Nature 70,100
Bipolar Disorder Bip Working Group of the PGC, 2011 Nat Genet 16,731
Anorexia Boraska et al., 2014 Mol Psych 17,767
Educational attainment Rietveld et al., Science 2013 101,069
Ever smoked TAG Consortium, 2010 Nat Genet 74,035
Rheumatoid Arthritis Okada et al., 2014 Nature 38,242
Crohn's Disease Jostins et al., 2012 Nature 20,883
Ulcerative Colitis Jostins et al., 2012 Nature 27,432

Table A.2: Phenotypes used in the main analyses. The average sample size is 73,599.

139
Annotation Prop. SNPs Prop. h 2 Enrichment Enrichment P-value
Coding 0.015 0.104 (0.012) 7.124 (0.842) 0.000
Coding + 500bp 0.065 0.190 (0.030) 2.937 (0.467) 0.000
Conserved 0.026 0.347 (0.039) 13.318 (1.503) 0.000
Conserved + 500bp 0.333 0.654 (0.026) 1.967 (0.078) 0.000
CT CF 0.024 -0.004 (0.019) -0.165 (0.792) 0.404
CTCF + 500bp 0.071 0.059 (0.019) 0.824 (0.273) 0.456
DGF 0.138 0.358 (0.094) 2.602 (0.686) 0.020
DGF + 500bp 0.542 0.761 (0.069) 1.406 (0.128) 0.003
DHS peaks 0.112 0.224 (0.063) 2.004 (0.566) 0.056
DHS 0.168 0.285 (0.069) 1.698 (0.410) 0.076
DHS + 500bp 0.499 0.787 (0.041) 1.579 (0.081) 0.000
FANTOM5 Enhancer 0.004 -0.003 (0.009) -0.727 (2.156) 0.536
FANTOM5 Enhancer + 500bp 0.019 0.017 (0.017) 0.880 (0.896) 0.743
Enhancer 0.063 0.239 (0.050) 3.767 (0.783) 0.000
Enhancer + 500bp 0.154 0.359 (0.042) 2.334 (0.272) 0.000
Fetal DHS 0.085 0.238 (0.044) 2.809 (0.513) 0.001
Fetal DHS + 500bp 0.285 0.590 (0.058) 2.071 (0.204) 0.000
H3K27ac 32 0.391 0.630 (0.054) 1.612 (0.138) 0.000
H3K27ac 32 + 500bp 0.423 0.664 (0.059) 1.572 (0.140) 0.000
H3K27ac 31 0.269 0.490 (0.054) 1.818 (0.200) 0.000
H3K27ac 31 + 500bp 0.336 0.611 (0.040) 1.818 (0.118) 0.000
H3K4mel peaks 0.171 0.447 (0.040) 2.611 (0.236) 0.000
H3K4mel 0.427 0.792 (0.065) 1.857 (0.152) 0.000
H3K4mel + 500bp 0.609 0.910 (0.039) 1.493 (0.064) 0.000
H3K4me3 peaks 0.042 0.158 (0.025) 3.775 (0.602) 0.000
H3K4me3 0.133 0.344 (0.045) 2.583 (0.336) 0.000
H3K4me3 + 500bp 0.255 0.487 (0.056) 1.905 (0.220) 0.000
H3K9ac peaks 0.039 0.248 (0.024) 6.396 (0.618) 0.000
H3K9ac 0.126 0.408 (0.056) 3.233 (0.441) 0.000
H3K9ac + 500bp 0.231 0.503 (0.040) 2.183 (0.172) 0.000
Intron 0.387 0.462 (0.014) 1.192 (0.035) 0.005
Intron + 500bp 0.397 0.521 (0.015) 1.313 (0.037) 0.000
PromoterFlanking 0.008 0.004 (0.011) 0.448 (1.315) 0.617
PromoterFlanking + 500bp 0.033 0.081 (0.018) 2.433 (0.532) 0.021
Promoter 0.031 0.087 (0.016) 2.807 (0.513) 0.008
Promoter + 500bp 0.039 0.080 (0.016) 2.061 (0.424) 0.018
Repressed 0.461 0.285 (0.063) 0.618 (0.137) 0.006
Repressed + 500bp 0.719 0.446 (0.049) 0.620 (0.068) 0.000
Super Enhancer 0.168 0.304 (0.035) 1.803 (0.209) 0.000
Super Enhancer + 500bp 0.172 0.319 (0.037) 1.856 (0.216) 0.000
TFBS 0.132 0.445 (0.063) 3.356 (0.478) 0.000
TFBS + 500bp 0.343 0.503 (0.052) 1.465 (0.152) 0.005
Transcribed 0.345 0.407 (0.038) 1.178 (0.111) 0.251
Transcribed + 500bp 0.763 0.721 (0.028) 0.945 (0.036) 0.104
TSS 0.018 0.104 (0.023) 5.685 (1.281) 0.001
TSS + 500bp 0.035 0.172 (0.029) 4.948 (0.830) 0.000
3-prime UTR 0.011 0.054 (0.009) 4.891 (0.857) 0.005
3-prime UTR + 500bp 0.027 0.074 (0.011) 2.731 (0.411) 0.003
5-prime UTR 0.005 0.028 (0.008) 5.240 (1.387) 0.008
5-prime UTR + 500bp 0.028 0.065 (0.010) 2.341 (0.375) 0.001
Weak Enhancer 0.021 0.070 (0.023) 3.325 (1.090) 0.001
Weak Enhancer + 500bp 0.089 0.199 (0.030) 2.238 (0.335) 0.000
140
Table A.3: Proportion of SNP-heritability and enrichment for different functional categories. We display
results meta-analyzed across nine traits for each of the 53 annotations, including two distinct H3K27ac
annotations (Methods). Although true SNP-heritability is non-negative, we report here unbiased estimates,
we can be negative (Methods).
Annotation Prop. SNPs Prop. h2' Enrichment Enrichment P-value
Prop. h 2 Enrichment Enrichment P-value
Annotation
Coding 0.015 SNPs
Prop. 0.108 (0.013) 7.359 (0.861) 0.000
Coding + 500bp 0.065 0.192 (0.030) 2.978 (0.469) 0.000
Conserved 0.026 0.359 (0.042) 13.758 (1.606) 0.000
Conserved + 500bp 0.333 0.656 (0.026) 1.972 (0.079) 0.000
CTCF 0.024 -0.013 (0.019) -0.566 (0.810) 0.256
CTCF + 500bp 0.071 0.043 (0.020) 0.599 (0.279) 0.148
DGF 0.138 0.367 (0.097) 2.667 (0.702) 0.018
DGF + 500bp 0.542 0.748 (0.071) 1.381 (0.132) 0.005
DHS peaks 0.112 0.221 (0.065) 1.976 (0.584) 0.073
DHS 0.168 0.279 (0.070) 1.665 (0.420) 0.098
DHS + 500bp 0.499 0.756 (0.040) 1.515 (0.081) 0.000
FANTOM5 Enhancer 0.004 -0.004 (0.009) -0.853 (2.158) 0.507
FANTOM5 Enhancer + 500bp 0.019 0.016 (0.018) 0.817 (0.920) 0.686
Enhancer 0.063 0.242 (0.051) 3.828 (0.811) 0.000
Enhancer + 500bp 0.154 0.349 (0.043) 2.266 (0.280) 0.000
Fetal DHS 0.085 0.237 (0.044) 2.791 (0.518) 0.001
Fetal DHS + 500bp 0.285 0.558 (0.059) 1.958 (0.207) 0.000
H3K27ac 32 0.391 0.627 (0.055) 1.604 (0.141) 0.000
H3K27ac 32 + 500bp 0.423 0.664 (0.060) 1.571 (0.142) 0.000
H3K27ac 31 0.269 0.485 (0.056) 1.799 (0.207) 0.000
H3K27ac 31 + 500bp 0.336 0.607 (0.042) 1.807 (0.124) 0.000
H3K4mel peaks 0.171 0.464 (0.041) 2.710 (0.242) 0.000
H3K4mel 0.427 0.807 (0.066) 1.891 (0.156) 0.000
H3K4mel + 500bp 0.609 0.899 (0.041) 1.476 (0.067) 0.000
H3K4me3 peaks 0.042 0.166 (0.026) 3.961 (0.619) 0.000
H3K4me3 0.133 0.345 (0.045) 2.590 (0.335) 0.000
H3K4me3 + 500bp 0.255 0.488 (0.058) 1.909 (0.227) 0.000
H3K9ac peaks 0.039 0.259 (0.025) 6.671 (0.636) 0.000
H3K9ac 0.126 0.409 (0.057) 3.241 (0.452) 0.000
H3K9ac + 500bp 0.231 0.502 (0.041) 2.178 (0.178) 0.000
Intron 0.387 0.467 (0.014) 1.206 (0.035) 0.004
Intron + 500bp 0.397 0.528 (0.015) 1.329 (0.037) 0.000
PromoterFlanking 0.008 0.010 (0.011) 1.136 (1.347) 0.974
PromoterFlanking + 500bp 0.033 0.083 (0.018) 2.488 (0.544) 0.020
Promoter 0.031 0.091 (0.016) 2.923 (0.522) 0.006
Promoter + 500bp 0.039 0.079 (0.017) 2.043 (0.441) 0.023
Repressed 0.461 0.268 (0.065) 0.580 (0.141) 0.004
Repressed + 500bp 0.719 0.443 (0.049) 0.616 (0.069) 0.000
Super Enhancer 0.168 0.303 (0.036) 1.796 (0.214) 0.000
Super Enhancer + 500bp 0.172 0.316 (0.038) 1.843 (0.221) 0.000
TFBS 0.132 0.440 (0.065) 3.320 (0.491) 0.000
TFBS + 500bp 0.343 0.481 (0.054) 1.400 (0.156) 0.016
Transcribed 0.345 0.430 (0.039) 1.244 (0.112) 0.108
Transcribed + 500bp 0.763 0.726 (0.028) 0.952 (0.037) 0.166
TSS 0.018 0.105 (0.024) 5.770 (1.306) 0.001
TSS + 500bp 0.035 0.171 (0.029) 4.915 (0.842) 0.000
3-prime UTR 0.011 0.053 (0.010) 4.781 (0.873) 0.007
3-prime UTR + 500bp 0.027 0.075 (0.011) 2.793 (0.422) 0.003
5-prime UTR 0.005 0.029 (0.008) 5.305 (1.404) 0.007
5-prime UTR + 500bp 0.028 0.067 (0.011) 2.395 (0.385) 0.001
Weak Enhancer 0.021 0.070 (0.024) 3.328 (1.141) 0.002
Weak Enhancer + 500bp 0.089 0.190 (0.030) 2.138 (0.343) 0.000
0.0 < DAF < 0.1 0.175 d.4t1 (0.009) 0.408 (0.054) 0.000
0.1 < DAF < 0.2 0.211 0.215 (0.013) 1.021 (0.063) 0.811
0.2 < DAF < 0.3 0.144 0.143 (0.019) 0.994 (0.132) 0.836
0.3 < DAF < 0.4 0.111 0.142 (0.023) 1.284 (0.211) 0.192
0.4 < DAF < 0.6 0.161 0.223 (0.012) 1.387 (0.073) 0.000
0.6 < DA F < 0.8 0.116 0175 (0.027) 1.514 (0.23,5) 0-019
Cell type Mark Cell-type group
Cell~~~ tyeMrieltp ru
Fetal adrenal H3K4me1 Adrenal/Pancreas
Fetal adrenal H3K4me3 Adrenal/Pancreas
Pancreas H3K4mel Adrenal/Pancreas
Pancreas H3K4me3 Adrenal/Pancreas
Pancreatic islets H3K27ac Adrenal/Pancreas
Pancreatic islets H3K4mel Adrenal/Pancreas
Pancreatic islets H3K4me1 Adrenal/Pancreas
Pancreatic islets H3K4me3 Adrenal/Pancreas
Pancreatic islets H3K4me3 Adrenal/Pancreas
Pancreatic islets H3K9ac Adrenal/Pancreas
Angular gyrus H3K27ac: CNS
Angular gyrus H3K4me1 CNS
Angular gyrus H3K4me3 CNS
Angular gyrus H3K9ac CNS
Anterior caudate H3K27ac: CNS
Anterior caudate H3K4me1 CNS
Anterior caudate H3K4me3 CNS
Anterior caudate H3K9ac CNS
Cingulate gyrus H3K27ac: CNS
Cingulate gyrus H3K4mel CNS
Cingulate gyrus H3K4me3 CNS
Cingulate gyrus H3K9ac: CNS
Fetal brain H3K4mel CNS
Fetal brain H3K4me3 CNS
Fetal brain H3K4me3 CNS
Fetal brain H3K9ac: CNS
Germinal matrix H3K4me3 CNS
Hippocampus middle H3K27ac: CNS
Hippocampus middle H3K4mel CNS
Hippocampus middle H3K4me3 CNS
Hippocampus middle H3K9ac CNS
Inferior temporal lobe H3K27ac CNS
Inferior temporal lobe H3K4mel CNS
Inferior temporal lobe H3K4me3 CNS
Inferior temporal lobe H3K9ac CNS
Mid frontal lobe H3K27ac: CNS
Mid frontal lobe H3K4mel CNS
Mid frontal lobe H3K4me3 CNS
Mid frontal lobe H3K9ac: CNS
Neurosphere H3K27ac: CNS
Substantia nigra H3K27ac CNS
Substantia nigra H3K4mel CNS

142
Substantia nigra H3K4me3 CNS
Substantia nigra H3K9ac CNS
Aorta H3K4me3 Cardiovascular
Fetal heart H3K4mel Cardiovascular
Fetal heart H3K4me3 Cardiovascular
Fetal heart H3K9ac Cardiovascular
Fetal lung H3K4mel Cardiovascular
Fetal lung H3K4me3 Cardiovascular
Fetal lung H3K9ac Cardiovascular
Left Ventricle H3K4mel Cardiovascular
Left Ventricle H3K4me3 Cardiovascular
Lung H3K4mel Cardiovascular
Lung H3K4me3 Cardiovascular
Right atrium H3K4mel Cardiovascular
Right atrium H3K4me3 Cardiovascular
Right ventricle H3K4me1 Cardiovascular
Right ventricle H3K4me3 Cardiovascular
Breast fibroblast primary H3K4mel Connective/Bone
Breast fibroblast primary H3K4me3 Connective/Bone
Chondrogenic dif H3K27ac Connective/Bone
Osteoblast H3K27ac Connective/Bone
Penis foreskin fibroblast primary H3K4mel Connective/Bone
Penis foreskin fibroblast primary H3K4me3 Connective/Bone
Colon smooth muscle H3K27ac Gastrointestinal
Colon smooth muscle H3K4mel Gastrointestinal
Colon smooth muscle H3K4me3 Gastrointestinal
Colon smooth muscle H3K9ac Gastrointestinal
Colonic mucosa H3K27ac Gastrointestinal
Colonic mucosa H3K4mel Gastrointestinal
Colonic mucosa H3K4me3 Gastrointestinal
Colonic mucosa H3K9ac Gastrointestinal
Duodenum Mucosa H3K4mel Gastrointestinal
Duodenum Mucosa H3K4me3 Gastrointestinal
Duodenum Mucosa H3K9ac Gastrointestinal
Duodenum mucosa H3K27ac Gastrointestinal
Duodenum smooth muscle H3K27ac Gastrointestinal
Duodenum smooth muscle H3K4mel Gastrointestinal
Duodenum smooth muscle H3K4me3 Gastrointestinal
Esophagus H3K4mel Gastrointestinal
Esophagus H3K4me3 Gastrointestinal
Fetal large intestine H3K4mel Gastrointestinal
Fetal large intestine H3K4me3 Gastrointestinal
Fetal small intestine H3K4mel Gastrointestinal

143
Fetal small intestine H3K4me3 Gastrointestinal
Fetal stomach H3K4me1 Gastrointestinal
Fetal stomach H3K4me3 Gastrointestinal
Gastric H3K4mel Gastrointestinal
Gastric H3K4me3 Gastrointestinal
Rectal mucosa H3K27ac Gastrointestinal
Rectal mucosa H3K4mel Gastrointestinal
Rectal mucosa H3K4me3 Gastrointestinal
Rectal mucosa H3K9ac Gastrointestinal
Rectal smooth muscle H3K27ac Gastrointestinal
Rectal smooth muscle H3K4mel Gastrointestinal
Rectal smooth muscle H3K4me3 Gastrointestinal
Rectal smooth muscle H3K9ac Gastrointestinal
Sigmoid colon H3K4mel Gastrointestinal
Sigmoid colon H3K4me3 Gastrointestinal
Small intestine H3K4mel Gastrointestinal
Small intestine H3K4me3 Gastrointestinal
Stomach mucosa H3K4mel Gastrointestinal
Stomach mucosa H3K4me3 Gastrointestinal
Stomach mucosa H3K9ac Gastrointestinal
Stomach smooth muscle H3K27ac Gastrointestinal
Stomach smooth muscle H3K4mel Gastrointestinal
Stomach smooth muscle H3K4me3 Gastrointestinal
Stomach smooth muscle H3K9ac Gastrointestinal
CD14 H3K27ac Immune
CD14 primary H3K4mel Immune
CD14 primary H3K4me3 Immune
CD15 primary H3K4mel Immune
CD15 primary H3K4me3 Immune
CD19 H3K27ac Immune
CD19 primary (BI) H3K4mel Immune
CD19 primary (BI) H3K4me3 Immune
CD19 primary (UW) H3K4mel Immune
CD19 primary (UW) H3K4me3 Immune
CD20 H3K27ac Immune
CD25+ CD127- Treg H3K27ac Immune
CD25- CD45RA+ naive H3K27ac Immune
CD25- IL17+ Th17 stim H3K27ac Immune
CD25- IL17- Th stim MACS H3K27ac Immune
CD25int CD127+ Tmem H3K27ac Immune
CD3 primary H3K27ac Immune
CD3 primary (BI) H3K4me1 Immune
CD3 primary (BI) H3K4me3 Immune

144
CD3 primary (UW) H3K4me1 Immune
CD3 primary (UW) H3K4me3 Immune
CD34 primary H3K4me1 Immune
CD34 primary H3K4me3 Immune
CD4 memory primary H3K4mel Immune
CD4 memory primary H3K4me3 Immune
CD4 naive primary H3K4mel Immune
CD4 naive primary H3K4me3 Immune
CD4 primary H3K4me3 Immune
CD4+ CD25+ CD127- Treg primary H3K4me1 Immune
CD4+ CD25+ CD127- Treg primary H3K4me3 Immune
CD4+ CD25- CD45RO+ memory primary H3K4mel Immune
CD4+ CD25- CD45RO+ memory primary H3K4me3 Immune
CD4+ CD25- CD45RA+ naive primary H3K4me1 Immune
CD4+ CD25- CD45RA+ naive primary H3K4me3 Immune
CD4+ CD25- IL17+ PMA Ionomycin stim Th17 primary H3K4mel Immune
CD4+ CD25- IL17+ PMA Ionomycin stim Th17 primary H3K4me3 Immune
CD4+ CD25- 1L17- PMA lonomycin stim MACS Th primary H3K4mel Immune
CD4+ CD25- 1L17- PMA lonomycin stim MACS Th primary H3K4me3 Immune
CD4+ CD25- Th primary H3K4mel Immune
CD4+ CD25- Th primary H3K4me3 Immune
CD4+ CD25int CD127+ Tmem primary H3K4mel Immune
CD4+ CD25int CD127+ Tmem primary H3K4me3 Immune
CD56 primary H3K4me1 Immune
CD56 primary H3K4me3 Immune
CD8 memory primary H3K4mel Immune
CD8 memory primary H3K4me3 Immune
CD8 naive primary (BI) H3K4mel Immune
CD8 naive primary (BI) H3K4me3 Immune
CD8 naive primary (UCSF-UBC) H3K4mel Immune
CD8 naive primary (UCSF-UBC) H3K4me3 Immune
CD8 naive primary (UCSF-UBC) H3K9ac Immune
CD8 primary H3K4me3 Immune
Fetal thymus H3K4mel Immune
Fetal thymus H3K4me3 Immune
Mobilized CD34 H3K27ac Immune
Mobilized CD34 primary H3K4mel Immune
Mobilized CD34 primary H3K4me3 Immune
Peripheralblood mononuclear primary H3K4mel Immune
Peripheralblood mononuclear primary H3K4me3 Immune
Peripheralblood mononuclear primary H3K9ac Immune
Spleen H3K4mel Immune
Spleen H3K4me3 Immune

145
ThO H3K27ac Immune
Thi H3K27ac Immune
Th2 H3K27ac Immune
Thymus H3K4mel Immune
Treg primary H3K4me3 Immune
Fetal kidney H3K9ac Kidney
Kidney H3K27ac Kidney
Kidney H3K4mel Kidney
Kidney H3K4me3 Kidney
Kidney H3K9ac Kidney
Liver H3K27ac Liver
Liver (BI) H3K4me1 Liver
Liver (BI) H3K4me3 Liver
Liver (BI) H3K9ac Liver
Liver (UCSD) H3K4me1 Liver
Liver (UCSD) H3K4me3 Liver
Adipose nuclei H3K27ac Other
Adipose nuclei H3K4me1 Other
Adipose nuclei H3K4me3 Other
Adipose nuclei H3K9ac Other
Breast luminal epithelial H3K4me1 Other
Breast myoepithelial H3K4mel Other
Breast myoepithelial H3K4me3 Other
Breast myoepithelial H3K9ac Other
Breast vHMEC H3K4mel Other
Breast vHMEC H3K4me3 Other
Fetal placenta H3K4mel Other
Fetal placenta H3K4me3 Other
Ovary H3K4me1 Other
Ovary H3K4me3 Other
Penis foreskin keratinocyte primary H3K4mel Other
Penis foreskin keratinocyte primary H3K4me3 Other
Penis foreskin keratinocyte primary H3K9ac Other
Penis foreskin melanocyte primary H3K4me1 Other
Penis foreskin melanocyte primary H3K4me3 Other
Placenta amnion H3K4me1 Other
Placenta amnion H3K4me3 Other
Placenta chorion H3K4mel Other
Placenta chorion H3K4me3 Other
Fetal leg muscle H3K4mel Skeletal muscle
Fetal leg muscle H3K4me3 Skeletal muscle
Fetal trunk muscle H3K4mel Skeletal muscle
Fetal trunk muscle H3K4me3 Skeletal muscle

146
Psoas muscle H3K4me1 Skeletal muscle
Psoas muscle H3K4me3 Skeletal muscle
Skeletal muscle H3K27ac Skeletal muscle
Skeletal muscle H3K4mel Skeletal muscle
Skeletal muscle H3K4me3 Skeletal muscle
Skeletal muscle H3K9ac Skeletal muscle

Table A.5: Cell types used in the cell-type-specific analysis. When the same cell type in the same histone
mark from more than one institution was used, the institution is given in parentheses.

147
Cell type cell-type group Mark -logio(p)
Chondrogenic dif** Connective/Bone H3K27ac 6.81
Penis foreskin fibroblast primary** Connective/Bone H3K4mel 6.43
Fetal lung** Cardiovascular H3K4mel 6.34
Fetal stomach** GI H3K4mel 5.48
Colon smooth muscle* GI H3K4mel 4.64
Aorta* Cardiovascular H3K4me3 4.64
Fetal lung* Cardiovascular H3K9ac 4,31
Stomach smooth muscle* GI H3K4me3 4.26
Osteoblast* Connective/Bone H3K27ac 4.04
Penis foreskin fibroblast primary* Connective/Bone H3K4me3 3.96
Stomach smooth muscle* GI H3K4mel 3.94
Fetal leg muscle* Skeletal Muscle H3K4me3 3.91
Fetal trunk muscle* Skeletal Muscle H3K4me3 3.72
Rectal smooth muscle* GI H3K4me3 3.57
Fetal lung* Cardiovascular H3K4me3 3.37
Rectal smooth muscle* GI H3K4mel 3.32
Fetal placenta* Other H3K4me3 3,26
Adipose nuclei* Other H3K4mel 3,1
Ovary* Other H3K4mel 3.06
Fetal large intestine* GI H3K4me3 3.05
Placenta chorion* Other H3K4me3 2.98
CD34 primary* Immune H3K4mel 2.96
Penis foreskin melanocyte primary* Other H3K4mel 2.95
Skeletal muscle* Skeletal Muscle H3K9ac 2.93
Mobilized CD34 primary* Immune H3K4mel 2.88
Fetal stomach* GI H3K4me3 2.87
Mobilized CD34 primary* Immune H3K4me3 2.87
Fetal adrenal* Adrenal/Pancreas H3K4me3 2.86
Breast fibroblast primary* Connective/Bone H3K4me3 2.85
Duodenum smooth muscle* GI H3K4mel 2.81
Colon smooth muscle* GI H3K4me3 2.76
Ovary* Other H3K4me3 2.7
Fetal brain* CNS H3K4me3 2.62
Skeletal muscle* Skeletal Muscle H3K4mel 2.6
Fetal small intestine* GI H3K4me3 2.6
Colon smooth muscle* GI H3K27ac 2.57
Lung* Cardiovascular H3K4me3 2.56
Liver (UCSD)* Liver H3K4me3 2.53
Esophagus* GI H3K4me3 2.53
Placenta amnion* Other H3K4me3 2.48
Right ventricle* Cardiovascular H3K4me3 2.47
Sigmoid colon* GI H3K4me3 2.44
Fetal leg muscle* Skeletal Muscle H3K4mel 2.44
Colonic mucosa* GI H3K4me3 2.18
Right atrium* Cardiovascular H3K4me3 2.14
CD34 primary* Immune H3K4me3 2.13
Gastric* GI H3K4me3 2.08
Skeletal muscle* Skeletal Muscle H3K4me3 2.07
Lung* Cardiovascular H3K4mel 2.06
Pancreatic islets* Adrenal/Pancreas H3K4me3 2.05
Adipose nuclei* Other H3K9ac 2.01
Right atrium* Cardiovascular H3K4mel 1.98
Stomach smooth muscle* GI H3K27ac 1.96
Rectal smooth muscle* GI H3K27ac 1.95
Breast fibroblast primary* Connective/Bone H3K4mel 1.93
Germinal matrix* CNS H3K4me3 1.92
Small intestine* GI H3K4me3 1.91
Fetal placenta* Other H3K4mel 1.91

(a) Height

Table A.6: Enrichment of top cell types for 17 t raits. * = significant at FDR < 0.05. ** = significant at
p < 0.05 after correcting for multiple hypotheses.

148
Cell type cell-type group Mark -logio(p)
Fetal brain* CNS H3K4me3 4.48
Penis foreskin fibroblast primary* Connective/Bone H3K4me3 4.43
Inferior temporal lobe* CNS H3K4me1 4.3
Mid frontal lobe* CNS H3K9ac 4.25
Anterior caudate* CNS H3K4me3 4.25
Mid frontal lobe* CNS H3K27ac 3.96
Anterior caudate* CNS H3K9ac 3.91
Cingulate gyrus* CNS H3K4mel 3.73
Inferior temporal lobe* CNS H3K4me3 3.73
Penis foreskin keratinocyte primary* Other H3K9ac 3.72
Mid frontal lobe* CNS H3K4me3 3.71
Hippocampus middle* CNS H3K4me1 3.66
Inferior temporal lobe* CNS H3K9ac 3.59
Fetal brain* CNS H3K9ac 3.57
Hippocampus middle* CNS H3K9ac 3.47
Cingulate gyrus* CNS H3K9ac 3.46
Hippocampus middle* CNS H3K4me3 3.4
Germinal matrix* CNS H3K4me3 3.4
Cingulate gyrus* CNS H3K4me3 3.4
Anterior caudate* CNS H3K4mel 3.31
Substantia nigra* CNS H3K4me3 3.24
Angular gyrus* CNS H3K27ac 3.05
Penis foreskin melanocyte primary* Other H3K4me3 2.83
Angular gyrus* CNS H3K4me3 2.76
Substantia nigra* CNS H3K4mel 2.75
Pancreatic islets* Adrenal/Pancreas H3K4me3 2.6
Cingulate gyrus* CNS H3K27ac 2.57
Fetal adrenal* Adrenal/Pancreas H3K4me3 2.57
Angular gyrus* CNS H3K9ac 2.51
Inferior temporal lobe* CNS H3K27ac 2.39
Breast myoepithelial* Other H3K4me3 2.35
Substantia nigra* CNS H3K9ac 2.26
Substantia nigra* CNS H3K27ac 2.22
Hippocampus middle* CNS H3K27ac 2.07

(b) BMI

Table A.6: Enrichment of top cell types for 17 traits. * significant at FDR < 0.05. ** significant at
p < 0.05 after correcting for multiple hypotheses.

149
Cell type cell-type group Mark -logio (p)
Fetal brain** CNS H3K4me3 12.25
Pancreatic islets** Adrenal/Pancreas H3K4me3 11.73
Angular gyrus** CNS H3K4me3 11.22
Germinal matrix** CNS H3K4me3 11.18
Fetal adrenal** Adrenal/Pancreas H3K4me3 11.12
Mid frontal lobe** CNS H3K4me3 11.11
Inferior temporal lobe** CNS H3K4me3 10.22
Cingulate gyrus** CNS H3K4me3 9.94
Anterior caudate** CNS H3K4me3 8.91
Psoas muscle** Skeletal Muscle H3K4me3 8.66
Right ventricle** Cardiovascular H3K4me3 8.58
Pancreatic islets** Adrenal/Pancreas H3K9ac 7.74
Fetal leg muscle** Skeletal Muscle H3K4me3 7.71
Pancreas*"* Adrenal/Pancreas H3K4me3 7.26
Hippocampus middle** CNS H3K4me3 7.19
Breast myoepithelial** Other H3K4me3 6.93
Fetal trunk muscle** Skeletal Muscle H3K4me3 6.87
Peripheralblood mononuclear primary** Immune H3K4me3 6.66
Penis foreskin melanocyte primary** Other H3K4me3 6.53
Fetal stomach** GI H3K4me3 6.26
Gastric** GI H3K4me3 6.24
Right atrium** Cardiovascular H3K4me3 6.24
CD4+ CD25- CD45RA+ naive primary** Immune H3K4me3 6.16
CD4+ CD25int CD127+ Tmem primary** Immune H3K4me3 5.96
Ovary** Other H3K4me3 5.64
Penis foreskin fibroblast primary** Connective/Bone H3K4me3 5.57
Substantia nigra** CNS H3K4me3 5.41
Esophagus** GI H3K4me3 5.35
Colonic mucosa** GI H3K4me3 5.3
Fetal large intestine** GI H3K4me3 5.14
Fetal placenta** Other H3K4me3 5.07
Fetal brain** CNS H3K9ac 5.05
Aorta* Cardiovascular H3K4me3 4.74
CD8 naive primary (BI)* Immune H3K4me3 4.49
CD14 primary* Immune H3K4me3 4.49
Fetal small intestine* GI H3K4me3 4.43
Breast vHMEC* Other H3K4me3 4.39
CD4+ CD25- Th primary* Immune H3K4me3 4.38
CD34 primary* Immune H3K4me3 4.37
Placenta amnion* Other H3K4me3 4.34
Angular gyrus* CNS H3K9ac 4.33
Penis foreskin keratinocyte primary* Other H3K4me3 4.3
Pancreatic islets* Adrenal/Pancreas H3K4me3 4.26
Mid frontal lobe* CNS H3K9ac 4.23
CD4+ CD25- CD45R0+ memory primary* Immune H3K4me3 4.14
Rectal smooth muscle* GI H3K4me3 4.12
Left Ventricle* Cardiovascular H3K4me3 4.11
CD8 memory primary* Immune H3K4me3 4.06
CD4+ CD25+ CD127- Treg primary* Immune H3K4me3 4.05
Placenta chorion* Other H3K4me3 4.05
CD8 naive primary (UCSF-UBC)* Immune H3K4me3 3.77
Anterior caudate* CNS H3K9ac 3.73
Cingulate gyrus* CNS H3K9ac 3.69
CD19 primary (UW)* Immune H3K4me3 3.63
CD4+ CD25- IL17+ PMA Ionomycin stim Th17 primary* Immune H3K4me3 3.58
CD4 naive primary* Immune H3K4me3 3.53
Fetal brain* CNS H3K4me3 3.53
Lung* Cardiovascular H3K4me3 3.5
Mid frontal lobe* CNS H3K27ac 3.43
Breast fibroblast primary* Connective/Bone H3K4me3 3.41

(c) Age at menarche

Table A.6: Enrichment of top cell types for 17 traits. * significant at FDR < 0.05. ** = significant at
p < 0.05 after correcting for multiple hypotheses.
150
Cell type cell-type group Mark -logio(p)
Liver (BI)* Liver H3K4mel 4.76
Fetal adrenal* Adrenal/Pancreas H3K4mel 3.41
CD14 primary* Immune H3K4mel 3.33
Liver* Liver H3K27ac 2.97
Adipose nuclei Other H3K9ac 2.71

(d) LDL

Table A.6: Enrichment of top cell types for 17 traits. * significant at FDR < 0.05. ** significant at
p < 0.05 after correcting for multiple hypotheses.

Cell type cell-type group Mark -logio(p)


Liver (BI)* Liver H3K4mel 4.51
Adipose nuclei* Other H3K4me1 4.26
Liver* Liver H3K27ac 3.61
Adipose nuclei* Other H3K9ac 3.34
Adipose nuclei* Other H3K4me3 3.08
CD14 primary* Immune H3K4mel 2.86
Adipose nuclei* Other H3K27ac 2.84
Liver (BI)* Liver H3K9ac 2.74
Liver (BI)* Liver H3K4me3 2.66

(e) HDL

Table A.6: Enrichment of top cell types for 17 traits. * significant at FDR < 0.05. ** significant at
p < 0.05 after correcting for multiple hypotheses.

Cell type cell-type group Mark -logio (p)


Liver (BI)* Liver H3K4mel 3.99
Liver* Liver H3K27ac 3.66
Liver (BI)* Liver H3K9ac, 3.02
Duodenum Mucosa GI H3K4me3 2.71
Liver (UCSD) Liver H3K4me3 2.68

(f) Triglycerides

Table A.6: Enrichment of top cell types for 17 traits. * significant at FDR < 0.05. ** = significant at
p < 0.05 after correcting for multiple hypotheses.

Cell type cell-type group Mark -logio(p)


Adipose nuclei* Other H3K4me1 4.21
Duodenum Mucosa* GI H3K4mel 3.43
Colonic mucosa* GI H3K9ac 3.01
Duodenum Mucosa GI H3K9ac 2.78
Rectal mucosa GI H3K9ac 2.68

(g) Coronary artery disease

Table A.6: Enrichment of top cell types for 17 traits. * significant at FDR < 0.05. ** significant at
p < 0.05 after correcting for multiple hypotheses.

151
Cell type cell-type group Mark -logio (p)
Pancreatic islets Adrenal/Pancreas H3K4me3 2.87
Pancreatic islets Adrenal/Pancreas H3K27ac 2.73
Fetal large intestine GI H3K4mel 2.49
Fetal small intestine GI H3K4me1 2.31
Adipose nuclei Other H3K9ac 2.27

(h) Type 2 Diabetes

Table A.6: Enrichment of top cell types for 17 traits. * = significant at FDR < 0.05. ** = significant at
p < 0.05 after correcting for multiple hypotheses.

Cell type cell-type group Mark -ogio(p)


Pancreatic islets* Adrenal/Pancreas H3K27ac 3.93
Pancreatic islets* Adrenal/Pancreas H3K4mel 3.1
Pancreatic islets Adrenal/Pancreas H3K4me3 2.93
Pancreatic islets Adrenal/Pancreas H3K4me3 2.25
Fetal small intestine GI H3K4me1 2.18

(i) Fasting Glucose

Table A.6: Enrichment of top cell types for 17 traits. *"= significant at FDR < 0.05. ** significant at
p < 0.05 after correcting for multiple hypotheses.

152
Cell type cell-type group Mark -logio(p)
Fetal brain** CNS H3K4me3 18.51
Mid frontal lobe** CNS H3K4me3 14.44
Germinal matrix** CNS H3K4me3 12.68
Mid frontal lobe** CNS H3K9ac 11.27
Angular gyrus** CNS H3K4me3 10.89
Inferior temporal lobe** CNS H3K4me3 10.77
Cingulate gyrus** CNS H3K9ac 10.27
Fetal brain** CNS H3K9ac 10.24
Anterior caudate** CNS H3K4me3 9.66
Cingulate gyrus** CNS H3K4me3 9.34
Pancreatic islets** Adrenal/Pancreas H3K4me3 8.65
Anterior caudate** CNS H3K9ac 8.5
Angular gyrus** CNS H3K9ac 8.33
Mid frontal lobe** CNS H3K27ac 8.1
Anterior caudate** CNS H3K4mel 7.92
Inferior temporal lobe** CNS H3K4me1 7.43
Psoas muscle** Skeletal Muscle H3K4me3 7.38
Fetal brain** CNS H3K4me1 7.21
Inferior temporal lobe** CNS H3K9ac 7.03
Hippocampus middle** CNS H3K9ac 6.03
Pancreatic islets** Adrenal/Pancreas H3K9ac 5.79
Penis foreskin melanocyte primary** Other H3K4me3 5.68
Angular gyrus** CNS H3K27ac 5.63
Cingulate gyrus** CNS H3K4mel 5.55
Hippocampus middle** CNS H3K4me3 5.55
CD34 primary** Immune H3K4me3 5.33
Sigmoid colon** GI H3K4me3 5.3
Fetal adrenal** Adrenal/Pancreas H3K4me3 5.2
Inferior temporal lobe** CNS H3K27ac 5.08
Peripheralblood mononuclear primary** Immune H3K4me3 5.03
Gastric*"* GI H3K4me3 4.93
Substantia nigra* CNS H3K4me3 4.71
Fetal brain* CNS H3K4me3 4.58
Hippocampus middle* CNS H3K4mel 4.48
Ovary* Other H3K4me3 4.19
CD19 primary (UW)* Immune H3K4me3 4.15
Small intestine* GI H3K4me3 4.07
Lung* Cardiovascular H3K4me3 3.93
Fetal stomach* GI H3K4me3 3.89
Fetal leg muscle* Skeletal Muscle H3K4me3 3.82
Spleen* Immune H3K4me3 3.77
Breast fibroblast primary* Connective/Bone H3K4me3 3.69
Right ventricle* Cardiovascular H3K4me3 3.67
CD4+ CD25- Th primary* Immune H3K4me3 3.66
CD4+ CD25- IL17- PMA Ionomycin stim MACS Th sprimary* Immune H3K4mel 3.66
CD8 naive primary (UCSF-UBC)* Immune H3K4me3 3.65
Pancreas* Adrenal/Pancreas H3K4me3 3.63
CD4+ CD25- Th primary* Immune H3K4mel 3.56
CD4+ CD25- CD45RA+ naive primary* Immune H3K4mel 3.56
Colonic mucosa* GI H3K4me3 3.49
Right atrium* Cardiovascular H3K4me3 3.48
Fetal trunk muscle* Skeletal Muscle H3K4me3 3.47
CD4+ CD25int CD127+ Tmem primary* Immune H3K4me3 3.46
Substantia nigra* CNS H3K9ac 3.44
Placenta amnion* Other H3K4me3 3.38
Breast myoepithelial* Other H3K9ac 3.26
CD8 naive primary (BI)* Immune H3K4mel 3.24
Substantia nigra* CNS H3K4mel 3.18
Cingulate gyrus* CNS H3K27ac 3.1
CD4+ CD25- CD45RA+ naive primary* Immune H3K4me3 3.06

(j) Schizophrenia

Table A.6: Enrichment of top cell types for 17 traits. * = significant at FDR < 0.05. ** = significant at
p < 0.05 after correcting for multiple hypotheses. 153
Cell type cell-type group Mark -logio(p)
Mid frontal lobe* CNS H3K27ac 4.42
Penis foreskin keratinocyte primary Other H3K9ac 3.05
Fetal brain CNS H3K9ac 2.92
Fetal brain CNS H3K4me3 2.9
Mid frontal lobe CNS H3K4me3 2.78

(k) Bipolar disorder

Table A.6: Enrichment of top cell types for 17 traits. * = significant at FDR < 0.05. ** = significant at
p < 0.05 after correcting for multiple hypotheses.

Cell type cell-type group Mark -ogio(p)


Angular gyrus CNS H3K9ac 2.61
Mid frontal lobe CNS H3K9ac 2.38
Mid frontal lobe CNS H3K4mel 2.36
Anterior caudate CNS H3K9ac 2.28
Cingulate gyrus CNS H3K9ac 2.22

(1) Anorexia

Table A.6: Enrichment of top cell types for 17 traits. * significant at FDR < 0.05. ** significant at
p < 0.05 after correcting for multiple hypotheses.

154
Cell type cell-type group Mark -logio(p)
Angular gyrus** CNS H3K4me3 6.63
Fetal brain** CNS H3K4me3 6.05
Mid frontal lobe** CNS H3K4me3 5.99
Anterior caudate** CNS H3K4me3 5.73
Inferior temporal lobe** CNS H3K4me3 5.63
CD56 primary** Immune H3K4me3 5.32
Germinal matrix** CNS H3K4me3 5.29
Mid frontal lobe** CNS H3K9ac 5.26
Cingulate gyrus** CNS H3K9ac 4.98
Cingulate gyrus** CNS H3K4me3 4.94
CD8 naive primary (UCSF-UBC)** Immune H3K4me3 4.88
Penis foreskin melanocyte primary* Other H3K4me3 4.73
Mid frontal lobe* CNS H3K27ac 4.43
Peripheralblood mononuclear primary* Immune H3K4me3 4.39
CD34 primary* Immune H3K4me3 4.12
Fetal brain* CNS H3K9ac 4.03
CD14 primary* Immune H3K4me3 4.01
Inferior temporal lobe* CNS H3K9ac 3.98
Angular gyrus* CNS H3K9ac 3.96
Sigmoid colon* GI H3K4me3 3.83
Pancreatic islets* Adrenal/Pancreas H3K9ac 3.83
CD4+ CD25int CD127+ Tmem primary* Immune H3K4me3 3.72
Anterior caudate* CNS H3K9ac 3.65
Hippocampus middle* CNS H3K4me3 3.65
CD19 primary (UW)* Immune H3K4me3 3.63
Small intestine* GI H3K4me3 3.52
CD4+ CD25- Th primary* Immune H3K4me3 3.45
Lung* Cardiovascular H3K4me3 3.19
CD4+ CD25- CD45RO+ memory primary* Immune H3K4me3 3.19
Hippocampus middle* CNS H3K9ac 3.17
Liver (UCSD)* Liver H3K4me3 3.08
Fetal placenta* Other H3K4me3 3.08
Fetal adrenal* Adrenal/Pancreas H3K4me3 3.04
Right atrium* Cardiovascular H3K4me3 2.99
Pancreatic islets* Adrenal/Pancreas H3K4me3 2.96
CD8 naive primary (UCSF-UBC)* Immune H3K9ac 2.95
CD3 primary (BI)* Immune H3K4me3 2.94
Angular gyrus* CNS H3K27ac 2.93
CD4+ CD25- CD45RA+ naive primary* Immune H3K4me3 2.93
Gastric* GI H3K4me3 2.92
CD4 naive primary* Immune H3K4me3 2.89
CD8 memory primary* Immune H3K4me3 2.78
CD3 primary (UW)* Immune H3K4me3 2.76
Rectal smooth muscle* GI H3K4me3 2.73
Fetal brain* CNS H3K4me3 2.67
Esophagus* GI H3K4me3 2.66
CD8 naive primary (BI)* Immune H3K4me3 2.58
Left Ventricle* Cardiovascular H3K4me3 2.56
CD19 primary (BI)* Immune H3K4me3 2.56
Fetal thymus* Immune H3K4me3 2.52
Breast vHMEC* Other H3K4me3 2.51
CD8 primary* Immune H3K4me3 2.51
Psoas muscle* Skeletal Muscle H3K4me3 2.51
Peripheralblood mononuclear primary* Immune H3K9ac 2.5
Ovary* Other H3K4me3 2.47
Pancreas* Adrenal/Pancreas H3K4me3 2.46
Breast fibroblast primary* Connective/Bone H3K4me3 2.45
CD4+ CD25+ CD127- Treg primary* Immune H3K4me3 2.36
Placenta amnion* Other H3K4me3 2.34
Right ventricle* Cardiovascular H3K4me3 2.33

(m) Years of education

Table A.6: Enrichment of top cell types for 17 traits. significant at FDR < 0.05. ** = significant at
*

p < 0.05 after correcting for multiple hypotheses.


155
Cell type cell-type group Mark -ogio(p)
Inferior temporal lobe* CNS H3K4me3 3.21
Cingulate gyrus* CNS H3K27ac 3.2
Substantia nigra* CNS H3K27ac: 3.16
Hippocampus middle* CNS H3K27ac 3.13
Breast myoepithelial* Other H3K9ac 3.06
Inferior temporal lobe* CNS H3K4mel 2.93
Anterior caudate* CNS H3K27ac 2.81
Inferior temporal lobe* CNS H3K27ac 2.81
Angular gyrus* CNS H3K27ac 2.77
Pancreatic islets* Adrenal/Pancreas H3K4mel 2.55

(n) Ever smoked

Table A.6: Enrichment of top cell types for 17 traits. * significant at FDR < 0.05. ** significant at
p < 0.05 after correcting for multiple hypotheses.

156
Cell type cell-type group Mark -logio(p)
CD4+ CD25- IIL17+ PMA Ionomycin stim Th17 primary** Immune H3K4mel 6.76
CD4+ CD25- IL17- PMA Ionomycin stim MACS Th sprimary** Immune H3K4mel 6.11
CD4+ CD25- CD45RO+ memory primary** Immune H3K4mel 5.92
CD4 memory primary** Immune H3K4mel 5.88
CD4+ CD25+ CD127- Treg primary** Immune H3K4mel 5.83
CD25- IL17- Th stim MACS** Immune H3K27ac 5.7
Th2** Immune H3K27ac 5.5
CD8 memory primary** Immune H3K4mel 5.38
CD4 naive primary** Immune H3K4mel 5.26
CD4+ CD25- Th primary** Immune H3K4mel 5.25
CD19 primary (UW)** Immune H3K4mel 5.25
CD4+ CD25int CD127+ Tmem primary** Immune H3K4me1 4.88
CD4+ CD25- CD45RA+ naive primary* Immune H3K4mel 4.75
CD3 primary (BI)* Immune H3K4mel 4.64
CD3 primary (UW)* Immune H3K4mel 4.63
CD25- IL17+ Th17 stim* Immune H3K27ac 4.55
CD8 naive primary (UCSF-UBC)* Immune H3K4mel 4.49
CD8 naive primary (BI)* Immune H3K4mel 4.45
ThO* Immune H3K27ac 4.09
CD25+ CD127- Treg* Immune H3K27ac 4.09
Thl* Immune H3K27ac 3.96
CD19 primary (BI)* Immune H3K4mel 3.91
CD56 primary* Immune H3K4mel 3.77
Treg primary* Immune H3K4me3 3.63
CD3 primary* Immune H3K27ac 3.62
CD20* Immune H3K27ac 3.45
CD4+ CD25- IL17- PMA Ionomycin stim MACS Th sprimary* Immune H3K4me3 3.45
CD4+ CD25- IL17+ PMA Ionomycin stim Th17 primary* Immune H3K4me3 3.17
CD4+ CD25+ CD127- Treg primary* Immune H3K4me3 3.1
CD4+ CD25int CD127+ Tmem primary* Immune H3K4me3 2.76
Peripheralblood mononuclear primary* Immune H3K9ac 2.58
CD25int CD127+ Tmem* Immune H3K27ac 2.27
CD4+ CD25- CD45RO+ memory primary* Immune H3K4me3 2.24
CD4+ CD25- CD45RA+ naive primary* Immune H3K4me3 2.2
CD8 memory primary* Immune H3K4me3 2.17
CD19* Immune H3K27ac 2.13
CD4 memory primary* Immune H3K4me3 2.12
Peripheralblood mononuclear primary* Immune H3K4mel 2.12
CD4+ CD25- Th primary* Immune H3K4me3 1.98

(o) Rheumatoid arthritis

Table A.6: Enrichment of top cell types for 17 traits. * significant at FDR < 0.05. ** = significant at
p < 0.05 after correcting for multiple hypotheses.

157
Cell type cell-type group Mark -logio(p)
CD4+ CD25- IL17+ PMA Ionomycin stim Th17 primary** Immune H3K4mel 7.59
Thl** Immune H3K27ac 6.54
CD25- IL17+ Th17 stim** Immune H3K27ac 6.5
CD4+ CD25- IL17- PMA Ionomycin stim MACS Th sprimary** Immune H3K4me1 6.24
CD4 memory primary** Immune H3K4mel 5.88
Th2** Immune H3K27ac 5.87
CD4+ CD25- Th primary** Immune H3K4me1 5.59
CD8 memory primary** Immune H3K4mel 5.13
CD14 primary** Immune H3K4mel 5.03
CD3 primary (UW)** Immune H3K4me1 4.96
ThO* Immune H3K27ac 4.8
CD56 primary* Immune H3K4me1 4.8
CD25- IL17- Th stim MACS* Immune H3K27ac 4.72
CD4+ CD25- CD45RO+ memory primary* Immune H3K4mel 4.7
CD4 naive primary* Immune H3K4mel 4.51
CD4- CD25int CD127+ Tmem primary* Immune H3K4mel 4.44
CD4+ CD25- CD45RA+ naive primary* Immune H3K4mel 4.36
CD8 naive primary (BI)* Immune H3K4mel 4.31
CD 19. primary (UW)* Immune H3K4mel 4.26
CD8 naive primary (UCSF-UBC)* Immune H3K4mel 4.2
CD4+ CD25- IL17+ PMA Ionomycin stim Thl7 primary* Immune H3K4me3 4.18
CD19 primary (BI)* Immune H3K4mel 4.17
CD3 primary (BI)* Immune H3K4mel 3.73
CD4+ CD25+ CD127- Treg primary* Immune H3K4me1 3.62
CD3 primary* Immune H3K27ac 3.25
CD4+ CD25- IL17- PMA Ionomycin stim MACS Th sprimary* Immune H3K4me3 3.16
CD34 primary* Immune H3K4mel 2.87
Peripheralblood mononuclear primary* Immune H3K9ac 2.87
CD15 primary* Immune H3K4mel 2.85
Spleen* Immune H3K4me1 2.7
CD4 primary* Immune H3K4me3 2.49
Peripheralblood mononuclear primary* Immune H3K4mel 2.46
CD8 primary* Immune H3K4me3 2.44
CD14* Immune H3K27ac 2.18
CD4 memory primary* Immune H3K4me3 2.12
Colonic mucosa* GI H3K4mel 2.11
CD19 primary (BI)* Immune H3K4me3 2.1
CD4 naive primary* Immune H3K4me3 2.09
Mobilized CD34 primary* Immune H3K4mel 2.08
CD25int CD127+ Tmem* Immune H3K27ac 2.04

(p) Crohn's disease

Table A.6: Enrichment of top cell types for 17 traits. * significant at FDR < 0.05. ** significant at
p < 0.05 after correcting for multiple hypotheses.

158
Cell type cell-type group Mark -logio(p)
CD4+ CD25- IL17+ PMA Ionomycin stim Th17 primary** Immune H3K4mel 6.37
CD25- IL17+ Th17 stim** Immune H3K27ac 5.53
CD4+ CD25- CD45RO+ memory primary** Immune H3K4mel 5.48
CD4 memory primary** Immune H3K4me1 5.19
CD4+ CD25+ CD127- Treg primary** Immune H3K4mel 4.88
CD4+ CD25- IL17- PMA Ionomycin stim MACS Th sprimary* Immune H3K4mel 4.73
CD3 primary* Immune H3K27ac 4.72
Th2* Immune H3K27ac 4.5
CD25+ CD127- Treg* Immune H3K27ac 4.42
Colonic mucosa* GI H3K4mel 4.17
Spleen* Immune H3K4mel 4.06
CD4+ CD25- IL17+ PMA Ionomycin stim Thl7 primary* Immune H3K4me3 4.04
CD4+ CD25- Th primary* Immune H3K4me1 4.03
Colonic mucosa* GI H3K27ac 4.0
Thl* Immune H3K27ac 3.96
CD4 naive primary* Immune H3K4mel 3.91
CD4+ CD25int CD127+ Tmem primary* Immune H3K4mel 3.87
CD8 memory primary* Immune H3K4mel 3.84
Rectal mucosa* GI H3K4mel 3.74
CD19 primary (UW)* Immune H3K4mel 3.72
Colonic mucosa* GI H3K9ac 3.63
Rectal mucosa* GI H3K9ac 3.57
CD25- IL17- Th stim MACS* Immune H3K27ac 3.54
CD25int CD127+ Tmem* Immune H3K27ac 3.45
ThO* Immune H3K27ac 3.44
CD8 naive primary (UCSF-UBC)* Immune H3K4mel 3.43
CD56 primary* Immune H3K4mel 3.21
Rectal mucosa* GI H3K27ac 3.16
CD19 primary (BI)* Immune H3K4mel 2.95
Treg primary* Immune H3K4me3 2.93
CD8 naive primary (BI)* Immune H3K4mel 2.91
CD3 primary (UW)* Immune H3K4mel 2.83
CD4+ CD25- CD45RA+ naive primary* Immune H3K4mel 2.7
CD3 primary (BI)* Immune H3K4mel 2.44
Rectal mucosa* GI H3K4me3 2.29
CD4+ CD25- 1L17- PMA Ionomycin stim MACS Th sprimary* Immune H3K4me3 2.24
Duodenum smooth muscle* GI H3K27ac 2.17
Duodenum Mucosa* GI H3K4mel 2.15
CD34 primary* Immune H3K4mel 2.12

(q) Ulcerative colitis

Table A.6: Enrichment of top cell types for 17 traits. * = significant at FDR < 0.05. ** = significant at
p < 0.05 after correcting for multiple hypotheses.

159
Phenotype Heritability z-score
Type 2 Diabetes 8.12
Ever smoked 8.42
Coronary artery disease 8.71
Ulcerative colitis 8.81
Bipolar disorder 8.81
Triglycerides 9.45
LDL 9.49
Crohn's disease 10.12
Anorexia 10.44
HDL 11.11
Years of education 11.59
BMI 16.79
Age at menarche 16.84
Height 18.89
Schizophrenia 21.29
Rheumatoid arthritis 9.05
Fasting Glucose 7.68

Table A.7: Heritability z-scores for the 17 traits analyzed in the manuscript

160
0.51
.- Nh2g=3000
.-. Nh2g=4000
0.4
0- Nh2g=5000
.- Nh2g=7000
* * Nh2g=10000
LA
0
6 0.3-
V
0M

f-

0.2-

0.11-

0.01 I _ _L _ I _ _ _I II _ _

0. 2 0.3 0.4 0.5 0.6 0.7 0.8 0.9


h2g

Figure A-1: N and h affect power only through N - h . Sample size is varied to keep N - h2 constant as
9 varies. Power stays constant for a given value of N - h9.

I.-

20.6 10.10
104

0.9 10 0.01
Q.W 0.0 00 01 5 0.25 '_2 4 8 00 12 14
h2 .2

(a) Total h (b) Category-specific h 2 (c) Proportion of h2.


.

Figure A-2: Lack of bias in the presence of enrichment. In these simulations, total h and proportion of
SNPs causal are varied while keeping the proportion of heritability in the tested category constant.

161
Ar

0.0 47

0.01 01

0.2 OA 0.5 O.n'


T"Wt'1h2 0.05 0.20 "5 4 6 1z 14
252 101
a04

(a) Total h (b) Category-specific h (c) Proportion of h


.

.
Figure A-3: Lack of bias in null simulations. In these simulations, total h and proportion of SNPs causal
are varied. There is no enrichment in these simulations.

1.0 0.25

0.8- 0.20-

-. z
-

0.6 0.15
-

0.10
V E
0.2--
In

0.05
-

0.0 O.0
0.2 0.4 0.6 0
'

0.8 1..0 .Ud U.U5 U..LU .13 u.2U V.293


True total h2 True h2(CNS)
1.0 I0
,.

-- Enriched, prop SNPs causal=0.05


-- Enriched, prop SNPs causal-0.005
- Null. prop SNPs causal-0.05
0.8- 0.8- - Null. prop SNPs causal-0.005
-

On
0
0.6 V 0.6
-

C-

0.4-- 0.4
.

0.2- 0.2

2 4 6 8 10 12 14 -0 5000 10000
Heritability z-score N*h2g

Figure A-4: Bias and power with an out-of-sample reference panel. In these simulations, WTCCC1
genotypes were used to generate the phenotypes, and stratified LD score regression was run using a 1000G
reference panel of European samples, including GBR, FIN, IBS, CEU, and TSI populations.

162
-Top9 wAstation Is the co"
8Mafica anotatlo
MTopsonf&Mannsat has0. s 4 IOWh ausl ncct
Top sgrlfcat wflwt has 2Z 0.5 to the mosl 00

Os

-
0 0.6
.

0. 0.4

0.2
Top skgkat wwwA&taa Is the. Ienoam
Tosi,4&astAch" .3 4 2 .It WasUhI Wmat.

.1 Top shanNcamt MOwt has r2-0.5 totS08504 anloat.


No eaftstknm is sOka~t.

&"O 0,50 ; i

-
(a) High enrichment. (b) Low enrichment.

Figure A-5: Results of Figure 3-3, broken down by cell-type group.

Triglycerides Type 2 Diabetes Ulcerative colitis

0 Adrenal/Pancreas
Central Nervous System
Cardiovascular
Connective/Bone
Gastrointestinal
1 2 3 4 0 1 2 2 3 4 Immune/Hematopoietic
LDL Anorexia Coronary artery disease Kidney
Liver
Skeletal Muscle
Other

0 1 2 3 4 0.0 0.5 0 1 2
-0910(p) _109 10 5 -10910(p)

Figure A-6: Enrichment of cell-type groups for traits not included in Figure 3-6.The black dotted line at
- logio(P) = 3.5 is the cutoff for Bonferroni significance. The grey dotted line at - logio(P) = 2.1 is the
cutoff for FDR < 0.05.

163
1.0 I I I I I

0.8

0
6
0.6
v

4I-1

<v 0.4|
0.-

0.21

0.01 I I I I I

0. 4 0.06 0.08 0.10 0.12 0.14 0.16 0.18


Size of the category tested (Proportion of SNPs)

Figure A-7: Power as a function of category size. Each point represents a rejection probability
over 500
simulations. Baseline enrichment and enrichment in the tested category remain constant
as the size of the
category changes. All simulations have h2 = 0.7, N = 14000, Peausal = 0.05.

164
16

14-

z
12

10
4
C 6

No enrichment

IVO 910ovoo 0*\

Figure A-8: Enrichment of baseline categories, meta-analyzed over all 17 traits. The standard errors in
this analysis are artificially low due to correlated traits such as HDL/LDL/Triglycerides being treated as
independent. Results meta-analyzed over nine independent traits are reported in Figure 3-4.

165
166
Appendix B

Supplementary information for Chapter 4

Supplemental tables can be found at http: //biorxiv. org/highwire/f ilestream/29366/f ield_

highwire-adjunct-files/1/103069-2.xlsx.

Supplemental figures follow below.

167
MS
Celiac
Lupus
Asthma
~iI
Primarybiliary cirrhosis
UlcerativeColitis
IBD
RheumatoidArthritis
Typea1_Diabetes
Crohns Disease
Eczema
I II
Alzheimers
Parkinson
Bipolar
Epi all
MDD
SCZ
Tourette
Neuroticism
YearofEducation2
BM12
SmokingStatus
Migraine all
Migrainewithaura
DS
OCS
ADHD
Anorexia
Autism
Ep focal
Epigeneral
WHRadjBMI
HDL
LDL
I'I
Triglycerides
ComnaryArtery_Disease
Intrahemo
Migrainenoaura
FEV1 FVC
FVC
II I

I ill
9
Helght3
Heel_TScore
II 11
Systolic
Diastolic
Hypertension
FastingGlucose
Type_2_Diabetes
IS_all
ISearly

E
E

I
C,
:1ciI I
S
I

I
I

Figure S1. A heatmap of results from the multi-tissue analysis that pass FDR 5%.
-

log_10(P) is displayed, truncated at 5, for results that pass FDR 5%. . Numerical results
are reported in Table S9.
MS
Celiac
Il !l I
Lupus
Ill Ill
Asthma
Primarybillaycrrhosis
II j
ulcerativeColitis
II 11111
I
I'
RheurnatodArhrits
Type 1%Diabets
Crohns_Disease
III I
Eczema
Aizhelmners
Parkinson
1111
Bipolar
Epi al
MDD
SCz
Tourette
I I
Neuroticlarn
YearsofEducation2
BM12
SmokingStatus
II I I II
Migraine-all
Migrainewithaura
DS
OCS
ADHD
I
Anorexia
I
Autism
EpLfocal
Epiggnera
WHRadJBMI
HFEL
I. I 11111111 11111
LDL
Triglycerldes nIil I I
CoronaryArtery_Dismeas
intra hemno
Migraine_nosura
FEVIFVC
FVC
1111 liii
I
HeightS
HeelTScore II
Systolic
Diastolic ~I~IiI
Hypertension
FastingGiuce
Type 2 Diabetes ii~ li i II
IS-all
IS-early
I I
E
0
I ii
I

Figure S2. A heatmap of results from the analysis using chromatin data from the
Roadmap Epigenomics project. -log_10(P) is displayed, truncated at 5, for results that
pass FDR 5%. Numerical results are reported in Table S14.
Brain Cerebellar Hemisphere 0.90

Brain Cerebellum
Brain Anterior cingulate cortex (BA24)

Brain Cortex

Brain Frontal Cortex (BA9)

Brain Hypothalamus

(A) Brain Nucleus accumbens (basal ganglia)

Brain Caudate (basal ganglia)

Brain Putamen (basal ganglia)

Brain Amygdala

Brain Hippocampus

Brain Spinal cord (cervical c-1)

Brain Substantia nigra

E
0
~ ~- IM ~ 0)ca orE r
_ 6 mm

-
E 2 X a ma ma M
a c8
aD a
C.,
cc

m
8 ~
~~3 ~.8 co
8 Ir U. 2
*

a
7a~~ .9 0

e PU

Brain Cerebellar Hemisphere

Brain Cerebellum

Brain Anterior cingulate cortex (BA24)

Brain Cortex

Brain Frontal Cortex (BA9)

Brain Hypothalamus

(B) Brain Nucleus accumbens (basal ganglia)

Brain Caudate (basal ganglia)

Brain Putamen (basal ganglia)

Brain Amygdala

Brain Hippocampus

Brain Spinal cord (cervical c-1)

Brain Substantla nigra

2 E
f E C.L 6

do 8
CS CE
~Ut 0
G to
0C=

M CO

Figure S3. (A) A heatmap of jaccard indices among gene sets for the 13 brain regions
in GTEx, in the multi-tissue analysis. (B) A heatmap of jaccard indices among gene sets
for the 13 brain regions in GTEx, in the analysis of GTEx brain regions only. Numerical
results are reported in Table S15
(A) 6 Years of education* Neuroticism
3
&4 2L

0
0

03
,2
Epilepsy - all Epilepsy - general
03
Cortex 3
0. 0 -D
Cerebellum CD
Striatum 0
0M
Other
0 C)
-2
0
Tourette ADHD
3
03
2
0 0

0 0

(B) (C)
Years of education Neuroticism Years of Ed.
3
4
0. 0.
o03
o 2 03
-r
0
0 0
- GABA.
Epilepsy - all Fpilepsy - general - Glu. Neuroticism
4 4
- Astrocyte 3
0.
o03
M Neuron 0.
0 --- - - -- - - -- 2 - - - - - - - - -

-
M Oligoden.
-

I==

0 0 C.
y
5 Tourette 0.
ADHD
Cell type

S3

Cell type Cell type

Figure S4. Results of brain analyses for traits not depicted in Figure 4. (A) Results from within-
brain analysis of 13 brain regions in GTEx, classified into four groups. (B) Results from the data
of Cahoy et al. on three brain cell types. (C) Results from PyschENCODE data on two neuronal
subtypes. Numerical results are reported in Table S16.
IBD 6 Bipolar . Multiple sclerosis 6 Celiac HDL

0L4C 4C

LDL a Neuroticism 6 Type 1 Diabetes 6 Ulcerative Colitis


B
Innate
lymphocyte
Myeloid o0. 4.CL4 L 4
0

:21 i:~.2:k2i
0D
Stem

&
--------------

1 L2
----------- -----------------------------
Stromal
-alpha beta T
-gamma delta T

6 ADHD 6 Anorexia 6 Parkinson 6 Tourette

CL4 L C 4C 4
0 0 0 a.

2 L L 222

0 kLLh00LC

Figure S5. Results of lmmGen analysis traits not depicted in Figure 5. The width of each bar is
proportional to its height, for easier visualization. Numerical results are reported in Table S17.
5
MS
I

iL~
Cellac
Lupus
Asthma
I
Primarybiary_cirrhosis
UlcerativeColitis
IBD
RheumatoldArthritis
II i~I; 9 II
Type_1_Diabetes
Crohns_Disease
I
Eczema
Aizheimers
IIlI~VII I

Parkinson
Bipolar
Eplall
MDD
SCZ - I - - -- -

-
Tourette
Neuroticism
YearsofEducation2
BMI2
I I
SmokingStatus
Migraine-all
Migraine_wlthaura
DS
OCS
ADHD
Anorexia
Autism
EpL focal
Epi_general
WHRadjBMI
HDL
LDL
I II I
Triglycerdes
CoronaryArtery_Disease
Intrshemo - - I - - -- -
Migrainenosura
FEV1FVC
11111
-
FVC

I
I IFI 1BlIn I I
I
Height3
HeelTScore
Systolic
I III'
Diastolic
Hypertension
II
FastingGlucose
Type_2_Diabetes
11I
ISall
IS-early

I I I I
I
-a U0
_j jl 0j

I
Figure S6. A heatmap of results from applying the SNPsea method to the gene
expression data from the multi-tissue analysis. -log_10(P) is displayed, truncated at 5,
for results that pass FDR 5%. Numerical results are reported in Table S10.
MS 5
Celiac
Lupus
Asthma
i bI
i II111 "1
.I

.
Primary biliary _Orrhosis
UlcerativeColitis
IBD
iI Jj i III
I
Rheumatoid_Arthritis
Type IDiabetes
CrohnsDisease
Eczema
pIF Ill P11
Alzheimers
Parkinson
Bipolar I I I I I I I
Epiall
MDD
SCZ
Tourette
Neuroticism
YearsofEducation2
BM12
SmokingStatus
Migraine-all
1I I
Migraine withaura
DS
OCS
ADHD
Anoreda
Autism
Epifocal
EpLgeneral
WHRadjBMI
I~
HDL II
LDL
I I I
Triglycerides
CoronaryArteryDisease
Intra-hemo
Migrainenosura
FEV1FVC
FVC
Height3
HeelTScore
Systolic
Diastolic
Hypertension
Il
FastingGlucose
Type_2_Diabetes
IS_all
ISearly
Z CD
C
E
E
z8
(D '0
r
W

Figure S7. A heatmap of results from applying the DEPICT method with a P-value
threshold of 5e-8. -log_10(P) is displayed, truncated at 5, for results that pass FDR
5%. Numerical results are reported in Table S11.
I'
MS
Celiac
Lupus
Asthma
Primarybiliarycirrhosis
I
UlcerativeColitis
IBD ll" I
RheumatoidArthritis
I
Type_1_DIabetes I
iiI'll'
CrohnsDisease
Eczema
I
Alzhelmers I
Pardnson
II
Bipolar I
Epi all
MDD
SCZ
Tourette
1-
Neuroticism
YearsofEducation2
BM12
SmokingStatus
El.'j
Migraine-all
Migrainewithaura
11
DS
OCS
I
ADHD
Anorexa I I
Autism
Epijocal
EpL-general
I
WHRadjBMI
II 1I 1 11 I
HDL
LDL
I I I
Triglycerides I

II
CoronaryArteryDisease
Intra-hemo
Migrainenosura
FEV1FVC

I
FVC
Height3
-~-~-~- m iI
I
Heel_TSoore
Systolic
Diastolic
Hypertension
FastingGlucose
I I I Ii: I
"Ih I I

Type_2_DIabetes
ISall
I
IS-early
I 0
E
E
gr
-J U
i
0
I I w
0

Figure S8. A heatmap of results from applying the DEPICT method with a P-value
threshold of le-5. -log_10(P) is displayed, truncated at 5, for results that pass FDR
5%. Numerical results are reported in Table S12.
MS
Celiac
Lupus
Asthma
II ~ I

PrmarybilIaryqIrrhosI
UlcerativeColitis
II
IBD
RheumatoidArthrtis Ill I
Type_1_Diabetes
CrohnsDisease
Eczema
Aizheimers
I I
Parkinson
Bipolar
Epiall
U U

iii
MDD
SCz
Tourette
Ill
Neuroticism
YearsofEducation2
BM12
SmokingStatus
Migraine..all
Migrainewithaura
I
DS
OCS
ADHD
Anorexia
Autism
Epifocal
Epi-general
WHRadjBMI III IIII111111 I I
HDL
I II
LDL
11 Ill I
I I
Triglycerides
I
CoronaryArtery_Disease
Intrahemo
Migrainenosura

p
FEV1FVC
II. 1111 II
II
II I
IIIIII
FVC
Height3
Heel_TScore
I 1 1I I II
Systolic II
I
II 11
Diastolic
Hypertension
FastingGlucose
Type_2_Diabetes
ISall
II
IS-early

E
E
W I I
I
0
I I
w
I

I
I

Figure S9. A heatmap of results from applying the MAGMA method to the gene sets
created in the multi-tissue analysis. -log_10(P) is displayed, truncated at 5, for results
that pass FDR 5%. Numerical results are reported in Table S13.
A. Cortex v. non-brain B. Cortex v. other brain
8 8

T T
7 7

6 6
- I '-

..-

5 5
-T

84 84
N N

3 3

2 2

1 1 -- -I
o c

0 0
Nm ber s 4x 41

Number of cortex samples N umber of cortex samples

Figure S10. We repeatedly sub-sampled our dataset to a variety of sample sizes and ran our
approach on the sub-sampled dataset. (A) We assessed cortex enrichment for schizophrenia in
the multi-tissue analysis, in which cortex was compared to all non-brain samples. We kept the
ratio of cortex samples to non-brain samples constant as we downsampled. (B) We assessed
cortex enrichment for schizophrenia in the analysis of GTEx brain regions, in which cortex was
compared to all other brain samples. We kept the ratio of cortex samples to other brain
samples constant as we downsampled.
178
Appendix C

Supplementary information for Chapter 5

Supplementary Note

Quantitative Traits

Suppose we sample two cohorts with sample sizes N and N 2 . We measure phenotype 1 in cohort
1 and phenotype 2 in cohort 2. We model phenotype vectors for each cohort as yi = YO + 6, and
Y2 = Z7Y + E, where Y and Z are matrices of genotypes with columns standardized to mean zero
and variance onel, with dimensions N1 x M and N2 x M, respectively; 0 and y are vectors of per-
standardized genotype effect sizes, and 6 and e are vectors of residuals, representing environmental
effects and non-additive genetic effects. In this model, Y and Z are unobserved matrices of all
SNPs, including SNPs that are not genotyped.

We treat all of Y, Z, 3, y, 6 and c as random. We model all of these as independent, except for

1 We ignore the distinction between normalizing and centering in the population and in the sample, since this

introduces only 6(1/N) error.

179
/3, -y, 6, c. Suppose that (#, -y) has mean zero and covariance matrix 2

Var h!I PgI


Var[(1, 7)] = -(
1
MPgI h2I)

and (6, e) has mean zero and covariance matrix

(I - h2)1 P'I
Var[(6, e)] =
PeI (1 - h2)I

Let p := pg + pe. Vectors of genotypes for each individual are drawn i.i. d. from a distribution with
covariance matrix r (i.e., r is an LD matrix with ryk = E[YijYik]). There are N, individuals who
are included in both studies.

Lemma 1. Under this model, the expected genetic covariance (as defined in methods) between
phenotypes is pg, justifying our use of the notation Pg.

Proof. Let X denote an 1 x M vector of standardized genotypes for an arbitrary individual. Under
the model, the additive genetic component of phenotypel 1 for this individual is E XB, and
the additive genetic component of phenotype 1 for this individual is Ej Xjy 3 . Thus, the genetic

2
The assumption that all 3 is drawn with equal variance for all SNPs hides an implicit assumption that rare SNPs
have larger per-allele effect sizes than common SNPs. As discussed in the simulations section of the main text and
in our earlier work,13 LD Score regression is robust to moderate violations of this assumption, though it may break
down in extreme cases, e.g., if all causal variants are rare. In situations where a different model for Var[O] is more
appropriate, all proofs in this note go through with LD Score replaced by weighted LD Scores, fj =Zk [ jEk.

180
covariance between phenotype 1 and phenotype 2 is

Cov Xj/jZ XY3] = E[( X'iY)


j

=ZZ E[XX3 yk]


j k

= ZE[X yg j]

= pg.

We compute linear regression z-scores zi:= Tyi/ N 1 and Z2j yTY2 /vAT 2 for genotyped

SNPs j (where Y and Zj denote the jth columns of Y and Z).

Definition 1. The LD Score of a variant j is := Z rk, where the sum is taken over all other

variants k.

Proposition 1. Let j denote a genotyped SNP. Under the model described above,

QN1N2Pge + Ne p .
E[zil z2j] = V -NPfj (C.1)
,

M VN1 N2

Proof. By the law of total expectation,

E[zijz 2j] = E[E[zijz 2j IY Z]1 (C.2)

181
First we compute the inner expectation from Equation C.2, with Z and Y fixed.

1
E[zjz2j I Y, Z] = E[Y TyyT~
VN1 N2
= yST E [(Y + 6)(Zy
+ )T]Z3

- 1 YT (YE[pT_]Z + E[6TZY] E[TyTE] E[6TE]) Z

+
\N1 N2
1
= N YT(YE [OT]Z + E[ 6 TC]) Z

(gYYZTZ + peYjTZj) (C.3)

.
N\tN1N2 M

Next, we remove the conditioning on Y and Z.

1 Ns
(C.4)
IN 1N 2 E[YTZ] -v/N1N2

and
MNs
E[YTYZTZ ] = fj (C.5)
+

qN1N2 VN1 N2
Substituting equations C.4 and C.5 into Equation C.3,

VN1N2Pg Ns (pg + pe)


E[zI1 z 2j] M
QN1 N2
lN1N2Pg NXp
f+ (C.6)
M lN1 N 2

If study 1 and study 2 are the same study, then N = N 2 = Ns, p9 = h2 and p = 1, so Equation
C.6 reduces to the LD Score regression equation for a single trait from.13

Regression Weights

We can improve the efficiency of LD Score regression by weighting by the reciprocal of the conditional
variance function (CVF), Var[zijz 2j I,]. The CVF is not uniquely determined by the assumptions

182
about the first and second moments of # and -y used to derive Proposition 1. Therefore we derive
the CVF for the case where zg and Z2j are jointly distributed as bivariate normal'. From a standard
formula for double second moments of the bivariate normal, the CVF is

2
Var[zljz2j I = Var[zij]Var[z 2j] + IE[ZijZ2j]

N N
=
(N1 hil
M
+1)
N2 hijy
M
+ 1) + K
(N1N2Pg
M
MeJ +
p
MN 2
VN1N2
2
(C.7)

The terms on the left follow from the fact that Var[zj3 ] = X2 and E[x 2] = Nh2%/M + 1. The term
on the right follows from Proposition 1. Note that if z1 = z 2, this reduces to the expression for the
CVF of x 2 statistics from13 (though there is an error in Equation 3.2 of the supplementary note
of;1 3 the right side is missing a factor of 2. We thank Peter Visscher for pointing this out).
In cases where the normality assumption does not hold, LD Score regression will remain unbiased,
but may be inefficient, because the regression weights will be suboptimal. We also apply a heuristic
weighting scheme to avoid overcounting SNPs in high-LD regions, described in the methods.

Liability Threshold Model

In the liability threshold (probit) model,18 9 binary traits are determined by an unobserved contin-
uous liability 0. The observed trait is y := I[/ > T], where r is the liability threshold. If V) is
normally distributed, then setting T := 4D-1(1 - K) (where 4D is the standard normal cdf) yields a
population prevalence of K.
For phenotypes generated according to the liability threshold model, we can estimate not only
the heritability and genetic covariance of the observed phenotype, but also the heritability and
genetic covariance of the unobserved liability.
In the next lemma, we derive population case and control allele frequencies in terms of the
heritability of liability when liability is generated following the model for quantitative traits above.
Since we are only modeling additive effects and are willing to assume Hardy-Weinberg equilibrium,
3 For instance, it is sufficient but not necessary to assume that 3, -y, J and e are multivariate normal. More
generally, the z-scores will be approximately normal if 3 and y are reasonably polygenic. If the distribution of effect
sizes is heavy-tailed, e.g., if there are few causal SNPs, then the CVF may be larger.

183
we lose no generality and simplify notation considerably by stating the proofs in terms of haploid
genotypes.

We state this lemma in terms of marginal per-allele effect sizes, instead of the per-standardized-
genotype effect sizes considered above. Here marginal means that these are the effect sizes obtained
by univariate regression of phenotype against genotype in the infinite data limit. Haploid standard-
ized genotypes are defined Xjj := (Gij - pj)/ /p(1 - pj), where Gi is the 0-1 coded genotype. If
Oj is the marginal per-standardized-genotype effect and (j is the marginal per-allele effect, we have
Xj j = Gj(j. Thus, setting Gij = 1 yields (j = /3j/(1 - pj)/pj.

Lemma 2. Suppose unobserved liabilities 4, <p for traits Y1, Y2 with thresholds T1 ,T 2 corresponding
to prevalences K 1 , K2 are generated according to the mode for quantitative traits above, i. e., 'i =
Z 3 XjjB3 + 6, <pi = Ej Xigjyj + 6, with

hi- PgI
Var[( , 2g 1
M p(
Pg hI)

and
(1 - h2)I PeI
Var[(6, e)] =
PeI (1- h)I

Let (j and j denote the marginal per-allele effect sizes of SNP j on 4 and yc. Let

Pcas,kj :=P[Gij =1I Yik = 1]

Pcon,kj := P[Gi 1 Yik = 01

denote the allele frequencies of SNP j in cases and controls for phenotype k, where Yik denotes the

184
value of phenotype k for individual i and k = 1, 2. Then

E[pcas,1j - Pcon,1j] = 0,

E[pcas,2j - Pcon,2j] = 0,

(1 - pj)$(rI)2h 2
Var[pcas,1j - pcon,1j] = pjMK12(1 -K ) 2 f'
1

pj(1 -pj :$(T2) h2h


Var[pcas, 2j - Pcon,2j] = MK 2 (1 - K2 ) 2 f'

-i PconiiPcas,2p - Pcon,2j1 - p3(I - Pj)$(T1 O(T2)Pg


Cv[pcas, - , ,MK 1 (1 - K1 )K 2 (1 - K2

)
where the expectation is taken over where 0 is the standard normal density. These results apply
to population allele frequencies, not allele frequencies in a finite sample. We deal with ascertained

finite samples in the next section.

Proof. This proof is accomplished in two steps. First, we compute allele frequencies conditional

on the marginal effects on liability. To do this, we reverse the conditional probability using Bayes'

theorem, which reduces the problem to a series of [Taylor approximations to] Gaussian integrals.

Second, we remove the conditioning on the marginal effects on liability in order to express the allele

frequencies in terms of h', h', pg and fj. Since liability is just a quantitative trait, we need only

apply the LD Score regression equation for quantitative traits.

By Bayes' rule,

[G = 1 yi = jP[yi = 11 Gij = 1, (]IP[Gij = 1]


IP[ygi = 1]
= -P[ysi,=1 Gij 1, (j]
-KI

= IP[Oi > - |IG j = 1, (]. (C.8)

The distribution of 0 given Gij and (j is ) I(Gij = 1, (j) ~ ( 1- j) ~ N((j, 1) (where the
approximation that the variance equals one holds when the marginal heritability explained by j is
small, which is the typical case in GWAS). Thus P[Oi > Ti IGij = 1] is simply a Gaussian integral.

185
We approximate this probability with a first-order Taylor expansion around T1

.
P[ki > Ti Gij = ij ] = 1 - D(Tr -

)
K 1 + #(TI)(j, (C.9)

Substituting Equation C.9 into Equation C.8,

P[G =1i, l
=1, (j] =+ #(Ti)(4) (C.10)

.
K

A similar argument shows that

IP[Gij = 1 yji = 0, (j] = ' (1 - K1 - #(r)()). (C.11)


1-K 1

Subtracting Equation C.11 from Equation C.1O,

P[Gij = 11 yi = 1, (j] - P[Gij = 1 yii = 0 ,(j] = P (T. ) j


(C.12)
Kj(1 - K1)'

Similar results hold for trait 2, replacing C with and subscript 1 with subscript 2.

We have written the probabilities in question in terms of constants and marginal effects on
liability. Since liability is simply a quantitative trait, the means, variances, and covariances of
the marginal effects on liability are described by the LD Score regression equation for quantitative
traits from Proposition 1. Precisely, E[ j] = E[(j] = 0, Var[ j] = (1 - pj)hfj/pjM, Var[(j] =

(1 - pj)hjI/pjM and Cov[(j, ] = (1 - pj)pgfj/pjM. If we combine these results with Equation


C.12, we find that
E[pcas,j -Peon,1j] = 0; (C.13)

Var[pcas,ii - Pcon,i] Var (- ()

S( - )(CK14)
MK - FK12
)2(1 f C 4

186
(similarly for trait two), and

COV[pcas,i - Pconlj, Pcas,2j - Pcon,2j] =Cov K 1 (1 - K1 )' K2 (1 - K2 )]

P_ (1 - Pj)(TOi) (72)pg fj- (C 15)


MK1(1 - K1)K2(1 -K2)

Ascertained Studies of Liability Threshold Traits

In the next proposition, we derive an LD Score regression equation for ascertained case/control
studies.
Let P denote the sample prevalence of yj in study i for i = 1, 2. We compute z-scores

VN P)(cas -
--P(1 con)
Vf(1 -P iy

where Pj denotes allele frequency in the entire sample 4 , Icas denotes sample case allele frequency
and Pen denotes sample control allele frequency.
We emphasize one subtlety before stating the main proposition. The results in this section allow
for study k to select samples based on phenotype 1 only if k = 1. If study 1 ascertains on phenotype
2 - for example, if all cases i in study 1 have yi = Yi2 = 1- then lfcas,ij will not be an unbiased
estimate of Pcas,ij. Indeed, in this example, E[cas,ij] = P[Gij = 1 yi = Y2 = 1], which will not
equal Pcas,1i = P[Gij = 1 yi = 1] unless p = 1 or p = 0. This follows from the fact that the
conditionals and marginals of a bivariate normal are equal iff p = 0 or p = 1. We do not derive
formulae describing the bias, except to note that the most common scenario, the "healthy controls"
model - cases are sampled independently but all controls are controls for both traits - is probably
nothing to worry about, so long as cases for both traits are uncommon. In this scenario, P[Gij
-

11 yi = 0] ~ IP[Gij = I Iy = yi2 = 0]. Conditioning on Yi2 = 0 hardly changes the distribution,

4 Conditional on the marginal effect of j, the expected value of Pj is not equal to p3 unless P = K or the marginal
effect of j is zero.

187
because yi2 = 0 most of the time, anyway. In addition, excluding double cases from the analysis (as
a conservative defense against spurious comorbidity) is also likely to be safe for pairs of uncommon
traits with small excess comorbidity. In this case, P[Gij = 1 Yi = 1] ~ P[Gij = 1 I Yi = 1, Yi2 = 01,
so long as Y2 is uncommon and not too highly correlated with yi.

Proposition 2. Under the liability threshold model from lemma 2,

E[z1Vz2] ~
Eljj]rd N1N2g,obs
M jj + 1 N2 P1 (1 - P1 )P2 (1- P22)Na
) ( Na,b 1N
I
1 a=b] I (C.16)
( 6
wa,bE{cas,con}

where

Pgobs := Pg ( 0(T')0(T 2 )P1(1 - P1 )P 2 (1 - P 2

)
K1 (1 - K1 )K2 (1 - K2)

denotes observed scale genetic covariance, Na,b denotes the number of individuals with phenotype a
in study 1 and b in study two for a, b G {cas, con} (e.g., Ncas,con is the number of individuals who
are a case in study 1 but a control in study 2), Ni denotes total sample size in study i and Na,i for
a C {cas, con} and i = 1, 2 denotes the number of individuals with phenotype a in study i.

Observe that Pg,obs/ h~ShjS, - Pg/ hhh2 = rg. Put another way, the natural definition for
"observed scale genetic correlation" turns out to be the same as regular genetic correlation, because
the scale transformation factors in the numerator and denominator cancel. This is convenient: we
can compute genetic correlations for binary traits on a sensible scale without having to worry about
sample and population prevalences.

Proof. The full form of zijz 23 is

1N2(Pcas,1j - Pcon,1j)(Pcas,2j - Pcon,2j)


Z1jZ2j =
/91 - 81j)P2j(I - P2j)

188
where c := P1 (1 - P1 )P 2 (1 - P2 ). Our strategy for obtaining the expectation is

E[zijz2j] ~ \C/N 1 N 2 E[(Pcas,1j - Pconj1)(icas,2j - Pcon,2j)] (C.17)


-21)]
E[7'1(1- -C -2(

~~/c 2E[P - lon,)(cas,2j - Peon ,2j)] (C.18)

/ 1E[E[(cas,
1 - conl)(cas,2 - Peon,2j) i, (j]]
= CN
1 N2 IE [E[13(1 - Plj) 2(l -- 2j) , (C. 18)

where (j and j denote the marginal per-allele effects of j. Approximation C.17 hides 61(1/N)
error from moving from the expectation of a ratio to a ratio of expectations. Approximation
C.18 hides 6(1/N) error from moving from the expectation of a square root to a square root of
expectations. Equality C.19 follows from applying of the law of total expectation to the numerator
and denominator.

First, we compute the numerator. By linearity of expectation,

E [(Pcas,lj - Pcon,1j) (fcas,2j - Pcon,2j)] (j, 9] = E[ficas,ljPcas,2j (j, fI] - E [Pcas,ljPcon,2j Ij,

(
- E[PconljPcas,2jl (j, ] + E[23con,1jPcon,2j |(j, j] (C.20)

After conditioning on the marginal effects (j and j, the only source of variance in the sample allele

frequencies Pcas,1i, Pcon,1, Pcas,2, Peon,2 is sampling error. Write Pcas,ljPcas,2j = (Pcas,ij + ?7)(Pcas,2j + v),
where r and v denote sampling error. If study 1 and study 2 share samples, v and a will be

correlated:

E[cas,1jPcas,2j 1j, (j] = Pcas,1jPcas,2j+ E[T1Kv]

Ncas,cas N/Pcas,ij(1 - Pcas,lj)Pcas,2j(1- Pcas,2j)


~Pcas,1jPcas,2j+ Ncs1cs2(C. 21)
Ncas,1INas,2

~ Pcas,ljPcas,2j 1 + Ncas,
1 Ncas,2 ) (C.22)

where approximation C.21 is the (bivariate) central limit theorem, and approximation C.22 comes
from ignoring the difference between /pcas,1j (1 - Pcas,1j)Pcas,2j (1 - Pcas,2j) and pj (1 - pj). This step

189
is justified in the derivation of the denominator. Similar relationships hold for the other terms in
Equation C.20.

If we combine equations C.22 and C.15, we obtain

1)+1[a=b]
Na,b(-0
r'..O #(TI)#0(72) Pg j
E[(icas,ij - Peon,1j)(Pcas,2j - Pcon,2j)] P (1 - pj) c'/M ()P C+ n Na,1N2
a,bE{cas,con} N~~,

'
(C.23)

where c' K,(1 - K1 )K2 (1 - K 2 ).

f1
Next, we derive the expectation of the denominator. Conditional on (j and j, ij(l - p1j) is

Pipcasij + (1 - P1 )pcon,j
1 plus 6(1/N) sampling variance. If studies 1 and 2 share samples, the

&(1/N) sampling variance in Pil(l - P1l) and 12j(1 - 12j) will be correlated, but this still only
amounts to 6(N/N 1 N2 ) error. If we remove the conditioning on (j and j, then Ppcas,ij + (1

-
P1 )pcon,13 is equal to pj(1 - pj) plus O(h 2obstj/M) error from uncertainty in (j. The covariance
between uncertainty in (j and uncertainty in j is driven by Pg,obs, so the expectation of the denom-
6
inator is E [ "ii 1(1 -P 1j)P2j(1 - 2) = pj (1 - Pj) (1 + e(Ns/N1N2 ) + (Pg,obstyj/M)). We make

the approximation 5 that

IE [ Pij (1 - Pl )P2j(I - P2j)] ~r.. Pj (I - pj). (C.24)

We obtain the desired result by dividing s/cN1 N 2 times Equation C.23 by Equation C.24. E

Corollary 1. If study 1 is an ascertained study of a binary trait, and study 2 is a non-ascertained

quantitativestudy, then proposition2 holds, except with genetic covariance on the half-observed scale

Pg~obs VRi)P1( - P1)


P 9 obs:=Pg Ki P - K
)

5 For fe = 100 (roughly the median 1kG LD Score), M = 107 and Pg,ob, = 1, we get Pg,obsf/M = 10-5. A
worst-case value for N,/NN 2 might be N, = N, = N2 = 10 3 , in which case N8 /N1 N2 = 10-3. Thus, Pg,obsej/M
and N/NIN 2 will generally be at least 3 orders of magnitude smaller than 1.

190
Corollary 2. For a single binary trait,

Nh 2
E[X2] = Mbs fj + 1, (C.25)

where h b = h2 q(T)2 p(1 - P)/K2 (1 - K) 2

.
Proof. This follows from proposition 2 if we set study 1 equal to study 2 and note that the observed
scale genetic covariance between a trait and itself is observed scale heritability. To show that the
intercept is one, observe that if study 1 and study 2 are the same, then

Na b( 1+1[a=b) NP(1 - P) ( + I
/cN 1 N2 E
Na, 1 Nb,2 (Ncas Neon
(a,bEjcas,con}
NP(1 - P)(N as+ Neon))
NcasNeon
N 2 p(1 _ p)
(C.26)
NcasNcon

But NP = Ncas and N(1 - P) = NO,, so Equation C.26 simplifies to 1. E

Flavors of Heritability and Genetic Correlation

The heritability parameter estimated by ldsc is subtly different than the heritability parameter h2
estimated by GCTA. If g denotes the set of all genotyped SNPs in some GWAS, define 3
GCTA

argmaxceERgj Cor [yi, Xga], where Xg is a random vector of standardized genotypes for SNPs in g.
Then the heritability parameter estimated by GCTA is defined

h2 Z:= GCTA,j
jeg

Let S denote the set of SNPs used to compute LD Scores (i.e., fj = EkeS rik), and let /s

argmaxaGRsI Cor [yi, Xsa]. Generally #Sj # 3


GCTA,j unless all SNPs in S \ g are not in LD with

191
SNPs in g. Define

jES

Let S' denote the set of SNPs in S with MAF above 5%. Define

5-50%n := #. (C.27)
jeS'

The default setting in ldsc is to report h 2_Q, estimated as the slope from LD Score regression
times M5 50 %, the number of SNPs with MAF above 5%.
The reason for this is the following: suppose that h2 per SNP is not constant as a function of
MAF. Then the slope of LD Score regression will represent some sort of weighted average of the
values of h 2 per SNP, with more weight given to classes of SNPs that are well-represented among
the regression SNPs. In a typical GWAS setting, the regression SNPs are mostly common SNPs,
so multiplying the slope from LD Score regression by M (which includes rare SNPs) amounts to
extrapolating that h2 per SNP among common variants is the same as h2 per SNP among rare
variants. This extrapolation is particularly risky, because there are many more rare SNPs than
common SNPs.
It is probably reasonable to treat h 2 per SNP as a constant function of MAF for SNPs with
MAF above 5%, but we have very little information about h 2 per SNP for SNPs with MAF below
5%. Therefore we report h2_ 0 instead of h2 to avoid excessive extrapolation error. This lower
bound can be pushed lower with larger sample sizes and better rare variant coverage, either from
sequencing or imputation.
There are two main distinctions between h2_ 0 and h2. First, h2 does not include the effects of
common SNPs that are not tagged by the set of genotyped SNPs g. Second, the effects of causal
4% SNPs are not counted towards h 2 0 %. In practice, neither of these distinctions makes a large
difference, since most GWAS arrays focus on common variation and manage to assay or tag almost
all common variants, which is why we do not emphasize this distinction in the main text.
The relationship between the genetic covariance parameter estimated by LD Score regression
and the genetic covariance parameter estimated by GCTA is similar to the relationship between

192
h5 50 % and h9. Choice of M is not important for genetic correlation, because the factors of M in
the numerator and denominator cancel.

Supplementary Tables

Prevalence 2
2ab
i rg
0.01 0.72 (0.1) 0.59 (0.04) 0.51 (0.4)
0.05 0.72 (0.12) 0.59 (0.07) 0.45 (0.17)
0.2 0.72 (0.11) 0.6 (0.08) 0.46 (0.14)
0.5 0.73 (0.11) 0.59 (0.08) 0.42 (0.17)

Table C.1: Simulations with one binary trait and one quantitative trait. The prevalence column describes
the population prevalence of the binary trait. We ran 100 simulations for each prevalence. The h 2 column
shows the mean heritability estimate for the quantitative trait. The h2ab column shows the mean liability-
scale heritability estimate for the binary trait. The r, column shows the mean genetic correlation estimate.
Standard deviations across 100 simulations in parentheses. The true parameter values were rg = 0.46,
h2 = 0.7 for the quantitative trait and hlab = 0.6 for the binary trait. For all simulations, the quantitative
trait sample size was 1000, the binary trait sample size was 1000 cases and 1000 controls, and there were
500 overlapping samples. There were 1000 effective independent SNPs. The environmental covariance was
0.2. We simulated case/control ascertainment using simulated LD block genotypes and a rejection sampling
model of ascertainment. This is the same strategy used to simulate case/control ascertainment in.13

193
LD Score h 2 (5-50%) pg (5-50%) rg(5-50%)
Truth 0.83 0.42 0.5
HM3 0.53 (0.08) 0.28 (0.07) 0.52 (0.1)
PSG 0.36 (0.08) 0.18 (0.06) 0.5 (0.13)
30 Bins 0.81 (0.12) 0.41 (0.08) 0.51 (0.09)
60 Bins 0.81 (0.12) 0.41 (0.09) 0.51 (0.09)

Table C.2: Simulations with MAF- and LD-dependent genetic architecture. Effect sizes were drawn from
normal distributions such that the variance of per-allele effect sizes was uncorrelated with MAF, and variants
with LD Score below 100 were fourfold enriched for heritability. Sample size was 2062 with complete overlap
between studies; causal SNPs were about 600,000 best-guess imputed 1kG SNPs on chr 2, and the SNPs
retained for the LD Score regression were the subset of about 100,000 of these SNPs that were included
in HM3. True parameter values are shown in the top line of the table. Estimates are averages across
100 simulations. Standard deviations (in parentheses) are standard deviations across 100 simulations. LD
Scores were estimated using in-sample LD and a 1cM window. HM3 means LD Score with sum taken over
SNPs in HM3. PSG (per-standardized-genotype) means LD Score with the sum taken over all SNPs in 1kG
as in. 13 30 bins means per-allele LD Score binned on a MAF by LD Score grid with MAF breaks at 0.05,
0.1, 0.2, 0.3 and 0.4 and LD Score breaks at 35, 75, 150 and 400. 60 bins means per-allele LD Score binned
on a MAF by LD Score grid with MAF breaks at 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4 and 0.45 and
LD Score breaks at 30, 60, 120, 200 and 300, These simulations demonstrate that naive (HM3, PSG) LD
Score regression gives correct genetic correlation estimates even when heritability and genetic covariance
estimates are biased, so long as genetic correlation does not depend on LD.

194
Trait Reference Sample Size
Schizophrenia PGC Schizophrenia Working Group, Nature, 201431 70,100
Bipolar disorder PGC Bipolar Working Group, Nat Genet, 20114 16,731
Major depression PGC MDD Working Group, Mol Psych, 2013190 18,759
Anorexia Nervosa Boraska, et al., Mol Psych, 201456 17,767
Autism Spectrum Disorder PGC Cross-Disorder Group, Lancet, 2013148 10,263
Ever/Never Smoked TAG Consortium, 2010 Nat Genet,49 74,035
Alzheimer's Lambert, et al., Nat Genet, 201314 54,162
College Rietveld, et al., Science, 201348 101,069
Height Lango Allen, et al., Nature 201050 133,858
Obesity Class 1 Berndt, et al., Nat Genet, 2013191 98,000
Extreme Waist-Hip Ratio Berndt, et al., Nat Genet, 2013191 10,000
Coronary Artery Disease Schunkert, et al., Nat Genet, 20113 86,995
Triglycerides Teslovich, et al., Nature, 201041 96,598
LDL Cholesterol Teslovich, et al., Nature, 201041 95,454
HDL Cholesterol Teslovich, et al., Nature, 201041 99,900
Type-2 Diabetes Morris, et al., Nat Genet, 2012 54 69,033
Fasting Glucose Manning, et al., Nat Genet, 2012 55 46,186
Childhood Obesity EGG Consortium, Nat Genet, 2012151 13,848
Birth Length van der Valk, et al., HMG, 2014192 22,263
Birth Weight Horikoshi, et al., Nat Genet, 2013149 26,836
Infant Head Circumference Taal, et al., Nat Genet, 2012152 10,767
Age at Menarche Perry, et al., Nature, 201452 132,989
Crohn's Disease Jostins, et al., Nature, 2012 58 20,883
Ulcerative Colitis Jostins, et al., Nature, 2012 58 27,432
Rheumatoid Arthritis Stahl, et al., Nat Genet, 2010193 25,708
Table C.3: Sample sizes and references for traits analyzed in the main text.

Phenotype 1 Phenotype 2 rg se
College (Yes/No) Years of Education 1.00 0.014
Table C.4: Genetic correlation between the two educational attainment phenotypes from Rietveld, et al.48

195
Supplementary Figures

EO.mclal

O.y..1

Esn h.19M

.U

Figure C-1: Genetic correlations among anthropometric traits from studies by the GIANT and EGG
consortia. The structure of the figure is the same as Figure 5-2 in the main text: blue corresponds to
positive genetic correlations; red corresponds to negative genetic correlation. Larger squares correspond
to more significant p-values. Genetic correlations that are different from zero at 1% FDR are shown as
full-sized squares. Genetic correlations that are significantly different from zero at significance level 0.05
after Bonferroni correction are given an asterisk.

196
A
ii I

/
Figure C-2: Genetic correlations among smoking traits from the Tobacco and Genetics (TAG) consortium.
The structure of the figure is the same as Figure 5-2 in the main text: blue corresponds to positive genetic
correlations; red corresponds to negative genetic correlation. Larger squares correspond to more significant
p-values. Genetic correlations that are different from zero at 1% FDR are shown as full-sized squares.
Genetic correlations that are significantly different from zero at significance level 0.05 after Bonferroni
correction are given an asterisk.

/ //

WHOMA-8R)

Figure C-3: Genetic correlations among insulin-related traits from studies by the MAGIC consortium.
The structure of the figure is the same as Figure 5-2 in the main text: blue corresponds to positive genetic
correlations; red corresponds to negative genetic correlation. Larger squares correspond to more significant
p-values. Genetic correlations that are different from zero at 1% FDR are shown as full-sized squares.
Genetic correlations that are significantly different from zero at significance level 0.05 after Bonferroni
correction are given an asterisk.

197
0.6
M Vattikuti
,LDSC

0.0 - - ..-.... T...........

-0.4 - - -
-

-0.8-6 - - -
-

ci z i I

Figure C-4: This figure compares estimates of genetic correlations among metabolic traits from table 3 of
Vattikuti et al.' 4 3 to estimates from LD Score regression. The LD Score regression estimates used much
larger sample sizes. Error bars are standard errors.

198
scz I TO SCZ I TO (no MHC) SCZ I TO (LDSC -c200)

a 0 0

to

I I I
of ca

*0 *0

012 3 456
.0
.3
.1
0 1 2 3 4
.3

5 6
I1
a3

01 2334656
3

EIPSOd

Figure C-5: At left, we reproduced the conditional QQ plot comparing schizophrenia (SCZ) and triglyc-
erides (TG) from Andreassen et al.1 73 using the same data (PGC1 schizophrenia1 94 and TG from Teslovich,
et al. 4 1). Conditional QQ plots show the distribution of p-values for SCZ conditional on the - logio(p) for
TG exceeding different thresholds. The thresholds are indicated by color, as described in the legends. Dark
blue corresponds to no threshold, green corresponds to - logio(p) > 1, red corresponds to - logio(p) > 2
and light blue corresponds to - loglo(p) > 3. The major histocompatibility complex (MHC, chr6, 25-35
MB) is a genomic region containing SNPs with exceptionally long-range LD and the strongest GWAS
association for schizophrenia,31 as well as an association to TG. 4 1 If we remove the MHC, the signal of
enrichment in the conditional QQ plot is substantially attenuated (middle); in particular, the red line falls
below the green and blue lines (which correspond to less stringent thresholds for TG). If in addition we
remove SNPs with very high LD Scores (t > 200, roughly the top 15% of SNPs), the signal of enrichment
is further attenuated. The most likely explanation for the attenuation is that conditional QQ plots will
report pleiotropy if causal SNPs are in LD (even if the causal SNPs for trait 1 are different from the causal
SNPs for trait 2), which is more likely to occur in regions with long-range LD.

199
Appendix D

Bibliography

1T. J. C. Polderman, B. Benyamin, C. A. de Leeuw, P. F. Sullivan, A. van Bochoven, P. M.


Visscher, and D. Posthuma, "Meta-analysis of the heritability of human traits based on fifty
years of twin studies," Nature Genetics, vol. 47, pp. 702-709, July 2015.
2 P.Visscher, M. Brown, M. McCarthy, and J. Yang, "Five Years of GWAS Discovery," The
American Journal of Human Genetics, vol. 90, pp. 7-24, Jan. 2012.
3 J.Yang, B. Benyamin, B. P. McEvoy, S. Gordon, A. K. Henders, D. R. Nyholt, P. A. Madden,
A. C. Heath, N. G. Martin, G. W. Montgomery, M. E. Goddard, and P. M. Visscher, "Common
SNPs explain a large proportion of the heritability for human height," Nature Genetics, vol. 42,
pp. 565-569, July 2010.
4 S.H. Lee, T. R. DeCandia, S. Ripke, J. Yang, P. F. Sullivan, M. E. Goddard, M. C. Keller,
P. M. Visscher, and N. R. Wray, "Estimating the proportion of variation in susceptibility to
schizophrenia captured by common SNPs," Nature Genetics, vol. 44, pp. 247-250, Feb. 2012.

'The ENCODE Project Consortium, "An integrated encyclopedia of DNA elements in the human
genome," Nature, vol. 489, pp. 57-74, Sept. 2012.
6 A.Kundaje, W. Meuleman, J. Ernst, M. Bilenky, A. Yen, A. Heravi-Moussavi, P.
Kheradpour,
Z. Zhang, J. Wang, M. J. Ziller, V. Amin, J. W. Whitaker, M. D. Schultz, L. D. Ward, A. Sarkar,
G. Quon, R. S. Sandstrom, M. L. Eaton, Y.-C. Wu, A. Pfenning, X. Wang, M. ClaussnitzerYap-
ing Liu, C. Coarfa, R. Alan Harris, N. Shoresh, C. B. Epstein, E. Gjoneska, D. Leung, W. Xie,
R. David Hawkins, R. Lister, C. Hong, P. Gascard, A. J. Mungall, R. Moore, E. Chuah, A. Tam,
T. K. Canfield, R. Scott Hansen, R. Kaul, P. J. Sabo, M. S. Bansal, A. Carles, J. R. Dixon,
K.-H. Farh, S. Feizi, R. Karlic, A.-R. Kim, A. Kulkarni, D. Li, R. Lowdon, G. Elliott, T. R.
Mercer, S. J. Neph, V. Onuchic, P. Polak, N. Rajagopal, P. Ray, R. C. Sallari, K. T. Siebenthall,
N. A. Sinnott-Armstrong, M. Stevens, R. E. Thurman, J. Wu, B. Zhang, X. Zhou, N. Abdennur,
M. Adli, M. Akerman, L. Barrera, J. Antosiewicz-Bourget, T. Ballinger, M. J. Barnes, D. Bates,
R. J. A. Bell, D. A. Bennett, K. Bianco, C. Bock, P. Boyle, J. Brinchmann, P. Caballero-
Campo, R. Camahort, M. J. Carrasco-Alfonso, T. Charnecki, H. Chen, Z. Chen, J. B. Cheng,

201
S. Cho, A. Chu, W.-Y. Chung, C. Cowan, Q. Athena Deng, V. Deshpande, M. Diegel, B. Ding,
T. Durham, L. Echipare, L. Edsall, D. Flowers, 0. Genbacev-Krtolica, C. Gifford, S. Gillespie,
E. Giste, I. A. Glass, A. Gnirke, M. Gormley, H. Gu, J. Gu, D. A. Hafler, M. J. Hangauer,
M. Hariharan, M. Hatan, E. Haugen, Y. He, S. Heimfeld, S. Herlofsen, Z. Hou, R. Humbert,
R. Issner, A. R. Jackson, H. Jia, P. Jiang, A. K. Johnson, T. Kadlecek, B. Kamoh, M. Kapidzic,
J. Kent, A. Kim, M. Kleinewietfeld, S. Klugman, J. Krishnan, S. Kuan, T. Kutyavin, A.-Y. Lee,
K. Lee, J. Li, N. Li, Y. Li, K. L. Ligon, S. Lin, Y. Lin, J. Liu, Y. Liu, C. J. Luckey, Y. P. Ma,
C. Maire, A. Marson, J. S. Mattick, M. Mayo, M. McMaster, H. Metsky, T. Mikkelsen, D. Miller,
M. Miri, E. Mukame, R. P. Nagarajan, F. Neri, J. Nery, T. Nguyen, H. OAi2Geen, S. Paithankar,
T. Papayannopoulou, M. Pelizzola, P. Plettner, N. E. Propson, S. Raghuraman, B. J. Raney,
A. Raubitschek, A. P. Reynolds, H. Richards, K. Riehle, P. Rinaudo, J. F. Robinson, N. B. Rock-
weiler, E. Rosen, E. Rynes, J. Schein, R. Sears, T. Sejnowski, A. Shafer, L. Shen, R. Shoemaker,
M. Sigaroudinia, I. Slukvin, S. Stehling-Sun, R. Stewart, S. L. Subramanian, K. Suknuntha,
S. Swanson, S. Tian, H. Tilden, L. Tsai, M. Urich, I. Vaughn, J. Vierstra, S. Vong, U. Wagner,
H. Wang, T. Wang, Y. Wang, A. Weiss, H. Whitton, A. Wildberg, H. Witt, K.-J. Won, M. Xie,
X. Xing, I. Xu, Z. Xuan, Z. Ye, C.-a. Yen, P. Yu, X. Zhang, X. Zhang, J. Zhao, Y. Zhou, J. Zhu,
Y. Zhu, S. Ziegler, A. E. Beaudet, L. A. Boyer, P. L. De Jager, P. J. Farnham, S. J. Fisher,
D. Haussler, S. J. M. Jones, W. Li, M. A. Marra, M. T. McManus, S. Sunyaev, J. A. Thom-
son, T. D. Tlsty, L.-H. Tsai, W. Wang, R. A. Waterland, M. Q. Zhang, L. H. Chadwick, B. E.
Bernstein, J. F. Costello, J. R. Ecker, M. Hirst, A. Meissner, A. Milosavljevic, B. Ren, J. A.
Stamatoyannopoulos, T. Wang, M. Kellis, A. Kundaje, W. Meuleman, J. Ernst, M. Bilenky,
A. Yen, A. Heravi-Moussavi, P. Kheradpour, Z. Zhang, J. Wang, M. J. Ziller, V. Amin, J. W.
Whitaker, M. D. Schultz, L. D. Ward, A. Sarkar, G. Quon, R. S. Sandstrom, M. L. Eaton, Y.-C.
Wu, A. R. Pfenning, X. Wang, M. Claussnitzer, Y. Liu, C. Coarfa, R. A. Harris, N. Shoresh,
C. B. Epstein, E. Gjoneska, D. Leung, W. Xie, R. D. Hawkins, R. Lister, C. Hong, P. Gascard,
A. J. Mungall, R. Moore, E. Chuah, A. Tam, T. K. Canfield, R. S. Hansen, R. Kaul, P. J. Sabo,
M. S. Bansal, A. Carles, J. R. Dixon, K.-H. Farh, S. Feizi, R. Karlic, A.-R. Kim, A. Kulkarni,
D. Li, R. Lowdon, G. Elliott, T. R. Mercer, S. J. Neph, V. Onuchic, P. Polak, N. Rajagopal,
P. Ray, R. C. Sallari, K. T. Siebenthall, N. A. Sinnott-Armstrong, M. Stevens, R. E. Thurman,
J. Wu, B. Zhang, X. Zhou, A. E. Beaudet, L. A. Boyer, P. L. De Jager, P. J. Farnham, S. J.
Fisher, D. Haussler, S. J. M. Jones, W. Li, M. A. Marra, M. T. McManus, S. Sunyaev, J. A.
Thomson, T. D. Tlsty, L.-H. Tsai, W. Wang, R. A. Waterland, M. Q. Zhang, L. H. Chadwick,
B. E. Bernstein, J. F. Costello, J. R. Ecker, M. Hirst, A. Meissner, A. Milosavljevic, B. Ren,
J. A. Stamatoyannopoulos, T. Wang, and M. Kellis, "Integrative analysis of 111 reference human
epigenomes," Nature, vol. 518, pp. 317-330, Feb. 2015.
7 H.K. Finucane, B. Bulik-Sullivan, A. Gusev, G. Trynka, Y. Reshef, P.-R. Loh, V. Anttila,
H. Xu, C. Zang, K. Farh, S. Ripke, F. R. Day, S. Purcell, E. Stahl, S. Lindstrom, J. R. B. Perry,
Y. Okada, S. Raychaudhuri, M. J. Daly, N. Patterson, B. M. Neale, and A. L. Price, "Partitioning
heritability by functional annotation using genome-wide association summary statistics," Nature
Genetics, vol. 47, pp. 1228-1235, Sept. 2015.
8 H. Finucane, Y. Reshef, V. Anttila, K. Slowikowski, A. Gusev, A. Byrnes, S. Gazal, P.-R.

202
Loh, G. Genovese, A. Saunders, E. Macosko, S. Pollack, T. B. Consortium, J. R. B. Perry,
S. Raychaudhuri, S. McCarroll, B. Neale, and A. Price, "Heritability enrichment of specifically
expressed genes identifies disease-relevant tissues and cell types," bioRxiv, p. 103069, Jan. 2017.
9 B. Bulik-Sullivan, H. K. Finucane, V. Anttila, A. Gusev, F. R. Day, P.-R. Loh, ReproGen
Consortium, Psychiatric Genomics Consortium, Genetic Consortium for Anorexia Nervosa of
the Wellcome Trust Case Control Consortium 3, L. Duncan, J. R. B. Perry, N. Patterson, E. B.
Robinson, M. J. Daly, A. L. Price, and B. M. Neale, "An atlas of genetic correlations across
human diseases and traits," Nature Genetics, vol. 47, pp. 1236-1241, Nov. 2015.
0 J. Yang, T. A. Manolio, L. R. Pasquale, E. Boerwinkle, N. Caporaso, J. M. Cunningham,
M. de Andrade, B. Feenstra, E. Feingold, M. G. Hayes, W. G. Hill, M. T. Landi, A. Alonso,
G. Lettre, P. Lin, H. Ling, W. Lowe, R. A. Mathias, M. Melbye, E. Pugh, M. C. Cornelis, B. S.
Weir, M. E. Goddard, and P. M. Visscher, "Genome partitioning of genetic variation for complex
traits using common SNPs," Nature Genetics, vol. 43, pp. 519-525, June 2011.

"A. Gusev, S. H. Lee, G. Trynka, H. Finucane, B. J. Vilhjkqlmsson, H. Xu, C. Zang, S. Ripke,


B. Bulik-Sullivan, E. Stahl, Schizophrenia Working Group of the Psychiatric Genomics Consor-
tium, SWE-SCZ Consortium, A. K. KAd'hler, C. M. Hultman, S. -M. Purcell, S. A. McCarroll,
M. Daly, B. Pasaniuc, P. F. Sullivan, B. M. Neale, N. R. Wray, S. Raychaudhuri, A. L. Price,
Schizophrenia Working Group of the Psychiatric Genomics Consortium, and SWE-SCZ Consor-
tium, "Partitioning heritability of regulatory and cell-type-specific variants across 11 common
diseases," American Journal of Human Genetics, vol. 95, pp. 535-552, Nov. 2014.
12
J. Yang, S. H. Lee, M. E. Goddard, and P. M. Visscher, "GCTA: A Tool for Genome-wide
Complex Trait Analysis," The American Journal of Human Genetics, vol. 88, pp. 76-82, Jan.
2011.

B. K. Bulik-Sullivan, P.-R. Loh, H. K. Finucane, S. Ripke, J. Yang, Schizophrenia Working


Group of the Psychiatric Genomics Consortium, N. Patterson, M. J. Daly, A. L. Price, and
B. M. Neale, "LD Score regression distinguishes confounding from polygenicity in genome-wide
association studies," Nature Genetics, vol. 47, pp. 291-295, Mar. 2015.
14
D. M. Altshuler (Co-Chair), R. M. Durbin (Co-Chair), G. R. Abecasis,
D. R. Bentley,
A. Chakravarti, A. G. Clark, P. Donnelly, E. E. Eichler, P. Flicek, S. B. Gabriel, R. A. Gibbs,
E. D. Green, M. E. Hurles, B. M. Knoppers, J. 0. Korbel, E. S. Lander, C. Lee, H. Lehrach, E. R.
Mardis, G. T. Marth, G. A. McVean, D. A. Nickerson, J. P. Schmidt, S. T. Sherry, J. Wang, R. K.
Wilson, R. A. Gibbs (Principal Investigator), H. Dinh, C. Kovar, S. Lee, L. Lewis, D. Muzny,
J. Reid, M. Wang, J. Wang (Principal Investigator), X. Fang, X. Guo, M. Jian, H. Jiang,
X. Jin, G. Li, J. Li, Y. Li, Z. Li, X. Liu, Y. Lu, X. Ma, Z. Su, S. Tai, M. Tang, B. Wang,
G. Wang, H. Wu, R. Wu, Y. Yin, W. Zhang, J. Zhao, M. Zhao, X. Zheng, Y. Zhou, E. S.
Lander (Principal Investigator), D. M. Altshuler, S. B. Gabriel (Co-Chair), N. Gupta, P. Flicek
(Principal Investigator), L. Clarke, R. Leinonen, R. E. Smith, X. Zheng-Bradley, D. R. Bentley

203
(Principal Investigator), R. Grocock, S. Humphray, T. James, Z. Kingsbury, H. Lehrach (Prin-
cipal Investigator), R. Sudbrak (Project Leader), M. W. Albrecht, V. S. Amstislavskiy, T. A.
Borodina, M. Lienhard, F. Mertes, M. Sultan, B. Timmermann, M.-L. Yaspo, S. T. Sherry
(Principal Investigator), G. A. McVean (Principal Investigator), E. R. Mardis (Co-Principal In-
vestigator) (Co-Chair), R. K. Wilson (Co-Principal Investigator), L. Fulton, R. Fulton, G. M.
Weinstock, R. M. Durbin (Principal Investigator), S. Balasubramaniam, J. Burton, P. Danecek,
T. M. Keane, A. Kolb-Kokocinski, S. McCarthy, J. Stalker, M. Quail, J. P. Schmidt (Princi-
pal Investigator), C. J. Davies, J. Gollub, T. Webster, B. Wong, Y. Zhan, A. Auton (Princi-
pal Investigator), R. A. Gibbs (Principal Investigator), F. Yu (Project Leader), M. Bainbridge,
D. Challis, U. S. Evani, J. Lu, D. Muzny, U. Nagaswamy, J. Reid, A. Sabo, Y. Wang, J. Yu,
J. Wang (Principal Investigator), L. J. M. Coin, L. Fang, X. Guo, X. Jin, G. Li, Q. Li, Y. Li,
Z. Li, H. Lin, B. Liu, R. Luo, N. Qin, H. Shao, B. Wang, Y. Xie, C. Ye, C. Yu, F. Zhang,
H. Zheng, H. Zhu, G. T. Marth (Principal Investigator), E. P. Garrison, D. Kural, W.-P. Lee,
W. Fung Leong, A. N. Ward, J. Wu, M. Zhang, C. Lee (Principal Investigator), L. Griffin, C.-H.
Hsieh, R. E. Mills, X. Shi, M. von Grotthuss, C. Zhang, M. J. Daly (Principal Investigator),
M. A. DePristo (Project Leader), D. M. Altshuler, E. Banks, G. Bhatia, M. 0. Carneiro, G. del
Angel, S. B. Gabriel, G. Genovese, N. Gupta, R. E. Handsaker, C. Hartl, E. S. Lander, S. A.
McCarroll, J. C. Nemesh, R. E. Poplin, S. F. Schaffner, K. Shakir, S. C. Yoon (Principal Inves-
tigator), J. Lihm, V. Makarov, H. Jin (Principal Investigator), W. Kim, K. Cheol Kim, J. 0.
Korbel (Principal Investigator), T. Rausch, P. Flicek (Principal Investigator), K. Beal, L. Clarke,
F. Cunningham, J. Herrero, W. M. McLaren, G. R. S. Ritchie, R. E. Smith, X. Zheng-Bradley,
A. G. Clark (Principal Investigator), S. Gottipati, A. Keinan, J. L. Rodriguez-Flores, P. C.
Sabeti (Principal Investigator), S. R. Grossman, S. Tabrizi, R. Tariyal, D. N. Cooper (Princi-
pal Investigator), E. V. Ball, P. D. Stenson, D. R. Bentley (Principal Investigator), B. Barnes,
M. Bauer, R. Keira Cheetham, T. Cox, M. Eberle, S. Humphray, S. Kahn, L. Murray, J. Peden,
R. Shaw, K. Ye (Principal Investigator), M. A. Batzer (Principal Investigator), M. K. Konkel,
J. A. Walker, D. G. MacArthur (Principal Investigator), M. Lek, Sudbrak (Project Leader),
V. S. Amstislavskiy, R. Herwig, M. D. Shriver (Principal Investigator), C. D. Bustamante (Prin-
cipal Investigator), J. K. Byrnes, F. M. De La Vega, S. Gravel, E. E. Kenny, J. M. Kidd,
P. Lacroute, B. K. Maples, A. Moreno-Estrada, F. Zakharia, E. Halperin (Principal Investiga-
tor), Y. Baran, D. W. Craig (Principal Investigator), A. Christoforides, N. Homer, T. Izatt,
A. A. Kurdoglu, S. A. Sinari, K. Squire, S. T. Sherry (Principal Investigator), C. Xiao, J. Se-
bat (Principal Investigator), V. Bafna, K. Ye, E. G. Burchard (Principal Investigator), R. D.
Hernandez (Principal Investigator), C. R. Gignoux, D. Haussler (Principal Investigator), S. J.
Katzman, W. James Kent, B. Howie, A. Ruiz-Linares (Principal Investigator), E. T. Dermitzakis
(Principal Investigator), T. Lappalainen, S. E. Devine (Principal Investigator), X. Liu, A. Ma-
roo, L. J. Tallon, J. A. Rosenfeld (Principal Investigator), L. P. Michelson, G. R. Abecasis
(Principal Investigator) (Co-Chair), H. Min Kang (Project Leader), P. Anderson, A. Angius,
A. Bigham, T. Blackwell, F. Busonero, F. Cucca, C. Fuchsberger, C. Jones, G. Jun, Y. Li,
R. Lyons, A. Maschio, E. Porcu, F. Reinier, S. Sanna, D. Schlessinger, C. Sidore, A. Tan,
M. Kate Trost, P. Awadalla (Principal Investigator), A. Hodgkinson, G. Lunter (Principal Inves-
tigator), G. A. McVean (Principal Investigator) (Co-Chair), J. L. Marchini (Principal Investiga-

204
tor), S. Myers (Principal Investigator), C. Churchhouse, 0. Delaneau, A. Gupta-Hinch, Z. Iqbal,
I. Mathieson, A. Rimmer, D. K. Xifara, T. K. Oleksyk (Principal Investigator), Y. Fu (Princi-
pal Investigator), X. Liu, M. Xiong, L. Jorde (Principal Investigator), D. Witherspoon, J. Xing,
E. E. Eichler (Principal Investigator), B. L. Browning (Principal Investigator), C. Alkan, I. Ha-
jirasouliha, F. Hormozdiari, A. Ko, P. H. Sudmant, E. R. Mardis (Co-Principal Investigator),
K. Chen, A. Chinwalla, L. Ding, D. Dooling, D. C. Koboldt, M. D. McLellan, J. W. Wallis,
M. C. Wendl, Q. Zhang, R. M. Durbin (Principal Investigator), M. E. Hurles (Principal In-
vestigator), C. A. Albers, Q. Ayub, S. Balasubramaniam, Y. Chen, A. J. Coffey, V. Colonna,
P. Danecek, N. Huang, L. Jostins, T. M. Keane, H. Li, S. McCarthy, A. Scally, J. Stalker, K. Wal-
ter, Y. Xue, Y. Zhang, M. B. Gerstein (Principal Investigator), A. Abyzov, S. Balasubramanian,
J. Chen, D. Clarke, Y. Fu, L. Habegger, A. 0. Harmanci, M. Jin, E. Khurana, X. Jasmine Mu,
C. Sisu, Y. Li, R. Luo, H. Zhu, C. Lee (Principal Investigator) (Co-Chair), L. Griffin, C.-H.
Hsieh, R. E. Mills, X. Shi, M. von Grotthuss, C. Zhang, G. T. Marth (Principal Investiga-
tor), E. P. Garrison, D. Kural, W.-P. Lee, A. N. Ward, J. Wu, M. Zhang, S. A. McCarroll
(Project Leader), D. M. Altshuler, E. Banks, G. del Angel, G. Genovese, R. E. Handsaker,
C. Hartl, J. C. Nemesh, K. Shakir, S. C. Yoon (Principal Investigator), J. Lihm, V. Makarov,
J. Degenhardt, P. Flicek (Principal Investigator), L. Clarke, R. E. Smith, X. Zheng-Bradley,
J. 0. Korbel (Principal Investigator) (Co-Chair), T. Rausch, A. M. StAijtz, D. R. Bentley (Prin-
cipal Investigator), B. Barnes, R. Keira Cheetham, M. Eberle, S. Humphray, S. Kahn, L. Murray,
R. Shaw, K. Ye (Principal Investigator), M. A. Batzer (Principal Investigator), M. K. Konkel,
J. A. Walker, P. Lacroute, D. W. Craig (Principal Investigator), N. Homer, D. Church, C. Xiao,
J. Sebat (Principal Investigator), V. Bafna, J. J. Michaelson, K. Ye, S. E. Devine (Principal In-
vestigator), X. Liu, A. Maroo, L. J. Tallon, G. Lunter (Principal Investigator), G. A. McVean
(Principal Investigator), Z. Iqbal, D. Witherspoon, J. Xing, E. E. Eichler (Principal Investiga-
tor) (Co-Chair), C. Alkan, I. Hajirasouliha, F. Hormozdiari, A. Ko, P. H. Sudmant, K. Chen,
A. Chinwalla, L. Ding, M. D. McLellan, J. W. Wallis, M. E. Hurles (Principal Investigator)
(Co-Chair), B. Blackburne, H. Li, S. J. Lindsay, Z. Ning, A. Scally, K. Walter, Y. Zhang, M. B.
Gerstein (Principal Investigator), A. Abyzov, J. Chen, D. Clarke, E. Khurana, X. Jasmine Mu,
C. Sisu, R. A. Gibbs (Principal Investigator) (Co-Chair), F. Yu (Project Leader), M. Bainbridge,
D. Challis, U. S. Evani, C. Kovar, L. Lewis, J. Lu, D. Muzny, U. Nagaswamy, J. Reid, A. Sabo,
J. Yu, X. Guo, Y. Li, R. Wu, G. T. Marth (Principal Investigator) (Co-Chair), E. P. Garrison,
W. Fung Leong, A. N. Ward, G. del Angel, M. A. DePristo, S. B. Gabriel, N. Gupta, C. Hartl,
R. E. Poplin, A. G. Clark (Principal Investigator), J. L. Rodriguez-Flores, P. Flicek (Principal In-
vestigator), L. Clarke, R. E. Smith, X. Zheng-Bradley, D. G. MacArthur (Principal Investiga-
tor), C. D. Bustamante (Principal Investigator), S. Gravel, D. W. Craig (Principal Investigator),
A. Christoforides, N. Homer, T. Izatt, S. T. Sherry (Principal Investigator), C. Xiao, E. T. Der-
mitzakis (Principal Investigator), G. R. Abecasis (Principal Investigator), H. Min Kang, G. A.
McVean (Principal Investigator), E. R. Mardis (Principal Investigator), D. Dooling, L. Fulton,
R. Fulton, D. C. Koboldt, R. M. Durbin (Principal Investigator), S. Balasubramaniam, T. M.
Keane, S. McCarthy, J. Stalker, M. B. Gerstein (Principal Investigator), S. Balasubramanian,
L. Habegger, E. P. Garrison, R. A. Gibbs (Principal Investigator), M. Bainbridge, D. Muzny,
F. Yu, J. Yu, G. del Angel, R. E. Handsaker, V. Makarov, J. L. Rodriguez-Flores, H. Jin (Princi-

205
pal Investigator), W. Kim, K. Cheol Kim, P. Flicek (Principal Investigator), K. Beal, L. Clarke,
F. Cunningham, J. Herrero, W. M. McLaren, G. R. S. Ritchie, X. Zheng-Bradley, S. Tabrizi,
D. G. MacArthur (Principal Investigator), M. Lek, C. D. Bustamante (Principal Investigator),
F. M. De La Vega, D. W. Craig (Principal Investigator), A. A. Kurdoglu, T. Lappalainen,
J. A. Rosenfeld (Principal Investigator), L. P. Michelson, P. Awadalla (Principal Investigator),
A. Hodgkinson, G. A. McVean (Principal Investigator), K. Chen, Y. Chen, V. Colonna, A. Frank-
ish, J. Harrow, Y. Xue, M. B. Gerstein (Principal Investigator) (Co-Chair), A. Abyzov, S. Bala-
subramanian, J. Chen, D. Clarke, Y. Fu, A. 0. Harmanci, M. Jin, E. Khurana, X. Jasmine Mu,
C. Sisu, R. A. Gibbs (Principal Investigator), G. Fowler, W. Hale, D. Kalra, C. Kovar, D. Muzny,
J. Reid, J. Wang (Principal Investigator), X. Guo, G. Li, Y. Li, X. Zheng, D. M. Altshuler,
P. Flicek (Principal Investigator) (Co-Chair), L. Clarke (Project Leader), J. Barker, G. Kelman,
E. Kulesha, R. Leinonen, W. M. McLaren, R. Radhakrishnan, A. Roa, D. Smirnov, R. E. Smith,
I. Streeter, I. Toneva, B. Vaughan, X. Zheng-Bradley, D. R. Bentley (Principal Investigator),
T. Cox, S. Humphray, S. Kahn, R. Sudbrak (Project Leader), M. W. Albrecht, M. Lienhard,
D. W. Craig (Principal Investigator), T. Izatt, A. A. Kurdoglu, S. T. Sherry (Principal Investiga-
tor) (Co-Chair), V. Ananiev, Z. Belaia, D. Beloslyudtsev, N. Bouk, C. Chen, D. Church, R. Co-
hen, C. Cook, J. Garner, T. Hefferon, M. Kimelman, C. Liu, J. Lopez, P. Meric, C. OAA2Sulli-
van, Y. Ostapchuk, L. Phan, S. Ponomarov, V. Schneider, E. Shekhtman, K. Sirotkin, D. Slotta,
C. Xiao, H. Zhang, D. Haussler (Principal Investigator), G. R. Abecasis (Principal Investigator),
G. A. McVean (Principal Investigator), C. Alkan, A. Ko, D. Dooling, R. M. Durbin (Princi-
pal Investigator), S. Balasubramaniam, T. M. Keane, S. McCarthy, J. Stalker, A. Chakravarti
(Co-Chair), B. M. Knoppers (Co-Chair), G. R. Abecasis, K. C. Barnes, C. Beiswanger, E. G.
Burchard, C. D. Bustamante, H. Cai, H. Cao, R. M. Durbin, N. Gharani, R. A. Gibbs, C. R.
Gignoux, S. Gravel, B. Henn, D. Jones, L. Jorde, J. S. Kaye, A. Keinan, A. Kent, A. Kerasidou,
Y. Li, R. Mathias, G. A. McVean, A. Moreno-Estrada, P. N. Ossorio, M. Parker, D. Reich, C. N.
Rotimi, C. D. Royal, K. Sandoval, Y. Su, R. Sudbrak, Z. Tian, B. Timmermann, S. Tishkoff,
L. H. Toji, C. Tyler Smith, M. Via, Y. Wang, H. Yang, L. Yang, J. Zhu, W. Bodmer, G. Bedoya,
A. Ruiz-Linares, C. Zhi Ming, G. Yang, C. Jia You, L. Peltonen, A. Garcia-Montero, A. Orfao,
J. Dutil, J. C. Martinez-Cruzado, T. K. Oleksyk, L. D. Brooks, A. L. Felsenfeld, J. E. McEwen,
N. C. Clemm, A. Duncanson, M. Dunn, E. D. Green, M. S. Guyer, J. L. Peterson, G. R. Abeca-
sis, A. Auton, L. D. Brooks, M. A. DePristo, R. M. Durbin, R. E. Handsaker, H. Min Kang,
G. T. Marth, and G. A. McVean, "An integrated map of genetic variation from 1,092 human
genomes," Nature, vol. 491, pp. 56-65, Oct. 2012.

15 J. Baik, G. B. Arous, and S. PRkchAl, "Phase transition of the largest eigenvalue for nonnull
complex sample covariance matrices," The Annals of Probability, vol. 33, pp. 1643-1697, Sept.
2005.

6
D. Paul, "ASYMPTOTICS OF SAMPLE EIGENSTRUCTURE FOR A LARGE DIMEN-
SIONAL SPIKED COVARIANCE MODEL," Statistica Sinica, vol. 17, no. 4, pp. 1617-1642,
2007.

206
l7X. Mestre, "On the Asymptotic Behavior of the Sample Estimates of Eigenvalues and Eigenvec-
tors of Covariance Matrices," IEEE Transactions on Signal Processing, vol. 56, pp. 5353-5368,
Nov. 2008.

18 F. Benaych-Georges and R. R. Nadakuditi, "The eigenvalues and eigenvectors of finite, low rank
perturbations of large random matrices," Advances in Mathematics, vol. 227, pp. 494-521, May
2011.
9 A. Bloemendal, A. Knowles, H.-T. Yau, and J. Yin, "On the principal components of sample
covariance matrices," Probability Theory and Related Fields, vol. 164, pp. 459-552, Feb. 2016.
20 N. Patterson, A. L. Price, and D. Reich, "Population Structure and Eigenanalysis," PLoS Ge-
netics, vol. 2, no. 12, p. e190, 2006.
21 J. Yang, A. Bakshi, Z. Zhu, G. Hemani, A. A. E. Vinkhuyzen, S.
H. Lee, M. R. Robinson,
J. R. B. Perry, I. M. Nolte, J. V. van Vliet-Ostaptchouk, H. Snieder, The LifeLines Cohort
Study, T. Esko, L. Milani, R. MAd'gi, A. Metspalu, A. Hamsten, P. K. E. Magnusson, N. L.
Pedersen, E. Ingelsson, N. Soranzo, M. C. Keller, N. R. Wray, M. E. Goddard, and P. M.
Visscher, "Genetic variance estimation with imputed variants finds negligible missing heritability
for human height and body mass index," Nature Genetics, vol. 47, pp. 1114-1120, Oct. 2015.
22
E. A. Stahl, D. Wegmann, G. Trynka, J. Gutierrez-Achury, R. Do,
B. F. Voight, P. Kraft,
R. Chen, H. J. Kallberg, F. A. S. Kurreeman, D. G. R. a. M.-a. Consortium, M. I. G. Consortium,
S. Kathiresan, C. Wijmenga, P. K. Gregersen, L. Alfredsson, K. A. Siminovitch, J. Worthington,
P. I. W. d. Bakker, S. Raychaudhuri, and R. M. Plenge, "Bayesian inference analyses of the
polygenic architecture of rheumatoid arthritis," Nature Genetics, vol. 44, pp. 483-489, May
2012.
2
G. Trynka, C. Sandor, B. Han, H. Xu, B. E. Stranger, X. S. Liu, and S. Raychaudhuri, "Chromatin
marks identify critical cell types for fine mapping complex trait variants," Nature Genetics,
vol. 45, pp. 124-130, Feb. 2013.
24
K. K.-H. Farh, A. Marson, J. Zhu, M. Kleinewietfeld, W. J. Housley,
S. Beik, N. Shoresh,
H. Whitton, R. J. H. Ryan, A. A. Shishkin, M. Hatan, M. J. Carrasco-Alfonso, D. Mayer, C. J.
Luckey, N. A. Patsopoulos, P. L. De Jager, V. K. Kuchroo, C. B. Epstein, M. J. Daly, D. A.
Hafler, and B. E. Bernstein, "Genetic and epigenetic fine mapping of causal autoimmune disease
variants," Nature, vol. 518, pp. 337-343, Oct. 2014.
2
G. Kichaev, W.-Y. Yang, S. Lindstrom, F. Hormozdiari, E. Eskin, A. L. Price, P.
Kraft, and
B. Pasaniuc, "Integrating Functional Data to Prioritize Causal Variants in Statistical Fine-
Mapping Studies," PLoS Genetics, vol. 10, p. e1004722, Oct. 2014.
26 j.Pickrell, "Joint Analysis of Functional Genomic Data and Genome-wide Association
Studies of
18 Human Traits," The American Journal of Human Genetics, vol. 94, pp. 559-573, Apr. 2014.

207
27
M. T. Maurano, R. Humbert, E. Rynes, R. E. Thurman, E. Haugen, H. Wang, A. P. Reynolds,
R. Sandstrom, H. Qu, J. Brody, A. Shafer, F. Neri, K. Lee, T. Kutyavin, S. Stehling-Sun,
A. K. Johnson, T. K. Canfield, E. Giste, M. Diegel, D. Bates, R. S. Hansen, S. Neph, P. J.
Sabo, S. Heimfeld, A. Raubitschek, S. Ziegler, C. Cotsapas, N. Sotoodehnia, I. Glass, S. R.
Sunyaev, R. Kaul, and J. A. Stamatoyannopoulos, "Systematic Localization of Common Disease-
Associated Variation in Regulatory DNA," Science, vol. 337, pp. 1190-1195, Sept. 2012.
28 L. K. Davis, D. Yu, C. L. Keenan, E. R. Gamazon, A. I. Konkashbaev, E. M. Derks, B. M.
Neale, J. Yang, S. H. Lee, P. Evans, C. L. Barr, L. Bellodi, F. Benarroch, G. B. Berrio, 0. J.
Bienvenu, M. H. Bloch, R. M. Blom, R. D. Bruun, C. L. Budman, B. Camarena, D. Camp-
bell, C. Cappi, J. C. C. Silgado, D. C. Cath, M. C. Cavallini, D. A. Chavira, S. Chouinard,
D. V. Conti, E. H. Cook, V. Coric, B. A. Cullen, D. Deforce, R. Delorme, Y. Dion, C. K.
Edlund, K. Egberts, P. Falkai, T. V. Fernandez, P. J. Gallagher, H. Garrido, D. Geller, S. L.
Girard, H. J. Grabe, M. A. Grados, B. D. Greenberg, V. Gross-Tsur, S. Haddad, G. A. Heiman,
S. M. J. Hemmings, A. G. Hounie, C. Illmann, J. Jankovic, M. A. Jenike, J. L. Kennedy, R. A.
King, B. Kremeyer, R. Kurlan, N. Lanzagorta, M. Leboyer, J. F. Leckman, L. Lennertz, C. Liu,
C. Lochner, T. L. Lowe, F. Macciardi, J. T. McCracken, L. M. McGrath, S. C. M. Restrepo,
R. Moessner, J. Morgan, H. Muller, D. L. Murphy, A. L. Naarden, W. C. Ochoa, R. A. Ophoff,
L. Osiecki, A. J. Pakstis, M. T. Pato, C. N. Pato, J. Piacentini, C. Pittenger, Y. Pollak, S. L.
Rauch, T. J. Renner, V. I. Reus, M. A. Richter, M. A. Riddle, M. M. Robertson, R. Romero,
M. C. RosAario, D. Rosenberg, G. A. Rouleau, S. Ruhrmann, A. Ruiz-Linares, A. S. Sampaio,
J. Samuels, P. Sandor, B. Sheppard, H. S. Singer, J. H. Smit, D. J. Stein, E. Strengman, J. A. Tis-
chfield, A. V. V. Duarte, H. Vallada, F. V. Nieuwerburgh, J. Veenstra-VanderWeele, S. Walitza,
Y. Wang, J. R. Wendland, H. G. M. Westenberg, Y. Y. Shugart, E. C. Miguel, W. McMahon,
M. Wagner, H. Nicolini, D. Posthuma, G. L. Hanna, P. Heutink, D. Denys, P. D. Arnold, B. A.
Oostra, G. Nestadt, N. B. Freimer, D. L. Pauls, N. R. Wray, S. E. Stewart, C. A. Mathews,
J. A. Knowles, N. J. Cox, and J. M. Scharf, "Partitioning the Heritability of Tourette Syndrome
and Obsessive Compulsive Disorder Reveals Differences in Genetic Architecture," PLOS Genet,
vol. 9, p. e1003864, Oct. 2013.
29 J. Yang, M. N. Weedon, S. Purcell, G. Lettre, K. Estrada, C. J. Willer, A. V. Smith, E.
Ingelsson,
J. R. O'Connell, M. Mangino, R. MAd'gi, P. A. Madden, A. C. Heath, D. R. Nyholt, N. G. Martin,
G. W. Montgomery, T. M. Frayling, J. N. Hirschhorn, M. I. McCarthy, M. E. Goddard, and P. M.
Visscher, "Genomic inflation factors under polygenic inheritance," European Journal of Human
Genetics, vol. 19, pp. 807-812, July 2011.
30
"UCSC Genome Browser Home."
31
S. Ripke, B. M. Neale, A. Corvin, J. T. R. Walters, K.-H. Farh, P. A. Holmans, P. Lee, B. Bulik-
Sullivan, D. A. Collier, H. Huang, T. H. Pers, I. Agartz, E. Agerbo, M. Albus, M. Alexander,
F. Amin, S. A. Bacanu, M. Begemann, R. A. Belliveau Jr, J. Bene, S. E. Bergen, E. Bevilac-
qua, T. B. Bigdeli, D. W. Black, R. Bruggeman, N. G. Buccola, R. L. Buckner, W. Byerley,
W. Cahn, G. Cai, D. Campion, R. M. Cantor, V. J. Carr, N. Carrera, S. V. Catts, K. D.
Chambert, R. C. K. Chan, R. Y. L. Chen, E. Y. H. Chen, W. Cheng, E. F. C. Cheung,

208
S. Ann Chong, C. Robert Cloninger, D. Cohen, N. Cohen, P. Cormican, N. Craddock, J. J.
Crowley, D. Curtis, M. Davidson, K. L. Davis, F. Degenhardt, J. Del Favero, D. Demontis,
D. Dikeos, T. Dinan, S. Djurovic, G. Donohoe, E. Drapeau, J. Duan, F. Dudbridge, N. Dur-
mishi, P. Eichhammer, J. Eriksson, V. Escott-Price, L. Essioux, A. H. Fanous, M. S. Farrell,
J. Frank, L. Franke, R. Freedman, N. B. Freimer, M. Friedl, J. I. Friedman, M. Fromer, G. Gen-
ovese, L. Georgieva, I. Giegling, P. Giusti-RodrAijguez, S. Godard, J. I. Goldstein, V. Golim-
bet, S. Gopal, J. Gratten, L. de Haan, C. Hammer, M. L. Hamshere, M. Hansen, T. Hansen,
V. Haroutunian, A. M. Hartmann, F. A. Henskens, S. Herms, J. N. Hirschhorn, P. Hoffmann,
A. Hofman, M. V. Hollegaard, D. M. Hougaard, M. Ikeda, I. Joa, A. JuliAd, R. S. Kahn, L. Kalay-
djieva, S. Karachanak-Yankova, J. Karjalainen, D. Kavanagh, M. C. Keller, J. L. Kennedy,
A. Khrunin, Y. Kim, J. Klovins, J. A. Knowles, B. Konte, V. Kucinskas, Z. Ausrele Kucin-
skiene, H. Kuzelova-Ptackova, A. K. KAd'hler, C. Laurent, J. Lee Chee Keong, S. Hong Lee,
S. E. Legge, B. Lerer, M. Li, T. Li, K.-Y. Liang, J. Lieberman, S. Limborska, C. M. Loughland,
J. Lubinski, J. LAUinnqvist, M. Macek Jr, P. K. E. Magnusson, B. S. Maher, W. Maier, J. Mallet,
S. Marsal, M. Mattheisen, M. Mattingsdal, R. W. McCarley, C. McDonald, A. M. McIntosh,
S. Meier, C. J. Meijer, B. Melegh, I. Melle, R. I. Mesholam-Gately, A. Metspalu, P. T. Michie,
L. Milani, V. Milanova, Y. Mokrab, D. W. Morris, 0. Mors, K. C. Murphy, R. M. Murray,
I. Myin-Germeys, B. MAijller-Myhsok, M. Nelis, I. Nenadic, D. A. Nertney, G. Nestadt, K. K.
Nicodemus, L. Nikitina-Zake, L. Nisenbaum, A. Nordin, E. OaA2Callaghan, C. OaAZDushlaine,
F. A. OaA2Neill, S.-Y. Oh, A. Olincy, L. Olsen, J. Van Os, Psychosis Endophenotypes Interna-
tional Consortium, C. Pantelis, G. N. Papadimitriou, S. Papiol, E. Parkhomenko, M. T. Pato,
T. Paunio, M. Pejovic-Milovancevic, D. 0. Perkins, 0. PietilAd'inen, J. Pimm, A. J. Pocklington,
J. Powell, A. Price, A. E. Pulver, S. M. Purcell, D. Quested, H. B. Rasmussen, A. Reichenberg,
M. A. Reimers, A. L. Richards, J. L. Roffman, P. Roussos, D. M. Ruderfer, V. Salomaa, A. R.
Sanders, U. Schall, C. R. Schubert, T. G. Schulze, S. G. Schwab, E. M. Scolnick, R. J. Scott,
L. J. Seidman, J. Shi, E. Sigurdsson, T. Silagadze, J. M. Silverman, K. Sim, P. Slominsky, J. W.
Smoller, H.-C. So, C. A. Spencer, E. A. Stahl, H. Stefansson, S. Steinberg, E. Stogmann, R. E.
Straub, E. Strengman, J. Strohmaier, T. Scott Stroup, M. Subramaniam, J. Suvisaari, D. M.
Svrakic, J. P. Szatkiewicz, E. SAfderman, S. Thirumalai, D. Toncheva, S. Tosato, J. Veijola,
J. Waddington, D. Walsh, D. Wang, Q. Wang, B. T. Webb, M. Weiser, D. B. Wildenauer, N. M.
Williams, S. Williams, S. H. Witt, A. R. Wolen, E. H. M. Wong, B. K. Wormley, H. Simon Xi,
C. C. Zai, X. Zheng, F. Zimprich, N. R. Wray, K. Stefansson, P. M. Visscher, Wellcome Trust
Case-Control Consortium, R. Adolfsson, 0. A. Andreassen, D. H. R. Blackwood, E. Bramon,
J. D. Buxbaum, A. D. BAyrglum, S. Cichon, A. Darvasi, E. Domenici, H. Ehrenreich, T. Esko,
P. V. Gejman, M. Gill, H. Gurling, C. M. Hultman, N. Iwata, A. V. Jablensky, E. G. JAfinsson,
K. S. Kendler, G. Kirov, J. Knight, T. Lencz, D. F. Levinson, Q. S. Li, J. Liu, A. K. Malhotra,
S. A. McCarroll, A. McQuillin, J. L. Moran, P. B. Mortensen, B. J. Mowry, M. M. NAithen,
R. A. Ophoff, M. J. Owen, A. Palotie, C. N. Pato, T. L. Petryshen, D. Posthuma, M. Rietschel,
B. P. Riley, D. Rujescu, P. C. Sham, P. Sklar, D. St Clair, D. R. Weinberger, J. R. Wendland,
T. Werge, M. J. Daly, P. F. Sullivan, and M. C. QMA2Donovan, "Biological insights from 108
schizophrenia-associated genetic loci," Nature, vol. 511, pp. 421-427, July 2014.

209
32
D. Hnisz, B. Abraham, T. Lee, A. Lau, V. Saint-AndrAl', A. Sigova, H. Hoke, and R. Young,
"Super-Enhancers in the Control of Cell Identity and Disease," Cell, vol. 155, pp. 934-947, Nov.
2013.
33
M. M. Hoffman, J. Ernst, S. P. Wilder, A. Kundaje, R. S. Harris, M. Libbrecht, B. Giardine, P. M.
Ellenbogen, J. A. Bilmes, E. Birney, R. C. Hardison, I. Dunham, M. Kellis, and W. S. Noble,
"Integrative annotation of chromatin elements from ENCODE data," Nucleic Acids Research,
vol. 41, pp. 827-841, Jan. 2013.
34
K. Lindblad-Toh, M. Garber, 0. Zuk, M. F. Lin, B. J. Parker, S. Washietl, P. Kheradpour,
J. Ernst, G. Jordan, E. Mauceli, L. D. Ward, C. B. Lowe, A. K. Holloway, M. Clamp, S. Gnerre,
J. AlfAiffldi, K. Beal, J. Chang, H. Clawson, J. Cuff, F. Di Palma, S. Fitzgerald, P. Flicek,
M. Guttman, M. J. Hubisz, D. B. Jaffe, I. Jungreis, W. J. Kent, D. Kostka, M. Lara, A. L.
Martins, T. Massingham, I. Moltke, B. J. Raney, M. D. Rasmussen, J. Robinson, A. Stark, A. J.
Vilella, J. Wen, X. Xie, M. C. Zody, J. Baldwin, T. Bloom, C. Whye Chin, D. Heiman, R. Nicol,
C. Nusbaum, S. Young, J. Wilkinson, K. C. Worley, C. L. Kovar, D. M. Muzny, R. A. Gibbs,
A. Cree, H. H. Dihn, G. Fowler, S. Jhangiani, V. Joshi, S. Lee, L. R. Lewis, L. V. Nazareth,
G. Okwuonu, J. Santibanez, W. C. Warren, E. R. Mardis, G. M. Weinstock, R. K. Wilson,
K. Delehaunty, D. Dooling, C. Fronik, L. Fulton, B. Fulton, T. Graves, P. Minx, E. Sodergren,
E. Birney, E. H. Margulies, J. Herrero, E. D. Green, D. Haussler, A. Siepel, N. Goldman,
K. S. Pollard, J. S. Pedersen, E. S. Lander, and M. Kellis, "A high-resolution map of human
evolutionary constraint using 29 mammals," Nature, vol. 478, pp. 476-482, Oct. 2011.
3 L. D. Ward and M. Kellis, "Evidence of Abundant Purifying Selection in Humans for Recently
Acquired Regulatory Functions," Science, vol. 337, pp. 1675-1678, Sept. 2012.
36 fR. Andersson, C. Gebhard, I. Miguel-Escalada, I. Hoof, J. Bornholdt, M. Boyd, Y. Chen, X. Zhao,
C. Schmidl, T. Suzuki, E. Ntini, E. Arner, E. Valen, K. Li, L. Schwarzfischer, D. Glatz, J. Raithel,
B. Lilje, N. Rapin, F. 0. Bagger, M. JAyrgensen, P. R. Andersen, N. Bertin, 0. Rackham, A. M.
Burroughs, J. K. Baillie, Y. Ishizu, Y. Shimizu, E. Furuhata, S. Maeda, Y. Negishi, C. J. Mungall,
T. F. Meehan, T. Lassmann, M. Itoh, H. Kawaji, N. Kondo, J. Kawai, A. Lennartsson, C. 0.
Daub, P. Heutink, D. A. Hume, T. H. Jensen, H. Suzuki, Y. Hayashizaki, F. M ijller, The
FANTOM Consortium, A. R. R. Forrest, P. Carninci, M. Rehli, and A. Sandelin, "An atlas of
active enhancers across human cell types and tissues," Nature, vol. 507, pp. 455-461, Mar. 2014.
37
P. R. Burton, D. G. Clayton, L. R. Cardon, N. Craddock, P. Deloukas, A. Duncanson, D. P.
Kwiatkowski, M. I. McCarthy, W. H. Ouwehand, N. J. Samani, J. A. Todd, P. Donnelly, J. C.
Barrett, D. Davison, D. Easton, D. Evans, H.-T. Leung, J. L. Marchini, A. P. Morris, C. C. A.
Spencer, M. D. Tobin, A. P. Attwood, J. P. Boorman, B. Cant, U. Everson, J. M. Hussey, J. D.
Jolley, A. S. Knight, K. Koch, E. Meech, S. Nutland, C. V. Prowse, H. E. Stevens, N. C. Taylor,
G. R. Walters, N. M. Walker, N. A. Watkins, T. Winzer, R. W. Jones, W. L. McArdle, S. M.
Ring, D. P. Strachan, M. Pembrey, G. Breen, D. S. Clair, S. Caesar, K. Gordon-Smith, L. Jones,
C. Fraser, E. K. Green, D. Grozeva, M. L. Hamshere, P. A. Holmans, I. R. Jones, G. Kirov,
V. Moskvina, I. Nikolov, M. C. O'Donovan, M. J. Owen, D. A. Collier, A. Elkin, A. Farmer,

210
R. Williamson, P. McGuffin, A. H. Young, 1. N. Ferrier, S. G. Ball, A. J. Balmforth, J. H. Barrett,
D. T. Bishop, M. M. Iles, A. Maqbool, N. Yuldasheva, A. S. Hall, P. S. Braund, R. J. Dixon,
M. Mangino, S. Stevens, J. R. Thompson, F. Bredin, M. remelling, M. Parkes, H. Drummond,
C. W. Lees, E. R. Nimmo, J. Satsangi, S. A. Fisher, A. Forbes, C. M. Lewis, C. M. Onnie, N. J.
Prescott, J. Sanderson, C. G. Mathew, J. Barbour, M. K. Mohiuddin, C. E. Todhunter, J. C.
Mansfield, T. Ahmad, F. R. Cummings, D. P. Jewell, J. Webster, M. J. Brown, G. M. Lathrop,
J. Connell, A. Dominiczak, C. A. B. Marcano, B. Burke, R. Dobson, J. Gungadoo, K. L. Lee, P. B.
Munroe, S. J. Newhouse, A. Onipinla, C. Wallace, M. Xue, M. Caulfield, M. Farrall, A. Barton,
T. B. i. R. G. a. G. (braggs), I. N. Bruce, H. Donovan, S. Eyre, P. D. Gilbert, S. L. Hider, A. M.
Hinks, S. L. John, C. Potter, A. J. Silman, D. P. M. Symmons, W. Thomson, J. Worthington,
D. B. Dunger, B. Widmer, T. M. Frayling, R. M. Freathy, H. Lango, J. R. B. Perry, B. M.
Shields, M. N. Weedon, A. T. Hattersley, G. A. Hitman, M. Walker, K. S. Elliott, C. J. Groves,
C. M. Lindgren, N. W. Rayner, N. J. Timpson, E. Zeggini, M. Newport, G. Sirugo, E. Lyons,
F. Vannberg, A. V. S. Hill, L. A. Bradbury, C. Farrar, J. J. Pointon, P. Wordsworth, M. A. Brown,
J. A. Franklyn, J. M. Heward, M. J. Simmonds, S. C. L. Gough, S. Seal, B. C. S. C. (uk), M. R.
Stratton, N. Rahman, M. Ban, A. Goris, S. J. Sawcer, A. Compston, D. Conway, M. Jallow,
K. A. Rockett, S. J. Bumpstead, A. Chaney, K. Downes, M. J. R. Ghori, R. Gwilliam, S. E.
Hunt, M. Inouye, A. Keniry, E. King, R. McGinnis, S. Potter, R. Ravindrarajah, P. Whittaker,
C. Widden, D. Withers, N. J. Cardin, T. Ferreira, J. Pereira-Gale, I. B. HallgrimsdA ttir, B. N.
Howie, Z. Su, Y. Y. Teo, D. Vukcevic, D. Bentley, and A. Compston, "Genome-wide association
study of 14,000 cases of seven common diseases and 3,000 shared controls," Nature, vol. 447,
pp. 661-678, June 2007.
31 J. A. Stamatoyannopoulos, "What does our genome encode?," Genome Research, vol. 22,
pp. 1602-1611, Sept. 2012.
39S. Pott and J. D. Lieb, "What are super-enhancers?," Nature Genetics, vol. 47, pp. 8-12, Dec.
2014.
40 L. S. Lilly, Pathophysiology of Heart Disease: A Collaborative Project of Medical Students and
Faculty. Lippincott Williams & Wilkins, 2011. Google-Books-ID: bIF7PckmFMoC.
4 'T. M. Teslovich, K. Musunuru, A. V. Smith, A. C. Edmondson, I. M. Stylianou, M. Koseki, J. P.
Pirruccello, S. Ripatti, D. I. Chasman, C. J. Willer, C. T. Johansen, S. W. Fouchier, A. Isaacs,
G. M. Peloso, M. Barbalic, S. L. Ricketts, J. C. Bis, Y. S. Aulchenko, G. Thorleifsson, M. F.
Feitosa, J. Chambers, M. Orho-Melander, 0. Melander, T. Johnson, X. Li, X. Guo, M. Li,
Y. Shin Cho, M. Jin Go, Y. Jin Kim, J.-Y. Lee, T. Park, K. Kim, X. Sim, R. Twee-Hee Ong,
D. C. Croteau-Chonka, L. A. Lange, J. D. Smith, K. Song, J. Hua Zhao, X. Yuan, J. Luan,
C. Lamina, A. Ziegler, W. Zhang, R. Y. L. Zee, A. F. Wright, J. C. M. Witteman, J. F. Wilson,
G. Willemsen, H.-E. Wichmann, J. B. Whitfield, D. M. Waterworth, N. J. Wareham, G. Wae-
ber, P. Vollenweider, B. F. Voight, V. Vitart, A. G. Uitterlinden, M. Uda, J. Tuomilehto, J. R.
Thompson, T. Tanaka, I. Surakka, H. M. Stringham, T. D. Spector, N. Soranzo, J. H. Smit,
J. Sinisalo, K. Silander, E. J. G. Sijbrands, A. Scuteri, J. Scott, D. Schlessinger, S. Sanna, V. Sa-
lomaa, J. Saharinen, C. Sabatti, A. Ruokonen, I. Rudan, L. M. Rose, R. Roberts, M. Rieder,

211
B. M. Psaty, P. P. Pramstaller, I. Pichler, M. Perola, B. W. J. H. Penninx, N. L. Pedersen,
C. Pattaro, A. N. Parker, G. Pare, B. A. Oostra, C. J. OaAZDonnell, M. S. Nieminen, D. A.
Nickerson, G. W. Montgomery, T. Meitinger, R. McPherson, M. I. McCarthy, W. McArdle,
D. Masson, N. G. Martin, F. Marroni, M. Mangino, P. K. E. Magnusson, G. Lucas, R. Luben,
R. J. F. Loos, M.-L. Lokki, G. Lettre, C. Langenberg, L. J. Launer, E. G. Lakatta, R. Laaksonen,
K. 0. Kyvik, F. Kronenberg, I. R. KAfinig, K.-T. Khaw, J. Kaprio, L. M. Kaplan, A. Johansson,
M.-R. Jarvelin, A. Cecile J. W. Janssens, E. Ingelsson, W. Igl, G. Kees Hovingh, J.-J. Hottenga,
A. Hofman, A. A. Hicks, C. Hengstenberg, I. M. Heid, C. Hayward, A. S. Havulinna, N. D. Hastie,
T. B. Harris, T. Haritunians, A. S. Hall, U. Gyllensten, C. Guiducci, L. C. Groop, E. Gonzalez,
C. Gieger, N. B. Freimer, L. Ferrucci, J. Erdmann, P. Elliott, K. G. Ejebe, A. DAfiring, A. F.
Dominiczak, S. Demissie, P. Deloukas, E. J. C. de Geus, U. de Faire, G. Crawford, F. S. Collins,
Y.-d. I. Chen, M. J. Caulfield, H. Campbell, N. P. Burtt, L. L. Bonnycastle, D. I. Boomsma,
S. M. Boekholdt, R. N. Bergman, I. Barroso, S. Bandinelli, C. M. Ballantyne, T. L. Assimes,
T. Quertermous, D. Altshuler, M. Seielstad, T. Y. Wong, E.-S. Tai, A. B. Feranil, C. W. Kuzawa,
L. S. Adair, H. A. Taylor Jr, I. B. Borecki, S. B. Gabriel, J. G. Wilson, H. Holm, U. Thorsteins-
dottir, V. Gudnason, R. M. Krauss, K. L. Mohlke, J. M. Ordovas, P. B. Munroe, J. S. Kooner,
A. R. Tall, R. A. Hegele, J. J. Kastelein, E. E. Schadt, J. I. Rotter, E. Boerwinkle, D. P. Stra-
chan, V. Mooser, K. Stefansson, M. P. Reilly, N. J. Samani, H. Schunkert, L. A. Cupples, M. S.
Sandhu, P. M. Ridker, D. J. Rader, C. M. van Duijn, L. Peltonen, G. R. Abecasis, M. Boehnke,
and S. Kathiresan, "Biological, clinical and population relevance of 95 loci for blood lipids,"
Nature, vol. 466, pp. 707-713, Aug. 2010.
42 W. M. Kettyle and R. A. Arky, Endocrine Pathphysiology. Philadelphia: Lippincott Williams
and Wilkins, Sept. 1998.

43S. C. J. Parker, M. L. Stitzel, D. L. Taylor, J. M. Orozco, M. R. Erdos, J. A. Akiyama, K. L.


van Bueren, P. S. Chines, N. Narisu, NISC Comparative Sequencing Program, B. L. Black,
A. Visel, L. A. Pennacchio, F. S. Collins, National Institutes of Health Intramural Sequenc-
ing Center Comparative Sequencing Program Authors, NISC Comparative Sequencing Program
Authors., J. Becker, B. Benjamin, R. Blakesley, G. Bouffard, S. Brooks, H. Coleman, M. Dekht-
yar, M. Gregory, X. Guan, J. Gupta, J. Han, A. Hargrove, S.-l. Ho, T. Johnson, R. Legaspi,
S. Lovett, Q. Maduro, C. Masiello, B. Maskeri, J. McDowell, C. Montemayor, J. Mullikin,
M. Park, N. Riebow, K. Schandler, B. Schmidt, C. Sison, M. Stantripop, J. Thomas, P. Thomas,
M. Vemulapalli, and A. Young, "Chromatin stretch enhancer states drive cell-specific gene regu-
lation and harbor human disease risk variants," Proceedings of the National Academy of Sciences,
vol. 110, pp. 17921-17926, Oct. 2013.
44
L. Pasquali, K. J. Gaulton, S. A. RodrAijguez-SeguAij, L. Mularoni, I. Miguel-Escalada, A. Ak-
erman, J. J. Tena, I. MorAqn, C. GAemez-MarAijn, M. van de Bunt, J. Ponsa-Cobas, N. Castro,
T. Nammo, I. Cebola, J. GarckAa-Hurtado, M. A. Maestro, F. Pattou, L. Piemonti, T. Berney,
A. L. Gloyn, P. Ravassard, J. L. G. Skarmeta, F. MAijller, M. I. McCarthy, and J. Ferrer,
"Pancreatic islet enhancer clusters enriched in type 2 diabetes risk-associated variants," Nature
Genetics, vol. 46, pp. 136-143, Jan. 2014.

212
4 P. Sklar, S. Ripke, L. J. Scott, 0. A. Andreassen, S. Cichon, N. Craddock, H. J. Edenberg,
J. I. Nurnberger, M. Rietschel, D. Blackwood, A. Corvin, M. Flickinger, W. Guan, M. Mattings-
dal, A. McQuillin, P. Kwan, T. F. Wienker, M. Daly, F. Dudbridge, P. A. Holmans, D. Lin,
M. Burmeister, T. A. Greenwood, M. L. Hamshere, P. Muglia, E. N. Smith, P. P. Zandi, C. M.
Nievergelt, R. McKinney, P. D. Shilling, N. J. Schork, C. S. Bloss, T. Foroud, D. L. Koller, E. S.
Gershon, C. Liu, J. A. Badner, W. A. Scheftner, W. B. Lawson, E. A. Nwulia, M. Hipolito,
W. Coryell, J. Rice, W. Byerley, F. J. McMahon, T. G. Schulze, W. Berrettini, F. W. Lohoff,
J. B. Potash, P. B. Mahon, M. G. McInnis, S. ZAUfllner, P. Zhang, D. W. Craig, S. Szelinger,
T. B. Barrett, R. Breuer, S. Meier, J. Strohmaier, S. H. Witt, F. Tozzi, A. Farmer, P. McGuf-
fin, J. Strauss, W. Xu, J. L. Kennedy, J. B. Vincent, K. Matthews, R. Day, M. A. Ferreira,
C. O'Dushlaine, R. Perlis, S. Raychaudhuri, D. Ruderfer, P. L. Hyoun, J. W. Smoller, J. Li,
D. Absher, R. C. Thompson, F. G. Meng, A. F. Schatzberg, W. E. Bunney, J. D. Barchas, E. G.
Jones, S. J. Watson, R. M. Myers, H. Akil, M. Boehnke, K. Chambert, J. Moran, E. Scolnick,
S. Djurovic, I. Melle, G. Morken, M. Gill, D. Morris, E. Quinn, T. W. MAijhleisen, F. A. De-
genhardt, M. Mattheisen, J. Schumacher, W. Maier, M. Steffens, P. Propping, M. M. NA ithen,
A. Anjorin, N. Bass, H. Gurling, R. Kandaswamy, J. Lawrence, K. McGhee, A. McIntosh,
A. W. McLean, W. J. Muir, B. S. Pickard, G. Breen, D. St. Clair, S. Caesar, K. Gordon-
Smith, L. Jones, C. Fraser, E. K. Green, D. Grozeva, I. R. Jones, G. Kirov, V. Moskvina,
I. Nikolov, M. C. O'Donovan, M. J. Owen, D. A. Collier, A. Elkin, R. Williamson, A. H. Young,
I. N. Ferrier, K. Stefansson, H. Stefansson, A. Adorgeirsson, S. Steinberg, A. Gustafsson, S. E.
Bergen, V. Nimgaonkar, C. Hultman, M. LandA'n, P. Lichtenstein, P. Sullivan, M. Schalling,
U. Osby, L. Backlund, L. FrisAl'n, N. Langstrom, S. Jamain, M. Leboyer, B. Etain, F. Bel-
livier, H. Petursson, E. SigurdLfsson, B. MAijller-Mysok, S. Lucae, M. Schwarz, P. R. Schofield,
N. Martin, G. W. Montgomery, M. Lathrop, H. ASskarsson, M. Bauer, A. Wright, P. B. Mitchell,
M. Hautzinger, A. Reif, J. R. Kelsoe, and S. M. Purcell, "Large-scale genome-wide association
analysis of bipolar disorder identifies a new susceptibility locus near ODZ4," Nature Genetics,
vol. 43, pp. 977-983, Sept. 2011.
46
W. Wang, S. Shao, Z. Jiao, M. Guo, H. Xu, and S. Wang, "The Th17/Treg imbalance and
cytokine environment in peripheral blood of patients with rheumatoid arthritis," Rheumatology
International, vol. 32, pp. 887-893, Apr. 2012.
47
I. S. Farooqi, "Defining the neural basis of appetite and obesity: from genes to behaviour,"
Clinical Medicine (London, England), vol. 14, pp. 286-289, June 2014.
41 C. A. Rietveld, S. E. Medland, J. Derringer, J. Yang, T. Esko, N. W.
Martin, H.-J. Westra,
K. Shakhbazov, A. Abdellaoui, A. Agrawal, E. Albrecht, B. Z. Alizadeh, N. Amin, J. Barnard,
S. E. Baumeister, K. S. Benke, L. F. Bielak, J. A. Boatman, P. A. Boyle, G. Davies, C. de Leeuw,
N. Eklund, D. S. Evans, R. Ferhmann, K. Fischer, C. Gieger, H. K. Gjessing, S. Hagg, J. R.
Harris, C. Hayward, C. Holzapfel, C. A. Ibrahim-Verbaas, E. Ingelsson, B. Jacobsson, P. K. Joshi,
A. Jugessur, M. Kaakinen, S. Kanoni, J. Karjalainen, I. Kolcic, K. Kristiansson, Z. Kutalik,
J. Lahti, S. H. Lee, P. Lin, P. A. Lind, Y. Liu, K. Lohman, M. Loitfelder, G. McMahon, P. M.
Vidal, 0. Meirelles, L. Milani, R. Myhre, M.-L. Nuotio, C. J. Oldmeadow, K. E. Petrovic, W. J.

213
Peyrot, 0. Polasek, L. Quaye, E. Reinmaa, J. P. Rice, T. S. Rizzi, H. Schmidt, R. Schmidt, A. V.
Smith, J. A. Smith, T. Tanaka, A. Terracciano, M. J. H. M. van der Loos, V. Vitart, H. Volzke,
J. Wellmann, L. Yu, W. Zhao, J. Allik, J. R. Attia, S. Bandinelli, F. Bastardot, J. Beauchamp,
D. A. Bennett, K. Berger, L. J. Bierut, D. I. Boomsma, U. Bultmann, H. Campbell, C. F. Chabris,
L. Cherkas, M. K. Chung, F. Cucca, M. de Andrade, P. L. De Jager, J.-E. De Neve, I. J. Deary,
G. V. Dedoussis, P. Deloukas, M. Dimitriou, G. Eiriksdottir, M. F. Elderson, J. G. Eriksson,
D. M. Evans, J. D. Faul, L. Ferrucci, M. E. Garcia, H. Gronberg, V. Guthnason, P. Hall, J. M.
Harris, T. B. Harris, N. D. Hastie, A. C. Heath, D. G. Hernandez, W. Hoffmann, A. Hofman,
R. Holle, E. G. Holliday, J.-J. Hottenga, W. G. Iacono, T. Illig, M.-R. Jarvelin, M. Kahonen,
J. Kaprio, R. M. Kirkpatrick, M. Kowgier, A. Latvala, L. J. Launer, D. A. Lawlor, T. Lehtimaki,
J. Li, P. Lichtenstein, P. Lichtner, D. C. Liewald, P. A. Madden, P. K. E. Magnusson, T. E.
Makinen, M. Masala, M. McGue, A. Metspalu, A. Mielck, M. B. Miller, G. W. Montgomery,
S. Mukherjee, D. R. Nyholt, B. A. Oostra, L. J. Palmer, A. Palotie, B. W. J. H. Penninx,
M. Perola, P. A. Peyser, M. Preisig, K. Raikkonen, 0. T. Raitakari, A. Realo, S. M. Ring,
S. Ripatti, F. Rivadeneira, I. Rudan, A. Rustichini, V. Salomaa, A.-P. Sarin, D. Schlessinger,
R. J. Scott, H. Snieder, B. St Pourcain, J. M. Starr, J. H. Sul, I. Surakka, R. Svento, A. Teumer,
The LifeLines Cohort Study, H. Tiemeier, F. J. A. van Rooij, D. R. Van Wagoner, E. Vartiainen,
J. Viikari, P. Vollenweider, J. M. Vonk, G. Waeber, D. R. Weir, H.-E. Wichmann, E. Widen,
G. Willemsen, J. F. Wilson, A. F. Wright, D. Conley, G. Davey-Smith, L. Franke, P. J. F.
Groenen, A. Hofman, M. Johannesson, S. L. R. Kardia, R. F. Krueger, D. Laibson, N. G. Martin,
M. N. Meyer, D. Posthuma, A. R. Thurik, N. J. Timpson, A. G. Uitterlinden, C. M. van Duijn,
P. M. Visscher, D. J. Benjamin, D. Cesarini, and P. D. Koellinger, "GWAS of 126,559 Individuals
Identifies Genetic Variants Associated with Educational Attainment," Science, vol. 340, pp. 1467-
1471, June 2013.

4' H. Furberg, Y. Kim, J. Dackor, E. Boerwinkle, N. Franceschini, D. Ardissino, L. Bernardinelli,


P. M. Mannucci, F. Mauri, P. A. Merlini, D. Absher, T. L. Assimes, S. P. Fortmann, C. Iribarren,
J. W. Knowles, T. Quertermous, L. Ferrucci, T. Tanaka, J. C. Bis, C. D. Furberg, T. Haritunians,
B. McKnight, B. M. Psaty, K. D. Taylor, E. L. Thacker, P. Almgren, L. Groop, C. Ladenvall,
M. Boehnke, A. U. Jackson, K. L. Mohlke, H. M. Stringham, J. Tuomilehto, E. J. Benjamin,
S.-J. Hwang, D. Levy, S. R. Preis, R. S. Vasan, J. Duan, P. V. Gejman, D. F. Levinson, A. R.
Sanders, J. Shi, E. H. Lips, J. D. McKay, A. Agudo, L. Barzan, V. Bencko, S. Benhamou,
X. CastellsaguAl', C. Canova, D. I. Conway, E. Fabianova, L. Foretova, V. Janout, C. M. Healy,
I. HolcAqtovA, K. Kjaerheim, P. Lagiou, J. Lissowska, R. Lowry, T. V. Macfarlane, D. Mates,
L. Richiardi, P. Rudnai, N. Szeszenia-Dabrowska, D. Zaridze, A. Znaor, M. Lathrop, P. Brennan,
S. Bandinelli, T. M. Frayling, J. M. Guralnik, Y. Milaneschi, J. R. B. Perry, D. Altshuler,
R. Elosua, S. Kathiresan, G. Lucas, 0. Melander, C. J. O'Donnell, V. Salomaa, S. M. Schwartz,
B. F. Voight, B. W. Penninx, J. H. Smit, N. Vogelzangs, D. I. Boomsma, E. J. C. de Geus,
J. M. Vink, G. Willemsen, S. J. Chanock, F. Gu, S. E. Hankinson, D. J. Hunter, A. Hofman,
H. Tiemeier, A. G. Uitterlinden, C. M. van Duijn, S. Walter, D. I. Chasman, B. M. Everett,
G. Parl', P. M. Ridker, M. D. Li, H. H. Maes, J. Audrain-McGovern, D. Posthuma, L. M.
Thornton, C. Lerman, J. Kaprio, J. E. Rose, J. P. A. loannidis, P. Kraft, D.-Y. Lin, and P. F.

214
Sullivan, "Genome-wide meta-analyses identify multiple loci associated with smoking behavior,"
Nature Genetics, vol. 42, pp. 441-447, May 2010.

50 H. Lango Allen, K. Estrada, G. Lettre, S. I. Berndt, M. N. Weedon, F. Rivadeneira, C. J.


Willer, A. U. Jackson, S. Vedantam, S. Raychaudhuri, T. Ferreira, A. R. Wood, R. J. Weyant,
A. V. SegrAf, E. K. Speliotes, E. Wheeler, N. Soranzo, J.-H. Park, J. Yang, D. Gudbjartsson,
N. L. Heard-Costa, J. C. Randall, L. Qi, A. Vernon Smith, R. MAd'gi, T. Pastinen, L. Liang,
I. M. Heid, J. Luan, G. Thorleifsson, T. W. Winkler, M. E. Goddard, K. Sin Lo, C. Palmer,
T. Workalemahu, Y. S. Aulchenko, A. Johansson, M. Carola Zillikens, M. F. Feitosa, T. Esko,
T. Johnson, S. Ketkar, P. Kraft, M. Mangino, I. Prokopenko, D. Absher, E. Albrecht, F. Ernst,
N. L. Glazer, C. Hayward, J.-J. Hottenga, K. B. Jacobs, J. W. Knowles, Z. Kutalik, K. L. Monda,
0. Polasek, M. Preuss, N. W. Rayner, N. R. Robertson, V. Steinthorsdottir, J. P. Tyrer, B. F.
Voight, F. Wiklund, J. Xu, J. Hua Zhao, D. R. Nyholt, N. Pellikka, M. Perola, J. R. B. Perry,
I. Surakka, M.-L. Tammesoo, E. L. Altmaier, N. Amin, T. Aspelund, T. Bhangale, G. Boucher,
D. I. Chasman, C. Chen, L. Coin, M. N. Cooper, A. L. Dixon, Q. Gibson, E. Grundberg, K. Hao,
M. Juhani Junttila, L. M. Kaplan, J. Kettunen, I. R. KA fnig, T. Kwan, R. W. Lawrence, D. F.
Levinson, M. Lorentzon, B. McKnight, A. P. Morris, M. MAijller, J. Suh Ngwa, S. Purcell,
S. Rafelt, R. M. Salem, E. Salvi, S. Sanna, J. Shi, U. Sovio, J. R. Thompson, M. C. Turchin,
L. Vandenput, D. J. Verlaan, V. Vitart, C. C. White, A. Ziegler, P. Almgren, A. J. Balm-
forth, H. Campbell, L. Citterio, A. De Grandi, A. Dominiczak, J. Duan, P. Elliott, R. Elosua,
J. G. Eriksson, N. B. Freimer, E. J. C. Geus, N. Glorioso, S. Haiqing, A.-L. Hartikainen, A. S.
Havulinna, A. A. Hicks, J. Hui, W. Igl, T. Illig, A. Jula, E. Kajantie, T. 0. KilpelAd'inen,
M. Koiranen, I. Kolcic, S. Koskinen, P. Kovacs, J. Laitinen, J. Liu, M.-L. Lokki, A. Marusic,
A. Maschio, T. Meitinger, A. Mulas, G. Parkl', A. N. Parker, J. F. Peden, A. Petersmann, I. Pich-
ler, K. H. PietilAd'inen, A. Pouta, M. Ridderstrkele, J. I. Rotter, J. G. Sambrook, A. R. Sanders,
C. Oliver Schmidt, J. Sinisalo, J. H. Smit, H. M. Stringham, G. Bragi Walters, E. Widen, S. H.
Wild, G. Willemsen, L. Zagato, L. Zgaga, P. Zitting, H. Alavere, M. Farrall, W. L. McArdle,
M. Nelis, M. J. Peters, S. Ripatti, J. B. J. van Meurs, K. K. Aben, K. G. Ardlie, J. S. Beckmann,
J. P. Beilby, R. N. Bergman, S. Bergmann, F. S. Collins, D. Cusi, M. den Heijer, G. Eiriksdottir,
P. V. Gejman, A. S. Hall, A. Hamsten, H. V. Huikuri, C. Iribarren, M. KAd'hAUnen, J. Kaprio,
S. Kathiresan, L. Kiemeney, T. Kocher, L. J. Launer, T. LehtimAd'ki, 0. Melander, T. H.
Mosley Jr, A. W. Musk, M. S. Nieminen, C. J. OaA2Donnell, C. Ohisson, B. Oostra, L. J.
Palmer, 0. Raitakari, P. M. Ridker, J. D. Rioux, A. Rissanen, C. Rivolta, H. Schunkert, A. R.
Shuldiner, D. S. Siscovick, M. Stumvoll, A. Tkfinjes, J. Tuomilehto, G.-J. van Ommen, J. Vi-
ikari, A. C. Heath, N. G. Martin, G. W. Montgomery, M. A. Province, M. Kayser, A. M. Arnold,
L. D. Atwood, E. Boerwinkle, S. J. Chanock, P. Deloukas, C. Gieger, H. GrAfinberg, P. Hall,
A. T. Hattersley, C. Hengstenberg, W. Hoffman, G. Mark Lathrop, V. Salomaa, S. Schreiber,
M. Uda, D. Waterworth, A. F. Wright, T. L. Assimes, I. Barroso, A. Hofman, K. L. Mohlke,
D. I. Boomsma, M. J. Caulfield, L. Adrienne Cupples, J. Erdmann, C. S. Fox, V. Gudnason,
U. Gyllensten, T. B. Harris, R. B. Hayes, M.-R. Jarvelin, V. Mooser, P. B. Munroe, W. H. Ouwe-
hand, B. W. Penninx, P. P. Pramstaller, T. Quertermous, I. Rudan, N. J. Samani, T. D. Spector,
H. VAflzke, H. Watkins, J. F. Wilson, L. C. Groop, T. Haritunians, F. B. Hu, R. C. Kaplan,

215
A. Metspalu, K. E. North, D. Schlessinger, N. J. Wareham, D. J. Hunter, J. R. OaA2Connell,
D. P. Strachan, H.-E. Wichmann, I. B. Borecki, C. M. van Duijn, E. E. Schadt, U. Thorsteins-
dottir, L. Peltonen, A. G. Uitterlinden, P. M. Visscher, N. Chatterjee, R. J. F. Loos, M. Boehnke,
M. I. McCarthy, E. Ingelsson, C. M. Lindgren, G. R. Abecasis, K. Stefansson, T. M. Frayling,
and J. N. Hirschhorn, "Hundreds of variants clustered in genomic loci and biological pathways
affect human height," Nature, vol. 467, pp. 832-838, Oct. 2010.

E. K. Speliotes, C. J. Willer, S. I. Berndt, K. L. Monda, G. Thorleifsson, A. U. Jackson, H. L.


Allen, C. M. Lindgren, J. Luan, R. MAd'gi, J. C. Randall, S. Vedantam, T. W. Winkler, L. Qi,
T. Workalemahu, I. M. Heid, V. Steinthorsdottir, H. M. Stringham, M. N. Weedon, E. Wheeler,
A. R. Wood, T. Ferreira, R. J. Weyant, A. V. SegrAf, K. Estrada, L. Liang, J. Nemesh, J.-H.
Park, S. Gustafsson, T. 0. KilpelAd'inen, J. Yang, N. Bouatia-Naji, T. Esko, M. F. Feitosa,
Z. Kutalik, M. Mangino, S. Raychaudhuri, A. Scherag, A. V. Smith, R. Welch, J. H. Zhao,
K. K. Aben, D. M. Absher, N. Amin, A. L. Dixon, E. Fisher, N. L. Glazer, M. E. Goddard,
N. L. Heard-Costa, V. Hoesel, J.-J. Hottenga, A. Johansson, T. Johnson, S. Ketkar, C. Lamina,
S. Li, M. F. Moffatt, R. H. Myers, N. Narisu, J. R. B. Perry, M. J. Peters, M. Preuss, S. Ri-
patti, F. Rivadeneira, C. Sandholt, L. J. Scott, N. J. Timpson, J. P. Tyrer, S. van Wingerden,
R. M. Watanabe, C. C. White, F. Wiklund, C. Barlassina, D. I. Chasman, M. N. Cooper, J.-O.
Jansson, R. W. Lawrence, N. Pellikka, I. Prokopenko, J. Shi, E. Thiering, H. Alavere, M. T. S.
Alibrandi, P. Almgren, A. M. Arnold, T. Aspelund, L. D. Atwood, B. Balkau, A. J. Balm-
forth, A. J. Bennett, Y. Ben-Shlomo, R. N. Bergman, S. Bergmann, H. Biebermann, A. I. F.
Blakemore, T. Boes, L. L. Bonnycastle, S. R. Bornstein, M. J. Brown, T. A. Buchanan, F. Bu-
sonero, H. Campbell, F. P. Cappuccio, C. Cavalcanti-ProenAga, Y.-D. I. Chen, C.-M. Chen,
P. S. Chines, R. Clarke, L. Coin, J. Connell, I. N. M. Day, M. d. Heijer, J. Duan, S. Ebrahim,
P. Elliott, R. Elosua, G. Eiriksdottir, M. R. Erdos, J. G. Eriksson, M. F. Facheris, S. B. Felix,
P. Fischer-Posovszky, A. R. Folsom, N. Friedrich, N. B. Freimer, M. Fu, S. Gaget, P. V. Gej-
man, E. J. C. Geus, C. Gieger, A. P. Gjesing, A. Goel, P. Goyette, H. Grallert, J. Grkd'X ler,
D. M. Greenawalt, C. J. Groves, V. Gudnason, C. Guiducci, A.-L. Hartikainen, N. Hassanali,
A. S. Hall, A. S. Havulinna, C. Hayward, A. C. Heath, C. Hengstenberg, A. A. Hicks, A. Hin-
ney, A. Hofman, G. Homuth, J. Hui, W. Igl, C. Iribarren, B. Isomaa, K. B. Jacobs, I. Jarick,
E. Jewell, U. John, T. JAyrgensen, P. Jousilahti, A. Jula, M. Kaakinen, E. Kajantie, L. M.
Kaplan, S. Kathiresan, J. Kettunen, L. Kinnunen, J. W. Knowles, I. Kolcic, I. R. KAfinig,
S. Koskinen, P. Kovacs, J. Kuusisto, P. Kraft, K. KvalAyy, J. Laitinen, 0. Lantieri, C. Lanzani,
L. J. Launer, C. Lecoeur, T. LehtimAd'ki, G. Lettre, J. Liu, M.-L. Lokki, M. Lorentzon, R. N.
Luben, B. Ludwig, P. Manunta, D. Marek, M. Marre, N. G. Martin, W. L. McArdle, A. Mc-
Carthy, B. McKnight, T. Meitinger, 0. Melander, D. Meyre, K. Midthjell, G. W. Montgomery,
M. A. Morken, A. P. Morris, R. Mulic, J. S. Ngwa, M. Nelis, M. J. Neville, D. R. Nyholt, C. J.
O'Donnell, S. O'Rahilly, K. K. Ong, B. Oostra, G. ParAl', A. N. Parker, M. Perola, I. Pich-
ler, K. H. PietilAd'inen, C. G. P. Platou, 0. Polasek, A. Pouta, S. Rafelt, 0. Raitakari, N. W.
Rayner, M. Ridderstrkele, W. Rief, A. Ruokonen, N. R. Robertson, P. Rzehak, V. Salomaa,
A. R. Sanders, M. S. Sandhu, S. Sanna, J. Saramies, M. J. Savolainen, S. Scherag, S. Schipf,
S. Schreiber, H. Schunkert, K. Silander, J. Sinisalo, D. S. Siscovick, J. H. Smit, N. Soranzo,

216
U. Sovio, J. Stephens, I. Surakka, A. J. Swift, M.-L. Tammesoo, J.-C. Tardif, M. Teder-Laving,
T. M. Teslovich, J. R. Thompson, B. Thomson, A. TAUinjes, T. Tuomi, J. B. J. van Meurs,
G.-J. van Ommen, V. Vatin, J. Viikari, S. Visvikis-Siest, V. Vitart, C. I. G. Vogel, B. F. Voight,
L. L. Waite, H. Wallaschofski, G. B. Walters, E. Widen, S. Wiegand, S. H. Wild, G. Willemsen,
D. R. Witte, J. C. Witteman, J. Xu, Q. Zhang, L. Zgaga, A. Ziegler, P. Zitting, J. P. Beilby,
I. S. Farooqi, J. Hebebrand, H. V. Huikuri, A. L. James, M. KAd'hAunen, D. F. Levinson,
F. Macciardi, M. S. Nieminen, C. Ohlsson, L. J. Palmer, P. M. Ridker, M. Stumvoll, J. S. Beck-
mann, H. Boeing, E. Boerwinkle, D. I. Boomsma, M. J. Caulfield, S. J. Chanock, F. S. Collins,
L. A. Cupples, G. D. Smith, J. Erdmann, P. Froguel, H. GrAfinberg, U. Gyllensten, P. Hall,
T. Hansen, T. B. Harris, A. T. Hattersley, R. B. Hayes, J. Heinrich, F. B. Hu, K. Hveem, T. Il-
lig, M.-R. Jarvelin, J. Kaprio, F. Karpe, K.-T. Khaw, L. A. Kiemeney, H. Krude, M. Laakso,
D. A. Lawlor, A. Metspalu, P. B. Munroe, W. H. Ouwehand, 0. Pedersen, B. W. Penninx,
A. Peters, P. P. Pramstaller, T. Quertermous, T. Reinehr, A. Rissanen, I. Rudan, N. J. Samani,
P. E. H. Schwarz, A. R. Shuldiner, T. D. Spector, J. Tuomilehto, M. Uda, A. Uitterlinden, T. T.
Valle, M. Wabitsch, G. Waeber, N. J. Wareham, H. Watkins, J. F. Wilson, A. F. Wright, M. C.
Zillikens, N. Chatterjee, S. A. McCarroll, S. Purcell, E. E. Schadt, P. M. Visscher, T. L. Assimes,
I. B. Borecki, P. Deloukas, C. S. Fox, L. C. Groop, T. Haritunians, D. J. Hunter, R. C. Kaplan,
K. L. Mohlke, J. R. O'Connell, L. Peltonen, D. Schlessinger, D. P. Strachan, C. M. van Duijn,
H.-E. Wichmann, T. M. Frayling, U. Thorsteinsdottir, G. R. Abecasis, I. Barroso, M. Boehnke,
K. Stefansson, K. E. North, M. I McCarthy, J. N. Hirschhorn, E. Ingelsson, and R. J. F. Loos,
"Association analyses of 249,796 individuals reveal 18 new loci associated with body mass index,"
Nature Genetics, vol. 42, pp. 937-948, Nov. 2010.

52 J. R. B. Perry, F. Day, C. E. Elks, P. Sulem, D. J. Thompson, T. Ferreira, C. He,


D. I. Chasman,
T. Esko, G. Thorleifsson, E. Albrecht, W. Q. Ang, T. Corre, D. L. Cousminer, B. Feenstra,
N. Franceschini, A. Ganna, A. D. Johnson, S. Kjellqvist, K. L. Lunetta, G. McMahon, I. M.
Nolte, L. Paternoster, E. Porcu, A. V. Smith, L. Stolk, A. Teumer, N. TAqernikova, E. Tikka-
nen, S. Ulivi, E. K. Wagner, N. Amin, L. J. Bierut, E. M. Byrne, J.-J. Hottenga, D. L. Koller,
M. Mangino, T. H. Pers, L. M. Yerges-Armstrong, J. Hua Zhao, I. L. Andrulis, H. Anton-
Culver, F. Atsma, S. Bandinelli, M. W. Beckmann, J. Benitez, C. Blomqvist, S. E. Bojesen,
M. K. Bolla, B. Bonanni, H. Brauch, H. Brenner, J. E. Buring, J. Chang-Claude, S. Chanock,
J. Chen, G. Chenevix-Trench, J. M. CollAl'e, F. J. Couch, D. Couper, A. D. Coviello, A. Cox,
K. Czene, A. P. DA2adamo, G. Davey Smith, I. De Vivo, E. W. Demerath, J. Dennis, P. Devilee,
A. K. Dieffenbach, A. M. Dunning, G. Eiriksdottir, J. G. Eriksson, P. A. Fasching, L. Ferrucci,
D. Flesch-Janys, H. Flyger, T. Foroud, L. Franke, M. E. Garcia, M. GarcAija-Closas, F. Geller,
E. E. J. de Geus, G. G. Giles, D. F. Gudbjartsson, V. Gudnason, P. Gu l'nel, S. Guo, P. Hall,
U. Hamann, R. Haring, C. A. Hartman, A. C. Heath, A. Hofman, M. J. Hooning, J. L. Hop-
per, F. B. Hu, D. J. Hunter, D. Karasik, D. P. Kiel, J. A. Knight, V.-M. Kosma, Z. Kutalik,
S. Lai, D. Lambrechts, A. Lindblom, R. MAd'gi, P. K. Magnusson, A. Mannermaa, N. G. Martin,
G. Masson, P. F. McArdle, W. L. McArdle, M. Melbye, K. Michailidou, E. Mihailov, L. Milani,
R. L. Milne, H. Nevanlinna, P. Neven, E. A. Nohr, A. J. Oldehinkel, B. A. Oostra, A. Palotie,
M. Peacock, N. L. Pedersen, P. Peterlongo, J. Peto, P. D. P. Pharoah, D. S. Postma, A. Pouta,

217
K. PylkAd's, P. Radice, S. Ring, F. Rivadeneira, A. Robino, L. M. Rose, A. Rudolph, V. Sa-
lomaa, S. Sanna, D. Schlessinger, M. K. Schmidt, M. C. Southey, U. Sovio, M. J. Stampfer,
D. StAfickl, A. M. Storniolo, N. J. Timpson, J. Tyrer, J. A. Visser, P. Vollenweider, H. VA ilzke,
G. Waeber, M. Waldenberger, H. Wallaschofski, Q. Wang, G. Willemsen, R. Winqvist, B. H. R.
Wolffenbuttel, M. J. Wright, D. I. Boomsma, M. J. Econs, K.-T. Khaw, R. J. F. Loos, M. I.
McCarthy, G. W. Montgomery, J. P. Rice, E. A. Streeten, U. Thorsteinsdottir, C. M. van Duijn,
B. Z. Alizadeh, S. Bergmann, E. Boerwinkle, H. A. Boyd, L. Crisponi, P. Gasparini, C. Gieger,
T. B. Harris, E. Ingelsson, M.-R. JAd'rvelin, P. Kraft, D. Lawlor, A. Metspalu, C. E. Pennell,
P. M. Ridker, H. Snieder, T. I. A. SAyrensen, T. D. Spector, D. P. Strachan, A. G. Uitterlin-
den, N. J. Wareham, E. Widen, M. Zygmunt, A. Murray, D. F. Easton, K. Stefansson, J. M.
Murabito, and K. K. Ong, "Parent-of-origin-specific allelic associations among 106 genomic loci
for age at menarche," Nature, vol. 514, pp. 92-97, July 2014.
5 3 H. Schunkert, I. R. KAU-nig, S. Kathiresan, M. P. Reilly, T. L. Assimes, H. Holm, M. Preuss,
A. F. R. Stewart, M. Barbalic, C. Gieger, D. Absher, Z. Aherrahrou, H. Allayee, D. Altshuler,
S. S. Anand, K. Andersen, J. L. Anderson, D. Ardissino, S. G. Ball, A. J. Balmforth, T. A.
Barnes, D. M. Becker, L. C. Becker, K. Berger, J. C. Bis, S. M. Boekholdt, E. Boerwinkle,
P. S. Braund, M. J. Brown, M. S. Burnett, I. Buysschaert, J. F. Carlquist, L. Chen, S. Cichon,
V. Codd, R. W. Davies, G. Dedoussis, A. Dehghan, S. Demissie, J. M. Devaney, P. Diemert,
R. Do, A. Doering, S. Eifert, N. E. E. Mokhtari, S. G. Ellis, R. Elosua, J. C. Engert, S. E.
Epstein, U. de Faire, M. Fischer, A. R. Folsom, J. Freyer, B. Gigante, D. Girelli, S. Gretarsdottir,
V. Gudnason, J. R. Gulcher, E. Halperin, N. Hammond, S. L. Hazen, A. Hofman, B. D. Horne,
T. Illig, C. Iribarren, G. T. Jones, J. W. Jukema, M. A. Kaiser, L. M. Kaplan, J. J. P. Kastelein,
K.-T. Khaw, J. W. Knowles, G. Kolovou, A. Kong, R. Laaksonen, D. Lambrechts, K. Leander,
G. Lettre, M. Li, W. Lieb, C. Loley, A. J. Lotery, P. M. Mannucci, S. Maouche, N. Martinelli,
P. P. McKeown, C. Meisinger, T. Meitinger, 0. Melander, P. A. Merlini, V. Mooser, T. Morgan,
T. W. MAijhleisen, J. B. Muhlestein, T. MAijnzel, K. Musunuru, J. Nahrstaedt, C. P. Nelson,
M. M. NAfIthen, 0. Olivieri, R. S. Patel, C. C. Patterson, A. Peters, F. Peyvandi, L. Qu, A. A.
Quyyumi, D. J. Rader, L. S. Rallidis, C. Rice, F. R. Rosendaal, D. Rubin, V. Salomaa, M. L.
Sampietro, M. S. Sandhu, E. Schadt, A. SchAd'fer, A. Schillert, S. Schreiber, J. Schrezenmeir,
S. M. Schwartz, D. S. Siscovick, M. Sivananthan, S. Sivapalaratnam, A. Smith, T. B. Smith, J. D.
Snoep, N. Soranzo, J. A. Spertus, K. Stark, K. Stirrups, M. Stoll, W. H. W. Tang, S. Tennstedt,
G. Thorgeirsson, G. Thorleifsson, M. Tomaszewski, A. G. Uitterlinden, A. M. van Rij, B. F.
Voight, N. J. Wareham, G. A. Wells, H.-E. Wichmann, P. S. Wild, C. Willenborg, J. C. M.
Witteman, B. J. Wright, S. Ye, T. Zeller, A. Ziegler, F. Cambien, A. H. Goodall, L. A. Cupples,
T. Quertermous, W. MAd'rz, C. Hengstenberg, S. Blankenberg, W. H. Ouwehand, A. S. Hall,
P. Deloukas, J. R. Thompson, K. Stefansson, R. Roberts, U. Thorsteinsdottir, C. J. O'Donnell,
R. McPherson, J. Erdmann, and N. J. Samani, "Large-scale association analysis identifies 13 new
susceptibility loci for coronary artery disease," Nature Genetics, vol. 43, pp. 333-338, Mar. 2011.
54
A. P. Morris, B. F. Voight, T. M. Teslovich, T. Ferreira, A. V. SegrAl, V. Steinthorsdottir,
R. J. Strawbridge, H. Khan, H. Grallert, A. Mahajan, I. Prokopenko, H. M. Kang, C. Dina,
T. Esko, R. M. Fraser, S. Kanoni, A. Kumar, V. Lagou, C. Langenberg, J. Luan, C. M. Lind-

218
gren, M. Mkijller-Nurasyid, S. Pechlivanis, N. W. Rayner, L. J. Scott, S. Wiltshire, L. Yengo,
L. Kinnunen, E. J. Rossin, S. Raychaudhuri, A. D. Johnson, A. S. Dimas, R. J. F. Loos,
S. Vedantam, H. Chen, J. C. Florez, C. Fox, C.-T. Liu, D. Rybin, D. J. Couper, W. H. L.
Kao, M. Li, M. C. Cornelis, P. Kraft, Q. Sun, R. M. van Dam, H. M. Stringham, P. S. Chines,
K. Fischer, P. Fontanillas, 0. L. Holmen, S. E. Hunt, A. U. Jackson, A. Kong, R. Lawrence,
J. Meyer, J. R. B. Perry, C. G. P. Platou, S. Potter, E. Rehnberg, N. Robertson, S. Sivapalarat-
nam, A. StanADlAikovA4, K. Stirrups, G. Thorleifsson, E. Tikkanen, A. R. Wood, P. Almgren,
M. Atalay, R. Benediktsson, L. L. Bonnycastle, N. Burtt, J. Carey, G. Charpentier, A. T. Cren-
shaw, A. S. F. Doney, M. Dorkhan, S. Edkins, V. Emilsson, E. Eury, T. Forsen, K. Gertow,
B. Gigante, G. B. Grant, C. J. Groves, C. Guiducci, C. Herder, A. B. Hreidarsson, J. Hui,
A. James, A. Jonsson, W. Rathmann, N. Klopp, J. Kravic, K. KrjutAqkov, C. Langford, K. Le-
ander, E. Lindholm, S. Lobbens, S. MAd'nnistAuf, G. Mirza, T. W. MAijhleisen, B. Musk,
M. Parkin, L. Rallidis, J. Saramies, B. Sennblad, S. Shah, G. SigurArsson, A. Silveira, G. Stein-
bach, B. Thorand, J. Trakalo, F. Veglia, R. Wennauer, W. Winckler, D. Zabaneh, H. Campbell,
C. van Duijn, A. G. Uitterlinden, A. Hofman, E. Sijbrands, G. R. Abecasis, K. R. Owen, E. Zeg-
gini, M. D. Trip, N. G. Forouhi, A.-C. Syvkd'nen, J. G. Eriksson, L. Peltonen, M. M. NA ithen,
B. Balkau, C. N. A. Palmer, V. Lyssenko, T. Tuomi, B. Isomaa, D. J. Hunter, L. Qi, A. R.
Shuldiner, M. Roden, I. Barroso, T. Wilsgaard, J. Beilby, K. Hovingh, J. F. Price, J. F. Wilson,
R. Rauramaa, T. A. Lakka, L. Lind, G. Dedoussis, I. NjAylstad, N. L. Pedersen, K.-T. Khaw,
N. J. Wareham, S. M. Keinanen-Kiukaanniemi, T. E. Saaristo, E. Korpi-HyAfivAd'lti, J. Saltevo,
M. Laakso, J. Kuusisto, A. Metspalu, F. S. Collins, K. L. Mohlke, R. N. Bergman, J. Tuomile-
hto, B. 0. Boehm, C. Gieger, K. Hveem, S. Cauchi, P. Froguel, D. Baldassarre, E. Tremoli,
S. E. Humphries, D. Saleheen, J. Danesh, E. Ingelsson, S. Ripatti, V. Salomaa, R. Erbel, K.-H.
JAUickel, S. Moebus, A. Peters, T. Illig, U. de Faire, A. Hamsten, A. D. Morris, P. J. Donnelly,
T. M. Frayling, A. T. Hattersley, E. Boerwinkle, 0. Melander, S. Kathiresan, P. M. Nilsson,
P. Deloukas, U. Thorsteinsdottir, L. C. Groop, K. Stefansson, F. Hu, J. S. Pankow, J. Dupuis,
J. B. Meigs, D. Altshuler, M. Boehnke, and M. I. McCarthy, "Large-scale association analysis
provides insights into the genetic architecture and pathophysiology of type 2 diabetes," Nature
Genetics, vol. 44, pp. 981-990, Aug. 2012.

55 A.K. Manning, M.-F. Hivert, R. A. Scott, J. L. Grimsby, N. Bouatia-Naji, H. Chen, D. Rybin,


C.-T. Liu, L. F. Bielak, I. Prokopenko, N. Amin, D. Barnes, G. Cadby, J.-J. Hottenga, E. Ingels-
son, A. U. Jackson, T. Johnson, S. Kanoni, C. Ladenvall, V. Lagou, J. Lahti, C. Lecoeur, Y. Liu,
M. T. Martinez-Larrad, M. E. Montasser, P. Navarro, J. R. B. Perry, L. J. Rasmussen-Torvik,
P. Salo, N. Sattar, D. Shungin, R. J. Strawbridge, T. Tanaka, C. M. van Duijn, P. An, M. de An-
drade, J. S. Andrews, T. Aspelund, M. Atalay, Y. Aulchenko, B. Balkau, S. Bandinelli, J. S.
Beckmann, J. P. Beilby, C. Bellis, R. N. Bergman, J. Blangero, M. Boban, M. Boehnke, E. Boer-
winkle, L. L. Bonnycastle, D. I. Boomsma, I. B. Borecki, Y. BAUttcher, C. Bouchard, E. Brunner,
D. Budimir, H. Campbell, 0. Carlson, P. S. Chines, R. Clarke, F. S. Collins, A. CorbatAgn-
Anchuelo, D. Couper, U. de Faire, G. V. Dedoussis, P. Deloukas, M. Dimitriou, J. M. Egan,
G. Eiriksdottir, M. R. Erdos, J. G. Eriksson, E. Eury, L. Ferrucci, I. Ford, N. G. Forouhi,
C. S. Fox, M. G. Franzosi, P. W. Franks, T. M. Frayling, P. Froguel, P. Galan, E. de Geus,

219
B. Gigante, N. L. Glazer, A. Goel, L. Groop, V. Gudnason, G. Hallmans, A. Hamsten, 0. Hans-
son, T. B. Harris, C. Hayward, S. Heath, S. Hercberg, A. A. Hicks, A. Hingorani, A. Hofman,
J. Hui, J. Hung, M.-R. Jarvelin, M. A. Jhun, P. C. D. Johnson, J. W. Jukema, A. Jula, W. H.
Kao, J. Kaprio, S. L. R. Kardia, S. Keinanen-Kiukaanniemi, M. Kivimaki, I. Kolcic, P. Ko-
vacs, M. Kumari, J. Kuusisto, K. 0. Kyvik, M. Laakso, T. Lakka, L. Lannfelt, G. M. Lathrop,
L. J. Launer, K. Leander, G. Li, L. Lind, J. Lindstrom, S. Lobbens, R. J. F. Loos, J. Luan,
V. Lyssenko, R. Mkd'gi, P. K. E. Magnusson, M. Marmot, P. Meneton, K. L. Mohlke, V. Mooser,
M. A. Morken, I. Miljkovic, N. Narisu, J. O'Connell, K. K. Ong, B. A. Oostra, L. J. Palmer,
A. Palotie, J. S. Pankow, J. F. Peden, N. L. Pedersen, M. Pehlic, L. Peltonen, B. Penninx,
M. Pericic, M. Perola, L. Perusse, P. A. Peyser, 0. Polasek, P. P. Pramstaller, M. A. Province,
K. RAd'ikkAinen, R. Rauramaa, E. Rehnberg, K. Rice, J. I. Rotter, I. Rudan, A. Ruokonen,
T. Saaristo, M. Sabater-Lleal, V. Salomaa, D. B. Savage, R. Saxena, P. Schwarz, U. Seedorf,
B. Sennblad, M. Serrano-Rios, A. R. Shuldiner, E. J. G. Sijbrands, D. S. Siscovick, J. H. Smit,
K. S. Small, N. L. Smith, A. V. Smith, A. StanADA4kovAl, K. Stirrups, M. Stumvoll, Y. V.
Sun, A. J. Swift, A. TAUfnjes, J. Tuomilehto, S. Trompet, A. G. Uitterlinden, M. Uusitupa,
M. VikstrAu-m, V. Vitart, M.-C. Vohl, B. F. Voight, P. Vollenweider, G. Waeber, D. M. Wa-
terworth, H. Watkins, E. Wheeler, E. Widen, S. H. Wild, S. M. Willems, G. Willemsen, J. F.
Wilson, J. C. M. Witteman, A. F. Wright, H. Yaghootkar, D. Zelenika, T. Zemunik, L. Zgaga,
N. J. Wareham, M. I. McCarthy, I. Barroso, R. M. Watanabe, J. C. Florez, J. Dupuis, J. B.
Meigs, and C. Langenberg, "A genome-wide approach accounting for body mass index identi-
fies genetic variants influencing fasting glycemic traits and insulin resistance," Nature Genetics,
vol. 44, pp. 659-669, May 2012.

56 V. Boraska, C. S. Franklin, J. a. B. Floyd, L. M. Thornton, L. M. Huckins, L. Southam, N. W.


Rayner, I. Tachmazidou, K. L. Klump, J. Treasure, C. M. Lewis, U. Schmidt, F. Tozzi, K. Kieze-
brink, J. Hebebrand, P. Gorwood, R. a. H. Adan, M. J. H. Kas, A. Favaro, P. Santonastaso,
F. FernAqndez-Aranda, M. Gratacos, F. Rybakowski, M. Dmitrzak-Weglarz, J. Kaprio, A. Keski-
Rahkonen, A. Raevuori, E. F. Van Furth, M. C. T. Slof-Op 't Landt, J. I. Hudson, T. Reichborn-
Kjennerud, G. P. S. Knudsen, P. Monteleone, A. S. Kaplan, A. Karwautz, H. Hakonarson, W. H.
Berrettini, Y. Guo, D. Li, N. J. Schork, G. Komaki, T. Ando, H. Inoko, T. Esko, K. Fischer,
K. MAd'nnik, A. Metspalu, J. H. Baker, R. D. Cone, J. Dackor, J. E. DeSocio, C. E. Hilliard, J. K.
O'Toole, J. Pantel, J. P. Szatkiewicz, C. Taico, S. Zerwas, S. E. Trace, 0. S. P. Davis, S. Helder,
K. BAijhren, R. Burghardt, M. de Zwaan, K. Egberts, S. Ehrlich, B. Herpertz-Dahlmann,
W. Herzog, H. Imgart, A. Scherag, S. Scherag, S. Zipfel, C. Boni, N. Ramoz, A. Versini,
M. K. Brandys, U. N. Danner, C. de Kovel, J. Hendriks, B. P. C. Koeleman, R. A. Ophoff,
E. Strengman, A. A. van Elburg, A. Bruson, M. Clementi, D. Degortes, M. Forzan, E. Tenconi,
E. Docampo, G. EscaramAijs, S. JimA'nez-Murcia, J. Lissowska, A. Rajewski, N. Szeszenia-
Dabrowska, A. Slopien, J. Hauser, L. Karhunen, I. Meulenbelt, P. E. Slagboom, A. Tortorella,
M. Maj, G. Dedoussis, D. Dikeos, F. Gonidakis, K. Tziouvas, A. Tsitsika, H. Papezova, L. Slach-
tova, D. Martaskova, J. L. Kennedy, R. D. Levitan, Z. Yilmaz, J. Huemer, D. Koubek, E. Merl,
G. Wagner, P. Lichtenstein, G. Breen, S. Cohen-Woods, A. Farmer, P. McGuffin, S. Cichon,
I. Giegling, S. Herms, D. Rujescu, S. Schreiber, H.-E. Wichmann, C. Dina, R. Sladek, G. Gam-

220
baro, N. Soranzo, A. Julia, S. Marsal, R. Rabionet, V. Gaborieau, D. M. Dick, A. Palotie, S. Ri-
patti, E. WidAl'n, 0. A. Andreassen, T. Espeseth, A. Lundervold, I. Reinvang, V. M. Steen,
S. Le Hellard, M. Mattingsdal, I. Ntalla, V. Bencko, L. Foretova, V. Janout, M. Navratilova,
S. Gallinger, D. Pinto, S. W. Scherer, H. Aschauer, L. Carlberg, A. Schosser, L. Alfredsson,
B. Ding, L. Klareskog, L. Padyukov, P. Courtet, S. Guillaume, I. Jaussent, C. Finan, G. Kalsi,
M. Roberts, D. W. Logan, L. Peltonen, G. R. S. Ritchie, J. C. Barrett, Wellcome Trust Case
Control Consortium 3, X. Estivill, A. Hinney, P. F. Sullivan, D. A. Collier, E. Zeggini, and C. M.
Bulik, "A genome-wide association study of anorexia nervosa," Molecular Psychiatry, vol. 19,
pp. 1085-1094, Oct. 2014.
5 Y. Okada, D. Wu, G. Trynka, T. Raj, C. Terao, K. Ikari, Y. Kochi, K. Ohmura, A. Suzuki,
S. Yoshida, R. R. Graham, A. Manoharan, W. Ortmann, T. Bhangale, J. C. Denny, R. J. Carroll,
A. E. Eyler, J. D. Greenberg, J. M. Kremer, D. A. Pappas, L. Jiang, J. Yin, L. Ye, D.-F. Su,
J. Yang, G. Xie, E. Keystone, H.-J. Westra, T. Esko, A. Metspalu, X. Zhou, N. Gupta, D. Mirel,
E. A. Stahl, D. Diogo, J. Cui, K. Liao, M. H. Guo, K. Myouzen, T. Kawaguchi, M. J. H. Coenen,
P. L. C. M. van Riel, M. A. F. J. van de Laar, H.-J. Guchelaar, T. W. J. Huizinga, P. DieudAl',
X. Mariette, S. Louis Bridges Jr, A. Zhernakova, R. E. M. Toes, P. P. Tak, C. Miceli-Richard,
S.-Y. Bang, H.-S. Lee, J. Martin, M. A. Gonzalez-Gay, L. Rodriguez-Rodriguez, S. RantapAd'Xd'-
Dahlqvist, L. ADrlestig, H. K. Choi, Y. Kamatani, P. Galan, M. Lathrop, S. Eyre, J. Bowes,
A. Barton, N. de Vries, L. W. Moreland, L. A. Criswell, E. W. Karlson, A. Taniguchi, R. Yamada,
M. Kubo, J. S. Liu, S.-C. Bae, J. Worthington, L. Padyukov, L. Klareskog, P. K. Gregersen,
S. Raychaudhuri, B. E. Stranger, P. L. De Jager, L. Franke, P. M. Visscher, M. A. Brown,
H. Yamanaka, T. Mimori, A. Takahashi, H. Xu, T. W. Behrens, K. A. Siminovitch, S. Momohara,
F. Matsuda, K. Yamamoto, and R. M. Plenge, "Genetics of rheumatoid arthritis contributes to
biology and drug discovery," Nature, vol. 506, pp. 376-381, Dec. 2013.
5 8 L. Jostins, S. Ripke, R. K. Weersma, R. H. Duerr, D. P. McGovern, K. Y. Hui, J. C. Lee,
L. Philip Schumm, Y. Sharma, C. A. Anderson, J. Essers, M. Mitrovic, K. Ning, I. Cley-
nen, E. Theatre, S. L. Spain, S. Raychaudhuri, P. Goyette, Z. Wei, C. Abraham, J.-P. Achkar,
T. Ahmad, L. Amininejad, A. N. Ananthakrishnan, V. Andersen, J. M. Andrews, L. Baidoo,
T. Balschun, P. A. Bampton, A. Bitton, G. Boucher, S. Brand, C. BAijning, A. Cohain, S. Ci-
chon, M. DaAZAmato, D. De Jong, K. L. Devaney, M. Dubinsky, C. Edwards, D. Ellinghaus,
L. R. Ferguson, D. Franchimont, K. Fransen, R. Gearry, M. Georges, C. Gieger, J. Glas, T. Har-
itunians, A. Hart, C. Hawkey, M. Hedl, X. Hu, T. H. Karlsen, L. Kupcinskas, S. Kugathasan,
A. Latiano, D. Laukens, I. C. Lawrance, C. W. Lees, E. Louis, G. Mahy, J. Mansfield, A. R.
Morgan, C. Mowat, W. Newman, 0. Palmieri, C. Y. Ponsioen, U. Potocnik, N. J. Prescott,
M. Regueiro, J. I. Rotter, R. K. Russell, J. D. Sanderson, M. Sans, J. Satsangi, S. Schreiber,
L. A. Simms, J. Sventoraityte, S. R. Targan, K. D. Taylor, M. Tremelling, H. W. Verspaget,
M. De Vos, C. Wijmenga, D. C. Wilson, J. Winkelmann, R. J. Xavier, S. Zeissig, B. Zhang,
C. K. Zhang, H. Zhao, M. S. Silverberg, V. Annese, H. Hakonarson, S. R. Brant, G. Radford-
Smith, C. G. Mathew, J. D. Rioux, E. E. Schadt, M. J. Daly, A. Franke, M. Parkes, S. Vermeire,
J. C. Barrett, and J. H. Cho, "HostaA$microbe interactions have shaped the genetic architecture
of inflammatory bowel disease," Nature, vol. 491, pp. 119-124, Oct. 2012.

221
59
D. J. Liu, G. M. Peloso, X. Zhan, 0. L. Holmen, M. Zawistowski, S. Feng, M. Nikpay, P. L. Auer,
A. Goel, H. Zhang, U. Peters, M. Farrall, M. Orho-Melander, C. Kooperberg, R. McPherson,
H. Watkins, C. J. Willer, K. Hveem, 0. Melander, S. Kathiresan, and G. R. Abecasis, "Meta-
analysis of gene-level tests for rare variant association," Nature Genetics, vol. 46, no. 2, pp. 200-
204, 2014.
60 G. Trynka, H.-J. Westra, K. Slowikowski, X. Hu, H. Xu, B. E. Stranger, R. J. Klein, B. Han, and
S. Raychaudhuri, "Disentangling the Effects of Colocalizing Genomic Annotations to Functionally
Prioritize Non-coding Variants within Complex-Trait Loci," The American Journal of Human
Genetics, vol. 97, pp. 139-152, July 2015.
61
D. M. Altshuler, R. A. Gibbs, L. Peltonen, D. M. Altshuler, R. A. Gibbs, L. Peltonen, E. Der-
mitzakis, S. F. Schaffner, F. Yu, L. Peltonen, E. Dermitzakis, P. E. Bonnen, D. M. Altshuler,
R. A. Gibbs, P. I. W. de Bakker, P. Deloukas, S. B. Gabriel, R. Gwilliam, S. Hunt, M. Inouye,
X. Jia, A. Palotie, M. Parkin, P. Whittaker, F. Yu, K. Chang, A. Hawes, L. R. Lewis, Y. Ren,
D. Wheeler, R. A. Gibbs, D. Marie Muzny, C. Barnes, K. Darvishi, M. Hurles, J. M. Korn,
K. Kristiansson, C. Lee, S. A. McCarroll, J. Nemesh, E. Dermitzakis, A. Keinan, S. B. Mont-
gomery, S. Pollack, A. L. Price, N. Soranzo, P. E. Bonnen, R. A. Gibbs, C. Gonzaga-Jauregui,
A. Keinan, A. L. Price, F. Yu, V. Anttila, W. Brodeur, M. J. Daly, S. Leslie, G. McVean,
L. Moutsianas, H. Nguyen, S. F. Schaffner, Q. Zhang, M. J. R. Ghori, R. McGinnis, W. McLaren,
S. Pollack, A. L. Price, S. F. Schaffner, F. Takeuchi, S. R. Grossman, I. Shlyakhter, E. B. Hostet-
ter, P. C. Sabeti, C. A. Adebamowo, M. W. Foster, D. R. Gordon, J. Licinio, M. Cristina Manca,
P. A. Marshall, I. Matsuda, D. Ngare, V. Ota Wang, D. Reddy, C. N. Rotimi, C. D. Royal, R. R.
Sharp, C. Zeng, L. D. Brooks, and J. E. McEwen, "Integrating common and rare genetic variation
in diverse human populations," Nature, vol. 467, pp. 52-58, Sept. 2010.
62
Y. Li and M. Kellis, "Joint Bayesian inference of risk variants and tissue-specific epigenomic
enrichments across multiple complex human diseases," Nucleic Acids Research, July 2016.

63 j. Pickrell, "Joint Analysis of Functional Genomic Data and Genome-wide Association Studies
of 18 Human Traits," The American Journal of Human Genetics, vol. 95, p. 126, July 2014.
64The GTEx Consortium, K. G. Ardlie, D. S. Deluca, A. V. Segre, T. J. Sullivan, T. R. Young,
E. T. Gelfand, C. A. Trowbridge, J. B. Maller, T. Tukiainen, M. Lek, L. D. Ward, P. Kheradpour,
B. Iriarte, Y. Meng, C. D. Palmer, T. Esko, W. Winckler, J. N. Hirschhorn, M. Kellis, D. G.
MacArthur, G. Getz, A. A. Shabalin, G. Li, Y.-H. Zhou, A. B. Nobel, I. Rusyn, F. A. Wright,
T. Lappalainen, P. G. Ferreira, H. Ongen, M. A. Rivas, A. Battle, S. Mostafavi, J. Monlong,
M. Sammeth, M. Mele, F. Reverter, J. M. Goldmann, D. Koller, R. Guigo, M. I. McCarthy,
E. T. Dermitzakis, E. R. Gamazon, H. K. Im, A. Konkashbaev, D. L. Nicolae, N. J. Cox,
T. Flutre, X. Wen, M. Stephens, J. K. Pritchard, Z. Tu, B. Zhang, T. Huang, Q. Long, L. Lin,
J. Yang, J. Zhu, J. Liu, A. Brown, B. Mestichelli, D. Tidwell, E. Lo, M. Salvatore, S. Shad, J. A.
Thomas, J. T. Lonsdale, M. T. Moser, B. M. Gillard, E. Karasik, K. Ramsey, C. Choi, B. A.
Foster, J. Syron, J. Fleming, H. Magazine, R. Hasz, G. D. Walters, J. P. Bridge, M. Miklos,
S. Sullivan, L. K. Barker, H. M. Traino, M. Mosavel, L. A. Siminoff, D. R. Valley, D. C. Rohrer,

222
S. D. Jewell, P. A. Branton, L. H. Sobin, M. Barcus, L. Qi, J. McLean, P. Hariharan, K. S.
UM, S. Wu, D. Tabor, C. Shive, A. M. Smith, S. A. Buia, A. H. Undale, K. L. Robinson,
N. Roche, K. M. Valentino, A. Britton, R. Burges, D. Bradbury, K. W. Hambright, J. Seleski,
G. E. Korzeniewski, K. Erickson, Y. Marcus, J. Tejada, M. Taherian, C. Lu, M. Basile, D. C.
Mash, S. Volpi, J. P. Struewing, G. F. Temple, J. Boyer, D. Colantuoni, R. Little, S. Koester,
L. J. Carithers, H. M. Moore, P. Guan, C. Compton, S. J. Sawyer, J. P. Demchok, J. B. Vaught,
C. A. Rabiner, N. C. Lockhart, K. G. Ardlie, G. Getz, F. A. Wright, M. Kellis, S. Volpi, and
E. T. Dermitzakis, "The Genotype-Tissue Expression (GTEx) pilot analysis: Multitissue gene
regulation in humans," Science, vol. 348, pp. 648-660, May 2015.
65 H. Ongen, A. A. Brown, 0. Delaneau, N. Panousis, A. C. Nica, GTEx Consortium, and E. T.
Dermitzakis, "Estimating the causal tissues for complex traits and diseases," bioRxiv, Sept. 2016.
66X. Hu, H. Kim, E. Stahl, R. Plenge, M. Daly, and S. Raychaudhuri, "Integrating Autoimmune
Risk Loci with Gene-Expression Data Identifies Specific Pathogenic Immune Cell Subsets," The
American Journal of Human Genetics, vol. 89, pp. 496-506, Oct. 2011.
6 7 K.
Slowikowski, X. Hu, and S. Raychaudhuri, "SNPsea: an algorithm to identify cell
types,
tissues and pathways affected by risk loci," Bioinformatics, vol. 30, pp. 2496-2497, Sept. 2014.
6 8 T.H. Pers, J. M. Karjalainen, Y. Chan, H.-J. Westra, A. R. Wood, J. Yang, J. C. Lui, S. Vedan-
tam, S. Gustafsson, T. Esko, T. Frayling, E. K. Speliotes, M. Boehnke, S. Raychaudhuri, R. S. N.
Fehrmann, J. N. Hirschhorn, and L. Franke, "Biological interpretation of genome-wide association
studies using predicted gene functions," Nature Communications, vol. 6, p. 5890, Jan. 2015.
6 9 P. Gormley, V. Anttila, B. S. Winsvold, P. Palta, T. Esko, T. H. Pers, K.-H. Farh, E. Cuenca-
Leon, M. Muona, N. A. Furlotte, T. Kurth, A. Ingason, G. McMahon, L. Ligthart, G. M. Ter-
windt, M. Kallela, T. M. Freilinger, C. Ran, S. G. Gordon, A. H. Stam, S. Steinberg, G. Borck,
M. Koiranen, L. Quaye, H. H. H. Adams, T. LehtimAd'ki, A.-P. Sarin, J. Wedenoja, D. A. Hinds,
J. E. Buring, M. SchAijrks, P. M. Ridker, M. G. Hrafnsdottir, H. Stefansson, S. M. Ring, J.-J.
Hottenga, B. W. J. H. Penninx, M. Fkd'rkkilAd', V. Artto, M. Kaunisto, S. VepsAd'lAd'inen,
R. Malik, A. C. Heath, P. A. F. Madden, N. G. Martin, G. W. Montgomery, M. I. Kurki,
M. Kals, R. MAd'gi, K. PAd'rn, E. HAd'mAd'lAd'inen, H. Huang, A. E. Byrnes, L. Franke,
J. Huang, E. Stergiakouli, P. H. Lee, C. Sandor, C. Webber, Z. Cader, B. Muller-Myhsok,
S. Schreiber, T. Meitinger, J. G. Eriksson, V. Salomaa, K. HeikkilAd', E. Loehrer, A. G. Uitter-
linden, A. Hofman, C. M. van Duijn, L. Cherkas, L. M. Pedersen, A. Stubhaug, C. S. Nielsen,
M. MAd'nnikkAfi, E. Mihailov, L. Milani, H. GAu-bel, A.-L. Esserlind, A. F. Christensen, T. F.
Hansen, T. Werge, V. Anttila, V. Artto, A. C. Belin, D. I. Boomsma, S. BAyrte, D. I. Chasman,
L. Cherkas, A. F. Christensen, B. Cormand, E. Cuenca-Leon, G. D. Smith, M. Dichgans, C. van
Duijn, E. Eising, T. Esko, A.-L. Esserlind, M. Ferrari, R. R. Frants, T. M. Freilinger, N. A.
Furlotte, P. Gormley, L. Griffiths, E. Hamalainen, T. F. Hansen, M. Hiekkala, M. A. Ikram,
A. Ingason, M.-R. JAd'rvelin, R. Kajanne, M. Kallela, J. Kaprio, M. Kaunisto, C. Kubisch,
M. Kurki, T. Kurth, L. Launer, T. Lehtimaki, D. Lessel, L. Ligthart, N. Litterman, A. M. J. M.
van den Maagdenberg, A. Macaya, R. Malik, M. Mangino, G. McMahon, B. Muller-Myhsok,

223
B. M. Neale, C. Northover, D. R. Nyholt, J. Olesen, A. Palotie, P. Palta, L. M. Pedersen,
N. Pedersen, D. Posthuma, P. Pozo-Rosich, A. Pressman, L. Quaye, 0. Raitakari, M. SchAijrks,
C. Sintas, K. Stefansson, H. Stefansson, S. Steinberg, D. Strachan, G. M. Terwindt, M. Vila-
Pueyo, M. Wessman, B. S. Winsvold, W. Wrenthal, H. Zhao, J.-A. Zwart, J. Kaprio, A. J.
Aromaa, 0. Raitakari, M. A. Ikram, T. Spector, M.-R. JAd'rvelin, A. Metspalu, C. Kubisch,
D. P. Strachan, M. D. Ferrari, A. C. Belin, M. Dichgans, M. Wessman, A. M. J. M. van den
Maagdenberg, J.-A. Zwart, D. I. Boomsma, G. D. Smith, K. Stefansson, N. Eriksson, M. J. Daly,
B. M. Neale, J. Olesen, D. I. Chasman, D. R. Nyholt, and A. Palotie, "Meta-analysis of 375,000
individuals identifies 38 susceptibility loci for migraine," Nature Genetics, vol. 48, pp. 856-866,
June 2016.
70 R. S. N. Fehrmann, J. M. Karjalainen, M. Krajewska, H.-J. Westra, D. Maloney, A. Simeonov,
T. H. Pers, J. N. Hirschhorn, R. C. Jansen, E. A. Schultes, H. H. H. B. M. van Haagen, E. G. E.
de Vries, G. J. te Meerman, C. Wijmenga, M. A. T. M. van Vugt, and L. Franke, "Gene expression
analysis identifies global gene dosage sensitivity in cancer," Nature Genetics, vol. 47, pp. 115-125,
Feb. 2015.
7 A. R. Wood, T. Esko, J. Yang, S. Vedantam, T. H. Pers, S. Gustafsson, A. Y. Chu, K. Estrada,
J. Luan, Z. Kutalik, N. Amin, M. L. Buchkovich, D. C. Croteau-Chonka, F. R. Day, Y. Duan,
T. Fall, R. Fehrmann, T. Ferreira, A. U. Jackson, J. Karjalainen, K. S. Lo, A. E. Locke, R. MAd'gi,
E. Mihailov, E. Porcu, J. C. Randall, A. Scherag, A. A. E. Vinkhuyzen, H.-J. Westra, T. W.
Winkler, T. Workalemahu, J. H. Zhao, D. Absher, E. Albrecht, D. Anderson, J. Baron, M. Beek-
man, A. Demirkan, G. B. Ehret, B. Feenstra, M. F. Feitosa, K. Fischer, R. M. Fraser, A. Goel,
J. Gong, A. E. Justice, S. Kanoni, M. E. Kleber, K. Kristiansson, U. Lim, V. Lotay, J. C.
Lui, M. Mangino, I. M. Leach, C. Medina-Gomez, M. A. Nalls, D. R. Nyholt, C. D. Palmer,
D. Pasko, S. Pechlivanis, I. Prokopenko, J. S. Ried, S. Ripke, D. Shungin, A. StancA4kovAq,
R. J. Strawbridge, Y. J. Sung, T. Tanaka, A. Teumer, S. Trompet, S. W. van der Laan, J. van
Setten, J. V. Van Vliet-Ostaptchouk, Z. Wang, L. Yengo, W. Zhang, U. Afzal, J._ADrnlAfiv,
G. M. Arscott, S. Bandinelli, A. Barrett, C. Bellis, A. J. Bennett, C. Berne, M. BlAijher, J. L.
Bolton, Y. BAittcher, H. A. Boyd, M. Bruinenberg, B. M. Buckley, S. Buyske, I. H. Caspersen,
P. S. Chines, R. Clarke, S. Claudi-Boehm, M. Cooper, E. W. Daw, P. A. De Jong, J. Deelen,
G. Delgado, J. C. Denny, R. Dhonukshe-Rutten, M. Dimitriou, A. S. F. Doney, M. DA frr, N. Ek-
lund, E. Eury, L. Folkersen, M. E. Garcia, F. Geller, V. Giedraitis, A. S. Go, H. Grallert, T. B.
Grammer, J. Grkd'A ler, H. GrA~fnberg, L. C. P. G. M. de Groot, C. J. Groves, J. Haessler,
P. Hall, T. Haller, G. Hallmans, A. Hannemann, C. A. Hartman, M. Hassinen, C. Hayward, N. L.
Heard-Costa, Q. Helmer, G. Hemani, A. K. Henders, H. L. Hillege, M. A. Hlatky, W. Hoffmann,
P. Hoffmann, 0. Holmen, J. J. Houwing-Duistermaat, T. Illig, A. Isaacs, A. L. James, J. Jeff,
B. Johansen, A. Johansson, J. Jolley, T. Juliusdottir, J. Junttila, A. N. Kho, L. Kinnunen,
N. Klopp, T. Kocher, W. Kratzer, P. Lichtner, L. Lind, J. LindstrAulm, S. Lobbens, M. Lorent-
zon, Y. Lu, V. Lyssenko, P. K. E. Magnusson, A. Mahajan, M. Maillard, W. L. McArdle, C. A.
McKenzie, S. McLachlan, P. J. McLaren, C. Menni, S. Merger, L. Milani, A. Moayyeri, K. L.
Monda, M. A. Morken, G. MAijller, M. M ijller-Nurasyid, A. W. Musk, N. Narisu, M. Nauck,
I. M. Nolte, M. M. NA ithen, L. Oozageer, S. Pilz, N. W. Rayner, F. Renstrom, N. R. Robertson,

224
L. M. Rose, R. Roussel, S. Sanna, H. Scharnagl, S. Scholtens, F. R. Schumacher, H. Schunkert,
R. A. Scott, J. Sehmi, T. Seufferlein, J. Shi, K. Silventoinen, J. H. Smit, A. V. Smith, J. Smolon-
ska, A. V. Stanton, K. Stirrups, D. J. Stott, H. M. Stringham, J. SundstrA -m, M. A. Swertz,
A.-C. SyvAd'nen, B. 0. Tayo, G. Thorleifsson, J. P. Tyrer, S. van Dijk, N. M. van Schoor,
N. van der Velde, D. van Heemst, F. V. A. van Oort, S. H. Vermeulen, N. Verweij, J. M. Vonk,
L. L. Waite, M. Waldenberger, R. Wennauer, L. R. Wilkens, C. Willenborg, T. Wilsgaard, M. K.
Wojczynski, A. Wong, A. F. Wright, Q. Zhang, D. Arveiler, S. J. L. Bakker, J. Beilby, R. N.
Bergman, S. Bergmann, R. Biffar, J. Blangero, D. I. Boomsma, S. R. Bornstein, P. Bovet,
P. Brambilla, M. J. Brown, H. Campbell, M. J. Caulfield, A. Chakravarti, R. Collins, F. S.
Collins, D. C. Crawford, L. A. Cupples, J. Danesh, U. de Faire, H. M. den Ruijter, R. Erbel,
J. Erdmann, J. G. Eriksson, M. Farrall, E. Ferrannini, J. FerriAfres, I. Ford, N. G. Forouhi,
T. Forrester, R. T. Gansevoort, P. V. Gejman, C. Gieger, A. Golay, 0. Gottesman, V. Gud-
nason, U. Gyllensten, D. W. Haas, A. S. Hall, T. B. Harris, A. T. Hattersley, A. C. Heath,
C. Hengstenberg, A. A. Hicks, L. A. Hindorff, A. D. Hingorani, A. Hofman, G. K. Hovingh, S. E.
Humphries, S. C. Hunt, E. Hypponen, K. B. Jacobs, M.-R. Jarvelin, P. Jousilahti, A. M. Jula,
J. Kaprio, J. J. P. Kastelein, M. Kayser, F. Kee, S. M. Keinanen-Kiukaanniemi, L. A. Kiemeney,
J. S. Kooner, C. Kooperberg, S. Koskinen, P. Kovacs, A. T. Kraja, M. Kumari, J. Kuusisto,
T. A. Lakka, C. Langenberg, L. Le Marchand, T. LehtimAd'ki, S. Lupoli, P. A. F. Madden,
S. MAd'nnistA -, P. Manunta, A. Marette, T. C. Matise, B. McKnight, T. Meitinger, F. L. Moll,
G. W. Montgomery, A. D. Morris, A. P. Morris, J. C. Murray, M. Nelis, C. Ohlsson, A. J.
Oldehinkel, K. K. Ong, W. H. Ouwehand, G. Pasterkamp, A. Peters, P. P. Pramstaller, J. F.
Price, L. Qi, 0. T. Raitakari, T. Rankinen, D. C. Rao, T. K. Rice, M. Ritchie, I. Rudan, V. Sa-
lomaa, N. J. Samani, J. Saramies, M. A. Sarzynski, P. E. H. Schwarz, S. Sebert, P. Sever, A. R.
Shuldiner, J. Sinisalo, V. Steinthorsdottir, R. P. Stolk, J.-C. Tardif, A. TAfnjes, A. Tremblay,
E. Tremoli, J. Virtamo, M.-C. Vohl, P. Amouyel, F. W. Asselbergs, T. L. Assimes, M. Bochud,
B. 0. Boehm, E. Boerwinkle, E. P. Bottinger, C. Bouchard, S. Cauchi, J. C. Chambers, S. J.
Chanock, R. S. Cooper, P. I. W. de Bakker, G. Dedoussis, L. Ferrucci, P. W. Franks, P. Froguel,
L. C. Groop, C. A. Haiman, A. Hamsten, M. G. Hayes, J. Hui, D. J. Hunter, K. Hveem, J. W.
Jukema, R. C. Kaplan, M. Kivimaki, D. Kuh, M. Laakso, Y. Liu, N. G. Martin, W. MAd'rz,
M. Melbye, S. Moebus, P. B. Munroe, 1. NjAylstad, B. A. Oostra, C. N. A. Palmer, N. L. Peder-
sen, M. Perola, L. PArrusse, U. Peters, J. E. Powell, C. Power, T. Quertermous, R. Rauramaa,
E. Reinmaa, P. M. Ridker, F. Rivadeneira, J. I. Rotter, T. E. Saaristo, D. Saleheen, D. Sch-
lessinger, P. E. Slagboom, H. Snieder, T. D. Spector, K. Strauch, M. Stumvoll, J. Tuomilehto,
M. Uusitupa, P. van der Harst, H. VAU'lzke, M. Walker, N. J. Wareham, H. Watkins, H.-E.
Wichmann, J. F. Wilson, P. Zanen, P. Deloukas, I. M. Heid, C. M. Lindgren, K. L. Mohlke,
E. K. Speliotes, U. Thorsteinsdottir, I. Barroso, C. S. Fox, K. E. North, D. P. Strachan, J. S.
Beckmann, S. I. Berndt, M. Boehnke, I. B. Borecki, M. I. McCarthy, A. Metspalu, K. Stefansson,
A. G. Uitterlinden, C. M. van Duijn, L. Franke, C. J. Willer, A. L. Price, G. Lettre, R. J. F.
Loos, M. N. Weedon, E. Ingelsson, J. R. O'Connell, G. R. Abecasis, D. 1. Chasman, M. E.
Goddard, P. M. Visscher, J. N. Hirschhorn, and T. M. Frayling, "Defining the role of common
variation in the genomic and biological architecture of adult human height," Nature Genetics,
vol. 46, pp. 1173-1186, Oct. 2014.

225
72
A. E. Locke, B. Kahali, S. I. Berndt, A. E. Justice, T. H. Pers, F. R. Day, C. Powell, S. Vedantam,
M. L. Buchkovich, J. Yang, D. C. Croteau-Chonka, T. Esko, T. Fall, T. Ferreira, S. Gustafsson,
Z. Kutalik, J. Luan, R. MAd'gi, J. C. Randall, T. W. Winkler, A. R. Wood, T. Workalemahu,
J. D. Faul, J. A. Smith, J. Hua Zhao, W. Zhao, J. Chen, R. Fehrmann, A. K. Hedman, J. Kar-
jalainen, E. M. Schmidt, D. Absher, N. Amin, D. Anderson, M. Beekman, J. L. Bolton, J. L.
Bragg-Gresham, S. Buyske, A. Demirkan, G. Deng, G. B. Ehret, B. Feenstra, M. F. Feitosa,
K. Fischer, A. Goel, J. Gong, A. U. Jackson, S. Kanoni, M. E. Kleber, K. Kristiansson, U. Lim,
V. Lotay, M. Mangino, I. Mateo Leach, C. Medina-Gomez, S. E. Medland, M. A. Nalls, C. D.
Palmer, D. Pasko, S. Pechlivanis, M. J. Peters, I. Prokopenko, D. Shungin, A. StanADAqkovAq,
R. J. Strawbridge, Y. Ju Sung, T. Tanaka, A. Teumer, S. Trompet, S. W. van der Laan, J. van
Setten, J. V. Van Vliet-Ostaptchouk, Z. Wang, L. Yengo, W. Zhang, A. Isaacs, E. Albrecht,
J. ADrnlAfiv, G. M. Arscott, A. P. Attwood, S. Bandinelli, A. Barrett, I. N. Bas, C. Bellis, A. J.
Bennett, C. Berne, R. Blagieva, M. Bl ijher, S. BAfihringer, L. L. Bonnycastle, Y. BAittcher,
H. A. Boyd, M. Bruinenberg, I. H. Caspersen, Y.-D. Ida Chen, R. Clarke, E. Warwick Daw,
A. J. M. de Craen, G. Delgado, M. Dimitriou, A. S. F. Doney, N. Eklund, K. Estrada, E. Eury,
L. Folkersen, R. M. Fraser, M. E. Garcia, F. Geller, V. Giedraitis, B. Gigante, A. S. Go, A. Golay,
A. H. Goodall, S. D. Gordon, M. Gorski, H.-J. Grabe, H. Grallert, T. B. Grammer, J. GrAd'A ler,
H. GrAfinberg, C. J. Groves, G. Gusto, J. Haessler, P. Hall, T. Haller, G. Hallmans, C. A. Hart-
man, M. Hassinen, C. Hayward, N. L. Heard-Costa, Q. Helmer, C. Hengstenberg, 0. Holmen,
J.-J. Hottenga, A. L. James, J. M. Jeff, A. Johansson, J. Jolley, T. Juliusdottir, L. Kinnunen,
W. Koenig, M. Koskenvuo, W. Kratzer, J. Laitinen, C. Lamina, K. Leander, N. R. Lee, P. Licht-
ner, L. Lind, J. LindstrAfim, K. Sin Lo, S. Lobbens, R. Lorbeer, Y. Lu, F. Mach, P. K. E. Mag-
nusson, A. Mahajan, W. L. McArdle, S. McLachlan, C. Menni, S. Merger, E. Mihailov, L. Milani,
A. Moayyeri, K. L. Monda, M. A. Morken, A. Mulas, G. Mkijller, M. Mkijller-Nurasyid, A. W.
Musk, R. Nagaraja, M. M. NAUithen, I. M. Nolte, S. Pilz, N. W. Rayner, F. Renstrom, R. Rettig,
J. S. Ried, S. Ripke, N. R. Robertson, L. M. Rose, S. Sanna, H. Scharnagl, S. Scholtens, F. R.
Schumacher, W. R. Scott, T. Seufferlein, J. Shi, A. Vernon Smith, J. Smolonska, A. V. Stanton,
V. Steinthorsdottir, K. Stirrups, H. M. Stringham, J. SundstrAfim, M. A. Swertz, A. J. Swift,
A.-C. SyvAd'nen, S.-T. Tan, B. 0. Tayo, B. Thorand, G. Thorleifsson, J. P. Tyrer, H.-W. Uh,
L. Vandenput, F. C. Verhulst, S. H. Vermeulen, N. Verweij, J. M. Vonk, L. L. Waite, H. R.
Warren, D. Waterworth, M. N. Weedon, L. R. Wilkens, C. Willenborg, T. Wilsgaard, M. K.
Wojczynski, A. Wong, A. F. Wright, Q. Zhang, The LifeLines Cohort Study, E. P. Brennan,
M. Choi, Z. Dastani, A. W. Drong, P. Eriksson, A. Franco-Cereceda, J. R. Gkedin, A. G.
Gharavi, M. E. Goddard, R. E. Handsaker, J. Huang, F. Karpe, S. Kathiresan, S. Keildson,
K. Kiryluk, M. Kubo, J.-Y. Lee, L. Liang, R. P. Lifton, B. Ma, S. A. McCarroll, A. J. McK-
night, J. L. Min, M. F. Moffatt, G. W. Montgomery, J. M. Murabito, G. Nicholson, D. R.
Nyholt, Y. Okada, J. R. B. Perry, R. Dorajoo, E. Reinmaa, R. M. Salem, N. Sandholm, R. A.
Scott, L. Stolk, A. Takahashi, T. Tanaka, F. M. v. Hooft, A. A. E. Vinkhuyzen, H.-J. Westra,
W. Zheng, K. T. Zondervan, The ADIPOGen Consortium, The AGEN-BMI Working Group,
The CARDIOGRAMplusC4D Consortium, The CKDGen Consortium, The Glgc, The Icbp,
The MAGIC Investigators, The MuTHER Consortium, The MIGen Consortium, The PAGE
Consortium, The ReproGen Consortium, The GENIE Consortium, The International Endogene

226
Consortium, A. C. Heath, D. Arveiler, S. J. L. Bakker, J. Beilby, R. N. Bergman, J. Blangero,
P. Bovet, H. Campbell, M. J. Caulfield, G. Cesana, A. Chakravarti, D. I. Chasman, P. S. Chines,
F. S. Collins, D. C. Crawford, L. Adrienne Cupples, D. Cusi, J. Danesh, U. de Faire, H. M.
den Ruijter, A. F. Dominiczak, R. Erbel, J. Erdmann, J. G. Eriksson, M. Farrall, S. B. Felix,
E. Ferrannini, J. FerriAfres, I. Ford, N. G. Forouhi, T. Forrester, 0. H. Franco, R. T. Gan-
sevoort, P. V. Gejman, C. Gieger, 0. Gottesman, V. Gudnason, U. Gyllensten, A. S. Hall, T. B.
Harris, A. T. Hattersley, A. A. Hicks, L. A. Hindorff, A. D. Hingorani, A. Hofman, G. Homuth,
G. Kees Hovingh, S. E. Humphries, S. C. Hunt, E. HyppAu-nen, T. Illig, K. B. Jacobs, M.-R.
Jarvelin, K.-H. JAfickel, B. Johansen, P. Jousilahti, J. Wouter Jukema, A. M. Jula, J. Kaprio,
J. J. P. Kastelein, S. M. Keinanen-Kiukaanniemi, L. A. Kiemeney, P. Knekt, J. S. Kooner,
C. Kooperberg, P. Kovacs, A. T. Kraja, M. Kumari, J. Kuusisto, T. A. Lakka, C. Langenberg,
L. Le Marchand, T. LehtimAd'ki, V. Lyssenko, S. MAd'nnistAfi, A. Marette, T. C. Matise, C. A.
McKenzie, B. McKnight, F. L. Moll, A. D. Morris, A. P. Morris, J. C. Murray, M. Nelis, C. Ohls-
son, A. J. Oldehinkel, K. K. Ong, P. A. F. Madden, G. Pasterkamp, J. F. Peden, A. Peters, D. S.
Postma, P. P. Pramstaller, J. F. Price, L. Qi, 0. T. Raitakari, T. Rankinen, D. C. Rao, T. K.
Rice, P. M. Ridker, J. D. Rioux, M. D. Ritchie, I. Rudan, V. Salomaa, N. J. Samani, J. Saramies,
M. A. Sarzynski, H. Schunkert, P. E. H. Schwarz, P. Sever, A. R. Shuldiner, J. Sinisalo, R. P.
Stolk, K. Strauch, A. TAU-njes, D.-A. TrAl'gouAflt, A. Tremblay, E. Tremoli, J. Virtamo, M.-
C. Vohl, U. VA flker, G. Waeber, G. Willemsen, J. C. Witteman, M. Carola Zillikens, L. S.
Adair, P. Amouyel, F. W. Asselbergs, T. L. Assimes, M. Bochud, B. 0. Boehm, E. Boerwin-
kle, S. R. Bornstein, E. P. Bottinger, C. Bouchard, S. Cauchi, J. C. Chambers, S. J. Chanock,
R. S. Cooper, P. I. W. de Bakker, G. Dedoussis, L. Ferrucci, P. W. Franks, P. Froguel, L. C.
Groop, C. A. Haiman, A. Hamsten, J. Hui, D. J. Hunter, K. Hveem, R. C. Kaplan, M. Kivimaki,
D. Kuh, M. Laakso, Y. Liu, N. G. Martin, W. MAd'rz, M. Melbye, A. Metspalu, S. Moebus, P. B.
Munroe, I. NjAylstad, B. A. Oostra, C. N. A. Palmer, N. L. Pedersen, M. Perola, L. P l'russe,
U. Peters, C. Power, T. Quertermous, R. Rauramaa, F. Rivadeneira, T. E. Saaristo, D. Sale-
heen, N. Sattar, E. E. Schadt, D. Schlessinger, P. Eline Slagboom, H. Snieder, T. D. Spector,
U. Thorsteinsdottir, M. Stumvoll, J. Tuomilehto, A. G. Uitterlinden, M. Uusitupa, P. van der
Harst, M. Walker, H. Wallaschofski, N. J. Wareham, H. Watkins, D. R. Weir, H.-E. Wichmann,
J. F. Wilson, P. Zanen, I. B. Borecki, P. Deloukas, C. S. Fox, I. M. Heid, J. R. OAA2Connell,
D. P. Strachan, K. Stefansson, C. M. van Duijn, G. R. Abecasis, L. Franke, T. M. Frayling, M. I.
McCarthy, P. M. Visscher, A. Scherag, C. J. Willer, M. Boehnke, K. L. Mohlke, C. M. Lindgren,
J. S. Beckmann, I. Barroso, K. E. North, E. Ingelsson, J. N. Hirschhorn, R. J. F. Loos, and E. K.
Speliotes, "Genetic studies of body mass index yield new insights for obesity biology," Nature,
vol. 518, pp. 197-206, Feb. 2015.
7
1J. D. Cahoy, B. Emery, A. Kaushal, L. C. Foo, J. L. Zamanian, K. S. Christopherson, Y. Xing,
J. L. Lubischer, P. A. Krieg, S. A. Krupenko, W. J. Thompson, and B. A. Barres, "A Tran-
scriptome Database for Astrocytes, Neurons, and Oligodendrocytes: A New Resource for Under-
standing Brain Development and Function," The Journal of Neuroscience, vol. 28, pp. 264-278,
Jan. 2008.

71S. Akbarian, C. Liu, J. A. Knowles, F. M. Vaccarino, P. J. Farnham, G. E. Crawford, A. E.

227
Jaffe, D. Pinto, S. Dracheva, D. H. Geschwind, J. Mill, A. C. Nairn, A. Abyzov, S. Pochareddy,
S. Prabhakar, S. Weissman, P. F. Sullivan, M. W. State, Z. Weng, M. A. Peters, K. P. White,
M. B. Gerstein, A. Amiri, C. Armoskus, A. E. Ashley-Koch, T. Bae, A. Beckel-Mitchener, B. P.
Berman, G. A. Coetzee, G. Coppola, N. Francoeur, M. Fromer, R. Gao, K. Grennan, J. Herstein,
D. H. Kavanagh, N. A. Ivanov, Y. Jiang, R. R. Kitchen, A. Kozlenkov, M. Kundakovic, M. Li,
Z. Li, S. Liu, L. M. Mangravite, E. Mattei, E. Markenscoff-Papadimitriou, F. C. P. Navarro,
N. North, L. Omberg, D. Panchision, N. Parikshak, J. Poschmann, A. J. Price, M. Purcaro,
T. E. Reddy, P. Roussos, S. Schreiner, S. Scuderi, R. Sebra, M. Shibata, A. W. Shieh, M. Skarica,
W. Sun, V. Swarup, A. Thomas, J. Tsuji, H. van Bakel, D. Wang, Y. Wang, K. Wang, D. M.
Werling, A. J. Willsey, H. Witt, H. Won, C. C. Y. Wong, G. A. Wray, E. Y. Wu, X. Xu,
L. Yao, G. Senthil, T. Lehner, P. Sklar, and N. Sestan, "The PsychENCODE project," Nature
Neuroscience, vol. 18, pp. 1707-1712, Dec. 2015.
75 T. S. P. Heng, M. W. Painter, and Immunological Genome Project Consortium, "The Immuno-
logical Genome Project: networks of gene expression in immune cells," Nature Immunology,
vol. 9, pp. 1091-1094, Oct. 2008.
76 C. Sudlow, J. Gallacher, N. Allen, V. Beral, P. Burton, J. Danesh, P. Downey, P. Elliott, J. Green,
M. Landray, B. Liu, P. Matthews, G. Ong, J. Pell, A. Silman, A. Young, T. Sprosen, T. Peakman,
and R. Collins, "UK Biobank: An Open Access Resource for Identifying the Causes of a Wide
Range of Complex Diseases of Middle and Old Age," PLOS Med, vol. 12, p. e1001779, Mar. 2015.
77 A. Okbay, J. P. Beauchamp, M. A. Fontana, J. J. Lee, T. H. Pers, C. A. Rietveld, P. Turley,
G.-B. Chen, V. Emilsson, S. F. W. Meddens, S. Oskarsson, J. K. Pickrell, K. Thom, P. Timshel,
R. de Vlaming, A. Abdellaoui, T. S. Ahluwalia, J. Bacelis, C. Baumbach, G. Bjornsdottir, J. H.
Brandsma, M. Pina Concas, J. Derringer, N. A. Furlotte, T. E. Galesloot, G. Girotto, R. Gupta,
L. M. Hall, S. E. Harris, E. Hofer, M. Horikoshi, J. E. Huffman, K. Kaasik, I. P. Kalafati,
R. Karlsson, A. Kong, J. Lahti, S. J. v. d. Lee, C. deLeeuw, P. A. Lind, K.-O. Lindgren, T. Liu,
M. Mangino, J. Marten, E. Mihailov, M. B. Miller, P. J. van der Most, C. Oldmeadow, A. Pay-
ton, N. Pervjakova, W. J. Peyrot, Y. Qian, 0. Raitakari, R. Rueedi, E. Salvi, B. Schmidt, K. E.
Schraut, J. Shi, A. V. Smith, R. A. Poot, B. St Pourcain, A. Teumer, G. Thorleifsson, N. Verweij,
D. Vuckovic, J. Wellmann, H.-J. Westra, J. Yang, W. Zhao, Z. Zhu, B. Z. Alizadeh, N. Amin,
A. Bakshi, S. E. Baumeister, G. Biino, K. BAynnelykke, P. A. Boyle, H. Campbell, F. P. Cappuc-
cio, G. Davies, J.-E. De Neve, P. Deloukas, I. Demuth, J. Ding, P. Eibich, L. Eisele, N. Eklund,
D. M. Evans, J. D. Faul, M. F. Feitosa, A. J. Forstner, I. Gandin, B. Gunnarsson, B. V. HalldA rs-
son, T. B. Harris, A. C. Heath, L. J. Hocking, E. G. Holliday, G. Homuth, M. A. Horan, J.-J.
Hottenga, P. L. de Jager, P. K. Joshi, A. Jugessur, M. A. Kaakinen, M. Kid'hAfnen, S. Kanoni,
L. Keltigangas-JAd'rvinen, L. A. L. M. Kiemeney, I. Kolcic, S. Koskinen, A. T. Kraja, M. Kroh,
Z. Kutalik, A. Latvala, L. J. Launer, M. P. Lebreton, D. F. Levinson, P. Lichtenstein, P. Lichtner,
D. C. M. Liewald, L. Cohort Study, A. Loukola, P. A. Madden, R. MAd'gi, T. MAd'ki-Opas, R. E.
Marioni, P. Marques-Vidal, G. A. Meddens, G. McMahon, C. Meisinger, T. Meitinger, Y. Mi-
laneschi, L. Milani, G. W. Montgomery, R. Myhre, C. P. Nelson, D. R. Nyholt, W. E. R. Ollier,
A. Palotie, L. Paternoster, N. L. Pedersen, K. E. Petrovic, D. J. Porteous, K. RAd'ikkAfinen,

228
S. M. Ring, A. Robino, 0. Rostapshova, I. Rudan, A. Rustichini, V. Salomaa, A. R. Sanders,
A.-P. Sarin, H. Schmidt, R. J. Scott, B. H. Smith, J. A. Smith, J. A. Staessen, E. Steinhagen-
Thiessen, K. Strauch, A. Terracciano, M. D. Tobin, S. Ulivi, S. Vaccargiu, L. Quaye, F. J. A.
van Rooij, C. Venturini, A. A. E. Vinkhuyzen, U. VAfilker, H. Vkfilzke, J. M. Vonk, D. Vozzi,
J. Waage, E. B. Ware, G. Willemsen, J. R. Attia, D. A. Bennett, K. Berger, L. Bertram, H. Bis-
gaard, D. I. Boomsma, I. B. Borecki, U. BAijltmann, C. F. Chabris, F. Cucca, D. Cusi, I. J.
Deary, G. V. Dedoussis, C. M. van Duijn, J. G. Eriksson, B. Franke, L. Franke, P. Gasparini,
P. V. Gejman, C. Gieger, H.-J. Grabe, J. Gratten, P. J. F. Groenen, V. Gudnason, P. van der
Harst, C. Hayward, D. A. Hinds, W. Hoffmann, E. HyppAninen, W. G. Iacono, B. Jacobsson,
M.-R. JAd'rvelin, K.-H. JAackel, J. Kaprio, S. L. R. Kardia, T. LehtimAd'ki, S. F. Lehrer,
P. K. E. Magnusson, N. G. Martin, M. McGue, A. Metspalu, N. Pendleton, B. W. J. H. Pen-
ninx, M. Perola, N. Pirastu, M. Pirastu, 0. Polasek, D. Posthuma, C. Power, M. A. Province,
N. J. Samani, D. Schlessinger, R. Schmidt, T. I. A. SAyrensen, T. D. Spector, K. Stefansson,
U. Thorsteinsdottir, A. R. Thurik, N. J. Timpson, H. Tiemeier, J. Y. Tung, A. G. Uitterlinden,
V. Vitart, P. Vollenweider, D. R. Weir, J. F. Wilson, A. F. Wright, D. C. Conley, R. F. Krueger,
G. Davey Smith, A. Hofman, D. I. Laibson, S. E. Medland, M. N. Meyer, J. Yang, M. Johannes-
son, P. M. Visscher, T. Esko, P. D. Koellinger, D. Cesarini, and D. J. Benjamin, "Genome-wide
association study identifies 74 loci associated with educational attainment," Nature, vol. 533,
pp. 539-542, May 2016.

71A. Okbay, B. M. L. Baselmans, J.-E. De Neve, P. Turley, M. G. Nivard, M. A. Fontana,


S. F. W. Meddens, R. K. LinnAl'r, C. A. Rietveld, J. Derringer, J. Gratten, J. J. Lee, J. Z.
Liu, R. de Vlaming, T. S. Ahluwalia, J. Buchwald, A. Cavadino, A. C. Frazier-Wood, N. A.
Furlotte, V. Garfield, M. H. Geisel, J. R. Gonzalez, S. Haitjema, R. Karlsson, S. W. van der
Laan, K.-H. Ladwig, J. Lahti, S. J. van der Lee, P. A. Lind, T. Liu, L. Matteson, E. Mihailov,
M. B. Miller, C. C. Minica, I. M. Nolte, D. Mook-Kanamori, P. J. van der Most, C. Oldmeadow,
Y. Qian, 0. Raitakari, R. Rawal, A. Realo, R. Rueedi, B. Schmidt, A. V. Smith, E. Stergiak-
ouli, T. Tanaka, K. Taylor, G. Thorleifsson, J. Wedenoja, J. Wellmann, H.-J. Westra, S. M.
Willems, W. Zhao, LifeLines Cohort Study, N. Amin, A. Bakshi, S. Bergmann, G. Bjornsdottir,
P. A. Boyle, S. Cherney, S. R. Cox, G. Davies, 0. S. P. Davis, J. Ding, N. Direk, P. Eibich,
R. T. Emeny, G. Fatemifar, J. D. Faul, L. Ferrucci, A. J. Forstner, C. Gieger, R. Gupta, T. B.
Harris, J. M. Harris, E. G. Holliday, J.-J. Hottenga, P. L. De Jager, M. A. Kaakinen, E. Ka-
jantie, V. Karhunen, I. Kolcic, M. Kumari, L. J. Launer, L. Franke, R. Li-Gao, D. C. Liewald,
M. Koini, A. Loukola, P. Marques-Vidal, G. W. Montgomery, M. A. Mosing, L. Paternoster,
A. Pattie, K. E. Petrovic, L. Pulkki-Rkeback, L. Quaye, K. RAd'ikkAfinen, I. Rudan, R. J.
Scott, J. A. Smith, A. R. Sutin, M. Trzaskowski, A. E. Vinkhuyzen, L. Yu, D. Zabaneh, J. R.
Attia, D. A. Bennett, K. Berger, L. Bertram, D. I. Boomsma, H. Snieder, S.-C. Chang, F. Cucca,
I. J. Deary, C. M. van Duijn, J. G. Eriksson, U. BAijltmann, E. J. C. de Geus, P. J. F. Groe-
nen, V. Gudnason, T. Hansen, C. A. Hartman, C. M. A. Haworth, C. Hayward, A. C. Heath,
D. A. Hinds, E. Hyppfinen, W. G. Iacono, M.-R. J d'rvelin, K.-H. JA ickel, J. Kaprio, S. L. R.
Kardia, L. Keltikangas-JAd'rvinen, P. Kraft, L. D. Kubzansky, T. LehtimAd'ki, P. K. E. Mag-
nusson, N. G. Martin, M. McGue, A. Metspalu, M. Mills, R. de Mutsert, A. J. Oldehinkel,

229
G. Pasterkamp, N. L. Pedersen, R. Plomin, 0. Polasek, C. Power, S. S. Rich, F. R. Rosendaal,
H. M. den Ruijter, D. Schlessinger, H. Schmidt, R. Svento, R. Schmidt, B. Z. Alizadeh, T. I. A.
SAyrensen, T. D. Spector, J. M. Starr, K. Stefansson, A. Steptoe, A. Terracciano, U. Thorsteins-
dottir, A. R. Thurik, N. J. Timpson, H. Tiemeier, A. G. Uitterlinden, P. Vollenweider, G. G.
Wagner, D. R. Weir, J. Yang, D. C. Conley, G. D. Smith, A. Hofman, M. Johannesson, D. I.
Laibson, S. E. Medland, M. N. Meyer, J. K. Pickrell, T. Esko, R. F. Krueger, J. P. Beauchamp,
P. D. Koellinger, D. J. Benjamin, M. Bartels, and D. Cesarini, "Genetic variants associated with
subjective well-being, depressive symptoms, and neuroticism identified through genome-wide
analyses," Nature Genetics, vol. 48, pp. 624-633, June 2016.
79
J. P. Bradfield, H.-Q. Qu, K. Wang, H. Zhang, P. M. Sleiman, C. E. Kim, F. D. Mentch, H. Qiu,
J. T. Glessner, K. A. Thomas, E. C. Frackelton, R. M. Chiavacci, M. Imielinski, D. S. Monos,
R. Pandey, M. Bakay, S. F. A. Grant, C. Polychronakos, and H. Hakonarson, "A Genome-Wide
Meta-Analysis of Six Type 1 Diabetes Cohorts Identifies Multiple Associated Loci," PLOS Genet,
vol. 7, p. e1002293, Sept. 2011.
'O8P. C. A. Dubois, G. Trynka, L. Franke, K. A. Hunt, J. Romanos, A. Curtotti, A. Zhernakova,
G. A. R. Heap, R. AdAqny, A. Aromaa, M. T. Bardella, L. H. van den Berg, N. A. Bockett, E. G.
de la Concha, B. Dema, R. S. N. Fehrmann, M. FernA&ndez-Arquero, S. Fiatal, E. Grandone,
P. M. Green, H. J. M. Groen, R. Gwilliam, R. H. J. Houwen, S. E. Hunt, K. Kaukinen, D. Kelle-
her, I. Korponay-Szabo, K. Kurppa, P. MacMathuna, M. MAd'ki, M. C. Mazzilli, 0. T. McCann,
M. L. Mearin, C. A. Mein, M. M. Mirza, V. Mistry, B. Mora, K. I. Morley, C. J. Mulder, J. A.
Murray, C. NMz sez, E. Oosterom, R. A. Ophoff, I. Polanco, L. Peltonen, M. Platteel, A. Rybak,
V. Salomaa, J. J. Schweizer, M. P. Sperandeo, G. J. Tack, G. Turner, J. H. Veldink, W. H. M.
Verbeek, R. K. Weersma, V. M. Wolters, E. Urcelay, B. Cukrowska, L. Greco, S. L. Neuhausen,
R. McManus, D. Barisani, P. Deloukas, J. C. Barrett, P. Saavalainen, C. Wijmenga, and D. A.
van Heel, "Multiple common variants for celiac disease influencing immune gene expression,"
Nature Genetics, vol. 42, pp. 295-302, Apr. 2010.
81 J. Bentham, D. L. Morris, D. S. Cunninghame Graham, C. L. Pinder, P. Tombleson, T. W.
Behrens, J. Martkrjn, B. P. Fairfax, J. C. Knight, L. Chen, J. Replogle, A.-C. SyvAd'nen,
L. RAinnblom, R. R. Graham, J. E. Wither, J. D. Rioux, M. E. AlarcAn-Riquelme, and
T. J. Vyse, "Genetic association analyses implicate aberrant regulation of innate and adaptive
immunity genes in the pathogenesis of systemic lupus erythematosus," Nature Genetics, vol. 47,
pp. 1457-1464, Dec. 2015.
82
H. J. Cordell, Y. Han, G. F. Mells, Y. Li, G. M. Hirschfield, C. S. Greene, G. Xie,
B. D. Juran,
D. Zhu, D. C. Qian, J. A. B. Floyd, K. I. Morley, D. Prati, A. Lleo, D. Cusi, E. M. Schlicht,
C. Lammert, E. J. Atkinson, L. L. Chan, M. de Andrade, T. Balschun, A. L. Mason, R. P. Myers,
J. Zhang, P. Milkiewicz, J. Qu, J. A. Odin, V. A. Luketic, B. R. Bacon, H. C. Bodenheimer Jr,
V. Liakina, C. Vincent, C. Levy, P. K. Gregersen, P. L. Almasio, D. Alvaro, P. Andreone, A. An-
driulli, C. Barlassina, P. M. Battezzati, A. Benedetti, F. Bernuzzi, I. Bianchi, M. C. Bragazzi,
M. Brunetto, S. Bruno, G. Casella, B. Coco, A. Colli, M. Colombo, S. Colombo, C. Cursaro, L. S.
CrocAl, A. Crosignani, M. F. Donato, G. Elia, L. Fabris, C. Ferrari, A. Floreani, B. Foglieni,

230
R. Fontana, A. Galli, R. Lazzari, F. Macaluso, F. Malinverno, F. Marra, M. Marzioni, A. Mat-
talia, R. Montanari, L. Morini, F. Morisco, M. Hani S, L. Muratori, P. Muratori, G. A. Niro,
V. 0. Palmieri, A. Picciotto, M. Podda, P. Portincasa, V. Ronca, F. Rosina, S. Rossi, I. Sogno,
G. Spinzi, M. Spreafico, M. Strazzabosco, S. Tarallo, M. Tarocchi, C. Tiribelli, P. Toniutto,
M. Vinci, M. Zuin, C. L. Ch'ng, M. Rahman, T. Yapp, R. Sturgess, C. Healey, M. Czajkowski,
A. Gunasekera, P. Gyawali, P. Premchand, K. Kapur, R. Marley, G. Foster, A. Watson, A. Dias,
J. Subhani, R. Harvey, R. McCorry, D. Ramanaden, J. Gasem, R. Evans, T. Mathialahan,
C. Shorrock, G. Lipscomb, P. Southern, J. Tibble, D. Gorard, A. Palegwala, S. Jones, M. Car-
bone, M. Dawwas, G. Alexander, S. Dolwani, M. Prince, M. Foxton, D. Elphick, H. Mitchison,
I. Gooding, M. Karmo, S. Saksena, M. Mendall, M. Patel, R. Ede, A. Austin, J. Sayer, L. Han-
key, C. Hovell, N. Fisher, M. Carter, K. Koss, A. Piotrowicz, C. Grimley, D. Neal, G. Lim,
S. Levi, A. Ala, A. Broad, A. Saeed, G. Wood, J. Brown, M. Wilkinson, H. Gordon, J. Ram-
age, J. Ridpath, T. Ngatchu, B. Grover, S. Shaukat, R. Shidrawi, G. Abouda, F. Ali, I. Rees,
I. Salam, M. Narain, A. Brown, S. Taylor-Robinson, S. Williams, L. Grellier, P. Banim, D. Das,
A. Chilton, M. Heneghan, H. Curtis, M. Gess, I. Drake, M. Aldersley, M. Davies, R. Jones,
A. McNair, R. Srirajaskanthan, M. Pitcher, S. Sen, G. Bird, A. Barnardo, P. Kitchen, K. Yoong,
0. Chirag, N. Sivaramakrishnan, G. MacFaul, D. Jones, A. Shah, C. Evans, S. Saha, K. Pollock,
P. Bramley, A. Mukhopadhya, A. Fraser, P. Mills, C. Shallcross, S. Campbell, A. Bathgate,
A. Shepherd, J. Dillon, S. Rushbrook, R. Przemioslo, C. Macdonald, J. Metcalf, U. Shmueli,
A. Davis, A. Naqvi, T. Lee, S. D. Ryder, J. Collier, H. Klass, M. Ninkovic, M. Cramp, N. Sharer,
R. Aspinall, P. Goggin, D. Ghosh, A. Douds, B. Hoeroldt, J. Booth, E. Williams, H. Hussaini,
W. Stableforth, R. Ayres, D. Thorburn, E. Marshall, A. Burroughs, S. Mann, M. Lombard,
P. Richardson, I. Patanwala, J. Maltby, M. Brookes, R. Mathew, S. Vyas, S. Singhal, D. Glee-
son, S. Misra, J. Butterworth, K. George, T. Harding, A. Douglass, S. Panter, J. Shearman,
G. Bray, G. Butcher, D. Forton, J. Mclindon, M. Cowan, G. Whatley, A. Mandal, H. Gupta,
P. Sanghi, S. Jain, S. Pereira, G. Prasad, G. Watts, M. Wright, J. Neuberger, F. Gordon, E. Unitt,
A. Grant, T. Delahooke, A. Higham, A. Brind, M. Cox, S. Ramakrishnan, A. King, C. Collins,
S. Whalley, A. Li, J. Fraser, A. Bell, V. S. Wong, A. Singhal, I. Gee, Y. Ang, R. Ransford,
J. Gotto, C. Millson, J. Bowles, C. Thomas, M. Harrison, R. Galaska, J. Kendall, J. Whiteman,
C. Lawlor, C. Gray, K. Elliott, C. Mulvaney-Jones, L. Hobson, G. Van Duyvenvoorde, A. Lof-
tus, K. Seward, R. Penn, J. Maiden, R. Damant, J. Hails, R. Cloudsdale, V. Silvestre, S. Glenn,
E. Dungca, N. Wheatley, H. Doyle, M. Kent, C. Hamilton, D. Braim, H. Wooldridge, R. Abra-
hams, A. Paton, N. Lancaster, A. Gibbins, K. Hogben, P. Desousa, F. Muscariu, J. Musselwhite,
A. McKay, L. Tan, C. Foale, J. Brighton, K. Flahive, E. Nambela, P. Townshend, C. Ford,
S. Holder, C. Palmer, J. Featherstone, M. Nasseri, J. Sadeghian, B. Williams, C. Thomas, S.-
A. Rolls, A. Hynes, C. Duggan, S. Jones, M. Crossey, G. Stansfield, C. MacNicol, J. Wilkins,
E. Wilhelmsen, P. Raymode, H.-J. Lee, E. Durant, R. Bishop, N. Ncube, S. Tripoli, R. Casey,
C. Cowley, R. Miller, K. Houghton, S. Ducker, F. Wright, B. Bird, G. Baxter, J. Keggans,
M. Hughes, E. Grieve, K. Young, D. Williams, K. Ocker, F. Hines, K. Martin, C. Innes,
T. Valliani, H. Fairlamb, S. Thornthwaite, A. Eastick, E. Tanqueray, J. Morrison, B. Holbrook,
J. Browning, K. Walker, S. Congreave, J. Verheyden, S. Slininger, L. Stafford, D. O'Donnell,
M. Ainsworth, S. Lord, L. Kent, L. March, C. Dickson, D. Simpson, B. Longhurst, M. Hayes,

231
E. Shpuza, N. White, S. Besley, S. Pearson, A. Wright, L. Jones, E. Gunter, H. Dewhurst,
A. Fouracres, L. Farrington, L. Graves, S. Marriott, M. Leoni, D. Tyrer, K. Martin, L. Dali-
kemmery, V. Lambourne, M. Green, D. Sirdefield, K. Amor, J. Colley, B. Shinder, J. Jones,
M. Mills, M. Carnahan, N. Taylor, K. Boulton, J. Tregonning, C. Brown, G. Clifford, E. Archer,
M. Hamilton, J. Curtis, T. Shewan, S. Walsh, K. Warner, K. Netherton, M. Mupudzi, B. Gunson,
J. Gitahi, D. Gocher, S. Batham, H. Pateman, S. Desmennu, J. Conder, D. Clement, S. Gal-
lagher, J. Orpe, P. Chan, L. Currie, L. O'Donohoe, M. Oblak, L. Morgan, M. Quinn, I. Amey,
Y. Baird, D. Cotterill, L. Cumlat, L. Winter, S. Greer, K. Spurdle, J. Allison, S. Dyer, H. Sweet-
ing, J. Kordula, M. E. Gershwin, C. A. Anderson, K. N. Lazaridis, P. Invernizzi, M. F. Seldin,
R. N. Sandford, C. I. Amos, and K. A. Siminovitch, "International genome-wide meta-analysis
identifies new primary biliary cirrhosis risk loci and targetable pathogenic pathways," Nature
Communications, vol. 6, p. 8019, Sept. 2015.

83 V. Anttila, B. Bulik-Sullivan, H. K. Finucane, J. Bras, L. Duncan, V. Escott-Price, G. Falcone,


P. Gormley, R. Malik, N. Patsopoulos, S. Ripke, R. Walters, Z. Wei, D. Yu, P. Lee, I. Consortium,
I. Consortium, I. C. o. C. Epilepsies, I. Consortium, I. Consortium, M. a. I. S. o. t. Isgc, A. W.
G. o. t. Pgc, A. N. W. G. o. t. Pgc, A. W. G. o. t. Pgc, B. D. W. G. o. t. Pgc, M. D. D. W. G.
o. t. Pgc, 0. a. T. W. G. o. t. Pgc, S. W. G. o. t. Pgc, G. Breen, C. Bulik, M. Daly, M. Dichgans,
S. Faraone, R. Guerreiro, P. Holmans, K. Kendler, B. Koeleman, C. Mathews, J. Scharf, P. Sklar,
J. Williams, N. Wood, C. Cotsapas, A. Palotie, J. Smoller, P. Sullivan, J. Rosand, A. Corvin, and
B. Neale, "Analysis of shared heritability in common disorders of the brain," bioRxiv, p. 048991,
Apr. 2016.
84 J.-C. Lambert, C. A. Ibrahim-Verbaas, D. Harold, A. C. Naj, R. Sims, C. Bellenguez, G. Jun,
A. L. DeStefano, J. C. Bis, G. W. Beecham, B. Grenier-Boley, G. Russo, T. A. Thornton-Wells,
N. Jones, A. V. Smith, V. Chouraki, C. Thomas, M. A. Ikram, D. Zelenika, B. N. Vardarajan,
Y. Kamatani, C.-F. Lin, A. Gerrish, H. Schmidt, B. Kunkle, M. L. Dunstan, A. Ruiz, M.-T.
Bihoreau, S.-H. Choi, C. Reitz, F. Pasquier, P. Hollingworth, A. Ramirez, 0. Hanon, A. L.
Fitzpatrick, J. D. Buxbaum, D. Campion, P. K. Crane, C. Baldwin, T. Becker, V. Gudnason,
C. Cruchaga, D. Craig, N. Amin, C. Berr, 0. L. Lopez, P. L. De Jager, V. Deramecourt, J. A.
Johnston, D. Evans, S. Lovestone, L. Letenneur, F. J. MorA n, D. C. Rubinsztein, G. Eiriks-
dottir, K. Sleegers, A. M. Goate, N. FiAl'vet, M. J. Huentelman, M. Gill, K. Brown, M. I.
Kamboh, L. Keller, P. Barberger-Gateau, B. McGuinness, E. B. Larson, R. Green, A. J. Myers,
C. Dufouil, S. Todd, D. Wallon, S. Love, E. Rogaeva, J. Gallacher, P. St George-Hyslop, J. Cla-
rimon, A. Lleo, A. Bayer, D. W. Tsuang, L. Yu, M. Tsolaki, P. BossAi, G. Spalletta, P. Proitsi,
J. Collinge, S. Sorbi, F. Sanchez-Garcia, N. C. Fox, J. Hardy, M. C. D. Naranjo, P. Bosco,
R. Clarke, C. Brayne, D. Galimberti, M. Mancuso, F. Matthews, European Alzheimer's Disease
Initiative (eadi), Genetic and Environmental Risk in Alzheimer's Disease (gerad), Alzheimer's
Disease Genetic Consortium (adgc), Cohorts for Heart and Aging Research in Genomic Epidemi-
ology (charge), S. Moebus, P. Mecocci, M. Del Zompo, W. Maier, H. Hampel, A. Pilotto, M. Bul-
lido, F. Panza, P. Caffarra, B. Nacmias, J. R. Gilbert, M. Mayhaus, L. Lannfelt, H. Hakonarson,
S. Pichler, M. M. Carrasquillo, M. Ingelsson, D. Beekly, V. Alvarez, F. Zou, 0. Valladares, S. G.
Younkin, E. Coto, K. L. Hamilton-Nelson, W. Gu, C. Razquin, P. Pastor, I. Mateo, M. J. Owen,

232
K. M. Faber, P. V. Jonsson, 0. Combarros, M. C. O'Donovan, L. B. Cantwell, H. Soininen,
D. Blacker, S. Mead, T. H. Mosley Jr, D. A. Bennett, T. B. Harris, L. Fratiglioni, C. Holmes,
R. F. A. G. de Bruijn, P. Passmore, T. J. Montine, K. Bettens, J. I. Rotter, A. Brice, K. Mor-
gan, T. M. Foroud, W. A. Kukull, D. Hannequin, J. F. Powell, M. A. Nalls, K. Ritchie, K. L.
Lunetta, J. S. K. Kauwe, E. Boerwinkle, M. Riemenschneider, M. Boada, M. Hiltunen, E. R.
Martin, R. Schmidt, D. Rujescu, L.-S. Wang, J.-F. Dartigues, R. Mayeux, C. Tzourio, A. Hofman,
M. M. NAn-then, C. Graff, B. M. Psaty, L. Jones, J. L. Haines, P. A. Holmans, M. Lathrop, M. A.
Pericak-Vance, L. J. Launer, L. A. Farrer, C. M. van Duijn, C. Van Broeckhoven, V. Moskvina,
S. Seshadri, J. Williams, G. D. Schellenberg, and P. Amouyel, "Meta-analysis of 74,046 indi-
viduals identifies 11 new susceptibility loci for Alzheimer's disease," Nature Genetics, vol. 45,
pp. 1452-1458, Dec. 2013.
85 Cross-Disorder Group of the Psychiatric Genomics Consortium, "Genetic relationship between
five psychiatric disorders estimated from genome-wide SNPs," Nature Genetics, vol. 45, pp. 984-
994, Sept. 2013.
86 International League Against Epilepsy Consortium on Complex Epilepsies, "Genetic determi-
nants of common epilepsies: a meta-analysis of genome-wide association studies," The Lancet
Neurology, vol. 13, pp. 893-903, Sept. 2014.
87
D. Woo, G. J. Falcone, W. J. Devan, W. M. Brown, A. Biffi, T. D. Howard, C. D. Anderson,
H. B. Brouwers, V. Valant, T. W. K. Battey, F. Radmanesh, M. R. Raffeld, S. Baedorf-Kassis,
R. Deka, J. G. Woo, L. J. Martin, M. Haverbusch, C. J. Moomaw, G. Sun, J. P. Broderick,
M. L. Flaherty, S. R. Martini, D. 0. Kleindorfer, B. Kissela, M. E. Comeau, J. M. Jagiella,
H. Schmidt, P. Freudenberger, A. Pichler, C. Enzinger, B. M. Hansen, B. Norrving, J. Jimenez-
Conde, E. Giralt-Steinhauer, R. Elosua, E. Cuadrado-Godia, C. Soriano, J. Roquer, P. Kraft,
A. M. Ayres, K. Schwab, J. L. McCauley, J. Pera, A. Urbanik, N. S. Rost, J. N. Goldstein,
A. Viswanathan, E.-M. StAfigerer, D. L. Tirschwell, M. Selim, D. L. Brown, S. L. Silliman, B. B.
Worrall, J. F. Meschia, C. S. Kidwell, J. Montaner, I. Fernandez-Cadenas, P. Delgado, R. Malik,
M. Dichgans, S. M. Greenberg, P. M. Rothwell, A. Lindgren, A. Slowik, R. Schmidt, C. D.
Langefeld, J. Rosand, and International Stroke Genetics Consortium, "Meta-analysis of genome-
wide association studies identifies 1q22 as a susceptibility locus for intracerebral hemorrhage,"
American Journal of Human Genetics, vol. 94, pp. 511-521, Apr. 2014.
88
M. Traylor, M. Farrall, E. G. Holliday, C. Sudlow, J. C. Hopewell, Y.-C.
Cheng, M. Fornage,
M. A. Ikram, R. Malik, S. Bevan, U. Thorsteinsdottir, M. A. Nalls, W. Longstreth, K. L. Wiggins,
S. Yadav, E. A. Parati, A. L. Destefano, B. B. Worrall, S. J. Kittner, M. S. Khan, A. P. Reiner,
A. Helgadottir, S. Achterberg, I. Fernandez-Cadenas, S. Abboud, R. Schmidt, M. Walters, W.-
M. Chen, E. B. Ringelstein, M. O'Donnell, W. K. Ho, J. Pera, R. Lemmens, B. Norrving,
P. Higgins, M. Benn, M. Sale, G. KuhlenbAd'umer, A. S. F. Doney, A. M. Vicente, H. Delavaran,
A. Algra, G. Davies, S. A. Oliveira, C. N. A. Palmer, I. Deary, H. Schmidt, M. Pandolfo,
J. Montaner, C. Carty, P. I. W. de Bakker, K. Kostulas, J. M. Ferro, N. R. van Zuydam,
E. Valdimarsson, B. G. Nordestgaard, A. Lindgren, V. Thijs, A. Slowik, D. Saleheen, G. ParAr,
K. Berger, G. Thorleifsson, Australian Stroke Genetics Collaborative, Wellcome Trust Case

233
Control Consortium 2 (WTCCC2), A. Hofman, T. H. Mosley, B. D. Mitchell, K. Furie, R. Clarke,
C. Levi, S. Seshadri, A. Gschwendtner, G. B. Boncoraglio, P. Sharma, J. C. Bis, S. Gretarsdottir,
B. M. Psaty, P. M. Rothwell, J. Rosand, J. F. Meschia, K. Stefansson, M. Dichgans, H. S. Markus,
and International Stroke Genetics Consortium, "Genetic risk factors for ischaemic stroke and
its subtypes (the METASTROKE collaboration): a meta-analysis of genome-wide association
studies," The Lancet Neurology, vol. 11, pp. 951-962, Nov. 2012.

89 N. A. Patsopoulos, Bayer Pharma MS Genetics Working Group, Steering Committees of Studies


Evaluating IFNis-1b and a CCR1-Antagonist, ANZgene Consortium, GeneMSA, International
Multiple Sclerosis Genetics Consortium, F. Esposito, J. Reischl, S. Lehr, D. Bauer, J. Heubach,
R. Sandbrink, C. Pohl, G. Edan, L. Kappos, D. Miller, J. MontalbAqn, C. H. Polman, M. S.
Freedman, H.-P. Hartung, B. G. W. Arnason, G. Comi, S. Cook, M. Filippi, D. S. Goodin, D. Jef-
fery, P. O'Connor, G. C. Ebers, D. Langdon, A. T. Reder, A. Traboulsee, F. Zipp, S. Schimrigk,
J. Hillert, M. Bahlo, D. R. Booth, S. Broadley, M. A. Brown, B. L. Browning, S. R. Browning,
H. Butzkueven, W. M. Carroll, C. Chapman, S. J. Foote, L. Griffiths, A. G. Kermode, T. J. Kil-
patrick, J. Lechner-Scott, M. Marriott, D. Mason, P. Moscato, R. N. Heard, M. P. Pender, V. M.
Perreau, D. Perera, J. P. Rubio, R. J. Scott, M. Slee, J. Stankovich, G. J. Stewart, B. V. Taylor,
N. Tubridy, E. Willoughby, J. Wiley, P. Matthews, F. M. Boneschi, A. Compston, J. Haines,
S. L. Hauser, J. McCauley, A. Ivinson, J. R. Oksenberg, M. Pericak-Vance, S. J. Sawcer, P. L.
De Jager, D. A. Hafler, and P. I. W. de Bakker, "Genome-wide meta-analysis identifies novel
multiple sclerosis susceptibility loci," Annals of Neurology, vol. 70, pp. 897-912, Dec. 2011.
90
M. A. Nalls, N. Pankratz, C. M. Lill, C. B. Do, D. G. Hernandez, M. Saad, A. L. DeSte-
fano, E. Kara, J. Bras, M. Sharma, C. Schulte, M. F. Keller, S. Arepalli, C. Letson, C. Ed-
sall, H. Stefansson, X. Liu, H. Pliner, J. H. Lee, R. Cheng, International Parkinson's Disease
Genomics Consortium (IPDGC), Parkinson's Study Group (PSG) Parkinson's Research: The
Organized GENetics Initiative (PROGENI), 23andMe, GenePD, NeuroGenetics Research Con-
sortium (NGRC), Hussman Institute of Human Genomics (HIHG), Ashkenazi Jewish Dataset
Investigator, Cohorts for Health and Aging Research in Genetic Epidemiology (CHARGE), North
American Brain Expression Consortium (NABEC), United Kingdom Brain Expression Consor-
tium (UKBEC), Greek Parkinson's Disease Consortium, Alzheimer Genetic Analysis Group,
M. A. Ikram, J. P. A. Ioannidis, G. M. Hadjigeorgiou, J. C. Bis, M. Martinez, J. S. Perlmut-
ter, A. Goate, K. Marder, B. Fiske, M. Sutherland, G. Xiromerisiou, R. H. Myers, L. N. Clark,
K. Stefansson, J. A. Hardy, P. Heutink, H. Chen, N. W. Wood, H. Houlden, H. Payami, A. Brice,
W. K. Scott, T. Gasser, L. Bertram, N. Eriksson, T. Foroud, and A. B. Singleton, "Large-scale
meta-analysis of genome-wide association data identifies six new risk loci for Parkinson's disease,"
Nature Genetics, vol. 46, pp. 989-993, Sept. 2014.

91 G. J. Falcone, A. Biffi, W. J. Devan, J. M. Jagiella, H. Schmidt, B. Kissela, B. M. Hansen,


J. Jimenez-Conde, E. Giralt-Steinhauer, R. Elosua, E. Cuadrado-Godia, C. Soriano, A. M. Ayres,
K. Schwab, J. Pera, A. Urbanik, N. S. Rost, J. N. Goldstein, A. Viswanathan, A. Pichler, C. En-
zinger, B. Norrving, D. L. Tirschwell, M. Selim, D. L. Brown, S. L. Silliman, B. B. Worrall, J. F.
Meschia, C. S. Kidwell, J. Montaner, I. Fernandez-Cadenas, P. Delgado, J. P. Broderick, S. M.

234
Greenberg, J. Roquer, A. Lindgren, A. Slowik, R. Schmidt, M. L. Flaherty, D. 0. Kleindorfer,
C. D. Langefeld, D. Woo, J. Rosand, and International Stroke Genetics Consortium, "Burden
of risk alleles for hypertension increases risk of intracerebral hemorrhage," Stroke; a Journal of
Cerebral Circulation, vol. 43, pp. 2877-2883, Nov. 2012.
92 D. Backenroth, K. Kiryluk, B. Xu, L. Pethukova, B. Vardarajan, E. Khurana, A.
Christiano,
J. Buxbaum, and I. Ionita-Laza, "Tissue-specific functional effect prediction of genetic variation
and applications to complex trait genetics," bioRxiv, Aug. 2016.
93 p. C. Tfelt-Hansen and P. J. Koehler, "One hundred years of migraine research: major clinical
and scientific observations from 1910 to 2010," Headache, vol. 51, pp. 752-778, May 2011.
14 T. E. Wilens, J. Biederman, and T. J. Spencer, "Attention Deficit/Hyperactivity Disorder Across
the Lifespan," Annual Review of Medicine, vol. 53, no. 1, pp. 113-131, 2002.
95 L. C. Hanford, A. Nazarov, G. B. Hall, and R. B. Sassi, "Cortical thickness in bipolar disorder:
a systematic review," Bipolar Disorders, vol. 18, pp. 4-18, Feb. 2016.
96
J. H. Callicott, A. Bertolino, V. S. Mattay, F. J. P. Langheim, J. Duyn, R. Coppola, T. E. Gold-
berg, and D. R. Weinberger, "Physiological Dysfunction of the Dorsolateral Prefrontal Cortex in
Schizophrenia Revisited," Cerebral Cortex, vol. 10, pp. 1078-1092, Nov. 2000.
9 7 N.
Medic, H. Ziauddeen, K. D. Ersche, I. S. Farooqi, E. T. Bullmore, P. J. Nathan, L. Ronan,
and P. C. Fletcher, "Increased body mass index is associated with specific regional alterations in
brain structure," International Journal of Obesity, vol. 40, pp. 1177-1182, July 2016.
98 N. Maleki, L. Becerra, L. Nutile, G. Pendse, J. Brawn, M. Bigal, R. Burstein, and D.
Borsook,
"Migraine attacks the Basal Ganglia," Molecular Pain, vol. 7, p. 71, Sept. 2011.
99S. Herculano-Houzel and R. Lent, "Isotropic Fractionator: A Simple, Rapid Method for the
Quantification of Total Cell and Neuron Numbers in the Brain," Journal of Neuroscience, vol. 25,
pp. 2518-2521, Mar. 2005.
100 T. Sakai, A. Oshima, Y. Nozaki, I. Ida, C. Haga, H. Akiyama, Y. Nakazato, and M. Mikuni,
"Changes in density of calcium-binding-protein-immunoreactive GABAergic neurons in prefrontal
cortex in schizophrenia and bipolar disorder," Neuropathology, vol. 28, pp. 143-150, Apr. 2008.
101F. M. Benes and S. Berretta, "GABAergic Interneurons: Implications for Understanding
Schizophrenia and Bipolar Disorder," Neuropsychopharmacology, vol. 25, pp. 1-27, July 2001.
1 02
E. Gjoneska, A. R. Pfenning, H. Mathys, G. Quon, A. Kundaje, L.-H. Tsai, and M. Kellis, "Con-
served epigenomic signals in mice and humans reveal immune basis of AlzheimeraAZs disease,"
Nature, vol. 518, pp. 365-369, Feb. 2015.
103 S. A. Gagliano, J. G. Pouget, J. Hardy, J. Knight, M. R. Barnes, M. Ryten, and
M. E. Weale,
"Genomics implicates adaptive and innate immunity in Alzheimer's and Parkinson's," bioRxiv
doi: 10.1101/059519, June 2016.

235
S045.
Rege and S. J. Hodgkinson, "Immune dysregulation and autoimmunity in bipolar disorder:
Synthesis of the evidence and its clinical application," The Australian and New Zealand Journal
of Psychiatry, vol. 47, pp. 1136-1151, Dec. 2013.
105 I. Elamin, M. J. Edwards, and D. Martino, "Immune dysfunction in Tourette syndrome," Be-
havioural Neurology, vol. 27, no. 1, pp. 23-32, 2013.
106 W. Jin, J. S. Millar, U. Broedl, J. M. Glick, and D. J. Rader, "Inhibition of endothelial lipase
causes increased HDL cholesterol levels in vivo," The Journal of Clinical Investigation, vol. 111,
pp. 357-362, Feb. 2003.
07 U. C. Broedl, C. Maugeais, J. S. Millar, W. Jin, R. E. Moore, I. V. Fuki, D. Marchadier,
J. M. Glick, and D. J. Rader, "Endothelial lipase promotes the catabolism of ApoB-containing
lipoproteins," CirculationResearch, vol. 94, pp. 1554-1561, June 2004.

108 K. R. Feingold and C. Grunfeld, "The role of HDL in innate immunity," Journal of Lipid Research,
vol. 52, pp. 1-3, Jan. 2011.

109 G. S. Hotamisligil, "Inflammation and metabolic disorders," Nature, vol. 444, pp. 860-867, Dec.
2006.

lOY. Zlotnikov-Klionsky, B. Nathansohn-Levi, E. Shezen, C. Rosen, S. Kagan, L. Bar-On, S. Jung,


E. Shifrut, S. Reich-Zeliger, N. Friedman, R. Aharoni, R. Arnon, 0. Yifa, A. Aronovich, and
Y. Reisner, "Perforin-Positive Dendritic Cells Exhibit an Immuno-regulatory Role in Metabolic
Syndrome and Autoimmunity," Immunity, vol. 43, pp. 776-787, Oct. 2015.

"' A. Dhirapong, A. Lleo, G.-X. Yang, K. Tsuneyama, R. Dunn, M. Kehry, T. A. Packard, J. C.


Cambier, F.-T. Liu, K. Lindor, R. L. Coppel, A. A. Ansari, and M. E. Gershwin, "B cell depletion
therapy exacerbates murine primary biliary cirrhosis," Hepatology (Baltimore, Md.), vol. 53,
pp. 527-535, Feb. 2011.
12 J. Zhang, W. Zhang, P. S. C. Leung, C. L. Bowlus, S. Dhaliwal, R. L. Coppel, A. A. Ansari, G.-X.
Yang, J. Wang, T. P. Kenny, X.-S. He, I. R. Mackay, and M. E. Gershwin, "Ongoing activation
of autoantigen-specific B cells in primary biliary cirrhosis," Hepatology (Baltimore, Md.), vol. 60,
pp. 1708-1716, Nov. 2014.
113 M. T. Heneka, D. T. Golenbock, and E. Latz, "Innate immunity in Alzheimer's disease," Nature
Immunology, vol. 16, pp. 229-236, Mar. 2015.

14 C. M. Lloyd and E. M. Hessel, "Functions of T cells in asthma: more than just TH2 cells," Nature
reviews. Immunology, vol. 10, Dec. 2010.

"5 U. MAijller-Ladner, T. Pap, R. E. Gay, M. Neidhart, and S. Gay, "Mechanisms of disease: the
molecular and cellular basis of joint destruction in rheumatoid arthritis," Nature clinicalpractice
Rheumatology, vol. 1, no. 2, pp. 102-110, 2005.

236
116 R. J. Xavier and D. K. Podolsky, "Unravelling the pathogenesis of inflammatory bowel disease,"
Nature, vol. 448, pp. 427-434, July 2007.
117 M. Sospedra and R. Martin, "Immunology of Multiple Sclerosis," Annual Review of Immunology,
vol. 23, no. 1, pp. 683-747, 2005.

118 I. G. Barbosa, R. Machado-Vieira, J. C. Soares, and A. L. Teixeira, "The immunology of bipolar


disorder," Neuroimmunomodulation, vol. 21, no. 0, pp. 117-122, 2014.

119 J. Steiner, R. Jacobs, B. Panteli, M. Brauner, K. Schiltz, S. Bahn, M. Herberth, S. Westphal,


T. Gos, M. Walter, H.-G. Bernstein, A. M. Myint, and B. Bogerts, "Acute schizophrenia is
accompanied by reduced T cell and increased B cell immunity," European Archives of Psychiatry
and Clinical Neuroscience, vol. 260, pp. 509-518, Oct. 2010.
120 A. Sekar, A. R. Bialas, H. de Rivera, A. Davis, T. R. Hammond, N. Kamitaki, K. Tooley,
J. Presumey, M. Baum, V. Van Doren, G. Genovese, S. A. Rose, R. E. Handsaker, Schizophrenia
Working Group of the Psychiatric Genomics Consortium, M. J. Daly, M. C. Carroll, B. Stevens,
and S. A. McCarroll, "Schizophrenia risk from complex variation of complement component 4,"
Nature, vol. 530, pp. 177-183, Feb. 2016.
121 P. Mehta, A. M. Nuotio-Antar, and C. W. Smith, "I If T cells promote inflammation and insulin
resistance during high fat diet-induced obesity in mice," Journal of Leukocyte Biology, vol. 97,
pp. 121-134, Jan. 2015.
122 C. A. d. Leeuw, J. M. Mooij, T. Heskes, and D. Posthuma, "MAGMA: Generalized Gene-Set
Analysis of GWAS Data," PLOS Comput Biol, vol. 11, p. e1004219, Apr. 2015.
123 S. Gazal, H. Finucane, N. A. Furlotte, P.-R. Loh, P. F. Palamara, X. Liu, A. Schoech, B. Bulik-
Sullivan, B. M. Neale, A. Gusev, and A. L. Price, "Linkage disequilibrium dependent architecture
of human complex traits reveals action of negative selection," bioRxiv, p. 082024, Oct. 2016.
12
4 H. Shi, G. Kichaev, and B. Pasaniuc, "Contrasting the Genetic Architecture of 30 Complex
Traits from Summary Association Data," The American Journal of Human Genetics, vol. 99,
pp. 139-153, July 2016.
12
G. P. Wagner, K. Kin, and V. J. Lynch, "Measurement of mRNA abundance using RNA-seq
data: RPKM measure is inconsistent among samples," Theory in Biosciences = Theorie in Den
Biowissenschaften, vol. 131, pp. 281-285, Dec. 2012.
126
C. W. Law, Y. Chen, W. Shi, and G. K. Smyth, "voom: precision weights unlock linear model
analysis tools for RNA-seq read counts," Genome Biology, vol. 15, p. R29, 2014.
12
7 G. Genovese, M. Fromer, E. A. Stahl, D. M. Ruderfer, K. Chambert, M. LandAl'n, J. L.
Moran,
S. M. Purcell, P. Sklar, P. F. Sullivan, C. M. Hultman, and S. A. McCarroll, "Increased bur-
den of ultra-rare protein-altering variants among 4,877 individuals with schizophrenia," Nature
Neuroscience, vol. 19, pp. 1433-1441, Oct. 2016.

237
128 Y. Banda, M. N. Kvale, T. J. Hoffmann, S. E. Hesselson, D. Ranatunga, H. Tang, C. Sabatti,
L. A. Croen, B. P. Dispensa, M. Henderson, C. Iribarren, E. Jorgenson, L. H. Kushi, D. Lud-
wig, D. Olberg, C. P. Quesenberry, S. Rowell, M. Sadler, L. C. Sakoda, S. Sciortino, L. Shen,
D. Smethurst, C. P. Somkin, S. K. V. D. Eeden, L. Walter, R. A. Whitmer, P.-Y. Kwok, C. Schae-
fer, and N. Risch, "Characterizing Race/Ethnicity and Genetic Ancestry for 100,000 Subjects
in the Genetic Epidemiology Research on Adult Health and Aging (GERA) Cohort," Genetics,
vol. 200, pp. 1285-1295, Aug. 2015.
12 9 P.-R. Loh, G. Bhatia, A. Gusev, H. K. Finucane, B. K. Bulik-Sullivan, S. J. Pollack, Schizophre-
nia Working Group of Psychiatric Genomics Consortium, T. R. de Candia, S. H. Lee, N. R. Wray,
K. S. Kendler, M. C. O'Donovan, B. M. Neale, N. Patterson, and A. L. Price, "Contrasting ge-
netic architectures of schizophrenia and other complex diseases using fast variance-components
analysis," Nature Genetics, vol. 47, pp. 1385-1392, Dec. 2015.
13 0 K. J. Galinsky, G. Bhatia, P.-R. Loh, S. Georgiev, S. Mukherjee, N. J. Patterson, and A. L.
Price, "Fast Principal-Component Analysis Reveals Convergent Evolution of ADHib in Europe
and East Asia," The American Journal of Human Genetics, vol. 98, pp. 456-472, Mar. 2016.
131 C. C. Chang, C. C. Chow, L. C. Tellier, S. Vattikuti, S. M. Purcell, and J. J. Lee, "Second-
generation PLINK: rising to the challenge of larger and richer datasets," GigaScience, vol. 4,
p. 7, 2015.
132 G. D. Smith and S. Ebrahim, "Mendelian randomization: can genetic epidemiology contribute to
understanding environmental determinants of disease?," InternationalJournal of Epidemiology,
vol. 32, pp. 1-22, Feb. 2003.
133 G. Davey Smith and G. Hemani, "Mendelian randomization: genetic anchors for causal inference
in epidemiological studies," Human Molecular Genetics, vol. 23, pp. R89-98, Sept. 2014.
134 S. G. Vandenberg, Methods and Goals in Human Behavior Genetics. Academic Press, Sept. 2013.
Google-Books-ID: 6_BFBQAAQBAJ.
135 0. Kempthorne and R. H. Osborne, "The interpretation of twin data," American journal of
human genetics, vol. 13, no. 3, p. 320, 1961.
136 J. C. Loehlin and S. G. Vandenberg, Genetic and environmental components in the covariation
of cognitive abilities: An additive model. Louisville Twin Study, University of Louisville, 1966.
13
M. Neale and L. Cardon, Methodology for genetic studies of twins and families, vol. 67. Springer
Science & Business Media, 2013.
13 8 P.Lichtenstein, B. H. Yip, C. BjAkirk, Y. Pawitan, T. D. Cannon, P. F. Sullivan, and C. M. Hult-
man, "Common genetic determinants of schizophrenia and bipolar disorder in Swedish families:
a population-based study," The Lancet, vol. 373, no. 9659, pp. 234-239, 2009.

238
139 B. F. Voight, G. M. Peloso, M. Orho-Melander, R. Frikke-Schmidt, M. Barbalic, M. K. Jensen,
G. Hindy, H. HAlm, E. L. Ding, T. Johnson, H. Schunkert, N. J. Samani, R. Clarke, J. C.
Hopewell, J. F. Thompson, M. Li, G. Thorleifsson, C. Newton-Cheh, K. Musunuru, J. P. Pirruc-
cello, D. Saleheen, L. Chen, A. F. Stewart, A. Schillert, U. Thorsteinsdottir, G. Thorgeirsson,
S. Anand, J. C. Engert, T. Morgan, J. Spertus, M. Stoll, K. Berger, N. Martinelli, D. Girelli,
P. P. McKeown, C. C. Patterson, S. E. Epstein, J. Devaney, M.-S. Burnett, V. Mooser, S. Ripatti,
I. Surakka, M. S. Nieminen, J. Sinisalo, M.-L. Lokki, M. Perola, A. Havulinna, U. de Faire, B. Gi-
gante, E. Ingelsson, T. Zeller, P. Wild, P. I. W. de Bakker, 0. H. Klungel, A.-H. Maitland-van der
Zee, B. J. M. Peters, A. de Boer, D. E. Grobbee, P. W. Kamphuisen, V. H. M. Deneer, C. C.
Elbers, N. C. Onland-Moret, M. H. Hofker, C. Wijmenga, W. M. Verschuren, J. M. Boer, Y. T.
van der Schouw, A. Rasheed, P. Frossard, S. Demissie, C. Willer, R. Do, J. M. Ordovas, G. R.
Abecasis, M. Boehnke, K. L. Mohlke, M. J. Daly, C. Guiducci, N. P. Burtt, A. Surti, E. Gonza-
lez, S. Purcell, S. Gabriel, J. Marrugat, J. Peden, J. Erdmann, P. Diemert, C. Willenborg, I. R.
KAffnig, M. Fischer, C. Hengstenberg, A. Ziegler, I. Buysschaert, D. Lambrechts, F. Van de
Werf, K. A. Fox, N. E. El Mokhtari, D. Rubin, J. Schrezenmeir, S. Schreiber, A. SchAd'fer,
J. Danesh, S. Blankenberg, R. Roberts, R. McPherson, H. Watkins, A. S. Hall, K. Overvad,
E. Rimm, E. Boerwinkle, A. Tybjaerg-Hansen, L. A. Cupples, M. P. Reilly, 0. Melander, P. M.
Mannucci, D. Ardissino, D. Siscovick, R. Elosua, K. Stefansson, C. J. O'Donnell, V. Salomaa,
D. J. Rader, L. Peltonen, S. M. Schwartz, D. Altshuler, and S. Kathiresan, "Plasma HDL choles-
terol and risk of myocardial infarction: a mendelian randomisation study," The Lancet, vol. 380,
pp. 572-580, Aug. 2012.

14 0
R. Do, C. J. Willer, E. M. Schmidt, S. Sengupta, C. Gao, G. M. Peloso, S. Gustafsson,
S. Kanoni,
A. Ganna, J. Chen, M. L. Buchkovich, S. Mora, J. S. Beckmann, J. L. Bragg-Gresham, H.-Y.
Chang, A. Demirkan, H. M. Den Hertog, L. A. Donnelly, G. B. Ehret, T. Esko, M. F. Feitosa,
T. Ferreira, K. Fischer, P. Fontanillas, R. M. Fraser, D. F. Freitag, D. Gurdasani, K. HeikkilAd',
E. HyppAfinen, A. Isaacs, A. U. Jackson, A. Johansson, T. Johnson, M. Kaakinen, J. Kettunen,
M. E. Kleber, X. Li, J. Luan, L.-P. LyytikAd'inen, P. K. E. Magnusson, M. Mangino, E. Mihailov,
M. E. Montasser, M. MAijller-Nurasyid, I. M. Nolte, J. R. O'Connell, C. D. Palmer, M. Perola,
A.-K. Petersen, S. Sanna, R. Saxena, S. K. Service, S. Shah, D. Shungin, C. Sidore, C. Song,
R. J. Strawbridge, I. Surakka, T. Tanaka, T. M. Teslovich, G. Thorleifsson, E. G. Van den
Herik, B. F. Voight, K. A. Volcik, L. L. Waite, A. Wong, Y. Wu, W. Zhang, D. Absher, G. Asiki,
I. Barroso, L. F. Been, J. L. Bolton, L. L. Bonnycastle, P. Brambilla, M. S. Burnett, G. Cesana,
M. Dimitriou, A. S. F. Doney, A. DAfIring, P. Elliott, S. E. Epstein, G. I. Eyjolfsson, B. Gi-
gante, M. 0. Goodarzi, H. Grallert, M. L. Gravito, C. J. Groves, G. Hallmans, A.-L. Hartikainen,
C. Hayward, D. Hernandez, A. A. Hicks, H. Holm, Y.-J. Hung, T. Illig, M. R. Jones, P. Kaleebu,
J. J. P. Kastelein, K.-T. Khaw, E. Kim, N. Klopp, P. Komulainen, M. Kumari, C. Langenberg,
T. LehtimAd'ki, S.-Y. Lin, J. LindstrAfim, R. J. F. Loos, F. Mach, W. L. McArdle, C. Meisinger,
B. D. Mitchell, G. MAijller, R. Nagaraja, N. Narisu, T. V. M. Nieminen, R. N. Nsubuga, I. Olaf-
sson, K. K. Ong, A. Palotie, T. Papamarkou, C. Pomilla, A. Pouta, D. J. Rader, M. P. Reilly,
P. M. Ridker, F. Rivadeneira, I. Rudan, A. Ruokonen, N. Samani, H. Scharnagl, J. Seeley,
K. Silander, A. StanADAqkovAq, K. Stirrups, A. J. Swift, L. Tiret, A. G. Uitterlinden, L. J.

239
van Pelt, S. Vedantam, N. Wainwright, C. Wijmenga, S. H. Wild, G. Willemsen, T. Wilsgaard,
J. F. Wilson, E. H. Young, J. H. Zhao, L. S. Adair, D. Arveiler, T. L. Assimes, S. Bandinelli,
F. Bennett, M. Bochud, B. 0. Boehm, D. I. Boomsma, I. B. Borecki, S. R. Bornstein, P. Bovet,
M. Burnier, H. Campbell, A. Chakravarti, J. C. Chambers, Y.-D. 1. Chen, F. S. Collins, R. S.
Cooper, J. Danesh, G. Dedoussis, U. de Faire, A. B. Feranil, J. FerriAtres, L. Ferrucci, N. B.
Freimer, C. Gieger, L. C. Groop, V. Gudnason, U. Gyllensten, A. Hamsten, T. B. Harris, A. Hin-
gorani, J. N. Hirschhorn, A. Hofman, G. K. Hovingh, C. A. Hsiung, S. E. Humphries, S. C. Hunt,
K. Hveem, C. Iribarren, M.-R. JAd'rvelin, A. Jula, M. KAd'hAfinen, J. Kaprio, A. KesAd'niemi,
M. Kivimaki, J. S. Kooner, P. J. Koudstaal, R. M. Krauss, D. Kuh, J. Kuusisto, K. 0. Kyvik,
M. Laakso, T. A. Lakka, L. Lind, C. M. Lindgren, N. G. Martin, W. MAd'rz, M. I. McCarthy,
C. A. McKenzie, P. Meneton, A. Metspalu, L. Moilanen, A. D. Morris, P. B. Munroe, I. NjAyI-
stad, N. L. Pedersen, C. Power, P. P. Pramstaller, J. F. Price, B. M. Psaty, T. Quertermous,
R. Rauramaa, D. Saleheen, V. Salomaa, D. K. Sanghera, J. Saramies, P. E. H. Schwarz, W. H.-H.
Sheu, A. R. Shuldiner, A. Siegbahn, T. D. Spector, K. Stefansson, D. P. Strachan, B. 0. Tayo,
E. Tremoli, J. Tuomilehto, M. Uusitupa, C. M. van Duijn, P. Vollenweider, L. Wallentin, N. J.
Wareham, J. B. Whitfield, B. H. R. Wolffenbuttel, D. Altshuler, J. M. Ordovas, E. Boerwinkle,
C. N. A. Palmer, U. Thorsteinsdottir, D. I. Chasman, J. I. Rotter, P. W. Franks, S. Ripatti, L. A.
Cupples, M. S. Sandhu, S. S. Rich, M. Boehnke, P. Deloukas, K. L. Mohlke, E. Ingelsson, G. R.
Abecasis, M. J. Daly, B. M. Neale, and S. Kathiresan, "Common variants associated with plasma
triglycerides and risk for coronary artery disease," Nature Genetics, vol. 45, pp. 1345-1352, Nov.
2013.
141 J. D. Angrist and J.-S. Pischke, Mostly Harmless Econometrics: An Empiricist's Companion.
Princeton: Princeton University Press, 1 edition ed., Jan. 2009.
142S. Burgess, S. G. Thompson, and CRP CHD Genetics Collaboration, "Avoiding bias from weak
instruments in Mendelian randomization studies," InternationalJournalof Epidemiology, vol. 40,
pp. 755-764, June 2011.
143S. Vattikuti, J. Guo, and C. C. Chow, "Heritability and Genetic Correlations Explained by
Common SNPs for Metabolic Syndrome Traits," PLOS Genet, vol. 8, p. e1002637, Mar. 2012.
14
G.-B. Chen, S. H. Lee, M.-J. A. Brion, G. W. Montgomery, N. R. Wray, G. L. Radford-Smith,
P. M. Visscher, and t. I. I. G. Consortium, "Estimation and partitioning of (co)heritability of
inflammatory bowel disease from GWAS and immunochip data," Human Molecular Genetics,
p. ddu174, Apr. 2014.
145S. M. Purcell, N. R. Wray, J. L. Stone, P. M. Visscher, M. C. O'Donovan, P. F. Sullivan, P. Sklar,
S. M. Purcell (Leader), J. L. Stone, P. F. Sullivan, D. M. Ruderfer, A. McQuillin, D. W. Morris,
C. T. OaADushlaine, A. Corvin, P. A. Holmans, M. C. OAAZDonovan, P. Sklar, N. R. Wray,
S. Macgregor, P. Sklar, P. F. Sullivan, M. C. OAA2Donovan, P. M. Visscher, H. Gurling, D. H. R.
Blackwood, A. Corvin, N. J. Craddock, M. Gill, C. M. Hultman, G. K. Kirov, P. Lichtenstein,
A. McQuillin, W. J. Muir, M. C. O'Donovan, M. J. Owen, C. N. Pato, S. M. Purcell, E. M.
Scolnick, D. St Clair, J. L. Stone, P. F. Sullivan, P. Sklar (Leader), M. C. O'Donovan, G. K.

240
Kirov, N. J. Craddock, P. A. Holmans, N. M. Williams, L. Georgieva, I. Nikolov, N. Norton,
H. Williams, D. Toncheva, V. Milanova, M. J. Owen, C. M. Hultman, P. Lichtenstein, E. F.
Thelander, P. Sullivan, D. W. Morris, C. T. O'Dushlaine, E. Kenny, E. M. Quinn, M. Gill,
A. Corvin, A. McQuillin, K. Choudhury, S. Datta, J. Pimm, S. Thirumalai, V. Puri, R. Krasucki,
J. Lawrence, D. Quested, N. Bass, H. Gurling, C. Crombie, G. Fraser, S. Leh Kuan, N. Walker,
D. St Clair, D. H. R. Blackwood, W. J. Muir, K. A. McGhee, B. Pickard, P. Malloy, A. W.
Maclean, M. Van Beck, N. R. Wray, S. Macgregor, P. M. Visscher, M. T. Pato, H. Medeiros,
F. Middleton, C. Carvalho, C. Morley, A. Fanous, D. Conti, J. A. Knowles, C. Paz Ferreira,
A. Macedo, M. Helena Azevedo, C. N. Pato, J. L. Stone, D. M. Ruderfer, A. N. Kirby, M. A. R.
Ferreira, M. J. Daly, S. M. Purcell, P. Sklar, S. M. Purcell, J. L. Stone, K. Chambert, D. M.
Ruderfer, F. Kuruvilla, S. B. Gabriel, K. Ardlie, J. L. Moran, M. J. Daly, E. M. Scolnick, and
P. Sklar, "Common polygenic variation contributes to risk of schizophrenia and bipolar disorder,"
Nature, July 2009.

14 F. Dudbridge, "Power and Predictive Accuracy of Polygenic Risk Scores," PLoS Genetics, vol. 9,
p. e1003348, Mar. 2013.

1'D. Speed, G. Hemani, M. R. Johnson, and D. J. Balding, "Improved heritability estimation from
genome-wide SNPs," The American Journal of Human Genetics, vol. 91, no. 6, pp. 1011-1021,
2012.
148 Cross-Disorder Group of the Psychiatric Genomics Consortium and others, "Identification of risk
loci with shared effects on five major psychiatric disorders: a genome-wide analysis," The Lancet,
vol. 381, no. 9875, pp. 1371-1379, 2013.
14 9 M.
Horikoshi, H. Yaghootkar, D. 0. Mook-Kanamori, U. Sovio, H. R. Taal, B. J. Hennig, J. P.
Bradfield, B. St Pourcain, D. M. Evans, P. Charoen, and others, "New loci associated with birth
weight identify genetic links between intrauterine growth and adult height and metabolism,"
Nature genetics, vol. 45, no. 1, pp. 76-82, 2013.
15
0 R. M. Freathy, A. J. Bennett, S. M. Ring, B. Shields, C. J. Groves, N. J. Timpson,
M. N. Weedon,
E. Zeggini, C. M. Lindgren, H. Lango, and others, "Type 2 diabetes risk alleles are associated
with reduced size at birth," Diabetes, vol. 58, no. 6, pp. 1428-1433, 2009.

15 Early Growth Genetics (EGG) Consortium and others, "A genome-wide association meta-analysis
identifies new childhood obesity loci," Nature genetics, vol. 44, no. 5, pp. 526-531, 2012.
15
2 H. R. Taal, B. St Pourcain, E. Thiering, S. Das, D. 0. Mook-Kanamori,
N. M. Warrington,
M. Kaakinen, E. Kreiner-MAyller, J. P. Bradfield, R. M. Freathy, and others, "Common variants
at 12q15 and 12q24 are associated with infant head circumference," Nature genetics, vol. 44,
no. 5, pp. 532-538, 2012.
15
3 N. C. Onland-Moret, P. H. M. Peeters, C. H. Van Gils, F. Clavel-Chapelon, T. Key, A. TjAyn-
neland, A. Trichopoulou, R. Kaaks, J. Manjer, S. Panico, and others, "Age at menarche in

241
relation to Adult height The EPIC Study," American journal of epidemiology, vol. 162, no. 7,
pp. 623-632, 2005.
15 4
F. R. Day, C. E. Elks, A. Murray, K. K. Ong, and J. R. Perry, "Puberty timing associated with
diabetes, cardiovascular disease and also diverse health outcomes in men and women: the UK
Biobank study," Scientific reports, vol. 5, p. 11208, 2015.
15 C. E. Elks, K. K. Ong, R. A. Scott, Y. T. v. d. Schouw, J. S. Brand, P. A.
Wark, P. Amiano,
B. Balkau, A. Barricarte, H. Boeing, A. Fonseca-Nunes, P. W. Franks, S. Grioni, J. Halkjaer,
R. Kaaks, T. J. Key, K. T. Khaw, A. Mattiello, P. M. Nilsson, K. Overvad, D. Palli, J. R. QuirAes,
S. Rinaldi, 0. Rolandsson, I. Romieu, C. Sacerdote, M.-J. SAnchez, A. M. W. Spijkerman,
A. Tjonneland, M.-J. Tormo, R. Tumino, D. L. v. d. A, N. G. Forouhi, S. J. Sharp, C. Langenberg,
E. Riboli, N. J. Wareham, and T. I. Consortium, "Age at Menarche and Type 2 Diabetes Risk,"
Diabetes Care, vol. 36, pp. 3526-3534, Nov. 2013.
15 6 N. Wang, X. Zhang, Y.-B. Xiang, G. Yang, H.-L. Li, J. Gao, H. Cai, Y.-T. Gao, W. Zheng, and
X.-O. Shu, "Associations of adult height and its components with mortality: a report from cohort
studies of 135 000 Chinese women and men," Internationaljournal of epidemiology, vol. 40, no. 6,
pp. 1715-1726, 2011.
157 P. R. Hebert, J. W. Rich-Edwards, J. E. Manson, P. M. Ridker, N. R. Cook,
G. T. O'connor,
J. E. Buring, and C. H. Hennekens, "Height and incidence of cardiovascular disease in male
physicians.," Circulation, vol. 88, no. 4, pp. 1437-1443, 1993.
158 J. W. Rich-Edwards, J. E. Manson, M. J. Stampfer, G. A. Colditz, W. C. Willett,
B. Rosner,
F. E. Speizer, and C. H. Hennekens, "Height and the risk of cardiovascular disease in women,"
American journal of epidemiology, vol. 142, no. 9, pp. 909-917, 1995.
159 D. E. Barnes and K. Yaffe, "The projected effect of risk factor reduction on Alzheimer's disease
prevalence," The Lancet Neurology, vol. 10, no. 9, pp. 819-828, 2011.

160 S. Norton, F. E. Matthews, D. E. Barnes, K. Yaffe, and C. Brayne, "Potential for primary
prevention of Alzheimer's disease: an analysis of population-based data," The Lancet Neurology,
vol. 13, no. 8, pp. 788-794, 2014.
161 J. H. MacCabe, M. P. Lambe, S. Cnattingius, P. C. Sham, A. S. David, A. Reichenberg, R. M.
Murray, and C. M. Hultman, "Excellent school performance at age 16 and risk of adult bipolar
disorder: national cohort study," The British Journal of Psychiatry, vol. 196, no. 2, pp. 109-115,
2010.
162 J. Tiihonen, J. Haukka, M. Henriksson, M. Cannon, T. KieseppAd', I. Laaksonen, J. Sinivuo,
and J. LAinnqvist, "Premorbid intellectual functioning in bipolar disorder and schizophrenia:
results from a cohort study of male conscripts," American Journal of Psychiatry, vol. 162, no. 10,
pp. 1904-1910, 2005.

242
163 J. P. Pierce, M. C. Fiore, T. E. Novotny, E. J. Hatziandreu, and R. M. Davis, "Trends in
cigarette smoking in the United States: educational differences are increasing," Jama, vol. 261,
no. 1, pp. 56-60, 1989.
164 R. H. Striegel-Moore, V. Garvin, F.-A. Dohm, and R. A. Rosenheck, "Psychiatric comorbidity
of eating disorders in men: a national study of hospitalized veterans," InternationalJournal of
Eating Disorders, 1999.
165 B. J. Blinder, E. J. Cumella, and V. A. Sanathara, "Psychiatric comorbidities of female inpatients
with eating disorders," Psychosomatic Medicine, vol. 68, no. 3, pp. 454-462, 2006.
166 I. J. Deary, S. Strand, P. Smith, and C. Fernandes, "Intelligence and educational achievement,"
Intelligence, vol. 35, no. 1, pp. 13-21, 2007.
167 C. M. Calvin, C. Fernandes, P. Smith, P. M. Visscher, and I. J. Deary, "Sex, intelligence and edu-
cational achievement in a national cohort of over 175,000 11-year-old schoolchildren in England,"
Intelligence, vol. 38, no. 4, pp. 424-432, 2010.
16 8 M. S. Durkin, M. J. Maenner, F. J. Meaney, S. E. Levy, C. DiGuiseppi, J. S. Nicholas, R. S.
Kirby, J. A. Pinto-Martin, and L. A. Schieve, "Socioeconomic inequality in the prevalence of
autism spectrum disorder: evidence from a US cross-sectional study," PLoS One, vol. 5, no. 7,
p. -e11551, 2010.
16 9 E. B. Robinson, K. E. Samocha, J. A. Kosmicki, L. McGrath, B. M. Neale, R. H. Perlis, and M. J.
Daly, "Autism spectrum disorder severity reflects the average contribution of de novo and familial
influences," Proceedings of the National Academy of Sciences, vol. 111, no. 42, pp. 15161-15165,
2014.
1 70 K. E. Samocha, E. B. Robinson, S. J. Sanders, C. Stevens, A. Sabo, L. M. McGrath, J. A.
Kosmicki, K. RehnstrA im, S. Mallick, A. Kirby, D. P. Wall, D. G. MacArthur, S. B. Gabriel,
M. DePristo, S. M. Purcell, A. Palotie, E. Boerwinkle, J. D. Buxbaum, E. H. Cook, R. A.
Gibbs, G. D. Schellenberg, J. S. Sutcliffe, B. Devlin, K. Roeder, B. M. Neale, and M. J. Daly,
"A framework for the interpretation of de novo mutation in human disease," Nature Genetics,
vol. 46, pp. 944-950, Aug. 2014.
171 A. J. Silman and J. E. Pearson, "Epidemiology and genetics of rheumatoid arthritis," Arthritis
research & therapy, vol. 4, no. 3, p. S265, 2002.
172 J. de Leon and F. J. Diaz, "A meta-analysis of worldwide studies demonstrates an association
between schizophrenia and tobacco smoking behaviors," Schizophrenia research, vol. 76, no. 2,
pp. 135-157, 2005.
173 0. A. Andreassen, S. Djurovic, W. K. Thompson, A. J. Schork, K. S. Kendler, M. C. OaA2-
Donovan, D. Rujescu, T. Werge, M. van de Bunt, A. P. Morris, and others, "Improved detection

243
of common variants associated with schizophrenia by leveraging pleiotropy with cardiovascular-
disease risk factors," The American Journal of Human Genetics, vol. 92, no. 2, pp. 197-209,
2013.

74 C. Cotsapas, B. F. Voight, E. Rossin, K. Lage, B. M. Neale, C. Wallace, G. R. Abecasis, J. C. Bar-


rett, T. Behrens, J. Cho, P. L. D. Jager, J. T. Elder, R. R. Graham, P. Gregersen, L. Klareskog,
K. A. Siminovitch, D. A. v. Heel, C. Wijmenga, J. Worthington, J. A. Todd, D. A. Hafler, S. S.
Rich, M. J. Daly, and o. b. o. t. F. N. o. Consortia, "Pervasive Sharing of Genetic Effects in
Autoimmune Disease," PLOS Genet, vol. 7, p. e1002254, Aug. 2011.

175 P. WAijrtz, Q. Wang, A. J. Kangas, R. C. Richmond, J. Skarp, M. Tiainen, T. Tynkkynen,


P. Soininen, A. S. Havulinna, M. Kaakinen, and others, "Metabolic signatures of adiposity
in young adults: Mendelian randomization analysis and effects of weight change," PLoS Med,
vol. 11, no. 12, p. e1001765, 2014.
176 S. Burgess, D. F. Freitag, H. Khan, D. N. Gorman, and S. G. Thompson, "Using multivariable
Mendelian randomization to disentangle the causal effects of lipid fractions," PloS one, vol. 9,
no. 10, p. e108891, 2014.

77 S. Greenland, J. Pearl, and J. M. Robins, "Causal diagrams for epidemiologic research," Epi-
demiology, pp. 37-48, 1999.
178 A. Dahl, V. Hore, V. Iotchkova, and J. Marchini, "Network inference in matrix-variate Gaussian
models with non-independent noise," arXiv preprint arXiv:1312.1622, 2013.
179 H. Aschard, B. J. VilhjAqlmsson, A. D. Joshi, A. L. Price, and P. Kraft, "Adjusting for heritable
covariates can bias effect estimates in genome-wide association studies," The American Journal
of Human Genetics, vol. 96, no. 2, pp. 329-339, 2015.
80 N. Carragher, G. Adamson, B. Bunting, and S. McCann, "Subtypes of depression in a nationally
representative sample," Journal of Affective Disorders, vol. 113, pp. 88-99, Feb. 2009.
181 J. Liley, J. A. Todd, and C. Wallace, "A method for identifying genetic heterogeneity within
phenotypically defined disease subgroups," Nature Genetics, vol. 49, pp. 310-316, Dec. 2016.

182 J. Arnedo, D. M. Svrakic, C. Del Val, R. Romero-Zaliz, H. HernAkndez-Cuervo, Molecular


Genetics of Schizophrenia Consortium, A. H. Fanous, M. T. Pato, C. N. Pato, G. A. de Erausquin,
C. R. Cloninger, and I. Zwir, "Uncovering the hidden risk architecture of the schizophrenias:
confirmation in three independent genome-wide association studies," The American Journal of
Psychiatry, vol. 172, pp. 139-153, Feb. 2015.
183 G. Breen, B. Bulik-Sullivan, M. Daly, S. Medland, B. Neale, M. O'Donovan, S. Ripke, P.
Sullivan,
P. M. Visscher, and N. R. Wray, "Eight types of schizophrenia? Not so fastaAq An Genomes
Unzipped," Sept. 2014.

244
184 D. J. Balding and R. A. Nichols, "A method for quantifying differentiation between populations at
multi-allelic loci and its implications for investigating identity and paternity," Genetica, vol. 96,
no. 1-2, pp. 3-12, 1995.
185 G. Nicholson, A. V. Smith, F. JAgnsson, A. GAzstafsson, K. StefAqnsson, and P. Donnelly,
"Assessing population differentiation and isolation from single-nucleotide polymorphism data,"
Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 64, pp. 695-715,
Oct. 2002.

186 J. K. Pritchard, M. Stephens, and P. Donnelly, "Inference of population structure using multilocus
genotype data," Genetics, vol. 155, pp. 945-959, June 2000.
187 D. S. Falconer, "The inheritance of liability to diseases with variable age of onset, with particular
reference to diabetes mellitus," Annals of human genetics, vol. 31, no. 1, pp. 1-20, 1967.
188 J. Yang, N. A. Zaitlen, M. E. Goddard, P. M. Visscher, and A. L. Price, "Advantages and pitfalls
in the application of mixed-model association methods," Nature Genetics, vol. 46, pp. 100-106,
Jan. 2014.

189 K. Pearson and A. Lee, "On the inheritance of characters not capable of exact quantitative
measurement," Philosophical Transactions of the Royal Society of London, A (195) pp. 79-150,
1901.

190 S. Ripke, N. R. Wray, C. M. Lewis, S. P. Hamilton, M. M. Weissman, G. Breen, E. M. Byrne,


D. H. Blackwood, D. I. Boomsma, S. Cichon, and others, "A mega-analysis of genome-wide
association studies for major depressive disorder," Molecular psychiatry, vol. 18, no. 4, pp. 497-
511, 2013.
191 I. Berndt, S. Gustafsson, R. MAd'gi, A. Ganna, E. Wheeler, M. F. Feitosa, A. E. Justice, K. L.
1.
Monda, D. C. Croteau-Chonka, F. R. Day, and others, "Genome-wide meta-analysis identifies
11 new loci for anthropometric traits and provides insights into genetic architecture," Nature
genetics, vol. 45, no. 5, pp. 501-512, 2013.

' 92 R. J. van der Valk, E. Kreiner-MAyller, M. N. Kooijman, M. Guxens, E. Stergiakouli, A. SAd'Ad'f,


J. P. Bradfield, F. Geller, M. G. Hayes, D. L. Cousminer, and others, "A novel common variant
in DCST2 is associated with length in early life and height in adulthood," Human molecular
genetics, vol. 24, no. 4, pp. 1155-1168, 2015.
19 3 E. A. Stahl, S. Raychaudhuri, E. F. Remmers, G. Xie, S. Eyre, B. P. Thomson, Y. Li, F. A.
Kurreeman, A. Zhernakova, A. Hinks, and others, "Genome-wide association study meta-analysis
identifies seven new rheumatoid arthritis risk loci," Nature genetics, vol. 42, no. 6, pp. 508-514,
2010.
194 S. P. G.-W. A. S. G. Consortium and others, "Genome-wide association study identifies five new
schizophrenia loci," Nature genetics, vol. 43, no. 10, pp. 969-976, 2011.

245

You might also like