You are on page 1of 7

REPORT

GCTA: A Tool for Genome-wide Complex Trait Analysis


Jian Yang,1,* S. Hong Lee,1 Michael E. Goddard,2,3 and Peter M. Visscher1

For most human complex diseases and traits, SNPs identified by genome-wide association studies (GWAS) explain only a small fraction
of the heritability. Here we report a user-friendly software tool called genome-wide complex trait analysis (GCTA), which was developed
based on a method we recently developed to address the ‘‘missing heritability’’ problem. GCTA estimates the variance explained by all
the SNPs on a chromosome or on the whole genome for a complex trait rather than testing the association of any particular SNP to the
trait. We introduce GCTA’s five main functions: data management, estimation of the genetic relationships from SNPs, mixed linear
model analysis of variance explained by the SNPs, estimation of the linkage disequilibrium structure, and GWAS simulation. We focus
on the function of estimating the variance explained by all the SNPs on the X chromosome and testing the hypotheses of dosage
compensation. The GCTA software is a versatile tool to estimate and partition complex trait variation with large GWAS data sets.

Despite the great success of genome-wide association y ¼ Xb þ g þ 3 with V ¼ As2g þ Is23 ; (Equation 2)
studies (GWAS), which have identified hundreds of SNPs
conferring the genetic variation of human complex where g is an n 3 1 vector of the total genetic effects of the
diseases and traits,1 the genetic architecture of human individuals with g  Nð0; As2g Þ, and A is interpreted as the
complex traits still remains largely unexplained. For most genetic relationship matrix (GRM) between individuals.
traits, the associated SNPs from GWAS only explain a small We can therefore estimate s2g by the restricted maximum
fraction of the heritability.2,3 There has not been any likelihood (REML) approach,10 relying on the GRM esti-
consensus on the explanation of the ‘‘missing heritability.’’ mated from all the SNPs. Here we report a versatile tool
Possible explanations include a large number of common called genome-wide complex trait analysis (GCTA), which
variants with small effects, rare variants with large effects, implements the method of estimating variance explained
and DNA structural variation.2,4 We recently proposed a by all SNPs, and extend the method to partition the genetic
method of estimating the total amount of phenotypic variance onto each of the chromosomes and also to esti-
variance captured by all SNPs on the current generation mate the variance explained by the X chromosome and
of commercial genotyping arrays and estimated that test for dosage compensation in females. We developed
~45% of the phenotypic variance for human height can GCTA in five function domains: data management, estima-
be explained by all common SNPs.5 Thus, most of the tion of the GRM from a set of SNPs, estimation of the vari-
ance explained by all the SNPs on a single chromosome or
heritability for height is hiding rather than missing
the whole genome, estimation of linkage disequilibrium
because of many SNPs with small effects.5,6 In contrast to
(LD) structure, and simulation.
single-SNP association analysis, the basic concept behind
our method is to fit the effects of all the SNPs as random
effects by a mixed linear model (MLM), Estimation of the Genetic Relationship
from Genome-wide SNPs
y ¼ Xb þ Wu þ 3 with varðyÞ ¼ V ¼ WW0 s2u þ Is23 ; One of the core functions of GCTA is to estimate the
(Equation 1) genetic relationships between individuals from the SNPs.
From the definition above, the genetic relationship
where y is an n 3 1 vector of phenotypes with n being the
between individuals j and k can be estimated by the
sample size, b is a vector of fixed effects such as sex, age,
following equation:
and/or one or more eigenvectors from principal compo-
  
nent analysis (PCA), u is a vector of SNP effects with 1X N
xij  2pi xik  2pi
u  Nð0; Is2u Þ, I is an n 3 n identity matrix, and 3 is a vector Ajk ¼   : (Equation 3)
N i¼1 2pi 1  pi
of residual effects with 3  Nð0; Is23 Þ. W is a standardized
genotype matrix
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ffi with the ijth element wij ¼ ðxij  2pi Þ= We provide a function to iteratively exclude one indi-
2pi ð1  pi Þ, where xij is the number of copies of the refer- vidual of a pair whose relationship is greater than a speci-
ence allele for the ith SNP of the jth individual and pi is the fied cutoff value, e.g., 0.025, while retaining the maximum
frequency of the reference allele. If we define A ¼ WW0 =N number of individuals in the data. For data collected from
and define s2g as the variance explained by all the SNPs, family or twin studies, we recommend that users estimate
i.e., s2g ¼ Ns2u , with N being the number of SNPs, then the genetic relationships with all of the autosomal SNPs
Equation 1 will be equivalent to:7–9 and then use this option to exclude close relatives. The
1
Queensland Statistical Genetics Laboratory, Queensland Institute of Medical Research, 300 Herston Road, Brisbane, Queensland 4006, Australia;
2
Department of Food and Agricultural Systems, University of Melbourne, Parkville, Victoria 3010, Australia; 3Biosciences Research Division, Department
of Primary Industries, Bundoora, Victoria 3086, Australia
*Correspondence: jian.yang@qimr.edu.au
DOI 10.1016/j.ajhg.2010.11.011. Ó2011 by The American Society of Human Genetics. All rights reserved.

76 The American Journal of Human Genetics 88, 76–82, January 7, 2011


reason for exclusion is that the objective of the analysis is y ¼ Xb þ g þ ge þ 3 with V ¼ Ag s2g þ Age s2ge þ Is23 , where
to estimate genetic variation captured by all the SNPs, just ge is a vector of genotype-environment interaction effects
as GWAS does for single SNPs. Including close relatives, for all of the individuals with Age ¼ Ag for the pairs of indi-
such as parent-offspring pairs and siblings, would result viduals in the same environment and with Age ¼ 0 for the
in the estimate of genetic variance being driven by the pairs of individuals in different environments.
phenotypic correlations for these pairs (just as in pedigree (3) To partition genetic variance onto each of the
analysis), and this estimate could be a biased estimate of 22 autosomes, we can specify the model as y ¼ Xbþ
total genetic variance, for example because of common P22 P22
i¼1 g i þ 3 with V ¼ i¼1 Ai si þ Is3 , where gi is a vector
2 2

environmental effects. Even if the estimate is not biased, of genetic effects attributed to the ith chromosome and
its interpretation is different from the estimate from ‘‘unre- Ai is the GRM estimated from the SNPs on the ith chromo-
lated’’ individuals: a pedigree-based estimator captures the some.
contribution from all causal variants (across the entire GCTA implements the REML method via the average
allele frequency spectrum), whereas our method captures information (AI) algorithm.13 In the REML iteration pro-
the contribution from causal variants that are in LD with cess, the estimates of variance components from the tth
the genotyped SNPs. iteration are updated by qðtþ1Þ ¼ qðtÞ þ ðAIðtÞ Þ1 vL=vqjqðtÞ ,
As a by-product, we provide a function in GCTA to where q is a vector of variance components (s21 , ., s2r
calculate the eigenvectors of the GRM, which is asymptot- and s23 ); L is the log likelihood function of the MLM
ically equivalent to those from the PCA implemented in (ignoring the constant), L ¼ 1=2ðlogjVj þ logjX0 V1 Xjþ
EIGENSTRAT11 because the GRM (Ajk) defined in GCTA is y0 PyÞ with P ¼ V1  V1 XðX0 V1 XÞ1 X0 V1 ; AI is the
approximately half of the covariance matrix (Jjk) used in average of the observed and expected information
EIGENSTRAT. The only purpose of developing this func- matrices,
tion is to calculate eigenvectors and then include them 2 0 3
y PA1 PA1 Py / y0 PA1 PAr Py y0 PA1 PPy
in the model as covariates to capture variance due to 6 « « « « 7
population structure. More sophisticated analyses of the AI ¼ 1=26 4 y0 PAr PA1 Py / y0 PAr PAr Py y0 PAr PPy 5;
7
0 0 0
population structure can be found in programs such as y PPA1 Py / y PPAr Py y PPPy
EIGENSTRAT11 and STRUCTURE.12 and vL=vq is a vector of first derivatives of the log likeli-
hood function with respect to each variance component,
Estimation of the Variance Explained by Genome- 2 3
wide SNPs by REML trðPA1 Þ  y0 PA1 Py
6 « 7 13
The GRM estimated from the SNPs can be fitted subse- vL=vq ¼ 1=26 7
4 trðPAr Þ  y0 PAr Py 5. At the beginning
quently in an MLM to estimate the variance explained trðPÞ  y0 PPy
by these SNPs via the REML method.10 Previously, we
of the iteration process, all of the components are initialized
included only one genetic factor in the model. Here we 2ð0Þ
by an arbitrary value, i.e., si ¼ s2P =ðr þ 1Þ, which is sub-
extend the model in a general form as
sequently updated by the expectation maximization (EM)
X
r
2ð1Þ 4ð0Þ 2ð0Þ 4ð0Þ
y ¼ Xb þ gi þ 3; algorithm, si ¼ ½si y0 PAi Py þ trðsi I  si PAi Þ=n.
i¼1
The EM algorithm is used as an initial step to determine
where gi is a vector of random genetic effects, which could the direction of the iteration updates because it is robust
be the total genetic effects for the whole genome or for to poor starting values. After one EM iteration, GCTA
a single chromosome. In this model, the phenotypic vari- switches to the AI algorithm for the remaining iterations
ance (s2P ) is partitioned into the variance explained by until the iteration converges with the criteria of L(t þ 1) –
each of the genetic factors and the residual variance, L(t) < 104, where L(t) is the log likelihood of the tth iteration.
X
r In the iteration process, any component that escapes from
V¼ Ai s2i þ Is23 ; the parameter space (i.e., its estimate is negative) will be
i¼1 set to 106 3 s2P . If a component keeps escaping from the
where s2i is the variance of the ith genetic factor with its parameter space, it will be constrained at 106 3 s2P .
corresponding GRM, Ai. From the REML analysis, GCTA has an option to provide
In GCTA, we provide flexible options to specify different the best linear unbiased prediction (BLUP) of the total
genetic models. For example: genetic effect for all individuals. BLUP is widely used by
(1) To estimate the variance explained by all autosomal plant and animal breeders to quantify the breeding value
SNPs, we can specify the model as y ¼ Xb þ g þ 3 with of individuals in artificial selection programs14 and also
V ¼ Ag s2g þ Is23 , where g is an n 3 1 vector of the aggregate by evolutionary geneticists.15 Consider Equations 1 and
effects of all the autosomal SNPs for all of the individuals 2, i.e., y ¼ Xb þ Wu þ 3 and y ¼ Xb þ g þ 3. Because these
and Ag is the GRM estimated from these SNPs. This model two models are mathematically equivalent,7–9 the BLUP of
is the same as Equation 2. g can be transformed to the BLUP of u by u b ¼ W0 A1 g=N.b
(2) To estimate the variance of genotype-environment Here the estimate of ui corresponds to the coefficient
interaction effects (s2ge ), we can specify the model as wij, which is then rescaled for the original xij by

The American Journal of Human Genetics 88, 76–82, January 7, 2011 77


 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ui ¼ b
b u i = 2pi ð1  pi Þ. We could obtain the BLUP of SNP chromosome genes. There are two alleles per locus in
effects in a discovery set by GCTA and predict genetic females, but only one in males. If we assume that each
values of the individuals in a validation set ( g b
new ¼ allele has a similar effect on the trait (i.e., no dosage com-
b For example, GCTA could be used to predict
Wnew u). pensation), the genetic variance on the X chromosome for
SNP effects in a discovery set, and the SNP effects could females is twice that for males: i.e., s2X ¼ s2XðFÞ ¼ 2s2XðMÞ .
be used in PLINK to predict whole-genome profiles via Thus,
the scoring approach in a validation set. If the predictions    
are unbiased, then the regression slope of the observed covX yjM ; ykM ¼ 1=2E AM jk sX for a male-male pair;
2

phenotypes on the predicted genetic values is 1.14 In that


case, the genetic value calculated based on the BLUP of    
SNP effects is an unbiased predictor of the true genetic covX yjF ; ykF ¼ E AFjk s2X for a female-female pair; and
value in the validation set (gnew), in the sense that
b b   pffiffiffi  
Eðgnew j g new Þ ¼ g new .
16,17
Prediction analyses of human covX yjM ; ykF ¼ 1= 2E AMF s2X for a male-female pair:
jk
complex traits have demonstrated that many SNPs that
do not pass the genome-wide significance level have This can be implemented by redefining GRM for the X
substantial contribution to the prediction.18,19 This option chromosome as AND X ¼ 1=2AX for male-male p pairs,
ffiffiffi
is therefore useful for the whole-genome prediction anal- AND ND
X ¼ AX for female-female pairs, and AX ¼ 1= 2AX
ysis with all of the SNPs, irrespective of their association for male-female pairs. If we assume that each allele in
p values. females has only half the effect of an allele in males (i.e.,
full dosage compensation), the X-linked genetic variance
for females is half that for males: i.e., s2X ¼ s2XðFÞ ¼
Estimation of the Variance Explained by the SNPs
1=2s2XðMÞ . Thus,
on the X Chromosome
The method of estimating the genetic relationship from    
covX yjM ; ykM ¼ 2E AM
jk sX for a male-male pair;
2
the X chromosome is different to that for the autosomal
SNPs, because males have only one X chromosome. We    
modified Equation 3 for the X chromosome as: covX yjF ; ykF ¼ E AFjk s2X for a female-female pair; and
  
X ij  pi
xM ik  pi
xM   pffiffiffi  
N
Ajk ¼
M
  for a male-male pair; covX yjM ; ykF ¼ 2E AMF s2X for a male-female pair:
i¼1
p i 1  pi jk

  Therefore, the raw AX matrix should be parameterized as



XN xFij  2pi xFik  2pi AFD
X ¼ 2AX for male-male ppairs, ffiffiffi AFD
X ¼ AX for female-
AFjk ¼   for a female-female pair; and ND
female pairs, and AX ¼ 2AX for male-female pairs.
i¼1
2pi 1  pi
The third possibility is to assume equal genetic variance
on the X chromosome for males and females, i.e., s2X ¼
  
XN xM  p xFik  2pi s2XðFÞ ¼ s2XðMÞ , in which case the AX matrix is not redefined
ij i
AMF
jk ¼ pffiffiffi   for a male-female pair; at all.
i¼1 2 p i 1  pi We can estimate s2X by fitting the model
where xM F y ¼ Xb þ gX þ g þ 3, where gX is a vector of genetic
ij and xij are the number of copies of the reference
allele for an X chromosome SNP for a male and a female, effects attributable to the X chromosome, with
respectively. varðgX Þ ¼ AND X sX assuming no dosage compensation,
2

Assuming the male-female genetic correlation to be 1,


FD 2
varðgX Þ ¼ AX sX assuming full dosage compensation,
the X-linked phenotypic covariance between a pair of indi- and varðgX Þ ¼ AX s2X assuming equal X-linked genetic vari-
viduals is:20 ance for males and females. Test of dosage compensation
can be achieved by comparing the likelihoods of model
   
covX yjM ; ykM ¼ E AM fitting under the three assumptions.
jk sXðMÞ for a male-male pair;
2

   
Estimation of the Variance Explained
covX yjF ; ykF ¼ E AFjk s2XðFÞ for a female-female pair; and
by Genome-wide SNPs for a Case-Control Study
    The methodology described above is also applicable for
covX yjM ; ykF ¼ E AMF
jk sXðMÞ sXðFÞ for a male-female pair; case-control data, for which the estimate of variance ex-
plained by the SNPs corresponds to variation on the
where s2XðMÞ and s2XðFÞ are the genetic variance attributed to observed 0–1 scale. Under the assumption of a threshold-
the X chromosome for males and females, respectively. liability model for a disease, i.e., disease liability on the
The relative values of s2XðMÞ and s2XðFÞ depend on the underlying scale follows standard normal distribution,21
assumption made regarding dosage compensation for X the estimate of variance explained by the SNPs on the

78 The American Journal of Human Genetics 88, 76–82, January 7, 2011


I
observed 0–1 scale can be transformed to that on the unob- variances are dependent on allele frequency, i.e., varð F bÞ¼
i
served continuous liability scale by a linear transforma- b II Þ ¼ (1 – hi) / hi if F ¼ 0. In addition, the covariance
varð F i
tion.22 The relationship between additive genetic variance between the two estimators is (3hi – 1) / hi þ (1 – 2hi)F /
on the observed 0–1 and unobserved liability scales was hi – F2, so that the sampling covariance between the esti-
proposed more than a half century ago,23,24 and we mators is (3hi – 1) / hi and the sampling correlation is
recently extended this transformation to account for ascer- (3hi – 1) / (1 – hi) when F ¼ 0. We proposed an estimator
tainment bias in a case-control study, i.e., a much higher based upon the correlation between uniting gametes:5
     III 
proportion of cases in the sample than in the general pop- b III ¼ x2  1 þ 2pi xi þ 2p2 hi and var F b jF
F i i i i
ulation (unpublished data). We provide options in GCTA
to analyze a binary trait and to transform the estimate on ¼ 1 þ 2ð1  2hi ÞF=hi  F 2 :
the 0–1 scale to that on the liability scale with an adjust- b III is also an unbiased estimator of F in the sense that
F i
ment for ascertainment bias. There is an important caveat III III
b jFÞ ¼ F. If F ¼ 0, varð F
E ðF b Þ ¼ 1 regardless of allele
in applying the methods described herein to case-control i i
data. Any batch, plate, or other technical artifact that frequency, which is smaller than the sampling variance
causes allele frequencies between case and control on b I and F
of F b II , i.e., 1 % (1 – hi) / hi. When 0 < F < 1/3, F
b III
i i i
average to be more different than that under the null
b I and F
also has a smaller variance than F b II . In GCTA, we
hypothesis stating that the samples come from the same i i

population will contribute to the estimation of spurious use 1 þ F b III rather than 1 þ F
b I to calculate the diagonal
i i
genetic variation, because cases will appear to be more of the GRM. For multiple SNPs, we average the estimates
b ¼ 1=N PN F
over all of the SNPs, i.e., F b
related to other cases than to controls. Therefore, stringent i¼1 i .
quality control is essential when applying GCTA to case-
control data. Quantitative traits are less likely to suffer Estimating LD Structure
from technical genotyping artifacts because they will In a standard GWAS, particularly with a large sample size,
generally not lead to spurious association between contin- the mean (lmean) or median (lmedian) of the test statistics
uous phenotypes and genotypes. for single-SNP associations often deviates from its expected
value under the null hypothesis of no association between
any SNP and the phenotype, which is usually interpreted
Estimation of the Inbreeding Coefficient as the effect due to population stratification and/or cryptic
from Genome-wide SNPs relatedness.11,26,27 An alternative explanation is that poly-
Apart from estimating the genetic relatedness between genic variation causes the observed inflated test statistic.18
individuals, GCTA also has a function to estimate the To predict the genomic inflation factors, lmean and lmedian,
inbreeding coefficient (F) from SNP data, i.e., the relation- from polygenic parameters such as the total amount of
ship between haplotypes within an individual. Two esti- variance that is explained by all SNPs, we need to quantify
mates have been used: one based on the variance of addi- the LD structure between SNPs and putative causal variants
tive genetic values (diagonal of the SNP-derived GRM) (unpublished data). GCTA provides a function to search for
and the other based on SNP homozygosity (implemented all the SNPs in LD with the ‘‘causal variants’’ (mimicked by
in PLINK).25 Let (1 – pi)2 þ pi(1 – pi)F, 2pi(1 – pi)(1 – F), a set of SNPs chosen by the user). Given a causal variant, we
and pi2 þ pi(1 – pi)F be the frequencies of the three geno- use simple regression to test for SNPs in LD with the causal
types of a SNP i and let hi ¼ 2pi(1 – pi). The estimate based variant within d Mb distance in either direction. PLINK has
on the variance of additive genotype values is an option (‘‘show targets’’) to select SNPs in LD with a set of
   I  target SNPs with LD r2 larger than a user-specified cutoff
b I ¼ ½xi  Eðxi Þ2 =hi  1 ¼ xi  2pi 2 =hi  1 and var F
F b jF
i i value. This function is very useful to distinguish indepen-
¼ ð1  hi Þ=hi þ 7ð1  2hi ÞF=hi  F 2 ; dent association signals but less suited to predict lmean and
lmedian, because the test statistics of the SNPs in modest LD
where xi is the number of copies of the reference allele for with causal variants (SNPs at Mb distance with low r2) will
the ith SNP. This is a special case of Equation 3 for a single also be inflated to a certain extent, and these test statistics
SNP when j ¼ k. The estimate based upon excess homozy- will contribute to the genomic inflation factors.
gosity is
GWAS Simulation
b II ¼ ½Oð#homÞ  Eð#homÞ=½1  Eð#homÞ
F i
 II  We provided a function to simulate GWAS data based on the
b j F ¼ ð1  hi Þ=hi
¼ 1  xi ð2  xi Þ=hi and var F observed genotype data. For a quantitative trait, the pheno-
i

 ð1  2hi ÞF=hi  F 2 ; types are simulated by the simple additive genetic model
y ¼ Wu þ 3, where the notation is the same as above. Given
where O(# hom) and E(# hom) are the observed and ex- a set of SNPs assigned as causal variants, the effects of the
pected number of homozygous genotypes in the sample, causal variants are generated from a standard normal distri-
respectively. Both estimators are unbiased estimates of F bution, and the residual effects are generated from a normal
b I jFÞ ¼ Eð F
in the sense that Eð F b II jFÞ ¼ F, but their sampling distribution with mean of 0 and variance of s2g ð1=h2  1Þ,
i i

The American Journal of Human Genetics 88, 76–82, January 7, 2011 79


where s2g is the empirical variance of Wu and h2 is the user In some of the mixed model analysis packages, such as
specified heritability. For a case-control study, assuming ASREML,35 to avoid the inversion of the n 3 n V matrix,
a threshold-liability model, disease liabilities are simulated people usually use Gaussian elimination of the mixed
in the same way as that for the phenotypes of a quantitative model equations (MME) to obtain the AI matrix based
trait. Any individual with disease liability exceeding on sparse matrix techniques. The SNP-derived GRM
a certain threshold T is assigned to be a case and a control matrix, however, is typically dense, so the sparse matrix
otherwise, where T is the threshold of normal distribution technique will bring an extra cost of memory and CPU
truncating the proportion of K (disease prevalence). The time. Moreover, the dimension of MME depends on the
only purpose of this function is to do a simple simulation number of random effects in the model, whereas the V
based on the observed genotype data. More complicated matrix does not. For example, when fitting the 22 chromo-
simulation can be performed with programs such as ms,28 somes simultaneously in the model, the dimension of
GENOME,29 FREGENE,30 and HAPGEN.31 MME is 22n 3 22n (ignoring the fixed effects), whereas
the dimension of V matrix is still n 3 n. We compared
Data Management the computational efficiency of GCTA and ASREML.
We chose the PLINK25 compact binary file format (*.bed, When the sample size is small, e.g., n < 3000, both
*.bim, and *.fam) as the input data format for GCTA because GCTA and ASREML take a few minutes to run. When the
of its popularity in the genetics community and its effi- sample size is large, e.g., n > 10,000, especially when fitting
ciency of data storage. For the imputed dosage data, we multiple GRMs, it takes days for ASREML to finish the anal-
use the output files of the imputation program MACH32 ysis, whereas GCTA needs only a few hours.
(*.mldose.gz and *.mlinfo.gz) as the inputs for GCTA. For
the convenience of analysis, we provide options to extract System Requirements
a subset of individuals and/or SNPs and to filter SNPs based We have released executable versions of GCTA for the
on certain criteria, such as chromosome position, minor three major operating systems: MS Windows, Linux/
allele frequency (MAF), and imputation R2 (for the imputed Unix, and Mac OS. We have also released the source codes
data). However, we do not provide functions for a thorough so that users can compile them for some specific platforms.
quality control (QC) of the data, such as Hardy-Weinberg GCTA requires a large amount of memory when calcu-
equilibrium test and missingness, because these functions lating the GRM or performing an REML analysis with
have been well developed in many other genetic analysis multiple genetic components. For example, it requires
packages, e.g., PLINK, GenABEL,33 and SNPTEST.34 We ~4.8 GB memory to calculate the GRM for a data set with
assume that the data have been cleaned by a standard QC 3925 individuals genotyped by 294,831 SNPs, and it takes
process before entering into GCTA. ~4 CPU hours (AMD Opteron 2.8 GHz) to finish the
computation. We therefore recommend using the 64-bit
version of GCTA for large memory support.
Estimating Total Heritability
The method implemented in GCTA is to estimate the vari-
Nonadditive Genetic Variance
ance explained by chromosome- or genome-wide SNPs
The analysis approach we have adapted is a logical exten-
rather than the trait heritability. Estimating the heritability
sion of estimation methods based on pedigrees. It allows
(i.e., variance explained by all the causal variants), however,
estimation of additive genetic variation that is captured
relies on the genetic relationship at causal variants that is
by SNP arrays and is therefore informative with respect
predicted with error by the genetic relationship derived
to the genetic architecture of complex traits. The estimate
from the SNPs as a result of imperfect tagging. We have
of variance captured by all of the SNPs obtained in GCTA is
previously established that the prediction error is c þ 1 / N,
directly comparable to the heritability estimated from
with c depending on the distribution of the MAF of causal
pedigree analysis in family and twin studies, as well as
variants. We therefore developed a method based on simple
the variance explained by GWAS hits, so that missing
regression to correct for the prediction error by
and hiding heritability can be quantified.5 Other sources
8  
< 1 þ b Ajj  1 ; j ¼ k of genetic variations such as dominance, gene-gene inter-
Ajk ¼ action, and gene-environment interaction are also impor-
:
bAjk ; jsk; tant for complex trait variation but are less relevant to
the ‘‘missing heritability’’ problem if the total heritability
where b ¼ 1  ðc þ 1=NÞ=varðAjk Þ. The estimate of variance refers to the narrow-sense heritability, i.e., the proportion
explained by all of the SNPs after such adjustment is an of phenotypic variance due to additive genetic variance.
unbiased estimate of heritability only if the assumption The current version of GCTA only provides functions to
about the MAF distribution of causal variants is correct. estimate and partition the variances of additive and
additive-environment interaction effects. It is technically
Efficiency of GCTA Computing Algorithm feasible to extend the analysis to include dominance
GCTA implements the REML method based on the vari- and/or gene-gene interaction effects in the future.
ance-covariance matrix V and the projection matrix P. However, the power to detect the high-order genetic

80 The American Journal of Human Genetics 88, 76–82, January 7, 2011


variation will be limited, i.e., the sampling variance of esti- gies for finding the underlying causes of complex disease.
mated variance components will be very large. Future Nat. Rev. Genet. 11, 446–450.
developments will also include options to do multivariate 5. Yang, J., Benyamin, B., McEvoy, B.P., Gordon, S., Henders,
analyses, to read genotype or imputed probability data in A.K., Nyholt, D.R., Madden, P.A., Heath, A.C., Martin, N.G.,
Montgomery, G.W., et al. (2010). Common SNPs explain
different formats, and to implement other applications of
a large proportion of the heritability for human height. Nat.
whole-genome or chromosome segment approaches.
Genet. 42, 565–569.
In summary, we have developed a versatile tool to esti- 6. Gibson, G. (2010). Hints of hidden heritability in GWAS. Nat.
mate genetic relationships from genome-wide SNPs that Genet. 42, 558–560.
can subsequently be used to estimate variance explained 7. Hayes, B.J., Visscher, P.M., and Goddard, M.E. (2009).
by SNPs via a mixed model approach. We provide flexible Increased accuracy of artificial selection by using the realized
options to specify different genetic models to partition relationship matrix. Genet. Res. 91, 47–60.
genetic variance onto each of the chromosomes. We devel- 8. Strandén, I., and Garrick, D.J. (2009). Technical note:
oped methods to estimate genetic relationships from the Derivation of equivalent computing algorithms for genomic
SNPs on the X chromosome and to test the hypotheses predictions and reliabilities of animal merit. J. Dairy Sci. 92,
of dosage compensation. GCTA is not limited to the anal- 2971–2975.
9. VanRaden, P.M. (2008). Efficient methods to compute
ysis of data on human complex traits, but in this report we
genomic predictions. J. Dairy Sci. 91, 4414–4423.
only use examples and specifications (e.g., the number of
10. Patterson, H.D., and Thompson, R. (1971). Recovery of inter-
autosomes) for humans.
block information when block sizes are unequal. Biometrika
58, 545–554.
Acknowledgments 11. Price, A.L., Patterson, N.J., Plenge, R.M., Weinblatt, M.E.,
Shadick, N.A., and Reich, D. (2006). Principal components
We thank Bruce Weir for discussions on the sampling variance of analysis corrects for stratification in genome-wide association
estimators of inbreeding coefficients. We thank Allan McRae and studies. Nat. Genet. 38, 904–909.
David Duffy for discussions and Anna Vinkhuyzen for software 12. Falush, D., Stephens, M., and Pritchard, J.K. (2003). Inference
testing. We acknowledge funding from the Australian National of population structure using multilocus genotype data:
Health and Medical Research Council (grants 389892 and Linked loci and correlated allele frequencies. Genetics 164,
613672) and the Australian Research Council (grants DP0770096 1567–1587.
and DP1093900). 13. Gilmour, A.R., Thompson, R., and Cullis, B.R. (1995). Average
information REML: An efficient algorithm for variance
Received: August 30, 2010 parameters estimation in linear mixed models. Biometrics
Revised: November 23, 2010 51, 1440–1450.
Accepted: November 29, 2010 14. Henderson, C.R. (1975). Best linear unbiased estimation and
Published online: December 16, 2010 prediction under a selection model. Biometrics 31, 423–447.
15. Kruuk, L.E. (2004). Estimating genetic parameters in natural
populations using the ‘‘animal model’’. Philos. Trans. R. Soc.
Web Resources
Lond. B Biol. Sci. 359, 873–890.
The URLs for data presented herein are as follows: 16. Goddard, M.E., Wray, N.R., Verbyla, K., and Visscher, P.M.
(2009). Estimating effects and making predictions from
Genome-wide Complex Trait Analysis (GCTA), http://gump.qimr. genome-wide marker data. Stat. Sci. 24, 517–529.
edu.au/gcta 17. de Los Campos, G., Gianola, D., and Allison, D.B. (2010).
MACH 1.0: A Markov Chain-based haplotyper, http://www.sph.
Predicting genetic predisposition in humans: The promise of
umich.edu/csg/yli/mach
whole-genome markers. Nat. Rev. Genet. 11, 880–886.
PLINK, http://pngu.mgh.harvard.edu/~purcell/plink
18. Purcell, S.M., Wray, N.R., Stone, J.L., Visscher, P.M.,
O’Donovan, M.C., Sullivan, P.F., and Sklar, P.; International
Schizophrenia Consortium. (2009). Common polygenic
variation contributes to risk of schizophrenia and bipolar
References
disorder. Nature 460, 748–752.
1. Hindorff, L.A., Sethupathy, P., Junkins, H.A., Ramos, E.M., 19. Lango Allen, H., Estrada, K., Lettre, G., Berndt, S.I., Weedon,
Mehta, J.P., Collins, F.S., and Manolio, T.A. (2009). Potential M.N., Rivadeneira, F., Willer, C.J., Jackson, A.U., Vedantam,
etiologic and functional implications of genome-wide associa- S., Raychaudhuri, S., et al. (2010). Hundreds of variants clus-
tion loci for human diseases and traits. Proc. Natl. Acad. Sci. tered in genomic loci and biological pathways affect human
USA 106, 9362–9367. height. Nature 467, 832–838.
2. Manolio, T.A., Collins, F.S., Cox, N.J., Goldstein, D.B., 20. Kent, J.W., Jr., Dyer, T.D., and Blangero, J. (2005). Estimating
Hindorff, L.A., Hunter, D.J., McCarthy, M.I., Ramos, E.M., the additive genetic effect of the X chromosome. Genet.
Cardon, L.R., Chakravarti, A., et al. (2009). Finding the Epidemiol. 29, 377–388.
missing heritability of complex diseases. Nature 461, 747–753. 21. Lynch, M., and Walsh, B. (1998). Genetics and Analysis of
3. Maher, B. (2008). Personal genomes: The case of the missing Quantitative Traits (Sunderland, MA: Sinauer Associates).
heritability. Nature 456, 18–21. 22. Falconer, D.S. (1965). The inheritance of liability to certain
4. Eichler, E.E., Flint, J., Gibson, G., Kong, A., Leal, S.M., Moore, diseases, estimated from the incidence among relatives.
J.H., and Nadeau, J.H. (2010). Missing heritability and strate- Ann. Hum. Genet. 29, 51–76.

The American Journal of Human Genetics 88, 76–82, January 7, 2011 81


23. Dempster, E.R., and Lerner, I.M. (1950). Heritability of 30. Hoggart, C.J., Chadeau-Hyam, M., Clark, T.G., Lampariello, R.,
threshold characters. Genetics 35, 212–236. Whittaker, J.C., De Iorio, M., and Balding, D.J. (2007).
24. Robertson, A., and Lerner, I.M. (1949). The heritability of Sequence-level population simulations over large genomic
all-or-none traits; viability of poultry. Genetics 34, 395–411. regions. Genetics 177, 1725–1731.
25. Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, 31. Spencer, C.C., Su, Z., Donnelly, P., and Marchini, J. (2009).
M.A., Bender, D., Maller, J., Sklar, P., de Bakker, P.I., Daly, Designing genome-wide association studies: Sample size,
M.J., and Sham, P.C. (2007). PLINK: A tool set for whole- power, imputation, and the choice of genotyping chip. PLoS
genome association and population-based linkage analyses. Genet. 5, e1000477.
Am. J. Hum. Genet. 81, 559–575. 32. Li, Y., and Abecasis, G.R. (2006). Mach 1.0: Rapid Haplotype
26. Campbell, C.D., Ogburn, E.L., Lunetta, K.L., Lyon, H.N., Reconstruction and Missing Genotype Inference. Am. J.
Freedman, M.L., Groop, L.C., Altshuler, D., Ardlie, K.G., and Hum. Genet. S79, 2290.
Hirschhorn, J.N. (2005). Demonstrating stratification in 33. Aulchenko, Y.S., Ripke, S., Isaacs, A., and van Duijn, C.M.
a European American population. Nat. Genet. 37, 868–872. (2007). GenABEL: An R library for genome-wide association
27. Cardon, L.R., and Palmer, L.J. (2003). Population stratification analysis. Bioinformatics 23, 1294–1296.
and spurious allelic association. Lancet 361, 598–604. 34. Wellcome Trust Case Control Consortium. (2007). Genome-
28. Hudson, R.R. (1990). Gene genealogies and the coalescent wide association study of 14,000 cases of seven common
process. Oxford Surveys in Evolutionary Biology 7, 1–44. diseases and 3,000 shared controls. Nature 447, 661–678.
29. Liang, L., Zöllner, S., and Abecasis, G.R. (2007). GENOME: 35. Gilmour, A.R., Gogel, B.J., Cullis, B.R., and Thompson, R.
A rapid coalescent-based whole genome simulator. (2006). ASReml User Guide Release 2.0 (Hemel Hempstead,
Bioinformatics 23, 1565–1567. UK: VSN International).

82 The American Journal of Human Genetics 88, 76–82, January 7, 2011

You might also like