Update

TRENDS in Genetics Vol.21 No.7 July 2005

377

Genetical genomics in humans and model organisms
Dirk-Jan de Koning and Chris S. Haley
The Roslin Institute, Roslin, Midlothian, UK, EH25 9PS

Genetical genomics has been proposed to map loci controlling gene-expression differences (eQTLs) that might underlie functional trait variation. We briefly review the studies in model species and conclude that, although they successfully demonstrate the utility of genetical genomics, they are too limited to unlock the full potential of this approach and some results should be interpreted with caution. We subsequently elaborate on two recent studies that use this approach in humans. The many differences between these studies complicate meaningful comparisons between them. A joint analysis of the two experiments offers some scope for more powerful genetical genomics.

same bin as the transcript it influences, otherwise it is termed trans acting. Differences in microarray platform and their effect on eQTL studies Differences in performance between microarray platforms have been discussed in detail elsewhere [10]. Because genetical genomics combines sequence polymorphisms with variation in expression levels, it is important to establish how robust the RNA measurement is against sequence variation [e.g. single nucleotide polymorphisms (SNPs)] in the transcript. The robustness of Affymetrix chips (http://www.affymetrix.com) against spurious ciseffects resulting from SNPs in the transcripts has been evaluated by re-sequencing some of the genes with ciseffects in rats [6] and by using available SNP data in mice [5]. Both studies concluded that the effect of SNP variation on the detection of cis-acting eQTLs was limited. An alternative approach for Affymetrix chips would be to study probe–eQTL interactions for cis-acting eQTL because Affymetrix chips use multiple probes to interrogate each transcript (Ritsert Jansen, personal communication). Agilent 60-mer oligonucleotide arrays were shown to be robust against four SNPs or less in the probe region [11]. Major hubs of genes regulation: fact or artefact? A common feature of eQTL studies is the detection of ‘hotspots’ or hubs of trans-acting eQTL: chromosomal regions that affect the expression of a much larger number Glossary
Bonferroni correction: a statistical adjustment for multiple comparisons. The Bonferroni correction is simple: if a number (n) of outcomes are being tested instead of a single outcome, the desired threshold level (P) is divided by n. False discovery rate: the proportion of false-positive test results among all significant tests (note that the FDR is conceptually different to the significance level). Haploid line: a line that is derived by crossing two strains and subsequently manipulating the F1 gametes to develop into fully homozygous individuals. Heritability: a statistic that estimates the proportion of variation in a trait that is attributable to genetic factors. Phenotypic standard deviations: a statistic that describes the dispersion of data about the mean. Quantitative trait locus: genetic loci or chromosomal regions that contribute to variability in complex quantitative traits, as identified by statistical analysis. Quantitative traits are typically affected by several genes and by the environment. Recombinant inbred lines: a strain that is formed by crossing two strains, followed by 20 or more consecutive generations of brother–sister mating or selfing. The resulting lines are homozygous (and therefore fixed) at each locus, enabling repeated replicates of genetically homogeneous lines to be assayed. Statistical power: a statistic that describes how effective a given experiment is to detect a certain effect. Statistical power is expressed as the proportion of tests that are expected to be significant given a certain experiment and a certain effect.

Introduction Genetical genomics describes the combined study of gene expression and marker genotypes in a segregating population [1,2]. It aims to detect the genomic loci that control gene-expression differences, these loci are referred to as expression quantitative trait loci (eQTLs; see Glossary). To date, most of these studies have used model species such as mice [3–5], maize [3], rats [6] and yeast [7,8]. The experimental designs include recombinant inbred lines (RI; in rodents) [4–6], F2 or F3 crosses (in mice and maize) [3] and haploid lines (in yeast) [7–9]. The common feature of these designs is that, compared with ‘traditional’ phenotype-based QTL experiments, the sizes of the experiments are modest to small. We have compared the statistical power to detect different QTL effects among the different eQTLs studies to date and comment on potential shortcomings (Box 1). The limited size of experiments can be attributed to the expense of gene-expression analyses. However, this should encourage collaborative efforts to perform more powerful eQTL studies rather than multiple studies that each lack sufficient power. Cis and trans eQTL eQTL can be classified as cis or trans acting based on the location of the transcript compared with that of the eQTL influencing the expression of that transcript. There is variation between studies in exactly how cis and trans are defined, but generally the genome is divided into segments (bins; to allow for inherent inaccuracy in the mapping of eQTL) based on physical or mapping distance {e.g. 20kb in yeast [7], 5MB [4,5] or 2 cM (w3.6 MB) in mice [3] and 20 MB in rats [6]}. A QTL is cis acting if it is located in the
Corresponding author: de Koning, D.-J. (DJ.deKoning@BBSRC.ac.uk). Available online 23 May 2005
www.sciencedirect.com

378

Update

TRENDS in Genetics Vol.21 No.7 July 2005

Box 1. The power of eQTL studies to date
Table I summarizes the statistical power to detect QTL for some eQTL studies to date and compares these with hypothetical F2 designs that are commonly encountered in QTL detection. For example, an eQTL with a Heritability of 0.03 (i.e. the eQTL explains 3% of the variation in RNA abundance among the F2 mice) would be detected in 7% of the experiments performed with 111 F2 mice [3] and 16% of the experiments with 86 haploid yeast lines [8]. Although the experiment using 112 haploid yeast lines [9] is the most powerful of all the studies, most studies have limited power to detect any QTL with an effect !0.5 phenotypic standard deviations (SD; equivalent to a QTL heritability of 0.13). As a result, the studies fail to detect many loci with moderate effects on gene regulation and are also expected to miss some loci with major effects. The statistical threshold that we have used for the power calculations is reasonably stringent for a single trait, but fairly liberal overall, considering that eQTL studies commonly examine the expression levels of thousands of genes. This is a major issue in genetical genomics because it uses multiple testing in two dimensions: hundreds of markers are tested for their putative effect on O10 000 gene transcripts. Traditional approaches, such as the Bonferroni correction, that limit the discovery of spurious effects by increasing the stringency on the statistical significance threshold are demanding as the thresholds become prohibitive for the detection of all but the most extreme effects. Alternatives such as the false discovery rate have been proposed for genome scans and gene-expression studies [15], and an overview of multiple testing issues and alternatives in genetics was recently presented by Manly et al. [16].

Table I. A comparison of statistical power to detect QTL in eQTL studies
Refs Population QTL effect (phenotypic SD)c QTL heritability in F2 (variance explained)d Brem et al. [7] Haploid yeast Yvert et al. [8] Haploid yeast Brem and Kruglyak [9] Haploid yeast Schadt et al. [3] F2 mice Schadt et al. [3] F3 maize Chesler et al. (mice); Recombinant inbred linese Bystryk et al. (mice); Hubner et al. (rats) [4–6] Hypothetical F2 Hypothetical F2
a b

Na

40 86 112 111 76 33

Statistical power for different QTL effectsb 0.25 0.40 0.5 0.6 0.03 0.08 0.13 0.18 0.05 0.2 0.51 0.73 0.16 0.67 0.94 0.99 0.25 0.84 0.99 0.99 0.07 0.37 0.68 0.90 0.04 0.19 0.41 0.67 0.05 0.29 0.62 0.91

0.75 0.28 0.99 0.99 0.99 0.99 0.94 0.99

200 400

0.21 0.60

0.77 0.99

0.96 0.99

0.99 0.99

0.99 0.99

Number of individuals with expression data. The probability of detecting as significant a QTL using a point-wise significance threshold of P!0.001, which corresponds to a LOD score of 3.0 for an F2 design (slightly more stringent than the proposed threshold for suggestive linkage but much less stringent than the threshold for significant linkage [17]). The power calculations account for different experimental designs but not for different genome length between species (the greater number of independent tests performed in a larger genome requires a more stringent significance threshold). c Additive effect of the QTL (half of the difference between homozygotes) expressed in units of the phenotypic standard deviation. d The proportion of the total variation in the population explained by the QTL, assuming an F2 population where the QTL allele frequencies are both 0.5. In an RI or haploid system, the heritability of the QTL is twice the magnitude in an F2. e Assuming a repeatability of 0.50 for gene transcripts and three replicates for every recombinant inbred (RI) line.

of genes than expected by chance. These major hubs of gene regulation are most prominent in yeast (eight) [7,8], followed by mice (approximately seven) [3–5]. Clustering of eQTL was not reported for maize [3]. The locations of the trans-acting eQTL show limited overlap among the three mouse eQTL studies [3–5], which could be due to tissue-specific trans regulation. Although the most significant eQTL are cis-acting, the detection of trans-acting regulatory hubs is plausible if cis-regulation provides more direct (i.e. less variable) genetic control than trans regulation, ensuring that cis-acting effects are larger and more consistent. Alternatively, it could be that the proportion of false positive eQTL is greater among transacting effects. The strong clustering in ‘hubs’ of eQTLs reflects the highly correlated expression levels of many gene transcripts. This is illustrated by a recent simulation study using real expression data from human pedigrees with a simulated SNP map that was independent of the expression levels [12]. As a result, all eQTLs detected were by default false positives. The eQTL analyses showed strong clustering of (trans) eQTLs and the five most populated bins contained 20% of the significant, but spurious, eQTLs [12]. Thus, although both the high correlation of expression levels among gene transcripts and the detection of eQTL hotspots in experimental
www.sciencedirect.com

studies can be interpreted to support the hypothesis of coordinated trans-regulation of multiple genes, a major concern is whether the correlation could be due to some technical or environmental factors that are currently unaccounted for. For example, the clustering of eQTL for multiple traits could simply represent the clustering of spurious QTL for highly correlated traits (i.e. with so many traits we expect to see many false-positive QTL effects, and if traits are highly correlated, for whatever reason, these false-positive QTLs will often locate to the same region). Because of the limited understanding of genetic and physiological control of gene expression and the limited experimental sizes so far, any conclusions with regard to hotspots for gene regulation should be interpreted with caution. eQTL studies in human cell lines Although the genetic complexity of most eQTL studies is limited because of the use of inbred resources, two recent studies report eQTL in analyses of cell lines derived from human pedigrees [13,14]. These initial studies both used lymphoblastoid cell lines from the CEPH pedigrees (http://www.cephb.fr/cephdb/) but otherwise have differences at almost every level of execution (Table 1). Many of the differences between the two studies are not unique to genetical genomics: discrepancies

Update

TRENDS in Genetics Vol.21 No.7 July 2005

379

Table 1. A comparison between two eQTL analyses on human CEPH dataa
CEPH families used Gene expression Platform Genes on array Design and replicates Criterion for selecting genes for eQTL analysis Genes taken forward to eQTL analysis Marker genotypes Data availability Morley et al. [14] 14 (eight in common) Affymetrix genome focus 25-mer oligonucleotide arrays w8500 Direct measurement with two array replicates per individual Greater variation between individuals than within 3554 2756 autosomal SNP markers from the SNP consortium database Genotypes available at http://www.ceph/fr/ cephdb Expression data at http://www.ncbi.nlm.nih. gov/geo/ (GEO accession GSE1485) (i) Sib-pair analyses using S.A.G.E. for whole genome analysis (ii) QTDT and association study for 17 genes with cis-acting eQTL 5 MB genome bins, testing for deviation from poisson distribution 142 genes with at least one eQTL (P!4.3!10K7) 984 genes with at least one eQTL (P!3.7!10K5) Monks et al. [13] 15 (eight in common) Agilent 60-mer oligonucleotide array 23 499 Reference design with at least two arrays per individual Differentially expressed in at least half of the offspring 2430 346 autosomal markers, selected from CEPH genotype database Genotypes available at http://www.ceph/fr/ cephdb Expression data at http://www.ncbi.nlm.nih. gov/geo/ (GEO accession GSE1726) Variance component analyses using SOLAR for both heritability of transcript level and eQTL

eQTL analyses

Test for hubs of gene regulation

eQTL results

Hubs of gene regulation

Two hotspots on chromosomes 14 and 20 affecting seven and six genes, respectively (using P!4.3!10K7) or 31 and 35 genes, respectively (using PZ3.7!10K5)b Hierarchical clustering of genes within 5 MB window on chromosome 14 RT–PCR for one gene with a large cis effect

Other analyses

At 4 cM (w3.2 MB) intervals comparing number of hits with those obtained by simulation 33 genes with at least one eQTL (P!5.0!10K6) 50 genes with at least one eQTL (P!5.0!10K5) 135 genes with at least one eQTL (P!5.0!10K4) Six locations with five or six linkage hits on chromosome 6; according to the authors, these are attributable to allelic diversity and non-specificity of gene probes and were therefore dismissed Test for enrichment of certain annotations among differentially expressed genes 574 genes with non-zero heritability; these were subsequently clustered using genetic or phenotypic correlations

a

b

Abbreviations: GEO, gene expression omnibus. A different number of genes are affected by the eQTL, depending on the P value used.

between gene-expression platforms, different statistical methods and protocols are common obstacles when comparing different microarray studies. Although the studies overlap for about half (eight) of the CEPH families studied, they use different genetic marker sets and different methods for expression analysis and eQTL analysis. Furthermore, they use different criteria for including genes in their eQTL analysis and apply different thresholds for QTL detection (Table 1). The results between the two studies are also remarkably different: Morley et al. take w42% of the genes (nZ3554) on their arrays forward to eQTL analysis, whereas Monks et al. use only w10% (nZ2430; Table 1). At comparable significance levels (3.7!10K5 and 5.0!10K5, respectively), Morley et al. report eQTL for w28% of the genes that were taken forward to eQTL analysis compared with w2% for Monks et al. (Table 1). Figure 1 shows the theoretical power for detection of QTL for the two studies using the two methods of QTL analysis. The QTL methods are briefly explained in Box 2. For the sib-pair analyses, both studies had similar power. The power calculations confirm that variance component methods such as sequential oligogenic linkage analysis routines (SOLAR)
www.sciencedirect.com

are theoretically slightly more powerful than sib-pair analyses, because they use all of the genetic relationships within the pedigree. However, the power difference does not explain the marked difference in numbers of QTL detected by the two studies. The greater number of eQTLs for the Morley et al. study could be due to several factors including: (i) less technical noise in gene-expression measurements, resulting in a larger proportion of the variance attributable to the QTL effect; (ii) environmental conditions that promote greater genetically controlled variation in expression; or (iii) less robust gene-expression measurements or analyses, making the results more prone to bias and false positive results. Given the low power of both studies to detect eQTLs under the stringent thresholds that they apply, the results of Monks et al. are more consistent with prior expectation, unless eQTL effects are much stronger than those of phenotypic QTL. Although the low theoretical power does not explain why Morley et al. detect more QTL than Monks et al., it would explain differences in genes for which eQTL are detected, in addition to discrepancies in finding eQTL in different locations for a particular transcript. When both studies have limited power to detect a given QTL, they will each

380

Update

TRENDS in Genetics Vol.21 No.7 July 2005

Power of eQTL studies in human pedigrees 1.0 0.9 Power to detect eQTL 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.05 0.1 0.15 0.2 0.25 0.3 QTL heritability 0.35 0.4 0.45 0.5 VCA (Morley et al.) VCA (Monks et al.) Sib-pair

TRENDS in Genetics

Figure 1. The statistical power to detect the eQTL of given heritability for the two studies using either a sib-pair analysis or a variance component analysis (VCA). Using sibpair analyses (red), both studies had similar power; therefore, only a single line is shown. The statistical power is defined as the proportion of analyses in which a QTL with a given effect will be detected under a defined P value (in this case P!0.0001, which is still less stringent than the proposed genome-wide threshold [17]) The power for the sibpair analyses was assessed using the genetic power calculator [20] (http://statgen.iop.kcl.ac.uk/gpc/). The power for the VCA (pink and blue) was assessed using routines that were kindly provided by Xijiang Yu (University of Edinburgh) based on Williams and Blangero [21], using the CEPH pedigrees. For all power calculations, the background heritability was assumed to be 0.30. To restrict the pedigree from the original 210 members to the 167 that were used by Monks et al., 43 individuals were randomly deleted from the power calculations. For a brief explanation of QTL methods, see Box 2.

only detect a small proportion of actual eQTL and are hence unlikely to detect the same effects. Both studies agree that the most significant QTL appear to be cis-acting, whereas the proportion of cis acting eQTL is smaller in Morley et al. (w22%) than in Monks et al. (w40%) for the most stringent significance levels. However, although Morley et al. claim support for two trans-acting hubs of regulation on chromosomes 14 and 20, Monks et al. claim ‘lack of evidence for linkage hotspots’, although their permutations show that eQTL are significantly ‘unevenly distributed’. However, Monks et al. make their statement based on the eQTL with P! 0.000005, whereas Morley et al. use P!0.000037 (7.4 times larger) to claim the larger hubs. Therefore, the difference in threshold, and the difference in genes that were analysed, could explain this discrepancy. An interesting aspect of the Morley et al. article is the follow-up analyses on cis-acting QTL: they perform a within family association test [quantitative transmission disequilibrium test (QTDT); Box 2] with additional SNP markers for 17 transcripts. Furthermore, they re-estimate the magnitude of these QTL effects by a regression
Box 2. QTL methods used in the eQTL analyses of human data Sib-pair analysis
Morley et al. [14] applied a sib-pair analysis using the SIBPAL procedure from S.A.G.E (http://darwin.cwru.edu/sage/index.php). A sib-pair analysis determines evidence for linkage between a marker and a quantitative trait by regressing the phenotypic difference between sibs on the proportion of alleles that are shared identical by descent (IBD) between the sibs.

analyses on the grandparent data, giving a more realistic estimate of the actual QTL effect. This provides a solution to the problem that when QTLs are initially detected in a study with low power, the effects of those that are detected can be grossly overestimated. This overestimation of QTL effects is apparent in the article by Monks et al., who report genes with two, three and even one gene with 15 eQTLs. Subsequently, they claim that ‘all detectable QTL accounted for at least 50% of the trait variance with 75% of the QTL having heritabilities O0.76’. This illustrates the level at which QTL effects are overestimated: it is impossible to have 15 eQTL, each explaining 50% of the trait variance. This phenomenon is not unique to eQTL, but it illustrates the issue particularly well. Morley et al. confirm one of their cis-acting eQTL by quantitative PCR, which would seem to allay concerns about SNP variation in the probe. Only a single gene was confirmed, therefore, no general conclusion can be drawn from this result. Monks et al. discuss the potential problem of SNP variation with the probe sequence and subsequently question their own results for the human leukocyte antigen (HLA) area, which harbours substantial sequence variation.

is estimated across a population using the IBD proportions between all related individuals for a putative QTL location.

Quantitative transmission disequilibrium test (QTDT)
Morley et al. [13] used a family-based association test to confirm some of the cis-acting eQTL. Transmission disequilibrium tests (TDT) were initially proposed for studying mendelian disorders and provide a combined test for linkage and association by comparing the transmitted and non-transmitted marker alleles from the parents with those of the affected offspring. The quantitative TDT (QTDT), used by Morley et al., extended this methodology to complex traits where direct classification of offspring is not possible [19].

Variance component QTL analysis
Monks et al. [13] applied a variance component QTL analysis using SOLAR (http://www.sfbr.org/solar/) [18]. In a variance component QTL analysis, the proportion of phenotypic variation attributable to a QTL
www.sciencedirect.com

Update

TRENDS in Genetics Vol.21 No.7 July 2005

381

Concluding remarks Both articles present an interesting set of results but only appear to share a limited theoretical power to detect eQTL of small to moderate sizes. A first step to compare both studies would be to analyse the experiment in the first study with the methods that were applied in the second study (i.e. re-analyse the data from Morley et al. with SOLAR and the data from Monks et al. with a sib-pair analysis). Given that the pedigree details, genotype and gene-expression data for both studies are available online (Table 1), ongoing exploration of these data sets is expected to shed further light on the differences and simalarities between the two studies. eQTL studies have been successfully linked to variation in disease phenotype in mice [3] and rats [6]. Although the current examples of eQTL mapping in humans lack this important aspect (and motivation) of eQTL mapping, these authors might have paved the way for future eQTL studies that will address the complex nature of human disease.
Acknowledgements
We acknowledge financial support from the BBSRC. We are grateful to the two referees, and to John Gibson, Ritsert Jansen and Rob Williams for constructive comments on an earlier draft of this article. We also thank Ritsert Jansen and Rob Williams for sharing their manuscripts on BXD data.

References
1 Jansen, R.C. and Nap, J.P. (2001) Genetical genomics: the added value from segregation. Trends Genet. 17, 388–391 2 Jansen, R.C. (2003) Studying complex biological systems using multifactorial perturbation. Nat. Rev. Genet. 4, 145–151 3 Schadt, E.E. et al. (2003) Genetics of gene expression surveyed in maize, mouse and man. Nature 422, 297–302 4 Bystrykh, L. et al. (2005) Uncovering regulatory pathways that affect hematopoietic stem cell function using ‘genetical genomics’. Nat. Genet. 37, 225–232 5 Chesler, E.J. et al. (2005) Complex trait analysis of gene expression uncovers polygenic and pleiotropic networks that modulate nervous system function. Nat. Genet. 37, 233–242

6 Hubner, N. et al. (2005) Integrated transcriptional profiling and linkage analysis for identification of genes underlying disease. Nat. Genet. 37, 243–253 7 Brem, R.B. et al. (2002) Genetic dissection of transcriptional regulation in budding yeast. Science 296, 752–755 8 Yvert, G. et al. (2003) Trans-acting regulatory variation in Saccharomyces cerevisiae and the role of transcription factors. Nat. Genet. 35, 57–64 9 Brem, R.B. and Kruglyak, L. (2005) The landscape of genetic complexity across 5700 gene expression traits in yeast. Proc. Natl. Acad. Sci. U. S. A. 102, 1572–1577 10 Tan, P.K. et al. (2003) Evaluation of gene expression measurements from commercial microarray platforms. Nucleic Acids Res. 31, 5676–5684 11 Hughes, T.R. et al. (2001) Expression profiling using microarrays fabricated by an ink-jet oligonucleotide synthesizer. Nat. Biotechnol. 19, 342–347 12 Perez-Enciso, M. (2004) In silico study of transcriptome genetic variation in outbred populations. Genetics 166, 547–554 13 Monks, S.A. et al. (2004) Genetic inheritance of gene expression in human cell lines. Am. J. Hum. Genet. 75, 1094–1105 14 Morley, M. et al. (2004) Genetic analysis of genome-wide variation in human gene expression. Nature 430, 743–747 15 Storey, J.D. and Tibshirani, R. (2003) Statistical significance for genomewide studies. Proc. Natl. Acad. Sci. U. S. A. 100, 9440–9445 16 Manly, K.F. et al. (2004) Genomics, prior probability, and statistical tests of multiple hypotheses. Genome Res. 14, 997–1001 17 Lander, E. and Kruglyak, L. (1995) Genetic dissection of complex traits: guidelines for interpreting and reporting linkage results. Nat. Genet. 11, 241–247 18 Almasy, L. and Blangero, J. (1998) Multipoint quantitative-trait linkage analysis in general pedigrees. Am. J. Hum. Genet. 62, 1198–1211 19 Abecasis, G.R. et al. (2000) A general test of association for quantitative traits in nuclear families. Am. J. Hum. Genet. 66, 279–292 20 Purcell, S. et al. (2003) Genetic power calculator: design of linkage and association genetic mapping studies of complex traits. Bioinformatics 19, 149–150 21 Williams, J.T. and Blangero, J. (1999) Power of variance component linkage analysis to detect quantitative trait loci. Ann. Hum. Genet. 63, 545–563

0168-9525/$ - see front matter Q 2005 Elsevier Ltd. All rights reserved. doi:10.1016/j.tig.2005.05.004

Genome Analysis

A highly unexpected strong correlation between fixation probability of nonsynonymous mutations and mutation rate
Gerald J. Wyckoff1,4,*, Christine M. Malcom1,2,*, Eric J. Vallender1,3,* and Bruce T. Lahn1
1 2

Howard Hughes Medical Institute, Department of Human Genetics, University of Chicago, Chicago, IL 60637, USA Department of Anthropology, University of Chicago, Chicago, IL 60637, USA 3 Committee on Genetics, University of Chicago, Chicago, IL 60637, USA 4 Department of Molecular Biology and Biochemistry, University of Missouri-Kansas City, Kansas City, MO 64108, USA

Under prevailing theories, the nonsynonymous-tosynonymous substitution ratio (i.e. Ka/Ks ), which measures the fixation probability of nonsynonymous
Corresponding author: Lahn, B.T. (blahn@bsd.uchicago.edu). * These authors contributed equally to this work.
www.sciencedirect.com

mutations, is correlated with the strength of selection. In this article, we report that Ka/Ks is also strongly correlated with the mutation rate as measured by Ks, and that this correlation appears to have a similar magnitude as the correlation between Ka/Ks and selective strength. This finding cannot be reconciled