You are on page 1of 48

QUANTO

Version 1.2
January 2007

Developed by:

Jim Gauderman, Ph.D. and John Morrison, M.S.


Department of Preventive Medicine
University of Southern California
e-mail: jimg@usc.edu, jmorr@usc.edu
Updates and Information: http://hydra.usc.edu/gxe

Acknowledgments:

Supported in part by NIEHS grants ES10421 and 5P30-ES07048 and


NCI grant CA52862. We thank Dr. Robert Elston for permission to
utilize the program MAXFUN, and Dr. Joshua Millstein for a program to
perform numerical integration.

References:

1. Gauderman WJ, Morrison JM. QUANTO 1.1: A computer program for


power and sample size calculations for genetic-epidemiology studies,
http://hydra.usc.edu/gxe, 2006.

2. Gauderman WJ. Sample size requirements for matched case-control studies


of gene-environment interaction. Stat Med 21:35-50, 2002.

3. Gauderman WJ. Sample size requirements for association studies of gene-


gene interaction. Am J Epidemiol 155:478-84, 2002.

4. Gauderman WJ. Candidate gene association studies for a quantitative trait,


using parent-offspring trios. Genet Epidemiol 25:327-338, 2003.
Table of Contents
Summary................................................................................................................................... 2
Changes since previous release ............................................................................................... 2
Disclaimer................................................................................................................................. 2
Referencing .............................................................................................................................. 2
1. Introduction......................................................................................................................... 3
2. Methods ............................................................................................................................... 5
2.1 Notation and Definitions........................................................................................................... 5
2.2 Theoretical Models.................................................................................................................... 7
2.2.1 Disease Trait .......................................................................................................................................7
2.2.2 Quantitative Trait ................................................................................................................................8
2.3 Sampling Designs and Likelihood Formation ........................................................................ 9
2.3.1 Disease Trait .......................................................................................................................................9
2.3.2 Quantitative Trait ..............................................................................................................................11
2.4 Calculation of Power and Sample Size.................................................................................. 12
2.4.1 Basic Approach.................................................................................................................................13
2.4.2 Calculation of the expected log-likelihood: Disease Trait ................................................................14
2.4.3 Calculation of the expected log-likelihood: Quantitative Trait.........................................................16
3. Using QUANTO ................................................................................................................ 17
3.1 Main Menu................................................................................................................................ 17
3.2 Parameters................................................................................................................................ 18
3.2.1 Outcome/Design ...............................................................................................................................19
3.2.3 Gene G ..............................................................................................................................................21
3.2.4 Environment......................................................................................................................................22
3.2.5 Gene H ..............................................................................................................................................23
3.2.6 Outcome Model ................................................................................................................................24
3.2.6.1 Disease Risk Parameters ...........................................................................................................24
3.2.6.2 Continuous Trait Parameters .....................................................................................................26
3.2.7 Power ................................................................................................................................................28
3.3 Wizards .................................................................................................................................... 29
3.4 View .......................................................................................................................................... 29
3.5 Help .......................................................................................................................................... 30
4. Examples ........................................................................................................................... 31
4.1 Testing the main effect of a gene (G) for a disease trait ...................................................... 31
4.2 Specifying a range of effect sizes............................................................................................ 35
4.3 Testing the main effect of a binary environmental factor (E)............................................. 36
4.4 Testing G×E interaction for a disease trait ........................................................................... 38
4.5 Testing a continuous environmental factor (Z) .................................................................... 40
4.6 Testing correlation between Z and a quantitative outcome (Y).......................................... 42
4.7 Testing G×E interaction for a quantitative outcome ........................................................... 44

1
Summary
QUANTO is a program for computing either power or required sample size for association
studies of genes, environmental factors, gene-environment (G×E) interaction, or gene-gene
(G×G) interaction. The program is written in C++ and is designed to run under Windows 95,
98, 2000, XP, and NT. Two types of outcomes are considered, a disease (binary) outcome
and a quantitative (continuous) outcome. For a disease outcome, five types of study designs
are available: 1) standard “unmatched case-control” study, 2) standard “matched case-
control” in which each case is matched to an unrelated control selected from the source
population of the case, 3) “case-sibling” in which each case is matched to an unaffected
sibling, 4) “case-parent” in which genotypes are collected from the case and their two
parents, while any environmental data are only collected from the case, and 5) “case-only” in
which genetic and/or environmental data are collected only on a series of cases. For a
quantitative outcome, two study designs are available: 1) “independent individuals”
consisting of a random sample of unrelated subjects, and 2) “parent-offspring” in which the
outcome and environmental factor are measured on the offspring and genotypes are obtained
from the offspring and their two parents.

Changes since previous release


Since version 1.1, we have added:

• Sample size or power calculations for a 2 degree of freedom (df) joint test of a genetic
main effect (G) and gene-environment interaction (GxE). This test was investigated in a
recently published paper (Kraft, Yen, Stram, Morrison, Gauderman. “Exploiting gene-
environment interaction to detect genetic associations”, Human Heredity, 2007). The 2 df
test was shown to provide increased power in some situations over a 1 df test of G or a 1
df test of GxE interaction. An option to perform this 2 df test has been added to the
‘Power’ dialog (see Section 3.2.7 in this documentation).

Disclaimer
Although we have tested QUANTO extensively, no warranty, expressed or implied, is made
as to the accuracy and functioning of the program. No responsibility is assumed by the
authors. If any errors are encountered, please contact the authors at the e-mail address listed
on the title page. Please check the website periodically for notices and program updates.

Referencing
If you use QUANTO, please cite reference #1 on the title page of this documentation and one
or more of the additional references as appropriate. For your convenience, we have included
an endnote library with the four references on the title page as part of the installation
package. This library is located in the same folder as your QUANTO executable file.

We would appreciate hearing your impressions of the program, and we welcome suggestions
for improvements in future versions.

2
1. Introduction
With increasing frequency, epidemiologic studies are addressing hypotheses related to genes,
gene-environment (G×E), and gene-gene (G×G) interaction. These studies are facilitated by
the increasing availability of candidate genes and genetic markers, along with technologies
that make the utilization of genetic data in large-scale epidemiologic studies affordable.
Case-control studies are widely used for studying associations between disease and potential
environmental and/or genetic risk factors. To avoid possible biases due to population
stratification, one may want to match cases to controls on ethnic background. Cases and
controls may also be matched on other factors known to influence disease risk (e.g. sex, age,
etc.). Rather than a disease outcome, interest may focus on a quantitative trait (e.g. blood
pressure, lung function, cholesterol).

For the study of a disease trait, five basic study designs are considered.

1. “Unmatched Case-control”: A sample of diseased (cases) and non-diseased (controls)


individuals are obtained from some source population. All individuals are assumed to be
independent of one another.

2. “Matched Case-control”: Each case is matched to one unaffected control selected from
the source population of the case.

3. “Case-sibling”: Each case is matched to one unaffected sibling. Compared to the case-
control design, this design has the advantage that cases and controls are perfectly
matched on ethnic background.

4. “Case-parent”: Genotypes are collected from the case and his/her two parents, while any
environmental data are collected only from the case. As with the case-sib design, this
design provides perfect control for ethnic confounding. This design is sometimes
referred to as the ‘TDT’ design, in that the Transmission Disequilibrium Test can be used
to analyze the data. The main effect of environmental factors cannot be assessed in the
case-parent design, but analysis of genetic main effects and G×E and G×G interactions
can be conducted.

5. “Case only”: Only affected individuals are sampled. This design can be used to test for
G×E and/or G×G interaction, but the main effects of genetic and environmental factors
cannot be estimated or tested.

These designs are described in more detail in references 1 and 2 listed on the title page of this
document.

For the study of a quantitative trait, two designs are considered:

1. “Independent Individuals”: A sample of independent subjects is obtained from some


population. Random sampling is assumed (i.e. subjects are not assumed to be sampled

3
conditional on their quantitative trait value). The trait and genotype/environmental factor
are measured on each individual.

2. “Parent-offspring”: Similar to the case-parent design for a disease outcome, except that
a quantitative outcome is measured on each offspring. Offspring are assumed to be
sampled randomly from the population. In this respect, this design is similar to the
independent-individuals design, with the addition of parental genotypes for each subject.

QUANTO provides sample size and power calculations for testing genetic and environmental
main effects and interactions in the context of the designs listed above. Section 2 of this
documentation describes the notation, models, likelihood formation, and the approach to
sample size calculation used in QUANTO. Section 3 describes the operation of the program,
and Section 4 provides worked examples of sample size calculations.

4
2. Methods
The following sections describe notation (Section 2.1), theoretical models (Section 2.2),
likelihood formulation (Section 2.3), and methods for computing sample size or power
(Section 2.4).

2.1 Notation and Definitions

The following tables list the notation used in QUANTO and in this documentation.

Table 1: Notation and definitions for data variables


Data Definition
D Indicator of disease status (1=diseased, 0=not diseased)
Y Quantitative trait, with population mean µY and variance σY2
G Genotype (AA, Aa, or aa) at a candidate locus
h Genotype (BB, Bb, or bb) at a second candidate locus
G Covariate based on genotype g and the inheritance model, as follows:
Dominant: G=1 for g=AA, Aa; G=0 for g=aa
Recessive: G=1 for g=AA; G=0 for g=Aa, aa
Log-additive: G=2 for g=AA; G=1 for g=Aa; G=0 for g=aa
H Covariate based on genotype h and the inheritance model (analogous to G)
E Binary exposure factor (1=exposed, 0=unexposed)
Z Continuous exposure factor, assumed to be normally distributed

Table 2: Notation and definitions for genotype model parameters


Parameter Definition
qA Frequency (prevalence) of the ‘A’ allele in the population
qB Frequency (prevalence) of the ‘B’ allele in the population

Table 3: Notation and definitions for environment model parameters


Parameter Definition
pE Prevalence of exposure (E=1) in the population
φ Sibling exposure odds ratio: odds of E=1 in the sibling of an exposed
subject relative to the odds of E=1 in the sibling of an unexposed subject
σz Standard deviation of Z
ρ Sibling correlation in Z (bounded between –1.0 and 1.0)

5
Table 4: Notation and definitions for disease trait model parameters
Parameter Definition
Kp Overall disease risk in the general population (see Section 1.2)
P0 Baseline disease risk: disease risk in unexposed (E=0 or Z=0) genetically
normal (G=0, H=0) subjects
Re Environmental relative risk (or odds ratio): Risk relative to P0 when E=1 or
Z=1, in genetically normal (G=0) subjects
Rg Genetic relative risk (or odds ratio): Risk relative to P0 when G=1, in
environmentally unexposed (E=0 or Z=0) subjects
Rge Interaction effect (Relative-risk ratio, or Odds-ratio ratio): Risk relative to
to P0 when G=1 and E=1 (or G=1 and Z=1), divided by the product Re×Rg
Re Marginal environmental relative risk (or odds ratio): Risk relative to P0
when E=1 or Z=1, irrespective of genetic status (see Section 2.2.1)
Rg Marginal genetic relative risk (or odds ratio): Risk relative to P0 when G=1,
irrespective of environmental exposure status (see Section 2.2.1)

Table 5: Notation and definitions for quantitative trait model parameters


Parameter Definition
µY Mean of Y in the population
σY 2
Variance of Y in the population
α Baseline mean of Y: Mean of Y in unexposed (E=0 or Z=0) genetically
normal (G=0, H=0) subjects
βe Environmental effect: Change in mean Y per unit increase in E or Z, in
genetically normal (G=0) subjects
βg Genetic effect: Change in mean Y per unit increase in G, in environmentally
unexposed (E=0 or Z=0) subjects
βge Interaction effect: Change in mean Y per unit increase in the product of
G*E or G*Z, beyond the change in mean Y due to βe and βg.
βe,βg Marginal environmental and genetic effects, respectively (see Section 2.2.2)
Re2 , Rg2 , Rge2 Marginal proportion of variance in Y explained by environmental, genetic,
and interaction effects, respectively (see Section 2.2.2)

6
2.2 Theoretical Models

2.2.1 Disease Trait

In this section, we develop theoretical models for how genetic and environmental factors
relate to the risk of disease. For reasons that will be made clear in Section 2.3, we consider
both the logistic model and the log-linear models. For simplicity, we consider models with a
genetic covariate G, binary environmental factor E, and their interaction. The factor E can be
replaced below by Z if the environmental factor is continuous, or by H to consider a second
gene. If only the main effect of G or E is of interest, the models below would be reduced to
include only the corresponding covariate and parameter.

The logistic model is given by:

1. 1. α + γ G + γ E + γ GE
e g e ge
Eq Pr( D = 1 | G, E ) = α + γ G + γ E + γ GE
(1)
1 + e g e ge

where the baseline probability of disease P0 is given by eα/(1 + eα), and the quantities
γ γ
Rg = e g , Re = eγ e , and Rge = e ge are genetic environmental, and interaction odds ratios,
respectively. Note the Rg is the disease odds ratio for carriers (G=1) compared to noncarriers
(G=0) in unexposed (E=0) persons, while Re, is the disease odds ratio for exposed (E=1)
versus unexposed (E=0) in genetically normal (G=0) persons. In the log-additive model, Rg
is also the odds ratio for G=1 compared to G=0; the odds ratio for G=2 (g=AA) compared to
G=0 (g=aa) is given by Rg2 .

The log-linear model is given by:

2. eq2 α + γ g G + γ e E + γ geGE
Pr( D = 1 | G, E ) = e (2)

γ
Where now the baseline probability of disease is simply P0 = eα, and Rg = e g , Re = eγ e ,
γ
and Rge = e ge are the genetic, environmental, and interaction relative risks, respectively.
The interpretation of these relative risks is analogous to the interpretations of the odds ratios
described above.

From the quantities described above, one can compute three summary measures, whose
values may be known at the stage of planning a new study. The first measure is the
population disease prevalence, defined as

3. Eq K p = Pr( D = 1) = ∑G ∑E Pr( D = 1 | G, E ) Pr(G | q A ) Pr( E | p E ) (3)

7
The second measure is the population-average exposure relative risk, a quantity that would
be estimated in an epidemiological study of the exposure factor alone, and is defined as

4. 4
Re =
Pr( D = 1 | E = 1)
=
∑ G
Pr( D = 1 | E = 1) Pr(G | q A )
(4)
Pr( D = 1 | E = 0) ∑
G
Pr( D = 1 | E = 0) Pr(G | q A )

The third measure is the population-average genetic relative risk, which would be estimated
in an epidemiological study of the genetic factor alone. It is given by

5. 5
Rg =
Pr( D = 1 | G = 1)
=
∑ E
Pr( D = 1 | G = 1) Pr( E | p E )
(5)
Pr( D = 1 | G = 0) ∑
E
Pr( D = 1 | G = 0) Pr( E | p E )

Equations (4) and (5) can be modified accordingly to express the population-average
exposure and genetic odds ratios, respectively. The summary quantities in Equations 3-5 can
be used to solve for the underlying model parameters (e.g. Re and Rg).

In the context of gene-gene interaction, the population-average effect of the second locus H
can be computed by replacing E in Equation 4 with H. For gene-environment interaction
with a continuous environmental factor Z, E is replaced by Z in Equation 4, and the
summations in Equation 5 are replaced by integrals over the distribution of Z.

2.2.2 Quantitative Trait

We assume a linear model relating the phenotype to the genotype and environmental factor,
of the form

Y = α + βgG + βeE + βgeGE + e. (6)

The residual e is assumed to be normally distributed, with mean zero and variance σR2. The
parameters βg, βe, and βge have interpretations as the change in the mean of Y per unit
increase of the corresponding variable. We denote the parameters βg and βe as the ‘main’
effects of the genotype and environmental factor, respectively, i.e. as the change in mean Y
when the other factor is zero (see Table 5).

We also consider an alternative parameterization of this linear model, with form

Y = α*+ β g (G – µG) + β e (E – µE) + βge(G – µG) (E – µE) + e. (7)

Here, µG and µE are the theoretical means of the covariates G and E, respectively. For
example, if a dominant model is assumed, G is a binary random variable with mean
µG=qA2 + 2qA(1 – qA). The model in Equation 7 is based on a common convention in
regression analysis of centering covariates on their respective means. This will result in
different theoretical expectations for the intercept (α in Equation 6, α* in Equation 7), genetic

8
effect (βg and β g ), and environmental effect (βe and β e ), but not for the interaction effect
βge. In expectation, the value of β g in Equation 7 is equivalent to the value of β g one
would get from the simpler model

Y = α* + β g (G – µG) + e (8)

i.e. from the model in Equation 7 ithout the environmental and interaction effects. Since the
model in Equation 8 clearly provides an estimate of the marginal effect of G, we denote β g
as the ‘marginal’ genetic effect, and similarly for β e .

QUANTO allows you to enter β’s based on either the main effects model (Equation 6) or the
marginal effects model (Equation 7). As a third alternative, you can instead enter the
proportion of variation in Y explained by the marginal effect of each variable, based on the
model in Equation 7. These ‘R-squared’ terms are obtained from the expression

Var(Y =( β g )2Var(G – µG) + ( β e )2 Var(E – µE) + (βge)2 Var[(G – µG)(E – µE)]+ σR2 (9)

Note that the covariance terms one would normally get by taking the variance of a sum drop
out due to the centering of each variable on its theoretical mean. The marginal genetic effect
(Rg2) is then computed as

Rg2 = ( β g )2 Var(G – µG) / Var(Y) (10)

with analogous expressions for Re2 and Rge2. The Var(G – µG) = Var(G) and is computed
based on the assumed genetic model and allele frequency. For example, if a dominant model
is assumed, Var(G)= [qA2 + 2qA(1 – qA)]×[(1 – qA)2].

2.3 Sampling Designs and Likelihood Formation

2.3.1 Disease Trait

We assume that genotypic and/or exposure data will be collected from all cases. Genotypic
and/or environmental data are collected from unaffected and unrelated controls in the two
case-control designs, and from unaffected siblings in the case-sib design. In the case-parent
design, only genotypic data are collected from parents; phenotypic and environmental data
on parents are neither required nor used. In the case-only design, no controls are selected.
For the matched case-control, case-sib, and case-parent designs, we assume that conditional
logistic regression will be utilized in the analysis of the data. The conditional logistic
likelihood for analysis of G×E interaction in a sample of N matched sets has the form:

9
6. 11 β g Gi 1 + β e Ei 1 + β ge Gi 1 Ei 1
e
L( β g , β e , β ge ) = ∏i =1
N
16 .
∑e
β g Gij + β e Eij + β ge Gij Eij (11)
j∈M ( i )

The index “1” refers to the case and the set M(i) includes all subjects in matched set i. For
the matched case-control and case-sib designs, the terms in the denominator of Equation 11
include a contribution from the case and each of his/her matched controls. For the case-
parent design, the denominator includes a contribution from the case and from the three
“pseudo-siblings” of the case, the latter formed as the three possible genotypes the case could
have inherited from the parents but did not. For a continuous covariate, E is replaced by Z
in Equation 11. For analysis of gene-gene interaction, E is replaced by H and in the case-
parent design, the summation in the denominator includes a contribution from the case and
from 15 “pseudo-siblings” considering the possible joint-genotype transmissions at both loci.

Maximum likelihood estimates (MLE’s) derived from Equation 11 are consistent estimates
of the corresponding log-odds-ratios from the logistic model (Equation 1) in the matched
case-control of case-sib designs, while they are consistent for the log-relative risks from the
log-linear model in the case-parent design (Equation 2). In other words, the MLE β̂ of each
parameter in Equation 11 is a consistent estimator of the corresponding parameter γ in
Equation 1 or 2.

In the unmatched case-control design, we assume that standard unconditional logistic


regression is used to analyze the data. QUANTO allows for unequal numbers of cases and
controls in the sample, with K denoting the control-to-case ratio. For example, if there will
be two controls for every case, K=2, while if there will be twice as many cases as controls,
K=0.5. In either situation, the sample size reported by QUANTO will refer to the number of
cases (N). It is up to you to determine the number of controls (N×K) based on your chosen
value of K, and the total number of subjects (N + N×K).

The likelihood for the logistic model for N cases and N×K controls has form

α + β g Gi + β e Ei + β geGi Ei
e 1
L(α , β g , β e , β ge ) = ∏iN=1 α + β g Gi + β e Ei + β geGi Ei ∏ Nj=×1K α + β g G j + β e E j + β geG j E j
(12)
1+ e 1+ e

where the first product is taken over the N cases and the second is taken over the N×K
controls. MLE’s obtained from this model are consistent estimators of the log-odds-ratio
parameters from the logistic model (Equation 1).

In the case-only design, the measure of interaction is derived from analysis of the association
between G and E. For binary E and G, this can be obtained from a standard analysis of a
2×2 contingency table. More generally, this association can be captured utilizing standard
unconditional logistic regression, with form

10
µ +β E
(e ge i ) Gi
L( µ , β ge ) = ∏i =1
N
µ+β E
. (13)
1 + e ge i

As with the case-parent design, βge in this design estimates the log relative-risk ratio
parameter γge from the log-linear model in Equation 2. The main effects of G and E cannot be
estimated in this design. A continuous environmental factor Z or second gene H can be
substituted for E in Equation 13. An assumption in Equation 13 is that G is a binary
indicator variable (as in the dominant or recessive model). If log-additive coding is used for
G (see Table 1), one can switch the positions of G and E in Equation 13 and still obtain a
consistent estimator of βge. However, the likelihood in Equation 13 is not directly applicable
in two situations: 1) analysis of gene-gene interaction with log-additive coding (i.e. 0, 1, 2
coding) for both G and H, and 2) analysis of gene-environment interaction with log-additive
coding for G and a continuous environmental factor Z. In this situation, we can adopt an
allele-based logistic model rather than the genotype-based model in Equation 12.
Specifically, we define Aij to be an indicator of the presence of allele A on chromosome j
(j=1 or 2) in subject i, based on their genotype Gi, i.e.

Genotype Gi Ai1 Ai2


---------------------------------------------------
AA 1 1
Aa 1 0
aa 0 0
---------------------------------------------------

Then, for analysis of G×Z interaction with log-additive coding of G, the likelihood in
Equation 13 is replaced by
µ+β Z A
(e ge i ) ij
L( µ , β ge ) = ∏ ∏
N 2
i =1 j =1 µ+β Z
. (14)
1 + e ge i

Analysis of G×H interaction, when log-additive coding is assumed for both G and H, is also
accomplished using Equation 13. The coding for H in this situation is still 0, 1, or 2 for
h=bb, Bb, or BB, respectively.

2.3.2 Quantitative Trait

We assume that genotypic and/or exposure data will be collected from all individuals in the
independent-individuals design, and from all offspring in the parent-offspring design.
Genotypes are assumed to be available for both parents in the latter design. We assume that
linear regression will be utilized for both designs. For the independent individuals design,
this model has the same form as the underlying, theoretical model in Equation 6 or 7,
depending on which model is chosen for specifying parameters. MLE’s from either model
are consistent estimators of the corresponding slope parameters in the true underlying model.

11
For the parent-offspring design, Gauderman (see reference 3, title page) has described an
analytic approach that involves a relatively simple modification to the linear regression
model. Specifically, one first defines categories based on parental mating type. Without
distinguishing maternal and paternal genotypes, there are six possible mating types for a two-
allele locus: AA×AA, AA×Aa, AA×aa, Aa×Aa, Aa×aa, and aa×aa. We let M=1,...,6 denote
these categories. The linear model in Equation 6 is then modified to have form:

Y = αM + βgG + βeE + βgeGE + e, (15)

i.e. the single intercept α in Equation 6 is expanded to include 6 intercepts, α1,…,α6,


corresponding to the six possible parental mating types. An analogous modification can be
made to the marginal model in Equation 7. The inclusion of mating-type specific intercepts
controls for population stratification and yields consistent estimates of the genetic,
environmental, and interaction effects in the presence of population stratification. For a G×G
interaction model, there are 6×6=36 mating-type specific intercepts spanning the joint
genotypes at the two loci for both parents.

Gauderman (reference 3, title page) denotes tests based on the model in Equation 15 by
QTDTM (Quantitative Transmission Disequilibrium Test, with Mating type indicators), and
compares the efficiency of this approach to alternative QTDT tests applied to parent-
offspring data.

2.4 Calculation of Power and Sample Size

QUANTO computes the power for a given number (N) of sampling units, or the required
number of sampling units to achieve a given power. A sampling unit is defined as:

1. A case in the unmatched case-control design (required number of controls in this design
is then given by N×K, see Section 2.3.1),
2. A case-control pair in the matched case-control design,
3. A case-sibling pair in the case-sibling design,
4. A complete case-parent trio in the case-parent design,
5. A single case in the case-only design,
6. A single subject in the independent-individuals design, or
7. A complete parent-offspring trio in the parent-offspring design.

12
2.4.1 Basic Approach

We assume that the null hypothesis (H0) of interest is:

H0: βg = 0 for a test of the gene only, or


H0: βe = 0 for a test of the environment only, or
H0: βge = 0 for a test of G×E or G×Z interaction, or
H0: βgh = 0 for a test of G×H interaction, or
H0: βg = 0, βge = 0 for a 2 df joint test of genetic main effect and GxE interaction.

The alternative hypothesis (H1) may either be two-sided (e.g. βge ≠ 0) or one-sided (e.g. βge >
0). For the latter, 2 df test listed above, QUANTO assumes a 2-sided alternative hypothesis.
In the situation of testing G×E interaction for a disease trait, we let Λ= {α, γg, γe, γge} denote
the assumed parameter values for the logistic (Equation 1) or log-linear (Equation 2) model.
For a quantitative trait, Λ= {α, βg, βe, βge} under the main-effects model in Equation 6, while
Λ= {α∗, β g , β e , βge} under the marginal-effects model in Equation 7. The other model
parameters are denoted by Ω = {qA, pE, φ}. The specific parameters included in Λ and Ω will
vary with the chosen design and hypothesis being tested. To estimate power sample size, we
use the follow approach, described in the context of testing for G×E interaction:

1. Maximize the expected log-likelihood L1 = ln[L(βg,βe,βge)] with respect to the


distribution of observable genotype and exposure data conditional on the true parameters
Λ and Ω. We denote the expected MLE’s by βˆ 1 = {βˆ g1 , βˆ e1 , βˆ ge
1
} and the corresponding
maximum of the expected log likelihood by L̂1 . Details of the computation of these
expected values are provided in section 2.4.2 (disease trait) and 2.4.3 (quantitative trait).

2. In an analogous fashion, maximize the expected log-likelihood L0 = ln[L(βg,βe)], i.e. the


expectation of the log-likelihood when βge is fixed to its null value of zero. The expected
MLE’s from this model are denoted βˆ 0 = {βˆ g0 , βˆ e0 } and the corresponding maximum of
the expected log-likelihood is denoted L̂0 .

3. Define ∆ = 2( Lˆ1 − Lˆ0 ) , i.e. the likelihood ratio test statistic for a single sampling unit
based on the expected maximum log-likelihoods under H1 and H0. For a given number N
of sampling units, the quantity N∆ is the non-centrality parameter of the chi-squared
distribution under the alternative hypothesis.

4. For a one-sided alternative hypothesis, required number of sampling units is computed as

7. 7 N = ( za + zb ) 2 / ∆ (16)

where a is the significance level, 1 – b is the desired power, and zu denotes the (1 – u)th
percentile of the standard normal distribution. For two-sided alternative hypothesis, za is

13
replaced by za/2. The use of Equation 16 presumes a one-degree-of-freedom test of H0, which
results when G and E are each represented by a single covariate.

5. Alternatively, for a given N, power for a 1-sided alternative hypothesis is computed as


1 − b = Φ ( N∆ − z a ) , where Φ(u) is the cumulative standard normal distribution
evaluated at u. For a 2-sided test, 1 − b = Φ ( N∆ − zα / 2 ) + Φ ( − N∆ − zα / 2 ) .

An analogous procedure is used to compute sample size or power for the other types of
hypotheses. For the 2 df joint test of H0: βg = 0, βge = 0, L1 = ln[L(βg,βe,βge)], L0 = ln[L(βe)],
and steps 4 and 5 above are modified as described in Kraft et al. (Human Heredity, 2007).

2.4.2 Calculation of the expected log-likelihood: Disease Trait

The key step in the calculation of power or required sample size is the computation of the
expected log-likelihood. The form of the expected log-likelihood is

8. 8 E (ln[ L( β )]) = ∑ ∑ ln[ L( β ; G , E )] f ( g , E | D, Λ , Ω) (17)


g E

where the summation is over all possible observable genotypes (g) and exposure (E) in a
sampling unit. For gene-gene interaction, the summation over E is replaced by a summation
over h. For a continuous environmental covariate, the summation over E is replaced with an
integral over the distribution of Z. The factor L(β;G,E) is the contribution to the likelihood in
Equation 11, 12, 13, or 14 (depending on the design) for a sampling unit with specific
genotypes g and exposures E. The factor f(g, E | D, Λ, Ω) is the probability distribution for
the observable data in a sampling unit, conditional on the disease-based ascertainment rule
and the assumed model parameters. For a case-control or case-sib study, D = {D1, D2} with
the disease status D1 = 1 in the case and D2 = 0 in the control. In the case-parent and case-
only designs, D = D1, i.e. only the disease status in the case.

The joint density of genotypes and exposures has the form:

9. 9 Pr( D | G, E , Λ ) Pr( g | Ω) Pr( E | Ω)


Pr( g , E | D, Λ, Ω) =
∑ ∑ Pr( D | G, E , Λ ) Pr( g | Ω) Pr( E | Ω)
g E
(18)

The first factor in the numerator of Equation 18 is the disease-risk function, which we
assume is based on the logistic model in Equation 1 for the case-control and case-sib designs,
and on the log-linear model in Equation 2 in the case-parent and case-only designs. We make
the assumption that disease status is independent within each sampling unit, conditional on
the genotypes and exposures, so that Pr(D | G, E, Λ) = Πi Pr(Di | Gi, Ei, Λ). Substitution of
Equation 18 into Equation 17 yields the following expression for the expected log-likelihood:

14
10. 1
0
∑ ∑ ln[ L( β ; G, E )] Pr( D | G, E , Λ ) Pr( g | Ω) f ( E | Ω)
E (ln[ L( β )]) =
g E

∑ ∑ Pr( D | G, E , Λ ) Pr( g | Ω) Pr( E | Ω)


(19)
g E

Additional components of Equation 19 are now described for each of the five study designs.
In all of the following, we let the subscripts 1, 2, m, and f denote the case, control, mother,
and father respectively, and we assume the environmental factor is a binary indicator of
exposure.

1. Unmatched Case-control: Assuming 1 control per case, the random variables include
g = {g1, g2} and E = {E1, E2}. Since case and controls are assumed to be unrelated,
Pr(g|Ω) = Pr(g1|qA)Pr(g2|qA) and Pr(E|Ω)=Pr(E1|pE)Pr(E2|pE). Furthermore, the
likelihood contributions for a case and control are also distinct (see Section 2.3.1),
and so the above summations can be broken into separate summations for the case
and for the control. For a 1:K design (K controls per case), the contribution for the
control is simply multiplied by K.

2. Matched Case-control: The random variables include g = {g1, g2} and E = {E1, E2},
so that are 3 × 3 × 2 × 2 = 36 possible joint covariate profiles for the pair. Since case
and controls are assumed to be unrelated, Pr(g|Ω) = Pr(g1|qA)Pr(g2|qA) and
Pr(E|Ω)=Pr(E1|pE)Pr(E2|pE). Unlike the unmatched design, the likelihood
contribution is specific to the case-control pair, rather than to each individual.

3. Case-sib: The random variables again include g ={g1, g2} and E = {E1, E2}, and there
are 3 × 3 × 2 × 2 = 36 possible joint covariate profiles. The joint distributions of
genotypes,

Pr( g | Ω) = ∑∑ Pr( g 1 | g f , g m ) Pr( g 2 | g f , g m ) Pr( g m | q A ) Pr( g f | q A ) ,


gm gg

and of exposures, Pr(E|Ω) = Pr(E1|E2, φ)Pr(E2|pE), are the determinants of power


differences relative to the matched case-control design.

4. Case-parent: The random variables include three genotypes g = {g1, gm, gf} and one
exposure E = {E1}, so that in theory there would be 3 × 3 × 3 × 2 = 54 possible
covariate profiles. However, the joint distribution
Pr(g|Ω) = Pr(g1| gm,gf) Pr(gm | qA) Pr(gf | qA) is nonzero for only 10 of the 27
combinations of joint genotypes, due to Mendelian rules of gene transmission. There
are thus only 20 observable joint genotype and exposure profiles in this design. The
exposure distribution is simply Pr(E|Ω) = Pr(E1|pE).

5. Case-only: The random variables include only the data for the case, i.e. g = {g1} and
E = {E1}. The corresponding probability distributions are given by
Pr(g | Ω) = Pr(g1 | qA) and Pr(E | Ω)=Pr(E1 | pE).

15
2.4.3 Calculation of the expected log-likelihood: Quantitative Trait

In both the independent-individuals and parent-offspring designs, study subjects are assumed
to be sampled randomly from the population. Thus, the random variables include the
genotype g (and possibly h) and environmental factor E as in the disease-based design, but
also the outcome variable Y. The expected log-likelihood is a function of the intercept(s),
slopes, and residual variance. For the independent individuals design, this expectation has
form

E [ L(α , β , σ R )] = ∑ ∑ ∫ ln[ L(α , β , σ R ; Y , g , E )] f (Y , g , E | Λ , Ω ) dY


g E Y
For the parent-offspring design, this expectation has form

E [ L(α M , β , σ R )] = ∑ ∑ ∫ ln[ L(α M , β , σ R ; Y , g m , g f , g , E )] f (Y , g m , g f , g , E | Λ , Ω) dY i.e.


g E Y

α is replaced by αM and the first summation is expanded to encompass the genotypes of the
mother, father, and offspring.

16
3. Using QUANTO
The following describes each menu and dialog of QUANTO.

3.1 Main Menu

File: Standard file-handling options.


Parameters: Specification of model parameter values.
Wizards: Guides you through the various choices under ‘Parameters’ (optional).
View: Choices for what is viewed on screen
Help: Program information and on-screen viewing of this documentation file

17
3.2 Parameters

This is the primary menu in QUANTO, where design choices and parameter settings are
specified. Each menu choice is described briefly below, and then more fully in the
subsequent sections.

For details,

see Section
Outcome/Design: The outcome and study design (3.2.1)
Hypothesis: The hypothesis to be tested (3.2.2)
Gene G: The distribution of G in the population (3.2.3)
Environment: The distribution of environment (E or Z) in the population (3.2.4)
Gene H: The distribution of H in the population (3.2.5)
Outcome Model: The parameters of the trait model (3.2.6)
Power: Choose N or power, and specify signifcance level (3.2.7)
Calculate: Calculate power or sample size for the current parameter settings
(or click the blue & white calculator button on the toolbar)

18
3.2.1 Outcome/Design

Or

Choose either ‘Disease’ or ‘Continuous’ outcome, and the specific design you will use for
that outcome. Choosing the unmatched case-control design will also call the following
dialog:

The default value of 1 assumes that equal numbers of cases and controls will be sampled.
Decimal values are allowed. As an example, providing a value of 2 corresponds to a 1:2
case:control ratio, while a value of 0.5 would correspond to a 2:1 case:control ratio.

19
3.2.2 Hypothesis

Choose a test of the main effect of the gene (gene-only), main effect of the environmental
factor (environment only), interaction of a gene and environmental factor (G×E), or
interaction of two genes (G×G). For the latter two, sample size or power will also be given
for the test of the main effect of each component under the specified model.

20
3.2.3 Gene G

Available when ‘Gene only’, ‘Gene-Environment interaction’, or ‘Gene-Gene interaction’ is


selected under ‘Hypothesis’. Calls the following dialog:

Enter the frequency of the high risk allele (‘A’) in the population and the assumed
inheritance model. The values in the ‘Susceptibility Frequency’ box are computed
automatically based on the allele frequency and inheritance model. For example, in the
above settings, approximately 2% of individuals in the population carry a high risk genotype
(Aa or AA under a dominant model) given an allele frequency of 1%.

You can optionally provide a range of values for qA by providing a larger value in the ‘to’
box and a non-zero stepsize in the ‘by’ box. QUANTO will compute N/power for each value
described by this range and stepsize. The susceptibility frequencies, however, are only
shown for the primary allele frequency (in the left hand box).

21
3.2.4 Environment

Available when ‘Environment only’ or ‘Gene-Environment interaction’ is selected under


‘Hypothesis’. Calls the following dialog:

Choose a ‘Binary’ (E) or ‘Continuous’ (Z) environmental factor. Depending on the choice,
enter the following information:

Binary: Enter the proportion of subjects exposed in the population (pE). For the case-
sib design, also enter the sibling exposure-sharing odds ratio (φ). The value of
φ must be greater than zero. Setting φ=1 indicates no correlation in exposure
between siblings.

Continuous: Enter the standard deviation of the covariate. QUANTO assumes the
covariate is normally distributed with mean zero and standard deviation σz.
The mean of the covariate does not influence power or sample size
calculations. You may optionally provide the units (e.g. centimeters, inches,
meters) relevant to the value provided for σz. This unit label does not
influence sample size or power calulations, it is only for your reference. For
the case-sib design, also enter the correlation in the covariate between siblings
(ρ). The range of possible values is –1 < ρ < 1…setting ρ=0 indicates no
correlation in exposure between siblings.

22
3.2.5 Gene H

Available when ‘Gene-Gene interaction’ is selected under ‘Hypothesis’. Calls the following
dialog:

Enter the frequency of the high risk allele (‘B’) in the population and the assumed inheritance
model. The values in the ‘Susceptibility Frequency’ box are computed automatically based
on the allele frequency and inheritance model.

23
3.2.6 Outcome Model

Calls a dialog specific to either a disease (Section 3.2.6.1) or continuous (Section 3.2.6.2)
outcome.

3.2.6.1 Disease Risk Parameters

Baseline Risk/Population Risk: Choose one of these and provide a value. Baseline risk (P0)
is defined following Equations 1 and 2 (Section 2.2.1), while the population risk (Kp) is
defined in Equation 3 (Section 2.2.2).

Environmental Effect: Choose whether to specify the Main or Marginal effect. This effect
represents the odds ratio in the case-control (matched or unmatched) or case-sibling design,
and the relative risk in the case-parent or case-only design. ‘Main’ refers to the
environmental effect in genetically normal (G=0) subjects (i.e. Re, see Equations 1 and 2 in
Section 2.2.1). ‘Marginal’ refers to the environmental effect averaged over the distribution
of genotypes ( R e , see Equation 4, Section 2.2.2). The value provided must be greater than
zero. A few notes:

• The choice of Main or Marginal is constrained to be the same for the Environmental and
Genetic Effects.

• When Marginal is selected, Baseline Risk/Population Risk is constrained to be Population


Risk.

24
• When gene-gene interaction is selected under ‘Hypothesis’, ‘Environmental Effect’ will
be replaced by ‘Gene H Effect’.
• When a continuous environmental factor (Z) is assumed, the effect size provided should
be relative to an increase of 1 unit on the scale of measurement on which the standard
deviation σz was given.

Genetic Effect: Choose either Main or Marginal. ‘Main’ refers to the genetic effect in
unexposed (E=0) subjects (i.e. Rg, see Equations 1 and 2 in Section 2.2.1). ‘Marginal’ refers
to the genetic effect averaged over the distribution of the environmental factor ( R g , see
Equation 5, Section 2.2.2). The value provided must be greater than zero. When ‘Gene-
Gene interaction’ is selected, this refers to the effect of Gene G.

Interaction: The assumed magnitude of the interaction effect (Rge, see Equations 1 and 2 in
Section 2.2.1). The value provided must be greater than zero. You can optionally provide a
range of values for Rge by providing a larger value in the ‘to’ box and a non-zero stepsize in
the ‘by’ box. QUANTO will compute N/power for each value described by this range and
stepsize.

OR Summary (RR Summary for Case-Parent or Case-only design): Each box provides
the OR (or RR) for the corresponding genotype/exposure combination, relative to unexposed
(E=0) noncarriers (G=0), based on the assumed values of the disease-risk parameters.

Update: Updates the quantities in the OR or RR Summary box. In the case-parent and case-
only designs, QUANTO checks that the disease-risk probabilities do not exceed 1.0 for any
combination of gene and/or environment. If a continuous environmental covariate is
assumed, QUANTO checks that the disease-risk probabilities do not exceed 1.0 when Z is
four standard deviations from its mean. If this range check fails, a message will appear and
you will be required to modify one or more parameter values before proceeding.

OK: Updates the parameters and exits the dialog.

25
3.2.6.2 Continuous Trait Parameters

Population Mean: Provide the mean of Y in the population. This value has no effect on
N or power calculations.

Population Std. Dev: The population standard deviation of Y. This value has no effect on N
or power calculations if R-squareds are provided, but it will have an
effect if β’s are specified.

Choose either ‘Main effect’, ‘Marginal effect’, or ‘Marginal R Squared’

Main Effect: Provide the values of βg, βe, and βge under the main effects model (see
Equation 6).

Marginal Effects: Provide the values of β g , β e , and βge under the marginal effects
model (see Equation 7).

26
Marginal R Squared:Provide the values of Rg2, Re2 , and Rge2 based on the marginal effects
model (e.g. see Equation 10).

You can optionally provide a range of values for the interaction parameter (βge or Rge), by
providing a larger value in the ‘to’ box and a non-zero stepsize in the ‘by’ box. QUANTO
will compute N or power for each value described by this range and stepsize.

Summary Box: Summarizes the expected values of Y for various combinations of G


and E. These expectations are based on the specified β’s or R-
squareds, and the underlying values of the distributional parameters
for G and E (e.g. qA and pE). The overall mean reported in the lower
right box (e.g. 100 in above dialog) is always equal to the ‘Population’
mean you specify. The difference in marginal means for E (e.g.
109.49 – 98.95) is given by the specified value of β e (e.g. 10.54), and
similarly for the G marginal means. The values within the 2×2 box
are determined by the values of βg, βe, and βge and Equation 7. For
example, in the case of a binary E and dominant G as in the above
dialog, these values are given by:

G
aa Aa,AA

0 α α + βg
E
1 α + βe α + βg + βe + βge

where α is determined so that the overall mean is equal to the


specified ‘Population Mean’.

Update: Updates the quantities in the Summary box and checks for inconsistencies.

OK: Updates the parameters, checks for inconsistencies, and exits the dialog.

The dialog is slightly different for the Gene Only and Environment Only models. The dialog
has only the main and R-squared effects. The dialog also has boxes to enter ranges for both
these effects.

27
3.2.7 Power

Calls one of the following dialogs:

The left dialog is called if the chosen hypothesis is ‘Gene only’, ‘Environment only’, or
‘Gene-gene interaction’, while the right dialog is called if the chosen hypothesis is ‘Gene-
environment interaction’. See below for a description of the 2 df test.

Power or Sample Size: Select one of these. If ‘Power is selected, required sample size (N)
will be computed for the specified power. If ‘Sample Size’ is selected, power will be
computed for the specified N. Power must be between 0.0 and 1.0, and sample size must be
greater than zero. Sample size refers to the number of sampling units, where a sampling unit
is defined for each design in Section 2.4. You can optionally provide a range of values for
either N or power by providing a larger value in the ‘to’ box and a non-zero stepsize in the
‘by’ box. QUANTO will compute N/power for each value described by this range and
stepsize.

Type I error rate: Specify the desired type I error rate, which must be between 0.0 and 1.0.
Specify whether the alternative hypothesis is 1- or 2-sided.

Perform 2 df test: Available only when the chosen hypothesis is ‘Gene-environment


interaction’. If this box is checked, QUANTO will compute N or power for the 2 df test of
the joint null hypothesis H0: βg=0 and βge = 0, in addition to the 1 df tests that are normally
produced. You must assume a 2-sided alternative hypothesis if you choose to do this 2 df
test. For additional details of this test, see Kraft et al., Human Heredity, 2007.

28
3.3 Wizards

A feature to automatically guide you through the options under ‘Parameters’. Choose ‘New’
to initialize parameter values to the program defaults. Choose ‘Modify’ to initialize
parameters values to those used in the last calculation.

3.4 View

Toolbar: Option to view the toolbar at the top of the screen.


Status Bar: Option to view the status bar at the bottom of the screen.
Current Settings: The default window that shows the current model parameter settings and
most recent calculation of N or power. Each model and corresponding
calculations are automatically appended to the session log.
Session Log: A continuous log of all model settings and calculations during a
QUANTO session. There are a few new options in version 1.1
including:
• Display is automatically moved to the most recent results as new
models are executed
• The ‘Home’ and ‘End’ keys can be used to move to the beginning and
end, respectively, of the log window
• Results in the log window can be cleared at any time during a session
by choosing ‘Clear Log’ under the File menu. Note that cleared
results cannot be recovered, so make sure to save the log results if
you will want them at a future time.
Font: Select the font size used by a printer. This will also typically change the
font size in the display window. However, since the screen display is
machine dependent, changing the font size may not change the displayed
font.

29
3.5 Help

Documentation: Opens Adobe Acrobat and displays this documentation file.


About Quanto: Displays the version number and acknowledgments.

Adobe Acrobat or Acrobat Reader needs to be installed to view the documentation. Acrobat
Reader is available free of charge at http://www.adobe.com. An error message will be
displayed if the program is unable to find Acrobat Reader on the computer.

30
4. Examples
4.1 Testing the main effect of a gene (G) for a disease trait

Suppose we are planning a study to test for association between a disease and a gene. As an
example, let’s assume the disease is asthma and the gene in Glutathione S Transerase M1
(GSTM1). We will make the following assumptions:

1. The prevalence of the disease in the population (Kp) is 10%.

2. There are two alleles at the GSTM1 locus, denoted ‘Null’ (allele ‘A’) and ‘non-Null’
(allele ‘a’). It is thought that only those with the ‘Null/Null’ genotype are at increased
risk for disease. The proportion of subjects in the population with the Null/Null genotype
is estimated to be 40%. Assuming Hardy-Weinberg equilibrium, the prevalence of the
‘Null’ allele in the population is then qA = 0.40 = 0.6325. The inheritance model is
recessive, since only the homozygous carrier is assumed to at increased risk.

3. The relative risk for Null/Null carriers, compared to normals, is 2.0 (Rg).

4. Desired power is 80%, at a significance level of 0.05 with a 2-sided alternate hypothesis.

5. A matched case-control design will be used.

31
The following steps would be used to compute the required number of case-control pairs for
such a study:

• Choose ‘Disease’ and ‘Case-control (Matched)’ under Outcome/Design and ‘Gene-


only’ under Hypothesis on the Parameter menu.

• Select ‘Gene G’ and supply the following values:

• Select ‘Outcome Model’ and supply the following values:

Note: This specifies that calculations will be performed only for a genetic odds ratio of 2.0,
since the values in the ‘to’ and ‘by’ boxes do not describe a range of values.

32
• Select ‘Power’ and supply the following values:

• Finally, choose ‘Calculate’ or click the blue calculator on the tool bar to request the
sample size computations. The results will look like:

Thus, we will need to sample 135 case-control pairs to achieve the desired power for
detecting a true genetic odds-ratio of 2.0 in this setting.

Thus, 135 case-control pairs would be required.

A few notes:

1. This result will automatically be copied to the log window. To view the log window,
choose ‘View’ and select ‘Session Log’.

2. To recalculate sample size for another design, hypothesis, and/or parameter setting, make
the desired change(s) and click on the calculator. Once a change is made, the previous
sample size calculation will disappear from the main ‘Settings’ window (it is still saved
in the Session Log). The Settings window always displays the current settings of all
model parameters.

33
3. The sample size provided always refers to the number of sampling units required (see
Section 2.4). For example, changing the Design from ‘Case-control’ to ‘Case-parent’ in
the above situation, and recalculating sample size, would yield:

Thus, in this design, 119 case-parent trios would be required.

4. You may save the current settings of your model by choosing ‘Save’ or ‘Save As’ under
the File menu and providing a file name. The default extension ‘.qpp’ will automatically
be applied to any filename you provide. The contents of the Session Log at the time you
save your model settings will also be saved as part of the ‘.qpp’ file. The format of a
‘.qpp’ file is specific to QUANTO and will not be readable by other programs.

5. You may also save the contents of your Session Log by choosing ‘Save Log’ option and
providing a filename. The default extension ‘.txt’ will be applied. This file is a standard
text (ASCII) file that is readable by other software, including any word processor.

34
4.2 Specifying a range of effect sizes

Consider the case-control study described in Example 4.1, but suppose we want to obtain
sample sizes for a range of genetic odds ratios, say from 1.25 to 3.0 in increments of 0.25.
To achieve this, we would modify the ‘Outcome Model’ dialog to

Clicking the calculator will produce:

This option is useful for finding a minimum detectable effect for a given sample size. You
may also provide a range of values for power (or N) to obtain calculations across ranges of
both effect size and power (or N).

35
4.3 Testing the main effect of a binary environmental factor (E)

Suppose that instead of a gene, we want to test for association between asthma and an
environmental risk factor, e.g. environmental tobacco smoke (ETS) exposure. We’ll assume
30% of subjects in the population are exposed, and that the odds of asthma are 1.5 times
higher for an exposed subject compared to an unexposed subject (i.e. Re=1.5). We again
assume a matched case-control design will be used. The following steps would be used to
compute the required number of case-control pairs for such a study:

• Choose ‘Disease’ and ‘Case-control (Matched)’ under Outcome/Design, and


‘Environment-only’ under Hypothesis.

• Select ‘Environment’ and supply the following values:

• Select ‘Outcome model’ and supply Kp=0.10 and Re=1.5.

• Choose 80% power, 0.05 significance level, and a 2-sided alternate as above.

• Clicking the calculator will then produce the following:

36
Thus, 433 case-control pairs would be needed to achieve 80% power under this setting.

37
4.4 Testing G×E interaction for a disease trait

Assume that in addition to testing the main effects of GSTM1 and ETS on asthma, we also
want to test for GSTM1 × ETS interaction. Assume again that we will use a matched case-
control design and that the allele frequency and exposure prevalence are as described above
in Sections 4.1 and 4.2. We will assume an interaction odds ratio of 1.8 (Rge = 1.8). The
following steps would be used to compute the required number of case-sibling pairs for such
a study:

1. Use the settings for allele frequency, inheritance model, and exposure prevalence
described in Sections 4.1 and 4.3.

2. Select ‘Case-control (Matched)’, and ‘G×E interaction’ from Hypothesis

3. Specify the following values in the ‘Outcome model’ dialog:

A few notes about these settings:

• The population disease risk is still constrained to be 10% (Kp=0.10).

• In Section 4.1, we made the assumption that the genetic odds ratio was 2.0, and in
Section 4.3 that the environmental odds ratio was 1.5. In the settings above, we
choose ‘Marginal’ to constrain the marginal effects of these factors to 2.0 and 1.5,
respectively (i.e. R g =2.0 and R e =1.5). Under these constraints, the ‘OR
summary’ Table shows that the genetic odds ratio in unexposed subjects is Rg =
1.62 and the environmental odds ratio in genetic noncarriers is Re = 1.09. The

38
odds ratio for exposed carriers (shown in the lower right corner of the OR
summary table) is computed as Rg×Re×Rge = 1.62 × 1.09 × 1.8 = 3.19.

• If desired, you can specify the values of Re and Rg by selecting the ‘Main’ button.
QUANTO will then compute the marginal effects conditional on these values and
the specified value of Rge.

4. Choose 80% power, 0.05 significance level, and 2-sided alternative as above

5. Clicking the calculator will produce the following:

Thus, 899 case-control pairs are required to detect an interaction effect of 1.8 in the
above setting. When an interaction hypothesis is requested, QUANTO also computes the
required sample sizes for the tests of the marginal effect of each component, in this case
the effects of the Gene by itself and the Environment by itself. The sample sizes for these
tests are based on the size of the marginal effects R g and R e , respectively. In this example,
the sample sizes are 135 and 433, respectively, the same as observed in Sections 4.1 and
4.3, a result of our fixing the marginal genetic and environmental effect sizes to the
values specified in those sections.

39
4.5 Testing a continuous environmental factor (Z)

Assume now that the environmental factor (Z) is continuous and normally distributed, rather
than being a binary exposure variable. For example, assume that Z measures the number of
cigarettes smoked by a child’s parent each day. We demonstrate calculation of required
sample size for a test of the environmental factor alone. The following selections will
achieve the task:

1. Select ‘Case-control (Matched)’ and ‘Environment only’.

2. Choose ‘Environment’ and specify the following:

In this situation, we assume that the standard deviation is 15 cigarettes. We provide a


descriptive label to indicate that this standard deviation is measured on the ‘cigarette’
scale of measurement. The units label is for descriptive purposes only; it is not used in
the calculations.

40
3. Choose ‘Outcome Model’ and specify the following:

This setting specifies that the disease odds ratio increases by a factor of 1.03 for each
increase of 1 cigarette smoked by the parent.

4. Clicking the calculator will produce a required sample size of 85 matched pairs.

A Note:

You may specify the standard deviation on any scale of measurement you want in the
‘Environment’ dialog. The effect size of Re will always be referable to a 1 unit increase in Z
on that scale of measurement. For example, we could have chosen to measure Z as the
number of packs of cigarettes (20 per pack) smoked each day. The following specifications
of σZ and Re would both produce the same required sample size:

Units σZ Re
----------------------------------------------------------------
per cigarette 15.0 1.03
per pack (20 cigs) 0.75 1.806 (= 1.0320)
----------------------------------------------------------------

When Z is measured on a ‘per pack’ basis, a 1 unit increase in Z refers to an increase of 1


pack (20 cigarettes). The specified value for Re per pack must then be larger than when Z is
measured on a ‘per cigarette’ basis. It is essential that that σZ and Re refer to the same scale
of measurement for Z.

41
4.6 Testing correlation between Z and a quantitative outcome (Y)

Assume that we will collect a sample of independent individuals with the goal of determining
whether a continuous environmental factor (e.g. number of cigarettes smoked) is correlated
with a continuous outcome (e.g. systolic blood pressure, or SBP). We will demonstrate
required N for detecting a hypothesized correlation of R=0.10, using the following selections:

1. Select Continuous and Independent Individuals from Outcome/Design, and


Environment Only from Hypothesis.

2. Specify a continuous exposure in the Environment dialog as described above, with some
standard deviation (e.g. 0.75, on the per pack scale, see Example 4.5).

3. Choose Outcome Model and specify:

Notes:
• The dialog requests the value of the squared correlation between Y and Z. Since we
desire a correlation of R=0.10, we input the value 0.01 (0.102) in this dialog. The
value of βE (2.67, in grey font) is computed based on the value of RE2 and the
assumed standard deviations of both Y and Z as βE = RE 2 Var ( Z ) / Var (Y ) .

• The values at the top of the dialog summarize the expected mean values of Y for
Z=0.0 and Z=1.0, subject to the constraint that the population mean is the value
provided in the ‘Population Mean’ box.

• The value provided in the ‘Population Mean’ box has no influence on sample
size/power calculations; its use is as a guide for setting the magnitudes of R-squared
or βE that are appropriate for your application.
4. Clicking the Calculator will produce:

42
Thus, we require N=781 individuals for detecting a squared correlation of R2=0.01.

Additional Notes:

• Sample size/power in the independent-individuals design is determined by the size of


R2. Thus, changing the standard deviations of Z or Y will not affect the above sample
size requirement, provided RE2 is held to 0.01. Also, changing to a binary exposure
variable (E) will also not affect the required N, again provided RE2 is held to 0.01.

• The values provided for the standard deviations of Z and Y will impact sample
size/power calculations if you choose to specify a value for βE (rather than RE2) on the
Outcome Dialog.

43
4.7 Testing G×E interaction for a quantitative outcome

Assume that we will collect a sample of independent individuals with the goal of determining
whether a gene (e.g. G=GSTM1) and environmental factor (e.g. E=ETS) interact in their
effect on lung function (e.g. Y=forced expiratory volume in the first second, FEV1) in a
sample of 10-year old children. Assume the following settings:

1. GSTM1 is recessive with allele freqnecy qA=0.6325 (see Example 4.1).

2. ETS is a binary exposure variable with 30% exposed (see Example 4.3).

3. The population mean FEV1 in 10-year old children is 2,000 milliliters (2 liters), with
standard deviation 300 milliliters.

4. The marginal difference in mean FEV1 between GSTM1 carriers and noncarriers is 50
milliliters, i.e. in the general population, carriers have mean FEV1 that is 50 ml lower
than noncarriers.

5. The marginal difference in mean FEV1 between ETS exposed and unexposed is 75
milliliters, i.e. in the general population, ETS exposed subjects have mean FEV1 that is
75 ml lower than unexposed subjects.

6. The interaction effect size is 100 milliters, i.e. subjects that are both genetically
susceptible and exposed to ETS have an additional deficit of 100 milliliters in FEV1
beyond the deficits predicted from the main (not marginal) effects of GSTM1 and ETS.

To calculate sample size, use the following steps:

• Choose ‘Continuous’ and ‘Independent-individuals’ under Outcome/Design

• Choose ‘Gene-Environment Interaction’ under Hypothesis.

• Under Gene G, set the allele frequency to 0.6325 and choose ‘Recessive’ (see Example
4.1).

• Under Environment, set the exposure prevalence to 0.3 (see Example 4.3).

• Under Outcome Model, provide the following settings:

44
Conditional on the assumed population mean of 2,000 ml, standard deviation of 300 ml, and
the assumed distributions of G and E, the Summary table at the top shows:

¾ A marginal difference of –50 ml by G (1,970 vs. 2,020).


¾ A marginal difference of –75 ml by E (1,947.5 vs. 2,022.5)
¾ A main effect of G (i.e. G effect when E=0) of –20 ml (2,010.5 vs. 2,030.5)
¾ A main effect of E (i.e. E effect when G=0 of –35 ml (1,995.5 vs. 2,030.5)
¾ An interaction effect of –100 ml, based on the comparison of expected mean Y when
G=1 and E=1 (1,875.5) to the mean when G=0 and E=0 (2,030.5). This difference in
mean is –155 ml, which is 100 ml (the interaction effect size) lower than what would be
predicted from the sum of the main effects of G and E (–20 + –35 = –55 ml).

45
• Clicking the Calculator will produce:

So we require N=1,370 individuals to detect the interaction effect. Also provided are the
sample sizes required to detect the marginal effects of G (N=1,173) and E (N=594). In the
independent-individuals design, these three sample sizes are dependent on the magnitudes of
the corresponding marginal R2 effect sizes, i.e. larger R2 translates into smaller N.

One might also consider a Parent-offspring design to test the above hypotheses. Changing
the Design to ‘Parent-offspring’ but keeping all other settings the same would produce:

46
For the parent-offspring design, we still require 1,370 sampling units to detect the interaction,
but here the sampling unit is a parent-offspring trio rather than an independent individual.

The number of trios required to detect the marginal G effect (N=2,106) is larger than the
requirement for the interaction (N=1,370), even though RG2 (0.0067) is larger than RGE2
(0.0056). This apparently anomalous result is due to the fact that in the parent-offspring trio
design, tests of genetic effects draw only from within-trio information, while the interaction
test draws from both within- and between-trio information (see Gauderman, reference 3 on
the title page for more on this).

47

You might also like