Professional Documents
Culture Documents
64:1186–1193, 1999
1186
Weinberg: Analysis of Genetic Studies with Partial Triads 1187
There are important parallels between the LRT and then repeats the maximization (the M step) of the like-
the TDT. Under a strict null hypothesis that the allele lihood on the basis of the newly revised, pseudocomplete
under study is neither linked to nor associated with the data. The method then revises the provisional assign-
disease, the parameters b1 and b2 in table 1 are 0, and ments of the incomplete triads (repeating the E step).
the LRT statistic has a central 2-df x2 distribution, This two-step alternating computation is repeated until
whereas the TDT statistic has a central 1-df x2 distri- convergence has been achieved. The likelihood then con-
bution. If there is association in the population but no sidered must be that based on the observed data (not
linkage, the null distributions of both test statistics are the pseudocomplete data) but determined by use of the
unaffected. If there is linkage but no association, and, estimated parameters. The likelihood contribution for
in contrast with the methods described by Ewens and dyads is based on the sum of probabilities across the
Spielman (1995), random mating is not assumed for the cells with which the observed data are compatible; in
parental generation, neither distribution will be x2, but the example given, this likelihood contribution would
both tests would have little power. Thus, equation (1) be based on the sum across the first two cells of table
(and its corresponding likelihood) provides a method for 1. Details are provided in Appendix A. Statistical theory
statistically testing for linkage disequilibrium (via the b guarantees that the observed-data likelihood (including
parameters), in that rejection of the null hypothesis sug- the partial information from dyads and monads) in-
gests the presence of both association and linkage. creases to a maximum via the algorithm; thus, an LRT
In earlier work, my colleagues and I performed sim- can be performed validly (Dempster et al. 1977). My
ulations to study the operating characteristics of the 2- experience with simulated data has been that conver-
df x2 LRT of the null hypothesis that b1 ⫽ b 2 ⫽ 0 (Wein- gence under the EM is achieved reliably in !40 iterations,
berg et al. 1998). The testing was performed under the by means of this simple multinomial application of the
broad class of genetic alternatives, without constraining method. Thus, simulations have become feasible.
the alternative to be, for example, a dominant model.
The simulated population was composed of two ad- Simulation Methods
mixed subpopulations that had very different gene prev-
alences and different baseline risks of disease (risks As in earlier work (Weinberg et al. 1998), the simu-
among individuals who carry no copies of the allele). lations were set up on the basis of an admixture of two
On the basis of those simulations, when both tests could subpopulations. For a 20% subpopulation, the gene
be applied (i.e., the genetic mechanism was not mater- prevalence was .3, and the background risk was .05; for
nally mediated), the LRT outperformed the TDT under the remaining 80% of the population, the gene preva-
either a recessive or a dominant genetic model but was lence was .1, and the background risk was .01. For sim-
slightly underpowered, compared with the TDT, under plicity, each of the two subpopulations was assumed to
the gene-dose model in which R 2 ⫽ R21. be in Hardy-Weinberg equilibrium, even though the re-
sulting mixed population was not and mating was not
Handling Missing Parents by Use of the Expectation- random in the mixed population. By this construction,
Maximization (EM) Algorithm there is a strong, positive (and etiologically meaningless)
association between the allele and the disease in the pop-
Now suppose that, for a given triad, the father is miss- ulation, even when R1 ⫽ R 2 ⫽ 1. All simulations and
ing and M ⫽ 2 ⫽ C. This father must be either F ⫽ 2 analyses were performed by use of the GLIM package
or F ⫽ 1, but the value for F cannot be known. Assume (Baker and Nelder 1978).
that “missingness” is unrelated to (M, F, C) in that the For each genetic scenario considered, 1,000 studies
probability that a father is missing is not related to the were simulated, each of which began with 100 case-
allele under study. If the parameters in table 1 are parent triads. If a proportion p of 1,000 simulated stud-
known, the probability that the father is in fact F ⫽ 2 ies reject the null hypothesis, then the empirical standard
is m 1 /(m 1 ⫹ m 2 ). Thus, roughly that proportion of dyads error for the rejection rate (size under the null hypothesis
of the form (M, F, C) ⫽ (2, unknown, 2) would have or power under alternatives to the null hypothesis) can
come from the first row of table 1. This is the idea be approximated by 冑p(1 ⫺ p)/1,000. Approximate
underlying the statistical method to be applied. 95% confidence intervals for the true rejection rate then
The approach proposed is a standard and easily ap- can be constructed by adding and subtracting twice this
plied statistical method for handling missing data, called number to p. For simplicity, the initial simulations as-
the “EM algorithm” (Dempster et al. 1977). In the pres- sumed that only the father could be missing, but this
ent context, the method fractionally assigns the incom- scenario was extended later to allow mothers to be miss-
plete triads to their theoretically possible cells, on the ing as well. The missingness of fathers was assumed to
basis of current estimates of the parameters (the E step), be unrelated to the allele under study and was assigned
Weinberg: Analysis of Genetic Studies with Partial Triads 1189
Results of Simulations
Figure 1 Simulation-based estimated powers for various testing
procedures, as a function of the percentage of fathers with missing
Simulation under the null model yielded rejection rates
data. The points plotted indicate the empirical proportion of tests (by
use of a .05 level of significance) that would have rejected the null for both the full-data LRT and the incomplete-data
hypothesis that b1 ⫽ b2 ⫽ 0 (equivalently, R1 ⫽ R2 ⫽ 1.0), among EM-LRT that were consistent with the .05 level. How-
1,000 simulated studies, each of which included 100 families. In the ever, see Appendix B for a minor correction that is help-
simulation of these data sets, the inherited genotype was assumed to ful, especially when a large proportion of triads are
influence risk via a recessive model (in which R1 ⫽ 1 and R2 ⫽ 3). The
incomplete.
dashed lines correspond to the full-data analyses using the 2-df LRT
or the TDT. The triangles plot the power resulting from application Figure 1 displays results for the recessive model, under
of the EM algorithm to include information from the incomplete triads. a scenario in which the gene effect is inherited. As in an
earlier report (Weinberg et al. 1998), the LRT (power
of .7) outperformed the TDT (power !.5) when the tri-
randomly by a Bernoulli mechanism (by use of uniformly ads were complete. With an increasing proportion of
distributed random numbers), with a probability, in in- missing fathers, the power of both the LRT and the TDT
crements of .10 within the range .00–.50. The propor- declined sharply when only the completely observed tri-
tion missing was assumed to be ⭐.50, primarily because, ads were used in the analysis. When incomplete triads
for only 100 families, the type I error rate is 1.05 when were included by means of the EM-LRT, however, the
a higher proportion of fathers is missing. power stayed high, evidently recapturing most of the
For comparison, under scenarios in which R1 1 1 or information. Figure 2 shows corresponding results under
R 2 1 1, each simulated study was analyzed with full a dominant model. Again, the EM recaptured most of
data, under both the LRT and the TDT. The analyses the lost power, with remarkably little loss even when
then were repeated, this time without the case-parent half the fathers were missing, compared with the full-
triads in which the father was flagged as missing and data analysis.
with only the completely observed triads. Then, the EM The TDT corresponds to the score statistic for the
algorithm was applied, as described above, to recapture gene-dose model (R 2 ⫽ R12; Schaid and Sommer 1994)
information from the partially observed triads, that is, and outperformed the 2-df LRT, under this scenario.
the dyads comprising only the mother and the child. Nevertheless, as shown in figure 3, the power of the TDT
Simulations were performed under the null hypothesis dropped rapidly when fathers were missing and fell be-
of no linkage disequilibrium (R1 ⫽ 1 ⫽ R 2), to verify low that of the EM-LRT when ⭓30% were missing.
that the empirical size of a level .05 test is consistent Again, the EM recaptured most of the power that would
with the nominal level. Then, data including triads with have been sacrificed by exclusion of the triads with miss-
missing fathers were simulated under a dominant model ing data.
(R1 ⫽ R 2 ⫽ 3), a recessive model (R1 ⫽ 1 and R 2 ⫽ 3), Results for genetic mechanisms that operate via the
and a gene-dose model (R1 ⫽ 2 and R 2 ⫽ 4). Simula- mother revealed more modest gains in power, for the
tions also were performed under a dominant model EM algorithm (the TDT is not shown, because the TDT
1190 Am. J. Hum. Genet. 64:1186–1193, 1999
Discussion
ever, this may not always be true. Nevertheless, param- as a literal representation of a multiplicative alternative
eters corresponding to mating-type–specific probabilities to the null hypothesis. In such a context, the use of the
that the father is missing could be incorporated (and EM algorithm will serve not only to provide a more
estimated), and EM-based maximization using the re- powerful statistical test but also to improve the precision
sulting likelihood could be performed (this model would of estimation for the relative risk parameters R1, R2, S1,
allow a test of the hypothesis that the six rates of miss- and S2. By contrast, if the gene under study is a marker
ingness are equal, via a 5-df x2 test). However, if different that is related to risk of the disease only through linkage
mating-type–specific rates of missingness were required disequilibrium with a nearby risk-conferring gene, then
for the father and the mother, the corresponding model the alternative model in table 1 is not quite correct, be-
would be overparameterized and intractable; thus, sim- cause the relative risk in the offspring will also depend
plification of assumptions sometimes is required. For ex- on the parental genotype and not only on the inherited
ample, the problem becomes statistically identifiable if genotype, owing to recombination. Nevertheless, the
a single rate of missingness for mothers is applied across EM-LRT still provides a valid test of linkage disequi-
all six mating-type categories. An alternative that might librium, relative to the null model specified by R1 ⫽
work well for some populations is to stratify on self- R 2 ⫽ 1.
defined ethnicity, under the assumption of possibly dif-
ferent rates of missingness across the strata. Of course,
there is always the option to use the analysis of complete
Acknowledgments
triads, either via the TDT or the LRT, since neither I wish to thank Drs. Norman Kaplan, Laura Scott, Marcy
method is invalidated by possibly differential rates of Speer, David Umbach, Richard Weinberg, and Allen Wilcox
missing data across the mating types. for their helpful comments.
Sun et al. (1998) recently discussed the remarkable
fact that linkage can be studied by use of only one parent Appendix A
per proband. The single-parent design can have practical
advantages, because the mother may be easier to recruit For simplicity of exposition, I assume that only fathers
than the father, and, in general, questions about parental are missing, but the method described can be easily gen-
status would not be of concern. Although Sun et al. eralized to accommodate missing mothers (and missing
(1998) described noniterative approaches to estimation, probands). Let Nijk denote the observed number of triads
maximum-likelihood methods such as the EM-LRT in which the mother, the father, and the child carry i, j,
would be expected to provide optimal efficiency. My and k copies, respectively, of the variant allele. Let Mi?k
simulations confirmed that a design based on mother/ denote the number(s) of dyads in which the mother has
offspring dyads, analysis of which would not be possible i copies, the child has k copies, and the father could not
by use of the TDT, provides good statistical power when be studied and therefore carries an unknown number of
analysis is based on the EM-LRT and when risk depends copies. Let N denote the total number of families stud-
on the inherited genotype. A 1-df LRT can be applied ied. Let pijk denote the cell probability for cell
to test for maternally mediated effects. (M, F, C) ⫽ (i, j, k)—that is, the expected count of table
The single-parent design, however, imposes notewor- 1 divided by N. If complete data were available, the
thy limitations on inference related to linkage. Genetic logarithm of the multinomial likelihood would be
mating symmetry cannot be verified on the basis of data
from dyads, and the full model (eq. [1]) cannot be si-
multaneously fit to distinguish between effects of the
log (L) ⫽ 冘
possible i,j,k
Nijk log (pijk) . (A1)
pression of (A1) on the basis of the estimated data. The away from 0. To avoid this problem, the iterations can
maximization can be done by use of standard software. be bounced away from this boundary, by checking for
r
Let Yijk denote the estimate for the count for cell (i, j, fitted values !.01, substituting .01 in place of such ex-
r
k), at iteration r, and let pijk denote the estimate for the tremely low values, and resuming the iteration.
probability for cell (i, j, k), at iteration r. Then, the E
step performs the next “estimation” as follows:
References
r
p Baker RJ, Nelder JA (1978) The GLIM system, release 3. Nu-
r⫹1
⫽ Nijk ⫹ M i?k ijk
Yijk
冘
possible j
r
pijk
.
merical Algorithms Group, Oxford
Curtis D, Sham PC (1995) A note on the application of the
transmission disequilibrium test when a parent is missing.
r⫹1
For the M step, Yijk is used in place of Nijk in equa- Am J Hum Genet 56:811–812
r⫹1
tion (A1), and the next set of pijk is obtained by Dempster AP, Laird NM, Rubin DB (1977) Maximum like-
maximization. lihood from incomplete data via the EM algorithm. J R Stat
These two steps are repeated in their turn until the Soc Ser B 39:1–22
parameter estimates converge. The theory guarantees Ewens WJ, Spielman RS (1995) The transmission/disequilib-
(Dempster et al. 1977) that, at each stage of iteration, rium test: history, subdivision, and admixture. Am J Hum
the likelihood given in equation (A2)—that is, the ob- Genet 57:455–464
Falk C, Rubinstein P (1987) Haplotype relative risks: an easy,
served-data likelihood—will increase. Following con-
reliable way to construct a proper control sample for risk
vergence to its maximum, the log likelihood then is cal- calculations. Ann Hum Genet 51:227–233
culated on the basis of equation (A2). When a model Schaid DJ, Sommer SS (1993) Genotype relative risks: methods
containing only the mating-type–stratum parameters is for design and analysis of candidate-gene association studies.
compared with one that also includes parameters cor- Am J Hum Genet 53:1114–1126
responding to R1 and R2, the change in twice the max- ——— (1994) Comparison of statistics for candidate-gene as-
imized log likelihood provides a test statistic with an sociation studies using cases and parents. Am J Hum Genet
approximately (large samples) 2-df x2 distribution (un- 55:402–409
der the null hypothesis), which then permits an LRT for Self S, Longton G, Kopecky K, Liang KY (1991) On estimating
linkage disequilibrium. Other nested models also can be HLA-disease association with application to a study of
compared, by contrasting, for example, the fits (via a 1- aplastic anemia. Biometrics 47:53–61
Spielman RS, Ewens WJ (1996) The TDT and other family-
df x2 test) for a model that allows separately for R1 and
based tests for linkage disequilibrium and association. Am
R2, compared with the more constrained dominance J Hum Genet 59:983–989
model specifying that R1 ⫽ R 2 or with a recessive model Spielman RS, McGinnis RE, Ewens WJ (1993) Transmission
specifying that R1 ⫽ 1. test for linkage disequilibrium: the insulin gene region and
insulin-dependent diabetes mellitus (IDDM). Am J Hum Ge-
net 52:506–516
Appendix B Sun F, Flanders W, Yang Q, Khoury M (1998) A new method
The EM algorithm can be shown to have the property for estimating the risk ratio in studies using case-parental
that, at each iteration, the estimated likelihood function control design. Am J Epidemiol 148:902–909
Weinberg CR, Wilcox AJ, Lie RT (1998) A log-linear approach
increases. However, in certain applications in which
to case-parent–triad data: assessing effects of disease genes
probabilities can be estimated to be 0, the iterations can that act either directly or through maternal effects and that
become stuck at 0 for the component of the parameter may be subject to parental imprinting. Am J Hum Genet 62:
vector with a current estimate of 0. For example, if there 969–978
are no observed triads in which all three members are Wilcox AJ, Weinberg CR, Lie RT (1998) Distinguishing the
homozygous for the variant allele, then the fitted values effects of maternal and offspring genes through studies of
for mating type 1 will begin at 0 and can never move “case-parent triads.” Am J Epidemiol 148:893–901