You are on page 1of 4

535

Ann. Hum. Genet. (1999), 63, 535538


Printed in Great Britain

HardyWeinberg quality control


I. GOMES, A. COLLINS, C. LONJOU, N. S. THOMAS, J. WILKINSON, M. WATSON
N. MORTON
Human Genetics, University of Southampton, Duthie Building, Southampton General Hospital,
Tremona Road, Southampton, U.K.
(Received 10.6.99. Accepted 15.10.99)

An efficient test of deviation from HardyWeinberg frequencies with one degree of freedom was
applied to 44 marker loci in a genome scan, and 7 loci had a significant excess of apparent
homozygotes (#( )  6) suggestive of typing error. In this example evidence for linkage did not
"
increase when outliers were censored. Statistical quality control is an essential part of genotyping,
and the effect of mistyping and map error should be considered in evaluating any genome scan.

Researchers on blood groups, isozymes, and


DNA polymorphisms compare their phenotype
frequencies with binomial expectations under
the HardyWeinberg law (Mourant et al. 1976).
Significant discrepancy, if not a type I error, is
evidence of some complication. Inclusion of
related individuals must be gross to be detected,
and inbreeding is almost never great enough to
be significant. Unrecognized null alleles must be
rare to escape homozygosity and spurious
parentage exclusion. Epistasis is an unfamiliar
joy. The most likely complication is systematic
mistyping of heterozygotes as homozygotes or
vice versa. Quality control provided by a Hardy
Weinberg test should be an essential part of any
genome scan or other application of DNA typing,
and the claim that a new typing method is better
than an old one should not be published without
evidence of acceptable fit to HardyWeinberg
frequencies.

In recent years we conducted a genome scan
for asthma using an Applied Biosystems (ABI)
Correspondence : N. Morton, Human Genetics, University of Southampton, Duthie Building, Southampton
General Hospital, Tremona Road, Southampton, U.K.
Tel : j44 (0) 23 8079 6535 ; Fax : j44 (0) 23 8079 4264.
E-mail : nem!soton.ac.uk

sequencer on dinucleotide repeats in nuclear


families of Caucasian race and Wessex residence
(Watson et al. 1995 ; Doull et al. 1996 ; Wilkinson
et al. 1998). HardyWeinberg quality control was
used to identify questionable genotypes for
review and possible retyping. Some markers gave
such poor fit that their utility in multipoint
mapping is questionable, but the effect of marker
error has not been studied.
Li & Horvitz (1953) gave several methods to
estimate the inbreeding coefficient for loci with
codominant multiple alleles. This included the
inefficient estimate from pooled homozygotes
that has been adopted in evolutionary genetics
(Nei, 1972) as well as the fully efficient maximum
likelihood estimate # that was later studied in
detail by Yasuda (1968) and reviewed by
Robertson & Hill (1984) and Morton & Teague
(1996). In brief, let L be the likelihood for a
random sample of N individuals at a locus with R
codominant allele frequencies Qr (r l 1, , R).
Then the maximum likelihood score is U l c ln
L\c with information K l NE (U#). Under the
null hypothesis, H , l 0 and then K l (Rk1) N
!
and in large samples U#\K has a #( ) distribution.
"
The value of # is constrained by Qr (1-# )j#  0,
and so kQr\(1kQr) #  1. The number of
possible genotypes is R(Rj1)\2, and R parameters must be estimated under Ho (R-1 gene
frequencies and N), giving R(Rk1)\2 degrees of

I. G

536

Table 1. HardyWeinberg quality control


# under H
Source
U#\K 6
U#\K  6

Loci
37
7

Score
U
309
4304

Weight
K
262 454
207 999

Pearson
3500
3078

Shannon
3022
1041

Table 2. Censoring outliers from multipoint


model (Wilkinson et al. 1998)
Source
All markers
Minus 2 outliers
Minus 4 outliers

ML location
(cM)
173.6
172.1
173.3

Effect #
0.18
0.19
0.16

lod
3.06
2.89
2.10

freedom (..). Estimation of under H leaves


"
(Rk2)(Rj1)\2 .. We examined both quadratic (Pearson) and likelihood ratio (Shannon)
forms of #. Our program for this analysis is
included in the ALLASS suite (http :\\
cedar.genetics.soton.ac.uk\publicIhtml\). Residual # is inflated when this analysis is applied
to families, and then departure from Hardy
Weinberg frequencies should be tested by an F
ratio. Transmission in families provides complementary quality control over mistyping, but
parents, families with untested parents, and
heterozygous children from diallelic intercrosses
are omitted and the evidence is dispersed over
multiple alleles and loci.

# under H

!
..
2511
919

Pearson
3442
1394

"

Shannon
2896
911

.. (U#\K)
2474
38
912
320

of family members, since parents and children


were pooled as if they were independent observations. Heterogeneity among the 7 loci with
U#\K  6 is highly significant (#( ) l 231), but
'
residual heterogeneity is non-significant (#( ) l
*"#
911). The pooled residual is #( ) l 3806, F l
$$)'
1.12. Inclusion of family members has a highly
significant, but small, effect that is no greater for
deviant loci.
We examined the effect of censoring outliers
on evidence for linkage of asthma score to distal
chromosome 12 (Table 2). The most significant
deviations correspond to l 0.085 and 0.025,
respectively. If there is a locus in this region
contributing to asthma, some information about
its position must be lost through mistyping. To
our surprise the evidence decreased when these
markers were omitted, although the D12S342
marker with gross discrepancy is near the lod
peak. Apparently each marker (even if sometimes
mistyped) contributed information that is lost
when deviant markers are censored. There was
little change in the estimate of effect and
maximum likelihood location.

Among 44 loci typed for chromosomes 11, 12,


13 and 16 there were 7 with #( )  6, of which 2
"
exceeded 130 (D12S342 and D16S420). The
remaining 37 loci give # l U\K l 0.001, a
typical value in countries without preferential
consanguineous marriage (Morton, 1992). There
is no suggestion of heterogeneity among loci
(Table 1, #( ) l 38). Residual variation among
$'
genotypes within loci is highly significant, but
the F statistic is only 1.17 for the Shannon test.
Many possible genotypes were observed rarely if
at all, making # tests with many degrees of
freedom unreliable (Agresti, 1990, pp. 246247).
The inflation of F must reflect non-independence

A remarkable result of this analysis is that all


the departure from HardyWeinberg frequencies
is concentrated in the single degree of freedom
for , despite the large number of residual degrees
of freedom. This parsimony greatly increases
power (Agresti, 1990, pp. 182183) as well as
reliability in sparse tables. In this example the
effect of sampling relatives is negligible, the
nominal # value of 6 corresponding to 5.3 when
residual variation is allowed for. The Pearson #
is more sensitive to small expected numbers and
gives a larger residual #.
Other data we examined show similar fre-

537

HardyWeinberg quality control


quencies of presumptive errors as an excess of
apparent homozygotes. The most critical stage
in genotyping by gel electrophoresis is not in
binning (Perlin et al. 1995 ; Ghosh et al. 1997 ;
Idury & Cardon, 1997) but in the subjective
judgement that only one peak density is significant. Sophisticated software would retain
secondary peaks and optimize haplotype segregation in families, but current software retains
the essential features of sequencing and does not
consider haplotypes or family relationships.
Even enhanced software might not cope with
secondary peaks (usually of larger alleles) that,
under sub-optimal conditions, may in some
individuals be imperceptible or smaller than
stutter bands. Tri- and tetranucleotide repeats
are less likely to confound a similar allele with a
stutter band, but do not eliminate the subjective
judgement between one and two peaks. Error in
genotyping single nucleotide polymorphisms
(SNPs) may be reduced by their phenotypic
simplicity, but error detection in families is less
efficient than for multiallelic loci (Gordon et al.
1999).
Our experience with chromosome 12 suggests
that even mistyped loci contribute some evidence, but this may not always be true. We have
not yet examined effects of map errors on
evidence for primary and secondary peaks (Concannon et al. 1998). In a candidate region a lod
depressed at one locus by mistyping may create
a secondary peak at a neighbouring locus, and
there is no credible test of significance for a
secondary peak on the flank of a strong candidate
region (Mein et al. 1998).
An apparent inbreeding coefficient of 0.01 or
greater within one ethnic group and populous
locality conflicts with all estimates from gene
frequencies, isonymy, pedigrees, and migration,
and therefore cannot plausibly be attributed to
population structure. Null alleles caused by
primer polymorphism or template deletion are
not so easily dismissed in the absence of family
studies, proven homozygotes, or sequencing.
Significantly negative estimates of are less
frequent, but can arise through type I error,
interracial crossing, bivalent alleles, or system-

atic mistyping of homozygotes as heterozygotes.


Any mistyping is most convincingly demonstrated by repeat testing, which is usually not
comprehensive and therefore cannot exclude
error frequencies of several per cent. This places
great weight on statistical quality control for
genetic epidemiology and studies of population
diversity.

A, A. (1990). Categorical data analysis. John Wiley


and Sons, New York.
C, P., G-E, K. J., H, D. A.,
W, B., M, V. A., S, B.,
M, M., F, J., W, S. R., C, N. J.,
B, G. I., R, M. & S, R. S. (1998). A
second-generation screen of the human genome for
susceptibility to insulin-dependent diabetes mellitus.
Nat. Genet. 19, 292296.
D, I. J. M., L, S., W, M., B, B., B, R. W., L, F., H, S. T.
& M, N. E. (1996). Allelic association of markers
on chromosomes 5q and 11q with atopy and bronchial
hyperresponsiveness. Am. J. Respir. Crit. Care Med.
153, 12801284.
G, S., K, Z. E., H, E. R., A,
D., K, J. I., R, J. B., M, A., T, J., T, C., S, S., E, W., M,
T., M, C., S, J. R., C, J. D., B, M. J., P, J. I., W, R., C, P.,
N, S. J., M, V. L., B, M.,
C, F. S. & T F (FUS I NIDDM G) S G (1997).
Methods for precise sizing, automated binning of
alleles, and reduction of error rates in large-scale
genotyping using flourescently labeled dinucleotide
markers. Genome Res. 7, 165178.
G, D. C., H, S. C. & O, J. (1999). True
pedigree errors more frequent than apparent errors for
single nucleotide polymorphisms. Hum. Hered. 49,
6570.
I, R. M. & C, L. R. (1997). A simple method
for automated allele binning in microsatellite markers.
Genome Res. 7, 11041109.
L, C. C. & H, D. G. (1953). Some methods of
estimating the inbreeding coefficient. Am. J. Hum.
Genet. 5 : 107117.
M, C. A., E, L., D, M. G., J,
G. C. L., T, A. E., G, J. V., S, A. N.,
S-M, L., M, M. E., W,
A. J., P, L. E., C, F., B, A. H.,
B, S. C. & T, J. A. (1998). A search for type 1
diabetes susceptibility genes in families from the
United Kingdom. Nat. Genet. 19, 297300.
M, N. E. (1992). Genetic structure of forensic
populations. Proc. Natl. Acad. Sci. USA 89, 25562560.
M, N. E. & T, J. W. (1996). Kinship,
inbreeding, and matching probabilities. In : Molecular
biology and human diversity, (Ed. A. J. Boyce &
C. G. N. Mascie-Taylor, pp. 5161. Cambridge University Press.
M, A. E., K A. C. & D-

538

I. G

S, K. (1976). The distribution of the human blood


groups and other polymorphisms. Oxford : Oxford
University Press.
N, M. (1972). Genetic distance between populations.
Am. Nat. 106, 283292.
P, M. W., L, G. & N, S.-K. (1995). Toward
fully automated genotyping : genotyping microsatellite
markers by deconvolution. Am. J. Hum. Genet. 57,
11991210.
R, A. & H, W. G. (1984). Deviations from
HardyWeinberg proportions : sampling variances and
use in estimation of inbreeding coefficients. Genetics
107, 703718.

W, M., L, S., C, A., B, R.,


D, I., B, B., L, F., H, S. T.
& M, N. E. (1995). Exclusion from proximal 11q
of a common gene with megaphenic effect on atopy.
Ann. Hum. Genet. 59, 403411.
W, J., G, S., C, A., T,
N. S., H, S. T. & M, N. (1998). Linkage
of asthma to markers on chromosome 12 in a sample of
240 families using quantitative phenotype scores.
Genomics 53, 251259.
Y N. (1968). Estimation of the inbreeding coefficient from phenotype frequencies by a method of
maximum likelihood scoring. Biometrics 24, 91535.

You might also like